hadoop-huaweicloud模块提供了与华为云对象存储服务(OBS)集成的支持。该支持通过JAR文件hadoop-huaweicloud.jar实现。
obs方案通过URL引用文件系统路径。FileSystem接口,呈现分层文件系统视图。以下操作部分或完全不支持:
OBSA依赖于两个JAR包,以及hadoop-common及其依赖项。
hadoop-huaweicloud JAR包。esdk-obs-java JAR包。hadoop-common 和 hadoop-huaweicloud 的版本必须保持一致。
要将库导入Maven构建,请将hadoop-huaweicloud JAR添加到构建依赖项中;它将自动引入兼容的esdk-obs-java JAR。
hadoop-huaweicloud JAR 不声明除其特有的OBS SDK JAR之外的任何依赖项。这样可以简化下游应用程序中排除/调整Hadoop依赖JAR的操作。必须声明hadoop-client或hadoop-common依赖项。
<properties>
<!-- Your exact Hadoop version here-->
<hadoop.version>3.4.0</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-huaweicloud</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
在访问URL之前,需要按如下方式配置Filesystem/AbstractFileSystem的OBS实现类以及存储桶所在的区域端点:
<property> <name>fs.obs.impl</name> <value>org.apache.hadoop.fs.obs.OBSFileSystem</value> <description>The OBS implementation class of the Filesystem.</description> </property> <property> <name>fs.AbstractFileSystem.obs.impl</name> <value>org.apache.hadoop.fs.obs.OBS</value> <description>The OBS implementation class of the AbstractFileSystem.</description> </property> <property> <name>fs.obs.endpoint</name> <value>obs.region.myhuaweicloud.com</value> <description>OBS region endpoint where a bucket is located.</description> </property>
OBS URL可按如下方式访问:
obs://<bucket_name>/path
方案 obs 标识了由华为云OBS支持的Hadoop兼容文件系统 OBSFileSystem 上的URL。例如,以下 FileSystem Shell 命令演示了如何访问名为 mybucket 的存储桶。
hadoop fs -mkdir obs://mybucket/testDir hadoop fs -put testFile obs://mybucket/testDir/testFile hadoop fs -cat obs://mybucket/testDir/testFile test file content
有关如何创建存储桶的详细信息,请参阅帮助中心 > 对象存储服务 > 快速入门> 基本操作流程
除了与公共OBS存储桶交互外,OBSA客户端需要与存储桶交互所需的凭证。客户端支持多种认证机制。最简单的认证机制是提供OBS访问密钥和秘密密钥,如下所示。
<property> <name>fs.obs.access.key</name> <description>OBS access key. Omit for provider-based authentication.</description> </property> <property> <name>fs.obs.secret.key</name> <description>OBS secret key. Omit for provider-based authentication.</description> </property>
请勿共享访问密钥、密钥和会话令牌。这些信息必须严格保密。
自定义实现com.obs.services.IObsCredentialsProvider(参见创建ObsClient实例)或org.apache.hadoop.fs.obs.BasicSessionCredential也可用于身份验证。
<property>
<name>fs.obs.security.provider</name>
<description>
Class name of security provider class which implements
com.obs.services.IObsCredentialsProvider, which will
be used to construct an OBS client instance as an input parameter.
</description>
</property>
<property>
<name>fs.obs.credentials.provider</name>
<description>
lass nameCof credential provider class which implements
org.apache.hadoop.fs.obs.BasicSessionCredential,
which must override three APIs: getOBSAccessKeyId(),
getOBSSecretKey(), and getSessionToken().
</description>
</property>
所有OBSA客户端选项都通过前缀为fs.obs.的选项进行配置。
<property>
<name>fs.obs.connection.ssl.enabled</name>
<value>false</value>
<description>Enable or disable SSL connections to OBS.</description>
</property>
<property>
<name>fs.obs.connection.maximum</name>
<value>1000</value>
<description>Maximum number of simultaneous connections to OBS.</description>
</property>
<property>
<name>fs.obs.connection.establish.timeout</name>
<value>120000</value>
<description>Socket connection setup timeout in milliseconds.</description>
</property>
<property>
<name>fs.obs.connection.timeout</name>
<value>120000</value>
<description>Socket connection timeout in milliseconds.</description>
</property>
<property>
<name>fs.obs.idle.connection.time</name>
<value>30000</value>
<description>Socket idle connection time.</description>
</property>
<property>
<name>fs.obs.max.idle.connections</name>
<value>1000</value>
<description>Maximum number of socket idle connections.</description>
</property>
<property>
<name>fs.obs.socket.send.buffer</name>
<value>256 * 1024</value>
<description>Socket send buffer to be used in OBS SDK. Represented in bytes.</description>
</property>
<property>
<name>fs.obs.socket.recv.buffer</name>
<value>256 * 1024</value>
<description>Socket receive buffer to be used in OBS SDK. Represented in bytes.</description>
</property>
<property>
<name>fs.obs.threads.keepalivetime</name>
<value>60</value>
<description>Number of seconds a thread can be idle before being
terminated in thread pool.</description>
</property>
<property>
<name>fs.obs.threads.max</name>
<value>20</value>
<description> Maximum number of concurrent active (part)uploads,
which each use a thread from thread pool.</description>
</property>
<property>
<name>fs.obs.max.total.tasks</name>
<value>20</value>
<description>Number of (part)uploads allowed to the queue before
blocking additional uploads.</description>
</property>
<property>
<name>fs.obs.delete.threads.max</name>
<value>20</value>
<description>Max number of delete threads.</description>
</property>
<property>
<name>fs.obs.multipart.size</name>
<value>104857600</value>
<description>Part size for multipart upload.
</description>
</property>
<property>
<name>fs.obs.multiobjectdelete.maximum</name>
<value>1000</value>
<description>Max number of objects in one multi-object delete call.
</description>
</property>
<property>
<name>fs.obs.fast.upload.buffer</name>
<value>disk</value>
<description>Which buffer to use. Default is `disk`, value may be
`disk` | `array` | `bytebuffer`.
</description>
</property>
<property>
<name>fs.obs.buffer.dir</name>
<value>dir1,dir2,dir3</value>
<description>Comma separated list of directories that will be used to buffer file
uploads to. This option takes effect only when the option 'fs.obs.fast.upload.buffer'
is set to 'disk'.
</description>
</property>
<property>
<name>fs.obs.fast.upload.active.blocks</name>
<value>4</value>
<description>Maximum number of blocks a single output stream can have active
(uploading, or queued to the central FileSystem instance's pool of queued
operations).
</description>
</property>
<property>
<name>fs.obs.readahead.range</name>
<value>1024 * 1024</value>
<description>Bytes to read ahead during a seek() before closing and
re-opening the OBS HTTP connection. </description>
</property>
<property>
<name>fs.obs.read.transform.enable</name>
<value>true</value>
<description>Flag indicating if socket connections can be reused by
position read. Set `false` only for HBase.</description>
</property>
<property>
<name>fs.obs.list.threads.core</name>
<value>30</value>
<description>Number of core list threads.</description>
</property>
<property>
<name>fs.obs.list.threads.max</name>
<value>60</value>
<description>Maximum number of list threads.</description>
</property>
<property>
<name>fs.obs.list.workqueue.capacity</name>
<value>1024</value>
<value>Capacity of list work queue.</value>
</property>
<property>
<name>fs.obs.list.parallel.factor</name>
<value>30</value>
<description>List parallel factor.</description>
</property>
<property>
<name>fs.obs.trash.enable</name>
<value>false</value>
<description>Switch for the fast delete.</description>
</property>
<property>
<name>fs.obs.trash.dir</name>
<description>The fast delete recycle directory.</description>
</property>
<property>
<name>fs.obs.block.size</name>
<value>128 * 1024 * 1024</value>
<description>Default block size for OBS FileSystem.
</description>
</property>
hadoop-huaweicloud模块包含完整的单元测试套件。大多数测试将针对华为云OBS运行。要运行这些测试,请创建src/test/resources/auth-keys.xml文件,其中包含上述章节提到的OBS账户信息以及以下属性。
<property>
<name>fs.contract.test.fs.obs</name>
<value>obs://obsfilesystem-bucket</value>
</property>