OBSA: 华为云OBS适配器支持Hadoop

简介

hadoop-huaweicloud模块提供了与华为云对象存储服务(OBS)集成的支持。该支持通过JAR文件hadoop-huaweicloud.jar实现。

功能特性

  • 读取和写入存储在华为云OBS账户中的数据。
  • 使用obs方案通过URL引用文件系统路径。
  • 通过实现标准的Hadoop FileSystem接口,呈现分层文件系统视图。
  • 支持对大文件进行分段上传。
  • 可以作为MapReduce作业中的数据源或数据接收端。
  • 使用华为云OBS的Java SDK,支持最新的OBS功能和认证方案。
  • 已进行规模测试。

限制

以下操作部分或完全不支持:

  • 符号链接操作。
  • 代理用户。
  • 文件截断。
  • 文件拼接。
  • 文件校验和。
  • 文件副本因子。
  • 扩展属性(XAttrs)操作。
  • 快照操作。
  • 存储策略。
  • 配额。
  • POSIX ACL(访问控制列表)。
  • 委托令牌操作。

入门指南

软件包

OBSA依赖于两个JAR包,以及hadoop-common及其依赖项。

  • hadoop-huaweicloud JAR包。
  • esdk-obs-java JAR包。

hadoop-commonhadoop-huaweicloud 的版本必须保持一致。

要将库导入Maven构建,请将hadoop-huaweicloud JAR添加到构建依赖项中;它将自动引入兼容的esdk-obs-java JAR。

hadoop-huaweicloud JAR 声明除其特有的OBS SDK JAR之外的任何依赖项。这样可以简化下游应用程序中排除/调整Hadoop依赖JAR的操作。必须声明hadoop-clienthadoop-common依赖项。

<properties>
 <!-- Your exact Hadoop version here-->
  <hadoop.version>3.4.0</hadoop.version>
</properties>

<dependencies>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-huaweicloud</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
</dependencies>

访问OBS URL

在访问URL之前,需要按如下方式配置Filesystem/AbstractFileSystem的OBS实现类以及存储桶所在的区域端点:

<property>
  <name>fs.obs.impl</name>
  <value>org.apache.hadoop.fs.obs.OBSFileSystem</value>
  <description>The OBS implementation class of the Filesystem.</description>
</property>

<property>
  <name>fs.AbstractFileSystem.obs.impl</name>
  <value>org.apache.hadoop.fs.obs.OBS</value>
  <description>The OBS implementation class of the AbstractFileSystem.</description>
</property>

<property>
  <name>fs.obs.endpoint</name>
  <value>obs.region.myhuaweicloud.com</value>
  <description>OBS region endpoint where a bucket is located.</description>
</property>

OBS URL可按如下方式访问:

obs://<bucket_name>/path

方案 obs 标识了由华为云OBS支持的Hadoop兼容文件系统 OBSFileSystem 上的URL。例如,以下 FileSystem Shell 命令演示了如何访问名为 mybucket 的存储桶。

hadoop fs -mkdir obs://mybucket/testDir

hadoop fs -put testFile obs://mybucket/testDir/testFile

hadoop fs -cat obs://mybucket/testDir/testFile
test file content

有关如何创建存储桶的详细信息,请参阅帮助中心 > 对象存储服务 > 快速入门> 基本操作流程

使用OBS进行身份验证

除了与公共OBS存储桶交互外,OBSA客户端需要与存储桶交互所需的凭证。客户端支持多种认证机制。最简单的认证机制是提供OBS访问密钥和秘密密钥,如下所示。

<property>
  <name>fs.obs.access.key</name>
  <description>OBS access key.
   Omit for provider-based authentication.</description>
</property>

<property>
  <name>fs.obs.secret.key</name>
  <description>OBS secret key.
   Omit for provider-based authentication.</description>
</property>

请勿共享访问密钥、密钥和会话令牌。这些信息必须严格保密。

自定义实现com.obs.services.IObsCredentialsProvider(参见创建ObsClient实例)或org.apache.hadoop.fs.obs.BasicSessionCredential也可用于身份验证。

<property>
  <name>fs.obs.security.provider</name>
  <description>
    Class name of security provider class which implements
    com.obs.services.IObsCredentialsProvider, which will
    be used to construct an OBS client instance as an input parameter.
  </description>
</property>

<property>
  <name>fs.obs.credentials.provider</name>
  <description>
    lass nameCof credential provider class which implements
    org.apache.hadoop.fs.obs.BasicSessionCredential,
    which must override three APIs: getOBSAccessKeyId(),
    getOBSSecretKey(), and getSessionToken().
  </description>
</property>

通用OBSA客户端配置

所有OBSA客户端选项都通过前缀为fs.obs.的选项进行配置。

<property>
  <name>fs.obs.connection.ssl.enabled</name>
  <value>false</value>
  <description>Enable or disable SSL connections to OBS.</description>
</property>

<property>
  <name>fs.obs.connection.maximum</name>
  <value>1000</value>
  <description>Maximum number of simultaneous connections to OBS.</description>
</property>

<property>
  <name>fs.obs.connection.establish.timeout</name>
  <value>120000</value>
  <description>Socket connection setup timeout in milliseconds.</description>
</property>

<property>
  <name>fs.obs.connection.timeout</name>
  <value>120000</value>
  <description>Socket connection timeout in milliseconds.</description>
</property>

<property>
  <name>fs.obs.idle.connection.time</name>
  <value>30000</value>
  <description>Socket idle connection time.</description>
</property>

<property>
  <name>fs.obs.max.idle.connections</name>
  <value>1000</value>
  <description>Maximum number of socket idle connections.</description>
</property>

<property>
  <name>fs.obs.socket.send.buffer</name>
  <value>256 * 1024</value>
  <description>Socket send buffer to be used in OBS SDK. Represented in bytes.</description>
</property>

<property>
  <name>fs.obs.socket.recv.buffer</name>
  <value>256 * 1024</value>
  <description>Socket receive buffer to be used in OBS SDK. Represented in bytes.</description>
</property>

<property>
  <name>fs.obs.threads.keepalivetime</name>
  <value>60</value>
  <description>Number of seconds a thread can be idle before being
    terminated in thread pool.</description>
</property>

<property>
  <name>fs.obs.threads.max</name>
  <value>20</value>
  <description> Maximum number of concurrent active (part)uploads,
    which each use a thread from thread pool.</description>
</property>

<property>
  <name>fs.obs.max.total.tasks</name>
  <value>20</value>
  <description>Number of (part)uploads allowed to the queue before
    blocking additional uploads.</description>
</property>

<property>
  <name>fs.obs.delete.threads.max</name>
  <value>20</value>
  <description>Max number of delete threads.</description>
</property>

<property>
  <name>fs.obs.multipart.size</name>
  <value>104857600</value>
  <description>Part size for multipart upload.
  </description>
</property>

<property>
  <name>fs.obs.multiobjectdelete.maximum</name>
  <value>1000</value>
  <description>Max number of objects in one multi-object delete call.
  </description>
</property>

<property>
  <name>fs.obs.fast.upload.buffer</name>
  <value>disk</value>
  <description>Which buffer to use. Default is `disk`, value may be
    `disk` | `array` | `bytebuffer`.
  </description>
</property>

<property>
  <name>fs.obs.buffer.dir</name>
  <value>dir1,dir2,dir3</value>
  <description>Comma separated list of directories that will be used to buffer file
    uploads to. This option takes effect only when the option 'fs.obs.fast.upload.buffer'
    is set to 'disk'.
  </description>
</property>

<property>
  <name>fs.obs.fast.upload.active.blocks</name>
  <value>4</value>
  <description>Maximum number of blocks a single output stream can have active
    (uploading, or queued to the central FileSystem instance's pool of queued
    operations).
  </description>
</property>

<property>
  <name>fs.obs.readahead.range</name>
  <value>1024 * 1024</value>
  <description>Bytes to read ahead during a seek() before closing and
  re-opening the OBS HTTP connection. </description>
</property>

<property>
  <name>fs.obs.read.transform.enable</name>
  <value>true</value>
  <description>Flag indicating if socket connections can be reused by
    position read. Set `false` only for HBase.</description>
</property>

<property>
  <name>fs.obs.list.threads.core</name>
  <value>30</value>
  <description>Number of core list threads.</description>
</property>

<property>
  <name>fs.obs.list.threads.max</name>
  <value>60</value>
  <description>Maximum number of list threads.</description>
</property>

<property>
  <name>fs.obs.list.workqueue.capacity</name>
  <value>1024</value>
  <value>Capacity of list work queue.</value>
</property>

<property>
  <name>fs.obs.list.parallel.factor</name>
  <value>30</value>
  <description>List parallel factor.</description>
</property>

<property>
  <name>fs.obs.trash.enable</name>
  <value>false</value>
  <description>Switch for the fast delete.</description>
</property>

<property>
  <name>fs.obs.trash.dir</name>
  <description>The fast delete recycle directory.</description>
</property>

<property>
  <name>fs.obs.block.size</name>
  <value>128 * 1024 * 1024</value>
  <description>Default block size for OBS FileSystem.
  </description>
</property>

测试hadoop-huaweicloud模块

hadoop-huaweicloud模块包含完整的单元测试套件。大多数测试将针对华为云OBS运行。要运行这些测试,请创建src/test/resources/auth-keys.xml文件,其中包含上述章节提到的OBS账户信息以及以下属性。

<property>
    <name>fs.contract.test.fs.obs</name>
    <value>obs://obsfilesystem-bucket</value>
</property>