Rumen

概述
- 动机
- Components
如何使用Rumen？
- Trace Builder
- 文件夹
附录
- 资源
- 依赖项

概述

Rumen is a data extraction and analysis tool built for Apache Hadoop. Rumen mines JobHistory logs to extract meaningful data and stores it in an easily-parsed, condensed format or digest. The raw trace data from MapReduce logs are often insufficient for simulation, emulation, and benchmarking, as these tools often attempt to measure conditions that did not occur in the source data. For example, if a task ran locally in the raw trace data but a simulation of the scheduler elects to run that task on a remote rack, the simulator requires a runtime its input cannot provide. To fill in these gaps, Rumen performs a statistical analysis of the digest to estimate the variables the trace doesn’t supply. Rumen traces drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and SLS (a simulator for the resource manager scheduler).

动机

从JobHistory日志中提取有意义的数据是任何基于MapReduce构建工具的常见任务。编写一个与MapReduce框架紧密耦合的自定义工具非常繁琐。因此需要一个内置工具来执行日志解析和分析的框架级任务。这样的工具可以使依赖作业历史的外部系统免受作业历史格式变更的影响。
对MapReduce作业的各类属性（如任务运行时间、任务失败情况等）进行统计分析是基准测试与模拟工具可能需要完成的另一项常见任务。Rumen会为Map/Reduce任务运行时间生成累积分布函数(CDF)。运行时CDF可用于推断未完成、缺失及合成任务的任务运行时间。同样地，系统也会计算每次尝试中成功任务总数的CDF。

组件

Rumen 包含两个组件

追踪构建器：将作业历史日志转换为易于解析的格式。目前TraceBuilder以JSON格式输出追踪数据。
*Folder *: 一个用于缩放输入轨迹的实用工具。从TraceBuilder获取的轨迹仅汇总了输入文件夹和文件中的作业。给定轨迹中所有作业完成的时间跨度可视为轨迹运行时间。Folder可用于缩放轨迹的运行时间。减少轨迹运行时间可能涉及从输入轨迹中删除某些作业并缩减剩余作业的运行时间。增加轨迹运行时间可能涉及向结果轨迹添加虚拟作业并延长单个作业的运行时间。

如何使用Rumen？

将JobHistory日志转换为所需的工作跟踪包含2个步骤

将信息提取为中间格式
调整从中间轨迹获取的作业轨迹，使其具备所需的属性。

从JobHistory日志中提取信息是一次性操作。这种被称为黄金轨迹的数据可以重复使用，以生成具有所需属性值（如output-duration、concentration等）的轨迹。

Rumen 提供2个基本命令

TraceBuilder
Folder

首先，我们需要生成黄金轨迹。因此第一步是在作业历史文件夹上运行TraceBuilder。TraceBuilder的输出是一个作业轨迹文件（以及可选的集群拓扑文件）。如果我们想要缩放输出，可以使用Folder工具将当前轨迹折叠到所需长度。本节的剩余部分将详细解释这些工具。

追踪构建器

命令

hadoop rumentrace [options] <jobtrace-output> <topology-output> <inputs>

该命令调用Rumen的TraceBuilder工具。

TraceBuilder将JobHistory文件转换为一组JSON对象，并将它们写入文件。它还会提取集群布局（拓扑结构）并将其写入文件。表示一个以空格分隔的JobHistory文件和文件夹列表。

1) TraceBuilder的输入和输出应为完整的文件系统路径。因此，使用file://指定local本地文件系统上的文件，使用hdfs://指定HDFS上的文件。由于输入文件或文件夹是文件系统路径，这意味着它们可以使用通配符匹配。这在通过正则表达式指定多个文件路径时非常有用。

2) 默认情况下，TraceBuilder不会递归扫描输入文件夹中的作业历史文件。只有直接放置在输入文件夹下的文件才会被考虑用于生成跟踪记录。如需通过递归扫描输入目录来添加该目录下的所有文件，请使用‘-recursive’选项。

集群拓扑的使用方式如下：

为了重建分片并确保在实际运行中观察到的距离/延迟被正确建模。
用于推断缺失拆分详情或合成生成任务的拆分信息。

选项

参数	描述	备注
`-demuxer`	Used to read the jobhistory files. The default is `DefaultInputDemuxer`.	Demuxer decides how the input file maps to jobhistory file(s). Job history logs and job configuration files are typically small files, and can be more effectively stored when embedded in some container file format like SequenceFile or TFile. To support such usage cases, one can specify a customized Demuxer class that can extract individual job history logs and job configuration files from the source files.
`-recursive`	Recursively traverse input paths for job history logs.	This option should be used to inform the TraceBuilder to recursively scan the input paths and process all the files under it. Note that, by default, only the history logs that are directly under the input folder are considered for generating the trace.

示例

hadoop rumentrace \
  file:///tmp/job-trace.json \
  file:///tmp/job-topology.json \
  hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser

这将分析存储在HDFS文件系统中/tmp/hadoop-yarn/staging/history/done_intermediate/testuser目录下的所有作业，并将作业轨迹输出到本地文件系统的/tmp/job-trace.json文件中，同时将拓扑信息存储在/tmp/job-topology.json文件中。

文件夹

命令

hadoop rumenfolder [options] [input] [output]

该命令调用Rumen的Folder工具。折叠本质上意味着生成轨迹的输出时长是固定的，作业时间线会被调整以适应最终输出时长。

输入和输出到Folder的路径应为完整的文件系统路径。因此使用file://来指定local本地文件系统上的文件，使用hdfs://来指定HDFS上的文件。

选项

参数	描述	备注
`-input-cycle`	Defines the basic unit of time for the folding operation. There is no default value for `input-cycle`. Input cycle must be provided.	'`-input-cycle 10m`' implies that the whole trace run will be now sliced at a 10min interval. Basic operations will be done on the 10m chunks. Note that Rumen understands various time units like m(min), h(hour), d(days) etc.
`-output-duration`	This parameter defines the final runtime of the trace. Default value if 1 hour.	'`-output-duration 30m`' implies that the resulting trace will have a max runtime of 30mins. All the jobs in the input trace file will be folded and scaled to fit this window.
`-concentration`	Set the concentration of the resulting trace. Default value is 1.	If the total runtime of the resulting trace is less than the total runtime of the input trace, then the resulting trace would contain lesser number of jobs as compared to the input trace. This essentially means that the output is diluted. To increase the density of jobs, set the concentration to a higher value.
`-debug`	Run the Folder in debug mode. By default it is set to false.	In debug mode, the Folder will print additional statements for debugging. Also the intermediate files generated in the scratch directory will not be cleaned up.
`-seed`	Initial seed to the Random Number Generator. By default, a Random Number Generator is used to generate a seed and the seed value is reported back to the user for future use.	If an initial seed is passed, then the `Random Number Generator` will generate the random numbers in the same sequence i.e the sequence of random numbers remains same if the same seed is used. Folder uses Random Number Generator to decide whether or not to emit the job.
`-temp-directory`	Temporary directory for the Folder. By default the output folder's parent directory is used as the scratch space.	This is the scratch space used by Folder. All the temporary files are cleaned up in the end unless the Folder is run in `debug` mode.
`-skew-buffer-length`	Enables Folder to tolerate skewed jobs. The default buffer length is 0.	'`-skew-buffer-length 100`' indicates that if the jobs appear out of order within a window size of 100, then they will be emitted in-order by the folder. If a job appears out-of-order outside this window, then the Folder will bail out provided `-allow-missorting` is not set. Folder reports the maximum skew size seen in the input trace for future use.
`-allow-missorting`	Enables Folder to tolerate out-of-order jobs. By default mis-sorting is not allowed.	If mis-sorting is allowed, then the Folder will ignore out-of-order jobs that cannot be deskewed using a skew buffer of size specified using `-skew-buffer-length`. If mis-sorting is not allowed, then the Folder will bail out if the skew buffer is incapable of tolerating the skew.

示例

将总运行时间为10小时的输入轨迹折叠，生成总运行时间为1小时的输出轨迹

hadoop rumenfolder \
  -output-duration 1h \
  -input-cycle 20m \
  file:///tmp/job-trace.json \
  file:///tmp/job-trace-1hr.json

如果折叠的作业顺序错乱，则该命令将中止执行。

将总运行时间为10小时的输入轨迹折叠，生成总运行时间为1小时的输出轨迹，并允许一定的偏斜

hadoop rumenfolder \
  -output-duration 1h \
  -input-cycle 20m \
  -allow-missorting \
  -skew-buffer-length 100 \
  file:///tmp/job-trace.json \
  file:///tmp/job-trace-1hr.json

如果折叠的任务顺序错乱，最多会对100个任务进行去偏斜处理。如果第101^st个任务仍处于乱序状态，则该命令将退出执行。

将总运行时间为10小时的输入轨迹折叠，以在调试模式下生成总运行时间为1小时的输出轨迹

hadoop rumenfolder \
  -output-duration 1h \
  -input-cycle 20m \
  -debug -temp-directory file:///tmp/debug \
  file:///tmp/job-trace.json \
  file:///tmp/job-trace-1hr.json

这将把10小时的任务跟踪文件file:///tmp/job-trace.json压缩到1小时内完成，并使用file:///tmp/debug作为临时目录。临时目录中的中间文件将不会被清理。

将总运行时间为10小时的输入轨迹折叠，生成总运行时间为1小时且具有自定义浓度的输出轨迹。

hadoop rumenfolder \
  -output-duration 1h \
  -input-cycle 20m \
  -concentration 2 \
  file:///tmp/job-trace.json \
  file:///tmp/job-trace-1hr.json

这将把10小时的工作跟踪文件file:///tmp/job-trace.json压缩至1小时内完成，压缩比为2。若将10小时的工作跟踪压缩至1小时，默认会保留10%的工作任务。当压缩比设为2时，将保留总输入工作量的20%。

附录

资源

MAPREDUCE-751 是引入 Rumen 到 MapReduce 的主要 JIRA 工单。查看 MapReduce 的 rumen-component 获取更多详情。

通用

通用

HDFS

MapReduce

MapReduce REST API接口

YARN

YARN REST API接口

YARN 服务

Hadoop兼容文件系统

认证

工具

参考

配置