Dask DataFrame API 与逻辑查询规划

内容

Dask DataFrame API 与逻辑查询规划¶

DataFrame¶

`DataFrame`(expr)	类似 DataFrame 的表达式集合。
`DataFrame.abs`()	返回一个包含每个元素绝对数值的 Series/DataFrame。
`DataFrame.add`(other[, axis, level, fill_value])
`DataFrame.align`(other[, join, axis, fill_value])	使用指定的连接方法将两个对象沿其轴对齐。
`DataFrame.all`([axis, skipna, split_every])	返回是否所有元素都为 True，可能是在某个轴上。
`DataFrame.any`([axis, skipna, split_every])	返回是否任何元素为 True，可能在某个轴上。
`DataFrame.apply`(function, *args[, meta, axis])	pandas.DataFrame.apply 的并行版本
`DataFrame.assign`(**pairs)	将新列分配给 DataFrame。
`DataFrame.astype`(dtypes)	将 pandas 对象转换为指定的数据类型 `dtype`。
`DataFrame.bfill`([axis, limit])	使用下一个有效观测值来填充NA/NaN值。
`DataFrame.categorize`([columns, index, ...])	将 DataFrame 的列转换为类别数据类型。
`DataFrame.columns`
`DataFrame.compute`([fuse, concatenate])	计算这个 DataFrame。
`DataFrame.copy`([deep])	复制数据框
`DataFrame.corr`([method, min_periods, ...])	计算列之间的成对相关性，排除NA/null值。
`DataFrame.count`([axis, numeric_only, ...])	统计每列或每行的非NA单元格数量。
`DataFrame.cov`([min_periods, numeric_only, ...])	计算列之间的成对协方差，排除NA/空值。
`DataFrame.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`DataFrame.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`DataFrame.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`DataFrame.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
`DataFrame.describe`([split_every, ...])	生成描述性统计数据。
`DataFrame.diff`([periods, axis])	元素的第一次离散差分。
`DataFrame.div`(other[, axis, level, fill_value])
`DataFrame.divide`(other[, axis, level, ...])
`DataFrame.divisions`	`npartitions + 1` 值的元组，按升序排列，标记每个分区索引的下限/上限。
`DataFrame.drop`([labels, axis, columns, errors])	从行或列中删除指定的标签。
`DataFrame.drop_duplicates`([subset, ...])	返回删除重复行后的DataFrame。
`DataFrame.dropna`([how, subset, thresh])	移除缺失值。
`DataFrame.dtypes`	返回数据类型
`DataFrame.eq`(other[, level, axis])
`DataFrame.eval`(expr, **kwargs)	评估一个描述对DataFrame列进行操作的字符串。
`DataFrame.explode`(column)	将类似列表的每个元素转换为一行，复制索引值。
`DataFrame.ffill`([axis, limit])	用最后一个有效观测值填充 NA/NaN 值，传播到下一个有效值。
`DataFrame.fillna`([value, axis])	使用指定方法填充 NA/NaN 值。
`DataFrame.floordiv`(other[, axis, level, ...])
`DataFrame.ge`(other[, level, axis])
`DataFrame.get_partition`(n)	获取表示第 nth 分区的 dask DataFrame/Series。
`DataFrame.groupby`(by[, group_keys, sort, ...])	使用映射器或按列的Series对DataFrame进行分组。
`DataFrame.gt`(other[, level, axis])
`DataFrame.head`([n, npartitions, compute])	数据集的前 n 行
`DataFrame.idxmax`([axis, skipna, ...])	返回请求轴上最大值的第一个出现的索引。
`DataFrame.idxmin`([axis, skipna, ...])	返回请求轴上最小值的首次出现的索引。
`DataFrame.iloc`	纯基于位置的整数索引，用于按位置选择。
`DataFrame.index`	返回 dask 索引实例
`DataFrame.info`([buf, verbose, memory_usage])	Dask DataFrame 的简要概述
`DataFrame.isin`(values)	DataFrame 中的每个元素是否包含在值中。
`DataFrame.isna`()	检测缺失值。
`DataFrame.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`DataFrame.items`()	遍历 (列名, 系列) 对。
`DataFrame.iterrows`()	遍历 DataFrame 行作为 (索引, 系列) 对。
`DataFrame.itertuples`([index, name])	将 DataFrame 行作为命名元组进行迭代。
`DataFrame.join`(other[, on, how, lsuffix, ...])	连接另一个DataFrame的列。
`DataFrame.known_divisions`	是否已知分区。
`DataFrame.le`(other[, level, axis])
`DataFrame.loc`	纯标签位置索引器，用于按标签选择。
`DataFrame.lt`(other[, level, axis])
`DataFrame.map_partitions`(func, *args[, ...])	将一个Python函数应用于每个分区
`DataFrame.mask`(cond[, other])	替换条件为 True 的值。
`DataFrame.max`([axis, skipna, numeric_only, ...])	返回请求轴上值的最大值。
`DataFrame.mean`([axis, skipna, numeric_only, ...])	返回请求轴上值的平均值。
`DataFrame.median`([axis, numeric_only])	返回请求轴上值的中位数。
`DataFrame.median_approximate`([axis, method, ...])	返回沿请求轴的值的近似中位数。
`DataFrame.melt`([id_vars, value_vars, ...])	将DataFrame从宽格式透视为长格式，可以选择保留标识符集。
`DataFrame.memory_usage`([deep, index])	返回每个列的内存使用情况，以字节为单位。
`DataFrame.memory_usage_per_partition`([...])	返回每个分区的内存使用情况
`DataFrame.merge`(right[, how, on, left_on, ...])	将 DataFrame 与另一个 DataFrame 合并
`DataFrame.min`([axis, skipna, numeric_only, ...])	返回请求轴上值的最小值。
`DataFrame.mod`(other[, axis, level, fill_value])
`DataFrame.mode`([dropna, split_every, ...])	获取沿选定轴的每个元素的模式。
`DataFrame.mul`(other[, axis, level, fill_value])
`DataFrame.ndim`	返回维度
`DataFrame.ne`(other[, level, axis])
`DataFrame.nlargest`([n, columns, split_every])	返回按 columns 降序排列的前 n 行。
`DataFrame.npartitions`	返回分区数量
`DataFrame.nsmallest`([n, columns, split_every])	返回按 columns 升序排列的前 n 行。
`DataFrame.partitions`	按分区切片数据框
`DataFrame.persist`([fuse])	将此 dask 集合持久化到内存中
`DataFrame.pivot_table`(index, columns, values)	创建一个电子表格样式的数据透视表作为DataFrame。
`DataFrame.pop`(item)	返回项目并从框架中移除。
`DataFrame.pow`(other[, axis, level, fill_value])
`DataFrame.prod`([axis, skipna, numeric_only, ...])	返回请求轴上值的乘积。
`DataFrame.quantile`([q, axis, numeric_only, ...])	DataFrame 的近似行方向和精确列方向的分位数
`DataFrame.query`(expr, **kwargs)	使用复杂表达式过滤数据框
`DataFrame.radd`(other[, axis, level, fill_value])
`DataFrame.random_split`(frac[, random_state, ...])	伪随机地将数据框按行分割成不同的部分
`DataFrame.rdiv`(other[, axis, level, fill_value])
`DataFrame.rename`([index, columns])	重命名列或索引标签。
`DataFrame.rename_axis`([mapper, index, ...])	设置索引或列的轴名称。
`DataFrame.repartition`([divisions, ...])	重新分配一个集合
`DataFrame.replace`([to_replace, value, regex])	将 to_replace 中的值替换为 value。
`DataFrame.resample`(rule[, closed, label])	重采样时间序列数据。
`DataFrame.reset_index`([drop])	将索引重置为默认索引。
`DataFrame.rfloordiv`(other[, axis, level, ...])
`DataFrame.rmod`(other[, axis, level, fill_value])
`DataFrame.rmul`(other[, axis, level, fill_value])
`DataFrame.round`([decimals])	将 DataFrame 四舍五入到可变的小数位数。
`DataFrame.rpow`(other[, axis, level, fill_value])
`DataFrame.rsub`(other[, axis, level, fill_value])
`DataFrame.rtruediv`(other[, axis, level, ...])
`DataFrame.sample`([n, frac, replace, ...])	随机样本项
`DataFrame.select_dtypes`([include, exclude])	根据列的数据类型返回DataFrame列的子集。
`DataFrame.sem`([axis, skipna, ddof, ...])	返回请求轴上的无偏标准误差。
`DataFrame.set_index`(other[, drop, sorted, ...])	使用现有列设置 DataFrame 索引（行标签）。
`DataFrame.shape`
`DataFrame.shuffle`([on, ignore_index, ...])	将 DataFrame 重新排列为新的分区
`DataFrame.size`	Series 或 DataFrame 的大小作为 Delayed 对象。
`DataFrame.sort_values`(by[, npartitions, ...])	按单列对数据集进行排序。
`DataFrame.squeeze`([axis])	将一维轴对象压缩为标量。
`DataFrame.std`([axis, skipna, ddof, ...])	返回请求轴上的样本标准差。
`DataFrame.sub`(other[, axis, level, fill_value])
`DataFrame.sum`([axis, skipna, numeric_only, ...])	返回请求轴上值的总和。
`DataFrame.tail`([n, compute])	数据集的最后 n 行
`DataFrame.to_backend`([backend])	切换到新的 DataFrame 后端
`DataFrame.to_bag`([index, format])	从 Series 创建一个 Dask Bag
`DataFrame.to_csv`(filename, **kwargs)	更多信息请参阅 dd.to_csv 的文档字符串
`DataFrame.to_dask_array`([lengths, meta, ...])	将 dask DataFrame 转换为 dask 数组。
`DataFrame.to_dask_dataframe`(args, *kwargs)	转换为旧版 dask-dataframe 集合
`DataFrame.to_delayed`([optimize_graph])	转换为一个 `dask.delayed` 对象列表，每个分区一个。
`DataFrame.to_hdf`(path_or_buf, key[, mode, ...])	更多信息请参见 dd.to_hdf 的文档字符串
`DataFrame.to_html`([max_rows])	将 DataFrame 渲染为 HTML 表格。
`DataFrame.to_json`(filename, args, *kwargs)	更多信息请参见 dd.to_json 的文档字符串
`DataFrame.to_legacy_dataframe`([optimize])	转换为旧版 dask-dataframe 集合
`DataFrame.to_parquet`(path, **kwargs)
`DataFrame.to_records`([index, lengths])
`DataFrame.to_string`([max_rows])	将 DataFrame 渲染为控制台友好的表格输出。
`DataFrame.to_sql`(name, uri[, schema, ...])
`DataFrame.to_timestamp`([freq, how])	将时间戳转换为DatetimeIndex，在周期的开始。
`DataFrame.truediv`(other[, axis, level, ...])
`DataFrame.values`	返回此数据框的值的 dask.array
`DataFrame.var`([axis, skipna, ddof, ...])	返回请求轴上的无偏方差。
`DataFrame.visualize`([tasks])	可视化表达式或任务图
`DataFrame.where`(cond[, other])	替换条件为 False 的值。

系列¶

`Series`(expr)	类似序列的表达式集合。
`Series.add`(other[, level, fill_value, axis])
`Series.align`(other[, join, axis, fill_value])	使用指定的连接方法将两个对象沿其轴对齐。
`Series.all`([axis, skipna, split_every])	返回是否所有元素都为 True，可能是在某个轴上。
`Series.any`([axis, skipna, split_every])	返回是否任何元素为 True，可能在某个轴上。
`Series.apply`(function, *args[, meta, axis])	pandas.Series.apply 的并行版本
`Series.astype`(dtypes)	将 pandas 对象转换为指定的数据类型 `dtype`。
`Series.autocorr`([lag, split_every])	计算滞后N的自相关。
`Series.between`(left, right[, inclusive])	返回布尔序列，等价于 left <= 序列 <= right。
`Series.bfill`([axis, limit])	使用下一个有效观测值来填充NA/NaN值。
`Series.clear_divisions`()	忘记分割信息。
`Series.clip`([lower, upper, axis])	在输入阈值处修剪值。
`Series.compute`([fuse, concatenate])	计算这个 DataFrame。
`Series.copy`([deep])	复制数据框
`Series.corr`(other[, method, min_periods, ...])	计算与 other Series 的相关性，排除缺失值。
`Series.count`([axis, numeric_only, split_every])	统计每列或每行的非NA单元格数量。
`Series.cov`(other[, min_periods, split_every])	计算与 Series 的协方差，排除缺失值。
`Series.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`Series.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`Series.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`Series.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
`Series.describe`([split_every, percentiles, ...])	生成描述性统计数据。
`Series.diff`([periods, axis])	元素的第一次离散差分。
`Series.div`(other[, level, fill_value, axis])
`Series.drop_duplicates`([ignore_index, ...])
`Series.dropna`()	返回一个移除了缺失值的新序列。
`Series.dtype`
`Series.eq`(other[, level, fill_value, axis])
`Series.explode`()	将类似列表的每个元素转换为一行。
`Series.ffill`([axis, limit])	用最后一个有效观测值填充 NA/NaN 值，传播到下一个有效值。
`Series.fillna`([value, axis])	使用指定方法填充 NA/NaN 值。
`Series.floordiv`(other[, level, fill_value, axis])
`Series.ge`(other[, level, fill_value, axis])
`Series.get_partition`(n)	获取表示第 nth 分区的 dask DataFrame/Series。
`Series.groupby`(by, **kwargs)	使用映射器或通过一系列列来分组系列。
`Series.gt`(other[, level, fill_value, axis])
`Series.head`([n, npartitions, compute])	数据集的前 n 行
`Series.idxmax`([axis, skipna, numeric_only, ...])	返回请求轴上最大值的第一个出现的索引。
`Series.idxmin`([axis, skipna, numeric_only, ...])	返回请求轴上最小值的首次出现的索引。
`Series.isin`(values)	DataFrame 中的每个元素是否包含在值中。
`Series.isna`()	检测缺失值。
`Series.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`Series.known_divisions`	是否已知分区。
`Series.le`(other[, level, fill_value, axis])
`Series.loc`	纯标签位置索引器，用于按标签选择。
`Series.lt`(other[, level, fill_value, axis])
`Series.map`(arg[, na_action, meta])	根据输入的映射或函数映射 Series 的值。
`Series.map_overlap`(func, before, after, *args)	对每个分区应用一个函数，与相邻分区共享行。
`Series.map_partitions`(func, *args[, meta, ...])	将一个Python函数应用于每个分区
`Series.mask`(cond[, other])	替换条件为 True 的值。
`Series.max`([axis, skipna, numeric_only, ...])	返回请求轴上值的最大值。
`Series.mean`([axis, skipna, numeric_only, ...])	返回请求轴上值的平均值。
`Series.median`()	返回请求轴上值的中位数。
`Series.median_approximate`([method])	返回沿请求轴的值的近似中位数。
`Series.memory_usage`([deep, index])	返回 Series 的内存使用情况。
`Series.memory_usage_per_partition`([index, deep])	返回每个分区的内存使用情况
`Series.min`([axis, skipna, numeric_only, ...])	返回请求轴上值的最小值。
`Series.mod`(other[, level, fill_value, axis])
`Series.mul`(other[, level, fill_value, axis])
`Series.nbytes`	字节数
`Series.ndim`	返回维度
`Series.ne`(other[, level, fill_value, axis])
`Series.nlargest`([n, split_every])	返回最大的 n 个元素。
`Series.notnull`()	DataFrame.notnull 是 DataFrame.notna 的别名。
`Series.nsmallest`([n, split_every])	返回最小的 n 个元素。
`Series.nunique`([dropna, split_every, split_out])	返回对象中唯一元素的数量。
`Series.nunique_approx`([split_every])	唯一行的近似数量。
`Series.persist`([fuse])	将此 dask 集合持久化到内存中
`Series.pipe`(func, args, *kwargs)	应用期望 Series 或 DataFrame 的可链式函数。
`Series.pow`(other[, level, fill_value, axis])
`Series.prod`([axis, skipna, numeric_only, ...])	返回请求轴上值的乘积。
`Series.quantile`([q, method])	Series 的近似分位数
`Series.radd`(other[, level, fill_value, axis])
`Series.random_split`(frac[, random_state, ...])	伪随机地将数据框按行分割成不同的部分
`Series.rdiv`(other[, level, fill_value, axis])
`Series.repartition`([divisions, npartitions, ...])	重新分配一个集合
`Series.replace`([to_replace, value, regex])	将 to_replace 中的值替换为 value。
`Series.rename`(index[, sorted_index])	修改系列索引标签或名称
`Series.resample`(rule[, closed, label])	重采样时间序列数据。
`Series.reset_index`([drop])	将索引重置为默认索引。
`Series.rolling`(window, **kwargs)	提供滚动变换功能。
`Series.round`([decimals])	将 DataFrame 四舍五入到可变的小数位数。
`Series.sample`([n, frac, replace, random_state])	随机样本项
`Series.sem`([axis, skipna, ddof, ...])	返回请求轴上的无偏标准误差。
`Series.shape`	返回一个表示 DataFrame 维度的元组。
`Series.shift`([periods, freq, axis])	通过可选的时间 freq 将索引按所需周期数进行移位。
`Series.size`	Series 或 DataFrame 的大小作为 Delayed 对象。
`Series.std`([axis, skipna, ddof, ...])	返回请求轴上的样本标准差。
`Series.sub`(other[, level, fill_value, axis])
`Series.sum`([axis, skipna, numeric_only, ...])	返回请求轴上值的总和。
`Series.to_backend`([backend])	切换到新的 DataFrame 后端
`Series.to_bag`([index, format])	从 Series 创建一个 Dask Bag
`Series.to_csv`(filename, **kwargs)	更多信息请参阅 dd.to_csv 的文档字符串
`Series.to_dask_array`([lengths, meta, optimize])	将 dask DataFrame 转换为 dask 数组。
`Series.to_delayed`([optimize_graph])	转换为一个 `dask.delayed` 对象列表，每个分区一个。
`Series.to_frame`([name])	将 Series 转换为 DataFrame。
`Series.to_hdf`(path_or_buf, key[, mode, append])	更多信息请参见 dd.to_hdf 的文档字符串
`Series.to_string`([max_rows])	渲染 Series 的字符串表示。
`Series.to_timestamp`([freq, how])	将时间戳转换为DatetimeIndex，在周期的开始。
`Series.truediv`(other[, level, fill_value, axis])
`Series.unique`([split_every, split_out, ...])	返回对象中的唯一值序列。
`Series.value_counts`([sort, ascending, ...])	返回一个包含唯一值计数的系列。
`Series.values`	返回此数据框的值的 dask.array
`Series.var`([axis, skipna, ddof, ...])	返回请求轴上的无偏方差。
`Series.visualize`([tasks])	可视化表达式或任务图
`Series.where`(cond[, other])	替换条件为 False 的值。

索引¶

`Index`(expr)	索引式表达式集合。
`Index.add`(other[, level, fill_value, axis])
`Index.align`(other[, join, axis, fill_value])	使用指定的连接方法将两个对象沿其轴对齐。
`Index.all`([axis, skipna, split_every])	返回是否所有元素都为 True，可能是在某个轴上。
`Index.any`([axis, skipna, split_every])	返回是否任何元素为 True，可能在某个轴上。
`Index.apply`(function, *args[, meta, axis])	pandas.Series.apply 的并行版本
`Index.astype`(dtypes)	将 pandas 对象转换为指定的数据类型 `dtype`。
`Index.autocorr`([lag, split_every])	计算滞后N的自相关。
`Index.between`(left, right[, inclusive])	返回布尔序列，等价于 left <= 序列 <= right。
`Index.bfill`([axis, limit])	使用下一个有效观测值来填充NA/NaN值。
`Index.clear_divisions`()	忘记分割信息。
`Index.clip`([lower, upper, axis])	在输入阈值处修剪值。
`Index.compute`([fuse, concatenate])	计算这个 DataFrame。
`Index.copy`([deep])	复制数据框
`Index.corr`(other[, method, min_periods, ...])	计算与 other Series 的相关性，排除缺失值。
`Index.count`([split_every])	统计每列或每行的非NA单元格数量。
`Index.cov`(other[, min_periods, split_every])	计算与 Series 的协方差，排除缺失值。
`Index.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`Index.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`Index.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`Index.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
`Index.describe`([split_every, percentiles, ...])	生成描述性统计数据。
`Index.diff`([periods, axis])	元素的第一次离散差分。
`Index.div`(other[, level, fill_value, axis])
`Index.drop_duplicates`([ignore_index, ...])
`Index.dropna`()	返回一个移除了缺失值的新序列。
`Index.dtype`
`Index.eq`(other[, level, fill_value, axis])
`Index.explode`()	将类似列表的每个元素转换为一行。
`Index.ffill`([axis, limit])	用最后一个有效观测值填充 NA/NaN 值，传播到下一个有效值。
`Index.fillna`([value, axis])	使用指定方法填充 NA/NaN 值。
`Index.floordiv`(other[, level, fill_value, axis])
`Index.ge`(other[, level, fill_value, axis])
`Index.get_partition`(n)	获取表示第 nth 分区的 dask DataFrame/Series。
`Index.groupby`(by, **kwargs)	使用映射器或通过一系列列来分组系列。
`Index.gt`(other[, level, fill_value, axis])
`Index.head`([n, npartitions, compute])	数据集的前 n 行
`Index.is_monotonic_decreasing`	如果对象中的值是单调递减的，则返回布尔值。
`Index.is_monotonic_increasing`	如果对象中的值是单调递增的，则返回布尔值。
`Index.isin`(values)	DataFrame 中的每个元素是否包含在值中。
`Index.isna`()	检测缺失值。
`Index.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`Index.known_divisions`	是否已知分区。
`Index.le`(other[, level, fill_value, axis])
`Index.loc`	纯标签位置索引器，用于按标签选择。
`Index.lt`(other[, level, fill_value, axis])
`Index.map`(arg[, na_action, meta, is_monotonic])	使用输入映射或函数映射值。
`Index.map_overlap`(func, before, after, *args)	对每个分区应用一个函数，与相邻分区共享行。
`Index.map_partitions`(func, *args[, meta, ...])	将一个Python函数应用于每个分区
`Index.mask`(cond[, other])	替换条件为 True 的值。
`Index.max`([axis, skipna, numeric_only, ...])	返回请求轴上值的最大值。
`Index.median`()	返回请求轴上值的中位数。
`Index.median_approximate`([method])	返回沿请求轴的值的近似中位数。
`Index.memory_usage`([deep])	值的内存使用情况。
`Index.memory_usage_per_partition`([index, deep])	返回每个分区的内存使用情况
`Index.min`([axis, skipna, numeric_only, ...])	返回请求轴上值的最小值。
`Index.mod`(other[, level, fill_value, axis])
`Index.mul`(other[, level, fill_value, axis])
`Index.nbytes`	字节数
`Index.ndim`	返回维度
`Index.ne`(other[, level, fill_value, axis])
`Index.nlargest`([n, split_every])	返回最大的 n 个元素。
`Index.notnull`()	DataFrame.notnull 是 DataFrame.notna 的别名。
`Index.nsmallest`([n, split_every])	返回最小的 n 个元素。
`Index.nunique`([dropna, split_every, split_out])	返回对象中唯一元素的数量。
`Index.nunique_approx`([split_every])	唯一行的近似数量。
`Index.persist`([fuse])	将此 dask 集合持久化到内存中
`Index.pipe`(func, args, *kwargs)	应用期望 Series 或 DataFrame 的可链式函数。
`Index.pow`(other[, level, fill_value, axis])
`Index.quantile`([q, method])	Series 的近似分位数
`Index.radd`(other[, level, fill_value, axis])
`Index.random_split`(frac[, random_state, shuffle])	伪随机地将数据框按行分割成不同的部分
`Index.rdiv`(other[, level, fill_value, axis])
`Index.rename`(index[, sorted_index])	修改系列索引标签或名称
`Index.repartition`([divisions, npartitions, ...])	重新分配一个集合
`Index.replace`([to_replace, value, regex])	将 to_replace 中的值替换为 value。
`Index.resample`(rule[, closed, label])	重采样时间序列数据。
`Index.reset_index`([drop])	将索引重置为默认索引。
`Index.rolling`(window, **kwargs)	提供滚动变换功能。
`Index.round`([decimals])	将 DataFrame 四舍五入到可变的小数位数。
`Index.sample`([n, frac, replace, random_state])	随机样本项
`Index.sem`([axis, skipna, ddof, split_every, ...])	返回请求轴上的无偏标准误差。
`Index.shape`	返回一个表示 DataFrame 维度的元组。
`Index.shift`([periods, freq])	通过可选的时间 freq 将索引按所需周期数进行移位。
`Index.size`	Series 或 DataFrame 的大小作为 Delayed 对象。
`Index.sub`(other[, level, fill_value, axis])
`Index.to_backend`([backend])	切换到新的 DataFrame 后端
`Index.to_bag`([index, format])	从 Series 创建一个 Dask Bag
`Index.to_csv`(filename, **kwargs)	更多信息请参阅 dd.to_csv 的文档字符串
`Index.to_dask_array`([lengths, meta, optimize])	将 dask DataFrame 转换为 dask 数组。
`Index.to_delayed`([optimize_graph])	转换为一个 `dask.delayed` 对象列表，每个分区一个。
`Index.to_frame`([index, name])	创建一个包含索引列的DataFrame。
`Index.to_hdf`(path_or_buf, key[, mode, append])	更多信息请参见 dd.to_hdf 的文档字符串
`Index.to_series`([index, name])	创建一个索引和值都等于索引键的系列。
`Index.to_string`([max_rows])	渲染 Series 的字符串表示。
`Index.to_timestamp`([freq, how])	将时间戳转换为DatetimeIndex，在周期的开始。
`Index.truediv`(other[, level, fill_value, axis])
`Index.unique`([split_every, split_out, ...])	返回对象中的唯一值序列。
`Index.value_counts`([sort, ascending, ...])	返回一个包含唯一值计数的系列。
`Index.values`	返回此数据框的值的 dask.array
`Index.visualize`([tasks])	可视化表达式或任务图
`Index.where`(cond[, other])	替换条件为 False 的值。
`Index.to_frame`([index, name])	创建一个包含索引列的DataFrame。

访问器¶

与 pandas 类似，Dask 在各种访问器下提供了特定于数据类型的方法。这些是 Series 中的独立命名空间，仅适用于特定的数据类型。

访问器实现与当前的 Dask DataFrame 实现一致。

分组操作¶

DataFrame 分组¶

`GroupBy.aggregate`([arg, split_every, ...])	使用一个或多个指定的操作进行聚合
`GroupBy.apply`(func, *args[, meta, ...])	pandas GroupBy.apply 的并行版本
`GroupBy.bfill`([limit, shuffle_method])	向后填充值。
`GroupBy.count`(**kwargs)	计算组的数量，排除缺失值。
`GroupBy.cumcount`()	每个组中的每一项从0到该组长度减1进行编号。
`GroupBy.cumprod`([numeric_only])	每个组的累积乘积。
`GroupBy.cumsum`([numeric_only])	每个组的累计和。
`GroupBy.ffill`([limit, shuffle_method])	向前填充值。
`GroupBy.get_group`(key)	从具有提供名称的组构造 DataFrame。
`GroupBy.max`([numeric_only])	计算组值的最大值。
`GroupBy.mean`([numeric_only, split_out])	计算各组的均值，排除缺失值。
`GroupBy.min`([numeric_only])	计算组值的最小值。
`GroupBy.size`(**kwargs)	计算组大小。
`GroupBy.std`([ddof, split_every, split_out, ...])	计算组的样本标准差，排除缺失值。
`GroupBy.sum`([numeric_only, min_count])	计算组值的总和。
`GroupBy.var`([ddof, split_every, split_out, ...])	计算各组的方差，排除缺失值。
`GroupBy.cov`([ddof, split_every, split_out, ...])	计算列之间的成对协方差，排除NA/空值。
`GroupBy.corr`([split_every, split_out, ...])	计算列之间的成对相关性，排除NA/null值。
`GroupBy.first`([numeric_only, sort])	计算每个组内每一列的第一个条目。
`GroupBy.last`([numeric_only, sort])	计算每个组内每一列的最后一个条目。
`GroupBy.idxmin`([split_every, split_out, ...])	返回请求轴上最小值的首次出现的索引。
`GroupBy.idxmax`([split_every, split_out, ...])	返回请求轴上最大值的第一个出现的索引。
`GroupBy.rolling`(window[, min_periods, ...])	提供滚动变换功能。
`GroupBy.transform`(func[, meta, shuffle_method])	pandas GroupBy.transform 的并行版本

Series 分组¶

`SeriesGroupBy.aggregate`([arg, split_every, ...])	使用一个或多个指定的操作进行聚合
`SeriesGroupBy.apply`(func, *args[, meta, ...])	pandas GroupBy.apply 的并行版本
`SeriesGroupBy.bfill`([limit, shuffle_method])	向后填充值。
`SeriesGroupBy.count`(**kwargs)	计算组的数量，排除缺失值。
`SeriesGroupBy.cumcount`()	每个组中的每一项从0到该组长度减1进行编号。
`SeriesGroupBy.cumprod`([numeric_only])	每个组的累积乘积。
`SeriesGroupBy.cumsum`([numeric_only])	每个组的累计和。
`SeriesGroupBy.ffill`([limit, shuffle_method])	向前填充值。
`SeriesGroupBy.get_group`(key)	从具有提供名称的组构造 DataFrame。
`SeriesGroupBy.max`([numeric_only])	计算组值的最大值。
`SeriesGroupBy.mean`([numeric_only, split_out])	计算各组的均值，排除缺失值。
`SeriesGroupBy.min`([numeric_only])	计算组值的最小值。
`SeriesGroupBy.nunique`([split_every, ...])	返回组中唯一元素的数量。
`SeriesGroupBy.size`(**kwargs)	计算组大小。
`SeriesGroupBy.std`([ddof, split_every, ...])	计算组的样本标准差，排除缺失值。
`SeriesGroupBy.sum`([numeric_only, min_count])	计算组值的总和。
`SeriesGroupBy.var`([ddof, split_every, ...])	计算各组的方差，排除缺失值。
`SeriesGroupBy.first`([numeric_only, sort])	计算每个组内每一列的第一个条目。
`SeriesGroupBy.last`([numeric_only, sort])	计算每个组内每一列的最后一个条目。
`SeriesGroupBy.idxmin`([split_every, ...])	返回请求轴上最小值的首次出现的索引。
`SeriesGroupBy.idxmax`([split_every, ...])	返回请求轴上最大值的第一个出现的索引。
`SeriesGroupBy.rolling`(window[, min_periods, ...])	提供滚动变换功能。
`SeriesGroupBy.transform`(func[, meta, ...])	pandas GroupBy.transform 的并行版本

自定义聚合¶

Aggregation(name, chunk, agg[, finalize])

用户定义的分组聚合。

滚动操作¶

`Series.rolling`(window, **kwargs)	提供滚动变换功能。
`DataFrame.rolling`(window, **kwargs)	提供滚动变换功能。

`Rolling.apply`(func, args, *kwargs)	计算滚动自定义聚合函数。
`Rolling.count`()	计算非 NaN 观测值的滚动计数。
`Rolling.kurt`()	计算无偏的滚动Fisher峰度定义。
`Rolling.max`()	计算滚动最大值。
`Rolling.mean`()	计算滚动平均值。
`Rolling.median`()	计算滚动中位数。
`Rolling.min`()	计算滚动最小值。
`Rolling.quantile`(q)	计算滚动分位数。
`Rolling.skew`()	计算滚动无偏斜度。
`Rolling.std`()	计算滚动标准差。
`Rolling.sum`()	计算滚动总和。
`Rolling.var`()	计算滚动方差。

创建数据框¶

`read_csv`(path, *args[, header, ...])
`read_table`(path, *args[, header, usecols, ...])
`read_fwf`(path, *args[, header, usecols, ...])
`read_parquet`([path, columns, filters, ...])	将 Parquet 文件读取到 Dask DataFrame 中
`read_hdf`(pattern, key[, start, stop, ...])
`read_json`(url_path[, orient, lines, ...])	从一组JSON文件创建一个数据框
`read_orc`(path[, engine, columns, index, ...])	从 ORC 文件中读取数据框
`read_sql_table`(table_name, con, index_col[, ...])	将 SQL 数据库表读取到 DataFrame 中。
`read_sql_query`(sql, con, index_col[, ...])	将 SQL 查询读取到 DataFrame 中。
`read_sql`(sql, con, index_col, **kwargs)	将 SQL 查询或数据库表读取到 DataFrame 中。
`from_array`(arr[, chunksize, columns, meta])	将任何可切片数组读入 Dask 数据框
`from_dask_array`(x[, columns, index, meta])	从 Dask 数组创建一个 Dask DataFrame。
`from_delayed`(dfs[, meta, divisions, prefix, ...])	从许多 Dask Delayed 对象创建 Dask DataFrame
`from_map`(func, *iterables[, args, meta, ...])	从自定义函数映射创建一个 DataFrame 集合。
`from_pandas`(data[, npartitions, sort, chunksize])	从 Pandas DataFrame 构建 Dask DataFrame
`DataFrame.from_dict`(data, *[, npartitions, ...])	从 Python 字典构建 Dask DataFrame

存储数据框¶

`to_csv`(df, filename[, single_file, ...])	将 Dask DataFrame 存储为 CSV 文件
`to_parquet`(df, path[, compression, ...])	将 Dask.dataframe 存储到 Parquet 文件
`to_hdf`(df, path, key[, mode, append, ...])	将 Dask DataFrame 存储到分层数据格式 (HDF) 文件
`to_records`(df)	从 Dask Dataframe 创建 Dask Array
`to_sql`(df, name, uri[, schema, if_exists, ...])	将 Dask 数据框存储到 SQL 表中
`to_json`(df, url_path[, orient, lines, ...])	将数据框写入 JSON 文本文件

转换数据框¶

`DataFrame.to_bag`([index, format])	从 Series 创建一个 Dask Bag
`DataFrame.to_dask_array`([lengths, meta, ...])	将 dask DataFrame 转换为 dask 数组。
`DataFrame.to_delayed`([optimize_graph])	转换为一个 `dask.delayed` 对象列表，每个分区一个。

从/到旧版 DataFrame 转换¶

`DataFrame.to_legacy_dataframe`([optimize])	转换为旧版 dask-dataframe 集合
`from_legacy_dataframe`(ddf[, optimize])	从旧版 dask-dataframe 集合创建一个 dask-expr 集合

重塑 DataFrame¶

`get_dummies`(data[, prefix, prefix_sep, ...])	将分类变量转换为哑变量/指示变量。
`pivot_table`(df, index, columns, values[, ...])	创建一个电子表格样式的数据透视表作为DataFrame。
`melt`(frame[, id_vars, value_vars, var_name, ...])

连接DataFrame¶

`DataFrame.merge`(right[, how, on, left_on, ...])	将 DataFrame 与另一个 DataFrame 合并
`concat`(dfs[, axis, join, ...])	沿行方向连接DataFrame。
`merge`(left, right[, how, on, left_on, ...])	使用数据库风格的连接合并 DataFrame 或命名 Series 对象。
`merge_asof`(left, right[, on, left_on, ...])	按键距离执行合并。

重采样¶

`Resampler`(obj, rule, **kwargs)	使用一个或多个操作进行聚合
`Resampler.agg`(func, args, *kwargs)	在指定的轴上使用一个或多个操作进行聚合。
`Resampler.count`()	计算组的数量，排除缺失值。
`Resampler.first`()	计算每个组内每一列的第一个条目。
`Resampler.last`()	计算每个组内每一列的最后一个条目。
`Resampler.max`()	计算组的最大值。
`Resampler.mean`()	计算各组的均值，排除缺失值。
`Resampler.median`()	计算各组的中位数，排除缺失值。
`Resampler.min`()	计算组的最小值。
`Resampler.nunique`()	返回组中唯一元素的数量。
`Resampler.ohlc`()	计算一组数据的开盘、最高、最低和收盘值，排除缺失值。
`Resampler.prod`()	计算组值的乘积。
`Resampler.quantile`()	返回给定分位数的值。
`Resampler.sem`()	计算各组均值的标准误差，排除缺失值。
`Resampler.size`()	计算组大小。
`Resampler.std`()	计算组的样本标准差，排除缺失值。
`Resampler.sum`()	计算组值的总和。
`Resampler.var`()	计算各组的方差，排除缺失值。

Dask 元数据¶

make_meta(x[, index, parent_meta])

此方法根据 x 的类型创建元数据，如果提供了 parent_meta ，则使用它。

查询计划与优化¶

`DataFrame.explain`([stage, format])	创建表达式的图形表示。
`DataFrame.visualize`([tasks])	可视化表达式或任务图
`DataFrame.analyze`([filename, format])	输出表达式中每个节点的统计信息。

其他功能¶

`compute`(*args[, traverse, optimize_graph, ...])	一次计算多个 dask 集合。
`map_partitions`(func, *args[, meta, ...])	在每个 DataFrame 分区上应用 Python 函数。
`map_overlap`(func, df, before, after, *args)	对每个分区应用一个函数，与相邻分区共享行。
`to_datetime`()	将参数转换为日期时间。
`to_numeric`(arg[, errors, downcast, meta])	将参数转换为数值类型。
`to_timedelta`()	将参数转换为 timedelta。

Dask DataFrame 最佳实践

dask_expr._collection.DataFrame