mars.dataframe.read_parquet#

mars.dataframe.read_parquet(path, engine: 字符串 = 'auto', columns: 可选[列表] = None, groups_as_chunks: 布尔型 = False, use_arrow_dtype: 可选[布尔型] = None, incremental_index: 布尔型 = False, storage_options: 可选[字典] = None, memory_scale: 可选[整型] = None, merge_small_files: 布尔型 = True, merge_small_file_options: 可选[字典] = None, gpu: 可选[布尔型] = None, **kwargs)[来源]#

从文件路径加载一个parquet对象，并返回一个DataFrame。

Parameters

path (str, 路径对象 或 类文件对象) – 任何有效的字符串路径都是可以接受的。字符串可以是一个 URL。对于文件 URL，预期有一个主机。一个本地文件可以是： file://localhost/path/to/table.parquet。文件 URL 也可以是指向包含多个分区 parquet 文件的目录的路径。pyarrow 和 fastparquet 都支持目录路径以及文件 URL。目录路径可以是： file://localhost/path/to/tables。类文件对象是指具有 read() 方法的对象，比如文件处理器（例如，通过内置的 open 函数）或 StringIO。
engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – 使用的Parquet库。默认行为是尝试使用‘pyarrow’，如果‘pyarrow’不可用则回退到‘fastparquet’。
columns (list, default=None) – 如果不为 None，则只会从文件中读取这些列。
groups_as_chunks (bool, 默认值为 False) – 如果为 True，则每个行组对应一个块。如果为 False，则每个文件对应一个块。仅在 'pyarrow' 引擎可用。
incremental_index (bool, 默认值为 False) – 如果未指定 index_col，确保范围索引递增，如果设置为 False，将获得稍微更佳的性能。
use_arrow_dtype (bool, default None) – 如果为True，使用arrow数据类型来存储列。
storage_options (dict, optional) – 存储连接的选项。
memory_scale (int, 可选) – 实际内存占用与原始文件大小的比率。
merge_small_files (bool, 默认为 True) – 合并小文件，文件大小较小。
merge_small_file_options (dict) – 合并小文件的选项
**kwargs – 任何额外的kwargs都被传递给引擎。

Return type

Mars 数据框