GraphAr: 标准图数据文件格式¶

GraphAr是一种用于存储图数据的文件格式，它为不同应用和系统之间高效互通提供了标准化格式。借助GraphAr，图数据可以实现导入/导出、持久化存储，并直接作为图处理应用的数据源。GraphAr与Apache Spark、Apache Hive、Neo4j等多种数据处理框架高度兼容。

GraphAr 文件格式¶

功能特性¶

GraphAr支持属性图数据模型以及多种图结构表示方式，如COO、CSR和CSC。它还能兼容现有广泛使用的文件类型，包括CSV、ORC和Parquet。可以利用Apache Spark高效地生成、加载和转换GraphAr文件。GraphAr还具有足够的灵活性，可以修改图的拓扑结构或属性，或者通过一组选定的顶点/边来构建新图。

文件格式¶

GraphAr包含两种类型的文件：

用于描述元信息的YAML文件;
用于存储顶点和边数据的数据文件。

信息文件¶

GraphAr使用两种文件来存储图数据：一组Yaml文件用于描述元信息；以及数据文件用于存储顶点和边的实际数据。名为".graph.yml"的图信息文件描述了名为的图的元信息。该文件内容包括：

图名称;
数据文件的根目录路径;
包含的顶点信息文件和边信息文件；
GraphAr的版本。

一个名为“.vertex.yml”的顶点信息文件定义了一组具有相同顶点标签的顶点，该组中的所有顶点都遵循相同的模式。

一个名为“.edge.yml”的边信息文件定义了具有特定标签的源顶点、目标顶点和边的单一组边。它描述了这些边的元信息。

请注意，GraphAr支持为给定的边组存储多种类型的邻接表，例如，当GraphAr中存在相关数据的两个副本（一个是ordered_by_source，另一个是ordered_by_dest）时，可以通过CSR和CSC两种方式访问一组边。

数据文件¶

如前所述，每个逻辑顶点/边表会被划分为多个物理表，并以以下文件格式之一存储：

查看信息文件和数据文件中的示例。

有关GraphAr文件格式的更多详情，请参阅GraphAr File Format。

数据类型¶

属性数据类型¶

GraphAr支持一系列内置属性数据类型，这些类型在实际应用场景中常见且被大多数文件格式（CSV、ORC、Parquet）所支持，包括：

- Boolean 
- Int32: Integer with 32 bits
- Int64: Integer with 64 bits
- Float: 32-bit floating point values
- Double: 64-bit floating point values
- String: Textual data
- Date: days since the Unix epoch
- Timestamp: milliseconds since the Unix epoch
- List: A list of values of the same type

GraphAr 在 GraphScope 中¶

GraphScope 提供了一组API用于以GraphAr格式加载和归档图数据。GraphScope客户端(Python)可以通过save_to和load_from函数来加载和归档GraphAr格式的图数据。

在GraphAr中保存图数据¶

您可以使用save_to函数将图保存为GraphAr格式。

save_to 支持以下与GraphAr相关的参数：

graphar_graph_name: 图名称，默认为“graph”。
graphar_file_type: 图数据的文件类型，包括"csv"、"orc"、"parquet"，默认为"parquet"。
graphar_vertex_chunk_size: graphar格式中顶点数据的分块大小，默认为2^18。
graphar_edge_chunk_size: graphar格式中边数据的分块大小，默认为2^22。
graphar_store_in_local: 是否让每个工作节点将图数据部分存储在本地文件系统中，默认为False。
selector: 用于选择要保存的子图的筛选器，如果未指定，将保存整个图。

以下是一个示例：

import graphscope
from graphscoped.dataset import load_ldbc

# initialize a session
sess = graphscope.session(cluster_type="hosts")
# load ldbc graph
graph = load_ldbc(sess)

# save the ldbc graph to GraphAr format
r = g.save_to(
    "/tmp/ldbc_graphar/",
    format="graphar",
    graphar_graph_name="ldbc",  # the name of the graph
    graphar_file_type="parquet",  # the file type of the graph data
    graphar_vertex_chunk_size=1024,  # the chunk size of the vertex data
    graphar_edge_chunk_size=4096,  # the chunk size of the edge data
)
# the result is a dictionary that contains the format and the URI path of the saved graph
print(r)
{ "format": "graphar", "uri": "graphar+file:///tmp/ldbc_graphar/ldbc.graph.yaml"}

你也可以使用带有selector参数的save_to函数将子图保存为GraphAr格式。示例如下：

import graphscope
from graphscoped.dataset import load_ldbc

# initialize a session
sess = graphscope.session(cluster_type="hosts")
# load ldbc graph
graph = load_ldbc(sess)

# define the selector
# we only want to save the "person" and "comment" vertices and the "knows" and "replyOf" edges
# with the specified properties
selector = {
    "vertices": {
        "person": ["id", "firstName", "lastName"],
        "comment": None,  # None means all properties
    },
    "edges": {
        "knows": ["creationDate"],
        "likes": ["creationDate"],
    },
}

# save the subgraph to GraphAr format
r = g.save_to(
    "/tmp/ldbc_subgraph_graphar/",
    format="graphar",
    selector=selector,
    graphar_graph_name="ldbc_subgraph",  # the name of the graph
    graphar_file_type="parquet",  # the file type of the graph data
    graphar_vertex_chunk_size=1024,  # the chunk size of the vertex data
    graphar_edge_chunk_size=4096,  # the chunk size of the edge data
)
# the result is a dictionary that contains the format and the URI path of the saved graph
print(r)
{ "format": "graphar", "uri": "graphar+file:///tmp/ldbc_graphar/ldbc_subgraph.graph.yaml"}

将GraphAr数据加载到GraphScope¶

您可以使用load_from函数从GraphAr格式数据加载图。

load_from 支持以下与GraphAr相关的参数：

graphar_store_in_local: 图数据是否存储在每个工作节点的本地文件系统中，默认为False。
selector: 用于选择要加载的子图的筛选器，如果未指定，则将加载整个图。

以下是一个示例：

import graphscope
from graphscope import pagerank
from graphscope.framework.graph import Graph 

# initialize a session
sess = graphscope.session(cluster_type="hosts")

# assume the graph data is saved in the "/tmp/ldbc_graphar/" directory and it's graph information file is "ldbc.graph.yaml", that the URI is "graphar+file:///tmp/ldbc_graphar/ldbc.graph.yaml"
uri = "graphar+file:///tmp/ldbc_graphar/ldbc.graph.yaml"

# load the graph from GraphAr format
g = Graph.load_from(uri, sess)
print(g.schema)

# do some graph processing
pg = g.project(vertices={"person": ["id"]}, edges={"knows": []})
ctx = pagerank(pg, max_round=10)
df = ctx.to_dataframe(selector={"id": "v.data", "r": "r"})
print(df)

你也可以使用load_from函数配合selector参数，从完整的ldbc数据集中加载GraphAr格式数据的子图。示例如下：

import graphscope
from graphscope.framework.graph import Graph

# initialize a session
sess = graphscope.session(cluster_type="hosts")

# assume the ldbc data is saved in the "/tmp/ldbc__graphar/" directory and it's graph information file is "ldbc.graph.yaml", that the URI is "graphar+file:///tmp/ldbc_graphar/ldbc.graph.yaml"
uri = "graphar+file:///tmp/ldbc_graphar/ldbc.graph.yaml"

# define the selector, you want to only load the "person" and "comment" vertices and the "knows" and "replyOf" edges
selector = {
    "vertices": {
        "person": None,
        "comment": None,  # None means all properties
    },
    "edges": {
        "knows": None,
        "likes": None,
    },
}
g = Graph.load_from(uri, sess, selector=selector)
print(g.schema)

# do some graph processing
pg = g.project(vertices={"person": ["id"]}, edges={"knows": []})
ctx = pagerank(pg, max_round=10)
df = ctx.to_dataframe(selector={"id": "v.data", "r": "r"})
print(df)

关于如何在GraphScope中使用GraphAr的更多示例，请参阅test_graphar。