Vineyard上的内存不可变图

Vineyard 是一个分布式不可变内存数据管理器,用作GraphScope中不可变图的存储后端。Vineyard通过内存映射提供零拷贝数据共享功能,GraphScope中的不同计算引擎可以在同一个vineyard集群上运行,从而高效共享图数据。

Vineyard中的图

Vineyard支持不可变属性图,并将其抽象为vineyard::ArrowFragment类,该类由边的CSR结构组成,并使用表来存储边和顶点属性。在ArrowFragment基础上,vineyard将分布式图抽象为vineyard::ArrowFragmentGroup,它由分布在集群中的一组片段组成。

将图数据加载到Vineyard

Vineyard可以作为一个独立服务部署,也可以与GraphScope一起启动。 提供了一个命令行工具vineyard-graph-loader用于将片段加载到vineyard中。它首先接受一个可选参数--socket ,该参数指定加载器将连接的IPC套接字。如果省略该参数,则会从环境变量VINEYARD_IPC_SOCKET中解析该值。它接受一组命令行参数或JSON文件作为配置。

$ vineyard-graph-loader --help
Usage: loading vertices and edges as vineyard graph.

    -     ./vineyard-graph-loader [--socket <vineyard-ipc-socket>] \
                                   <e_label_num> <efiles...> <v_label_num> <vfiles...> \
                                   [directed] [generate_eid] [retain_oid] [string_oid]

    - or: ./vineyard-graph-loader [--socket <vineyard-ipc-socket>] --config <config.json>

          The config is a json file and should look like

          {
              "vertices": [
                  {
                      "data_path": "....",
                      "label": "...",
                      "options": "...."
                  },
                  ...
              ],
              "edges": [
                  {
                      "data_path": "",
                      "label": "",
                      "src_label": "",
                      "dst_label": "",
                      "options": ""
                  },
                  ...
              ],
              "directed": 1, # 0 or 1
              "generate_eid": 1, # 0 or 1
              "retain_oid": 1, # 0 or 1
              "string_oid": 0, # 0 or 1
              "local_vertex_map": 0 # 0 or 1
          }%

指定如何构建图的一些选项包括:

  • directed: 表示该图是有向图还是无向图。

  • generate_eid: 是否为每条边生成全局唯一的边ID。

  • retain_oid: 是否保留原始顶点ID到最终顶点的属性表中。

  • string_oid: 顶点ID是否为字符串类型。

  • local_vertex_map: 是否在图构建过程中使用本地顶点映射,通常用于优化内存使用。

使用vineyard-graph-loader加载现代图可以通过以下方式完成:

  • 使用命令行参数

    vineyard-graph-loader 接受一系列命令行参数来指定边文件和顶点文件,例如:

    $ ./vineyard-graph-loader 2 "modern_graph/knows.csv#header_row=true&src_label=person&dst_label=person&label=knows&delimiter=|" \
                                "modern_graph/created.csv#header_row=true&src_label=person&dst_label=software&label=created&delimiter=|" \
                              2 "modern_graph/person.csv#header_row=true&label=person&delimiter=|" \
                                "modern_graph/software.csv#header_row=true&label=software&delimiter=|"
    
  • 使用JSON配置文件

    $ ./vineyard-graph-loader --config config.json
    

    JSON配置文件示例如下(以"现代图"为例):

       {
           "vertices": [
               {
                   "data_path": "modern_graph/person.csv",
                   "label": "person",
                   "options": "header_row=true&delimiter=|"
               },
               {
                   "data_path": "modern_graph/software.csv",
                   "label": "software",
                   "options": "header_row=true&delimiter=|"
               }
           ],
           "edges": [
               {
                   "data_path": "modern_graph/knows.csv",
                   "label": "knows",
                   "src_label": "person",
                   "dst_label": "person",
                   "options": "header_row=true&delimiter=|"
               },
               {
                   "data_path": "modern_graph/created.csv",
                   "label": "created",
                   "src_label": "person",
                   "dst_label": "software",
                   "options": "header_row=true&delimiter=|"
               }
           ],
           "directed": 1,
           "generate_eid": 1,
           "string_oid": 0,
           "local_vertex_map": 0
       }
    

使用已加载的图

加载到vineyard后,可以使用vineyard的IPCClient访问已加载的分片:

void WriteOut(vineyard::Client& client, const grape::CommSpec& comm_spec,
              vineyard::ObjectID fragment_group_id) {
  LOG(INFO) << "Loaded graph to vineyard: " << fragment_group_id;
  std::shared_ptr<vineyard::ArrowFragmentGroup> fg =
      std::dynamic_pointer_cast<vineyard::ArrowFragmentGroup>(
          client.GetObject(fragment_group_id));

  for (const auto& pair : fg->Fragments()) {
    LOG(INFO) << "[frag-" << pair.first << "]: " << pair.second;
  }

  // NB: only retrieve local fragments.
  auto locations = fg->FragmentLocations();
  for (const auto& pair : fg->Fragments()) {
    if (locations.at(pair.first) != client.instance_id()) {
      continue;
    }
    auto frag_id = pair.second;
    Traverse(client, frag_id);
  }
}

可以使用vineyard::ArrowFragment的API来遍历本地片段:

void Traverse(vineyard::Client& client, vineyard::ObjectID frag_id) {
  auto frag = std::dynamic_pointer_cast<GraphType>(client.GetObject(frag_id));
  LOG(INFO) << "graph total node number: " << frag->GetTotalNodesNum();
  LOG(INFO) << "fragment edge number: " << frag->GetEdgeNum();
  LOG(INFO) << "fragment in edge number: " << frag->GetInEdgeNum();
  LOG(INFO) << "fragment out edge number: " << frag->GetOutEdgeNum();

  for (LabelType vlabel = 0; vlabel < frag->vertex_label_num(); ++vlabel) {
    LOG(INFO) << "vertex table: " << vlabel << " -> "
              << frag->vertex_data_table(vlabel)->schema()->ToString();
  }
  for (LabelType elabel = 0; elabel < frag->edge_label_num(); ++elabel) {
    LOG(INFO) << "edge table: " << elabel << " -> "
              << frag->edge_data_table(elabel)->schema()->ToString();
  }

  LOG(INFO) << "--------------- consolidate vertex/edge table columns ...";

  if (frag->vertex_data_table(0)->columns().size() >= 4) {
    for (LabelType vlabel = 0; vlabel < frag->vertex_label_num(); ++vlabel) {
      LOG(INFO) << "vertex table: " << vlabel << " -> "
                << frag->vertex_data_table(vlabel)->schema()->ToString();
    }
  }

  if (frag->edge_data_table(0)->columns().size() >= 4) {
    for (LabelType elabel = 0; elabel < frag->edge_label_num(); ++elabel) {
      LOG(INFO) << "edge table: " << elabel << " -> "
                << frag->edge_data_table(elabel)->schema()->ToString();
    }
  }
}