GraphScope在图处理领域的旅程
图处理

背景

阿里巴巴的指数级数据增长

大约10年前,业务呈指数级扩展,因此数据也在迅速增长。 其中许多自然可以被建模为图,图计算的需求在许多领域中出现, 例如

  • E-commerce transactions
  • Navigation routes and POIs
  • Video creator ratings

早期采用

最初的图问题被单独解决。

2013 ~ 2020

获取图数据的洞察

图遍历探索

图遍历 - 访问图中节点的过程 - 是许多在线和交互式图应用程序中的关键原语,例如,

事实上的

Gremlin-基于图的遍历

Gremlin 是事实上的标准语言,允许对各种图操作进行高级和声明式编程。

Example

循环检测:用于在图中发现一个循环,即路径回环到起始顶点。


g.V().has('name','tom').as('a').repeat(out().simplePath())
 .times(LENGTH).where(out().as('a')).path()

MaxGraphGAIA,
扩展Gremlin查询的系统

MaxGraph的架构。其中,GAIA是一个基于MaxGraph的编译和执行的优化组件。

Example

Gremlin 查询编译

如循环检测示例所示,Gremlin查询可以是迭代和嵌套操作的任意组合。

g.V().has('firstname','Tom').as('a')
  .repeat(out().simplePath()).times(k)
  .where(out().eq('a')).path()

The gremlin query is compiled to a dataflow plan, some operators are assigned SCOPE with context for parallel processing

图分析,
全图计算

Example

实体解析:识别并链接同一现实世界实体的不同表示。这并非易事,面临的挑战有:

  • Iterative and intensive computation
  • Large scale and heterogeneous data, dynamic changes (PBs of data with billions of records across hundreds data sources with TBs of update daily)
  • Multi-domain connections/relations between possible entities
  • Noises and errors, incomplete/missing data

之前的解决方案?

顶点中心模型的局限性

Paradigm of Pregel and GAS, representatives of "Think-like-a-vertex" model.

我们曾经在一个内部的以顶点为中心的图系统ODPSGraph上工作,以并行化实体解析。然而,多年来出现了越来越多的挑战。

  • 编程困难,有时需要完全重新设计算法。
  • 不断在临时权衡中挣扎。 对于许多算法,在以顶点为中心的模型中有效地并行化它们通常意味着失去精度或质量。
  • 耗时且成本巨大。 通常需要几位高级工程师数月的努力才能将一个算法上线。
  • 性能不佳。

以顶点为中心的替代方案?

我们回答是!

PIE 模型和 GRAPE 系统

我们在SIGMOD'2017上介绍了PIE和GRAPE,并在 https://github.com/alibaba/libgrape-lite 开源了它。

给定一个查询 Q 和一个图 G,要计算 Q(G),用户只需要提供 3 个函数。

  • PEval: 一个用于 Q 的(顺序)算法,用于部分评估
  • IncEval: 一个用于 Q 的(顺序)增量算法
  • Assemble: 一个通常只是取部分结果并集的函数。

SIGMOD'2017

最佳论文奖

VLDB'2017

最佳演示奖

SIGMOD'2018

研究亮点

2018年

GNN 崛起

Example

基于GNN的推荐

Graph-Learn, 一个GNN框架

在VLDB'2019上展示,并在
https://github.com/alibaba/graph-learn
开源。它已经在阿里巴巴内外的许多场景中成功应用。

专门的图模式匹配、挖掘...

专门的图形应用程序也被广泛采用。我们列出了我们的一些研究...

  • [VLDB'2018] Real-time Constrained Cycle Detection in Large Dynamic Graphs.
  • [VLDB'2020] Maximum Biclique Search at Billion Scale.
  • ...

VLDB'2020

最佳论文(亚军)

一个问题.. 图系统变成了孤岛

凝聚我们的努力

GraphScope的出现

2018 ~ 2024

现实世界的图应用通常涉及多种类型的图计算

Example

电子商务平台中欺诈检测的简化工作流程:

  • 使用SQL从原始数据构建属性图;
  • 使用Gremlin提取子图;
  • 用于识别欺诈实体的标签传播算法;
  • 通过权重进行k跳采样的图采样;
  • 使用TensorFlow训练GNN模型

GraphScope: 大图处理的统一引擎

我们在VLDB'2021上展示了GraphScope,并在
https://github.com/alibaba/graphscope

  • 一个简单且统一的编程接口;
  • 一个分布式数据流运行时,允许在一个精心设计的连贯框架中对每个图操作进行单独优化。
  • 一个内存数据存储,自动管理中间数据的表示、转换和移动。
  • 拥抱Python,使我们能够无缝地将GraphScope与其他数据处理系统结合。

为什么选择Python?

交互式计算,所见即所得
  • Jupyter Notebook
  • Analysis, model, extract, visualize…
丰富的生态系统,端到端,一站式
Ability to process various data and tasks, e.g.,
  • Json, text, tensor, SQL, image, video…
  • scientific/graph computing, ML…
高级数据和 操作抽象
  • Tensor、Dataframe、Graph、Scalar、Objects …
  • Operations defined on high-level

易用性:与NetworkX兼容

pip install graphscope

兼容的图操作和算法API与NetworkX

示例

如何与其他PyData系统互操作?

Vineyard: 内存中不可变数据管理

我们在SIGMOD'2023上介绍了Vineyard,并在
https://github.com/v6d-io/v6d上开源了它,Vineyard是一个CNCF沙盒项目。

为什么我们需要 vineyard?

  • many big data systems work with (partitioned) immutable in-memory data, however,
  • it takes huge effort just to implement I/O adaptors, data partition/chunking strategies, fault-tolerance mechanisms, data access RPC, scale in/out.
  • sharing intermediate results between systems relies on external FS, although I/O cost is usually very high.
  • no efficient way to dynamically sync immutable data with external mutable data sources.

重新审视Vineyard的案例研究

在PyData生态系统中共享数据

Vineyard 提供:

  • Out-of-box high-level abstraction (e.g. Edge/Vertex Iterator over graphs instead of k/v lookup) and mid-to low-level APIs for advanced developers to tailor special needs.
  • zero-copy local data access through shared memory
  • Transparent (or very similar to accessing local data) remote data access support wide range of data sources/formats, data partition strategies
  • (fast) sync with mutable data sources
  • Fault-tolerant
  • ...

我们完成了吗?

下一次迭代

GraphScope Flex
GraphScope演进的下一阶段

2023 ~ 当前

GraphScope的实际挑战

一刀切的方法是不够的

Example

该图展示了现实世界中图系统的简化视图。它具有以下特点

  • Various business scenarios
  • Diverse graph workloads
  • Many graphs in different formats

多样性来源于

1. 各种图模型和组织

即使是单个数据集也可以根据其特定需求以不同的方式进行建模。

Cited from SIGMOD'2024 Panel: The Future of Graph Analytics

2. 多样化的图计算工作负载

和编程接口

用于图形查询

  • Cypher
  • Gremlin
  • ISO/GQL

用于图分析

  • Think-like-a-vertex: Pregel, Gather-Apply-Scatter(GAS) …
  • Think-like-a-graph: Pregel++, Blogel, PIE…
  • Other DSL: FLASH, GraphIt…

用于图学习

  • Mostly in Python
  • PyG, DGL, …

3. 性能考虑

  • High request throughput? vs. low latency with data-parallel execution?
  • In-memory vs. out-of-core?
  • Efficient processing vs. resources utilization
  • High availability?

图计算的多样化景观激发了GraphScope的诞生

演进至GraphScope Flex

  • Employs a LEGO-like modularity;
  • Comprises many components, each like a LEGO brick;
  • A CLI utility flexbuild, to select components and build artifacts for deployments.
Architecture of GraphScope Flex

如何处理图存储的多样性?

理解图存储抽象的复杂性

图存储可以多种多样。计算引擎访问数据的要求也各不相同。

Diversed feature requests of graph storage.

GRIN: 统一的图检索接口

开源在 https://github.com/graphscope/GRIN

GRIN 是 GraphScope 中提出的标准图检索接口。其目标是简化不同计算引擎和存储引擎之间的集成,从 M * N 减少到 M + N。

Storages without/with GRIN.

其中一个存储后端支持GRIN

GraphAr: 用于归档和交换图数据的开源文件格式

作为Apache孵化项目开源
https://github.com/apache/GraphAr

GraphAr(“Graph Archive”的缩写)是一个旨在使各种应用和系统(内存和外部存储、数据库、图计算系统以及交互式图查询框架)更方便和高效地构建和访问图数据的项目。

Data sharing without/with GraphAr.

用于图形查询

在GraphScope Flex中查询堆栈

  • Supports both Gremlin and Cypher in front-ends;
  • Provides a unified Intermediate Representation(IR) Abstraction, parsing Gremlin/Cypher… to unified IR;
  • And IR based Optimizer, and two engines,
  • Gaia, using a dataflow model, for OLAP-like jobs;
  • Hiactor, using the actor model, for OLTP-like jobs.
Example

实际应用:实时欺诈检测

问题: 通过检查每个订单与已知欺诈行为的对比,识别电子商务中的可疑交易。

Fraud detection and its Cypher solution.

这个问题可以通过部署GraphScope Flex与这些组件来解决。

性能

No.1 在LDBC SNB基准测试中

用于图分析

GraphScope Flex中的分析堆栈

  • Supports multiple interfaces
    • Python, NetworkX compatible APIs
    • Java SDK, Pregel(Giraph)/GraphX compatible APIs
    • C++ SDK, GRAPE API
  • 100+ Built-in algorithms, out-of-the-box ready to use.
    • PageRank, Centralities…
    • Reachability and shortest paths…
    • Community detection, LPA, Louvain…
  • At its core, is the GRAPE engine.
    • Supports distributed graph computing engine with auto-parallelization
    • Integrated Ingress for auto-incrementalization
    • Integrated FLASH model with great expressive capability
    • Supports GPU acceleration
Example

实际应用:股票分析

问题: 识别负责引导公司的主要股东,即持有超过51%股份的股东。

Person C holds more than 51% of Company 1, via Company 2, (0.8*0.6 = 0.48) and Company 3, (0.8*0.3*0.7 = 0.168).

这个问题由GraphScope Flex分析堆栈解决,其实现了一个基于标签传播的分析算法。

想了解更多详情吗?

GraphScope Flex
技术预览版现已推出!

https://github.com/alibaba/graphscope

未来工作

  • Enhancing the core capabilities
    • GQL support
    • New storage backends
    • GraphAr with Data Lakes
    • HTAP processing on graphs
    • Applications in real-life scenarios
    • ...
  • Inter-operations with other systems
    • Graph-specific ETL to streamline the integration of applications across different graph models derived from the same dataset.
    • In scenarios blending graph tasks with SQL-like operations, a unified compiler across multiple engines can significantly enhance work- flow interoperability and expand the scope of graph computations.
    • ...

我们的终极目标

让图计算变得简单且对每个人可用!

欢迎加入我们的团队!

阅读更多和有用的链接

Github

博客

Papers

游乐场

参考文献

  1. GraphScope Flex: LEGO-like Graph Computing Stack, SIGMOD2024, Tao He, Shuxian Hu, Longbin Lai, Dongze Li, Neng Li, Xue Li, Lexiao Liu, Xiaojian Luo, Binqing Lyu, Ke Meng, Sijie Shen, Li Su, Lei Wang, Jingbo Xu, Wenyuan Yu, Weibin Zeng, Lei Zhang, Siyuan Zhang, Jingren Zhou, Xiaoli Zhou, Diwen Zhu.
  2. Dynamic Graph Sampling Service for Real-time GNN Inference at Scale, EuroSys 2023. Jie Sun, Li Su, Wenting Shen, Zichao Zhang, Zuocheng Shi, Jingbo Xu, Yong Li, Wenyuan Yu, Zeke Wang, Fei Wu, Jingren Zhou.
  3. Bridging the Gap between Relational OLTP and Graph-based OLAP, USENIX ATC 2023. Sijie Shen, Zihang Yao, Lin Shi, Lei Wang, Longbin Lai, Qian Tao, Li Su, Rong Chen, Wenyuan Yu, Haibo Chen, Binyu Zang, Jingren Zhou.
  4. GLogS: Interactive Graph Pattern Matching Query At Large Scale, USENIX ATC 2023. Longbin Lai, Yufan Yang, Zhibin Wang, Yuxuan Liu, Haotian Ma, Sijie Shen, Bingqing Lyu, Xiaoli Zhou, Wenyuan Yu, Zhengping Qian, Chen Tian, Sheng Zhong, Yeh-Ching Chung, Jingren Zhou.
  5. Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training, USENIX ATC 2023. Jie Sun, Li Su, Zuocheng Shi, Wenting Shen, Zeke Wang, Lei Wang, Jie Zhang, Wenyuan Yu, Yong Li, Jingren Zhou, Fei Wu.
  6. Vineyard: Optimizing Data Sharing in Data-Intensive Analytics, SIGMOD 2023. Wenyuan Yu, Tao He, Lei Wang, Ke Meng, Ye Cao, Diwen Zhu, Sanhong Li, Jingren Zhou.
  7. Efficient Multi-GPU Graph Processing with Remote Work Stealing, ICDE 2023. Ke Meng, Liang Geng, Xue Li, Qian Tao, Wenyuan Yu, Jingren Zhou.
  8. FLASH: A Framework for Programming Distributed Graph Processing Algorithms, ICDE 2023. Xue Li, Ke Meng, Lu Qin, Longbin Lai, Wenyuan Yu, Zhengping Qian, Xuemin Lin, Jingren Zhou.
  9. GNNLab: A Factored System for Sample-based GNN Training over GPUs, EuroSys 2022. Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, and Jingren Zhou.
  10. GraphScope: A Unified Engine For Big Graph Processing, VLDB 2021. Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan Yu, Jingren Zhou, Diwen Zhu, and Rong Zhu.
  11. GraphScope: A One-Stop Large Graph Processing System. VLDB, demo, 2021. Jingbo Xu, Zhanning Bai, Wenfei Fan, Longbin Lai, Xue Li, Zhao Li, Zhengping Qian, Lei Wang, Yanyan Wang, Wenyuan Yu, and Jingren Zhou.
  12. Automating Incremental Graph Processing with Flexible Memoization. VLDB 2021. Shufeng Gong, Chao Tian, Qiang Yin, Wenyuan Yu, Yanfeng Zhang, Liang Geng, Song Yu, Ge Yu, and Jingren Zhou.
  13. GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language, NSDI 2021. Zhengping Qian, Chenqiang Min, Longbin Lai, Yong Fang, Gaofeng Li, Youyang Yao, Bingqing Lyu, Xiaoli Zhou, Zhimin Chen, and Jingren Zhou.
  14. FlexGraph: A Flexible and Efficient Distributed Framework for GNN Training. EuroSys 2021. Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, and Jingren Zhou.
  15. Maximum Biclique Search at Billion Scale, VLDB 2020. Best Paper Runner-up Bingqing Lyu, Lu Qin, Xuemin Lin, Ying Zhang, Zhengping Qian, and Jingren Zhou.
  16. Adaptive Asynchronous Parallelization of Graph Algorithms. TODS. 45(2): 6:1-6:45, 2020. Wenfei Fan, Ping Lu, Wenyuan Yu, Jingbo Xu, Qiang Yin, Xiaojian Luo, Jingren Zhou, and Ruochun Jin.
  17. Parallelizing Sequential Graph Computations TODS 43(4): 18:1-18:39, 2018. Wenfei Fan, Wenyuan Yu, Jingbo Xu, Jingren Zhou, Xiaojian Luo, Qiang Yin, Ping Lu, Yang Cao and Ruiqi Xu
  18. Real-time Constrained Cycle Detection in Large Dynamic Graphs. VLDB, 2018. Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and Jingren Zhou.
  19. GRAPE: Parallelizing Sequential Graph Computations. VLDB, demo, 2017. Wenfei Fan, Jingbo Xu , Yinghui Wu, Wenyuan Yu, Jiaxin Jiang

版权

  1. Some images are designed by Freepik (https://freepik.com).
  2. Some fonts and icons from FontAwesome (https://fontawesome.com).
  3. Other copyrights are reserved in GraphScope Team © 2020-2025.