教程:在GraphScope上运行Giraph应用¶
Apache Giraph是最著名的图计算框架之一,构建于Apache Hadoop之上。通过pregel接口,用户可以编写vertex-centric图算法。
GraphScope旨在提供一站式图处理框架,包括与流行的开源图计算框架集成。实际上,Giraph算法无需任何修改即可轻松在GraphScope上运行。
尝试一些示例giraph应用¶
我们在grape-demo.jar中提供了一些示例Giraph算法,例如SSSP、PageRank。 您可以尝试在GraphScope上运行这些Giraph算法。
由于Giraph允许用户使用自定义加载器加载图数据,我们通过session.load_from方法支持Giraph VertexInputFormat和Giraph EdgeInputFormat格式。
vformat = "giraph:com.alibaba.graphscope.example.giraph.format.P2PVertexInputFormat"
eformat = "giraph:com.alibaba.graphscope.example.giraph.format.P2PEdgeInputFormat"
#clone https://github.com/GraphScope/gstest to GS_TEST_DIR
graph = graphscope_session.load_from(
vertices="/path/to/vertex-input",
vformat=vformat,
edges="/path/to/edge-input",
eformat=eformat,
)
顶点和边应指向顶点输入和边输入。我们还提供了一些示例数据集gstest,位于GraphScope/gstest。
在本教程中,我们只需要p2p数据集。您可以通过以下方式下载:
wget https://raw.githubusercontent.com/GraphScope/gstest/master/p2p-31.e /home/graphscope/p2p-31.e
wget https://raw.githubusercontent.com/GraphScope/gstest/master/p2p-31.v /home/graphscope/p2p-31.v
然后您可以通过graphscope Python客户端加载图,并使用giraph应用查询该图。
import graphscope
import os
from graphscope.framework.app import load_app
"""Or launch session in k8s cluster"""
sess = graphscope.session(cluster_type='hosts')
sess.add_lib("/home/graphscope/grape-demo-0.19.0-shaded.jar")
# Remember to put giraph: before class name.
vformat = "giraph:com.alibaba.graphscope.example.giraph.format.P2PVertexInputFormat"
eformat = "giraph:com.alibaba.graphscope.example.giraph.format.P2PEdgeInputFormat"
# Replace path p2p.v and p2p.3 with your own path.
graph = sess.load_from(
vertices=os.path.expandvars("/home/graphscope/p2p-31.v"),
vformat=vformat,
edges=os.path.expandvars("/home/graphscope/p2p-31.e"),
eformat=eformat,
)
graph = graph._project_to_simple(v_prop="vdata", e_prop="data")
giraph_sssp = load_app(algo="giraph:com.alibaba.graphscope.example.giraph.SSSP")
ctx = giraph_sssp(graph, sourceId=6)
ctx.to_numpy('r')
运行您自己的Giraph应用程序。¶
成功运行示例giraph SSSP算法后,您可能想在GraphScope上尝试自己的giraph算法(其运行速度比Giraph本身快得多)。
开发Giraph算法¶
您可以基于Giraph的原生API实现您的算法。例如,您可以使用Giraph官方提供的示例应用。
git clone https://github.com/apache/giraph.git
cd giraph/
mvn package -pl :giraph-examples
然后你可以在目录 giraph-examples/target 中找到 giraph-examples-1.4.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar。
尽管几乎所有的API都得到支持,但Giraph-on-GraphScope确实存在一些限制。
当前不支持图修改API。
使用 Complex Writable 会导致性能下降。
提交到GraphScope。¶
操作步骤与上述几乎相同,只是需要替换提交的jar包,并选择正确的InputFormat类。
import graphscope
"""Or launch session in k8s cluster"""
sess = graphscope.session(cluster_type='hosts')
# path to local jar file, will be distributed over cluster
graphscope_session.add_lib("path/to/grape-demo.jar")
vformat = "giraph:${vertex-input-format-class-full-name}"
eformat = "giraph:${edge-input-format-class-full-name}"
#clone https://github.com/GraphScope/gstest to GS_TEST_DIR
graph = graphscope_session.load_from(
vertices=os.path.expandvars("${path-to-vertex-file}"), # path to local vertex file, will be distributed over cluster
vformat=vformat,
edges=os.path.expandvars("${path-to-edge-file}"), # path to local edge file, will be distributed over cluster
eformat=eformat,
)
graph = graph._project_to_simple(v_prop="vdata", e_prop="data")
giraph_sssp = load_app(algo="giraph:${giraph-computation-class-full-name}")
ctx = giraph_sssp(g, "${a=1,b=2...}")