快速入门¶
本教程快速概述了GraphScope的功能特性。首先,我们将使用Python在您的本地机器上安装GraphScope。虽然本指南中的大多数示例基于本地Python环境,但它同样适用于Kubernetes集群。
您可以通过pip轻松安装GraphScope:
python3 -m pip install graphscope -U
注意
我们建议您在干净的Python虚拟环境中使用Python 3.9安装GraphScope,可以选择miniconda或venv。
以venv为例,以下是创建虚拟环境、激活环境并安装GraphScope的逐步指南:
# Create a new virtual environment
python3.9 -m venv tutorial-env
# Activate the virtual environment
source tutorial-env/bin/activate
# Install GraphScope
python3.9 -m pip install graphscope
# Use GraphScope
python3.9
>>> import graphscope as gs
>>> ......
一站式图处理¶
我们将通过一个逐步演示的示例,展示如何使用GraphScope一站式处理各种图计算任务。
该示例针对引用网络上的节点分类任务。
ogbn-mag是由微软学术图谱的一个子集构成的异构图网络。它包含4种类型的实体(即论文、作者、机构和研究领域),以及连接两个实体的四种有向关系。
给定异构的ogbn-mag数据,任务是预测每篇论文的类别。节点分类可以识别多个会议中的论文,这些会议代表了不同主题的科学工作群体。我们同时利用属性和结构信息对论文进行分类。在图中,每篇论文节点包含一个128维的word2vec向量表示其内容,这是通过对其标题和摘要中词语嵌入取平均得到的。单个词语的嵌入是预先训练好的。结构信息则是实时计算的。
GraphScope将图数据建模为属性图,其中边/顶点带有标签并拥有多种属性。以ogbn-mag为例,下图展示了属性图的模型。
属性图示例¶
该图包含四种类型的顶点,分别标记为paper、author、institution和field_of_study。顶点之间通过四种边相连,每种边都有标签并指定其两端顶点的标签。例如,cites边连接两个标记为paper的顶点;另一个例子是writes边,要求源顶点标记为author,目标顶点为paper顶点。所有顶点和边都可能具有属性,例如paper顶点具有features、publish year、subject label等属性。
Import GraphScope and load a graph
要使用我们的检索模块将此图加载到GraphScope中,请使用以下代码。
import graphscope
from graphscope.dataset import load_ogbn_mag
g = load_ogbn_mag()
交互式查询使用户能够以灵活深入的方式探索、检查并展示图数据,从而快速找到特定信息。GraphScope通过提供对流行查询语言Gremlin和Cypher的支持,增强了交互式查询的呈现效果,并确保这些查询在大规模场景下高效执行。
Run interactive queries with Gremlin and Cypher
在本示例中,我们使用图遍历来计算两位给定作者共同撰写的论文数量。为简化查询,我们假设这两位作者可以分别通过ID 2和4307唯一标识。
# get the endpoint for submitting interactive queries on graph g.
interactive = graphscope.interactive(g, with_cypher=True)
# Gremlin query for counting the number of papers two authors (with id 2 and 4307) have co-authored
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()
# Cypher query for counting the number of papers two authors (with id 2 and 4307) have co-authored
# Note that for Cypher query, the parameter of lang="cypher" is mandatory
papers = interactive.execute( \
"MATCH (n1:author)-[:writes]->(p:paper)<-[:writes]-(n2:author) \
WHERE n1.id = 2 AND n2.id = 4307 \
RETURN count(DISTINCT p)", \
lang="cypher")
图分析在现实世界中有着广泛应用。许多算法,如社区检测、路径与连通性、中心性等,已被证明在各种商业场景中非常有用。GraphScope内置了一系列算法,使用户能够轻松分析其图数据。
Run analytical algorithms on the graph
继续我们的示例,我们首先通过从整个图中提取特定时间范围内的出版物(使用Gremlin!)来派生一个子图。然后,我们运行k-core分解和三角形计数来生成每个论文节点的结构特征。
请注意,许多算法可能仅适用于同构图。因此,要在属性图上评估这些算法,我们需要先将其投影为简单图。
# extract a subgraph of publication within a time range
sub_graph = interactive.subgraph("g.V().has('year', gte(2014).and(lte(2020))).outE('cites')")
# project the projected graph to simple graph.
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})
ret1 = graphscope.k_core(simple_g, k=5)
ret2 = graphscope.triangles(simple_g)
# add the results as new columns to the citation graph
sub_graph = sub_graph.add_column(ret1, {"kcore": "r"})
sub_graph = sub_graph.add_column(ret2, {"tc": "r"})
图神经网络(GNNs)融合了图分析和机器学习的双重优势。GNN算法能够将图中的结构信息和属性信息压缩为每个节点的低维嵌入向量。这些嵌入向量可以进一步输入到下游机器学习任务中。
Prepare data and engine for learning
在我们的示例中,我们训练了一个监督式GraphSAGE模型,将节点(论文)分类为349个类别,每个类别代表一个学术场所(例如预印本或会议)。为了实现这一目标,我们首先启动学习引擎,并按照前一步骤构建一个带特征的图。
# define the features for learning
paper_features = [f"feat_{i}" for i in range(128)]
paper_features.append("kcore")
paper_features.append("tc")
# launch a learning engine.
lg = graphscope.graphlearn(sub_graph, nodes=[("paper", paper_features)],
edges=[("paper", "cites", "paper")],
gen_labels=[
("train", "paper", 100, (0, 75)),
("val", "paper", 100, (75, 85)),
("test", "paper", 100, (85, 100))
])
然后我们定义训练过程并运行它。
Define the training process and run it
try:
# https://www.tensorflow.org/guide/migrate
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
except ImportError:
import tensorflow as tf
import graphscope.learning
from graphscope.learning.examples import EgoGraphSAGE
from graphscope.learning.examples import EgoSAGESupervisedDataLoader
from graphscope.learning.examples.tf.trainer import LocalTrainer
# supervised GraphSAGE.
def train_sage(graph, node_type, edge_type, class_num, features_num,
hops_num=2, nbrs_num=[25, 10], epochs=2,
hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,
):
graphscope.learning.reset_default_tf_graph()
dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]
model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)
# prepare train dataset
train_data = EgoSAGESupervisedDataLoader(
graph, graphscope.learning.Mask.TRAIN,
node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
)
train_embedding = model.forward(train_data.src_ego)
train_labels = train_data.src_ego.src.labels
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=train_labels, logits=train_embedding,
)
)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
# prepare test dataset
test_data = EgoSAGESupervisedDataLoader(
graph, graphscope.learning.Mask.TEST,
node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
)
test_embedding = model.forward(test_data.src_ego)
test_labels = test_data.src_ego.src.labels
test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)
test_acc = tf.div(
tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),
tf.cast(tf.shape(test_labels)[0], tf.float32),
)
# train and test
trainer = LocalTrainer()
trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)
trainer.test(test_data.iterator, test_acc)
train_sage(lg, node_type="paper", edge_type="cites",
class_num=349, # output dimension
features_num=130, # input dimension, 128 + kcore + triangle count
)
图分析任务快速入门¶
安装的graphscope软件包包含在本地机器上分析图所需的一切。如果您有一个需要运行迭代算法的图分析任务,它与graphscope配合良好。
Example: Running iterative algorithm (SSSP) in GraphScope
import graphscope as gs
from graphscope.dataset.modern_graph import load_modern_graph
gs.set_option(show_log=True)
# load the modern graph as example.
#(modern graph is an example property graph given by Apache at https://tinkerpop.apache.org/docs/current/tutorials/getting-started/)
graph = load_modern_graph()
# triggers label propagation algorithm(LPA)
# on the modern graph(property graph) and print the result.
ret = gs.lpa(graph)
print(ret.to_dataframe(selector={'id': 'v.id', 'label': 'r'}))
# project a modern graph (property graph) to a homogeneous graph
# and run single source shortest path(SSSP) algorithm on it, with assigned source=1.
pg = graph.project(vertices={'person': None}, edges={'knows': ['weight']})
ret = gs.sssp(pg, src=1)
print(ret.to_dataframe(selector={'id': 'v.id', 'distance': 'r'})
GraphScope 交互式查询快速入门¶
在已安装graphscope包的情况下,您可以轻松地在本地机器上与图数据进行交互。
您只需创建interactive实例作为提交Gremlin或Cypher查询的通道。
Example: Run Interactive Queries in GraphScope
import graphscope as gs
from graphscope.dataset.modern_graph import load_modern_graph
gs.set_option(show_log=True)
# load the modern graph as example.
#(modern graph is an example property graph given by Apache at https://tinkerpop.apache.org/docs/current/tutorials/getting-started/)
graph = load_modern_graph()
# Hereafter, you can use the `graph` object to create an `interactive` query session, which will start one Gremlin service and one Cypher service simultaneously on the backend.
g = gs.interactive(graph, with_cypher=True)
# then `execute` any supported gremlin query.
q1 = g.execute('g.V().count()')
print(q1.all().result()) # should print [6]
q2 = g.execute('g.V().hasLabel(\'person\')')
print(q2.all().result()) # should print [[v[2], v[3], v[0], v[1]]]
# or `execute` any supported Cypher query
q3 = g.execute("MATCH (n:person) RETURN count(n)", lang="cypher")
print(q3.records[0][0]) # should print 6
图学习快速入门¶
使用GraphScope进行GNN模型训练非常简单直观。您可以使用graphscope包在本地机器上训练GNN模型。请注意
tensorflow是运行以下示例所必需的。
Example: Training GraphSAGE Model in GraphScope
import graphscope
from graphscope.dataset import load_ogbn_mag
g = load_ogbn_mag()
# define the features for learning
paper_features = [f"feat_{i}" for i in range(128)]
# launch a learning engine.
lg = graphscope.graphlearn(g, nodes=[("paper", paper_features)],
edges=[("paper", "cites", "paper")],
gen_labels=[
("train", "paper", 100, (0, 75)),
("val", "paper", 100, (75, 85)),
("test", "paper", 100, (85, 100))
])
try:
# https://www.tensorflow.org/guide/migrate
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
except ImportError:
import tensorflow as tf
import graphscope.learning
from graphscope.learning.examples import EgoGraphSAGE
from graphscope.learning.examples import EgoSAGESupervisedDataLoader
from graphscope.learning.examples.tf.trainer import LocalTrainer
# supervised GraphSAGE
def train_sage(graph, node_type, edge_type, class_num, features_num,
hops_num=2, nbrs_num=[25, 10], epochs=2,
hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,
):
graphscope.learning.reset_default_tf_graph()
dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]
model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)
# prepare train dataset
train_data = EgoSAGESupervisedDataLoader(
graph, graphscope.learning.Mask.TRAIN,
node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
)
train_embedding = model.forward(train_data.src_ego)
train_labels = train_data.src_ego.src.labels
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=train_labels, logits=train_embedding,
)
)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
# prepare test dataset
test_data = EgoSAGESupervisedDataLoader(
graph, graphscope.learning.Mask.TEST,
node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
)
test_embedding = model.forward(test_data.src_ego)
test_labels = test_data.src_ego.src.labels
test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)
test_acc = tf.div(
tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),
tf.cast(tf.shape(test_labels)[0], tf.float32),
)
# train and test
trainer = LocalTrainer()
trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)
trainer.test(test_data.iterator, test_acc)
train_sage(lg, node_type="paper", edge_type="cites",
class_num=349, # output dimension
features_num=128, # input dimension
)