GFQL: 跳链 - PyGraphistry 数据框上的Cypher风格图模式匹配#
PyGraphistry 支持流行的 Cypher 图查询语言的一个丰富子集,您可以在数据帧上运行它,而无需安装数据库或本地库。它与数据帧原生集成,因此具有 Python 原生的语法,而不是传统的字符串语法。
PyGraphistry 图形模式匹配功能与 Cypher 有主要相似之处
多跳搜索
节点和边属性的谓词
能够识别匹配的节点和边
它在几个关键方面有所不同
纯PyData(Python/C++/Fortran):无需安装数据库、Java等,
pip install pygraphistry
就足够了它是面向集合的,而不是面向路径的:所有操作都保证能有效地转换为向量化的数据框操作,而不是传统图查询引擎中典型的逐行路径操作,后者在渐进意义上较慢。
高级用户可以插入自定义谓词作为原生Python数据框代码
教程
安装与配置
加载并丰富一个美国国会推特互动数据集
简单的图过滤:
g.hop()
和g.chain([...])
多跳和节点间路径模式挖掘
高级筛选条件
结果标注
1. 安装与配置#
[ ]:
# ! pip install graphistry[igraph]
导入#
[141]:
import pandas as pd
import graphistry
from graphistry import (
# graph operators
n, e_undirected, e_forward, e_reverse,
# attribute predicates
is_in, ge, startswith, contains, match as match_re
)
[ ]:
graphistry.register(api=3, username='...', password='...')
2. 加载并丰富一个美国国会推特互动数据集#
数据#
下载
将json转换为Pandas边数据框
将边缘数据框转换为PyGraphistry图
使用一些有用的图度量来丰富节点和边
可视化完整图表以进行测试
[9]:
# ! wget -q https://snap.stanford.edu/data/congress_network.zip
# ! unzip congress_network.zip
total 1.2M
drwxr-xr-x 1 root root 4.0K Dec 4 03:56 .
drwxr-xr-x 1 root root 4.0K Dec 4 03:33 ..
-rw-r--r-- 1 root root 150K May 9 2017 Attribute
-rw-r--r-- 1 root root 14K May 9 2017 Class_info
drwxr-xr-x 4 root root 4.0K Nov 30 14:24 .config
-rw-r--r-- 1 root root 190K Aug 5 05:26 congress_network.zip
-rw-r--r-- 1 root root 320K May 9 2017 edgelist
drwxr-xr-x 1 root root 4.0K Nov 30 14:27 sample_data
-rw-r--r-- 1 root root 16 May 9 2017 Statistics
-rw-r--r-- 1 root root 221K Dec 4 03:53 twitter.zip
-rw-r--r-- 1 root root 299K May 9 2017 vertex2aid
[40]:
import json
with open('congress_network/congress_network_data.json', 'r') as file:
data = json.load(file)
edges = []
for i, name in enumerate(data[0]['usernameList']):
for ii, j in enumerate(data[0]['outList'][i]):
edges.append({
'from': name,
'to': data[0]['usernameList'][j],
'weight': data[0]['outWeight'][i][ii]
})
edges_df = pd.DataFrame(edges)
print(edges_df.shape)
edges_df.sample(5)
(13289, 3)
[40]:
from | to | weight | |
---|---|---|---|
11112 | RepBobbyRush | janschakowsky | 0.034364 |
3836 | RepCori | Ilhan | 0.015936 |
5282 | RepTedDeutch | RepDWStweets | 0.003268 |
12352 | BennieGThompson | RepStricklandWA | 0.006849 |
9358 | RepCarolMiller | RepTroyNehls | 0.005291 |
将数据框加载为PyGraphistry图#
转换为图并预先计算一些有用的图指标
回想一下,g
对象在底层本质上只是两个数据框,g._edges
和 g._nodes
,并且有许多有用的图方法:
[77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')
# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
.materialize_nodes()
.nodes(lambda g: g._nodes.assign(title=g._nodes.id))
.edges(lambda g: g._edges.assign(weight2=g._edges.weight))
.bind(point_title='title')
.compute_igraph('community_infomap')
.compute_igraph('pagerank')
.get_degrees()
.encode_point_color(
'community_infomap',
as_categorical=True,
categorical_mapping={
0: '#32a9a2', # vibrant teal
1: '#ff6b6b', # soft coral
2: '#f9d342', # muted yellow
}
)
)
g2._nodes
WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs
[77]:
id | title | community_infomap | pagerank | degree_in | degree_out | degree | |
---|---|---|---|---|---|---|---|
0 | SenatorBaldwin | SenatorBaldwin | 0 | 0.001422 | 26 | 20 | 46 |
1 | SenJohnBarrasso | SenJohnBarrasso | 0 | 0.001179 | 22 | 19 | 41 |
2 | SenatorBennet | SenatorBennet | 0 | 0.001995 | 33 | 22 | 55 |
3 | MarshaBlackburn | MarshaBlackburn | 0 | 0.001331 | 18 | 38 | 56 |
4 | SenBlumenthal | SenBlumenthal | 0 | 0.001672 | 30 | 35 | 65 |
... | ... | ... | ... | ... | ... | ... | ... |
470 | RepJoeWilson | RepJoeWilson | 1 | 0.001780 | 21 | 38 | 59 |
471 | RobWittman | RobWittman | 1 | 0.001017 | 13 | 19 | 32 |
472 | rep_stevewomack | rep_stevewomack | 1 | 0.002637 | 35 | 19 | 54 |
473 | RepJohnYarmuth | RepJohnYarmuth | 2 | 0.000555 | 5 | 20 | 25 |
474 | RepLeeZeldin | RepLeeZeldin | 1 | 0.000511 | 3 | 25 | 28 |
475 行 × 7 列
[79]:
g2.plot()
[79]:
3. 简单过滤: g.hop()
& g.chain([...])
#
我们可以通过节点、边以及它们的组合进行过滤
结果是一个图表,我们可以在其中检查节点和边表,或执行进一步的图形操作,如可视化或进一步搜索
关键概念
有2个关键方法:* g.hop(...)
:过滤源节点、边、目标节点的三元组 * g.chain([....])
:任意长度的节点和边谓词序列
他们重用了数据框库中的列操作核心,例如对字符串、数字和日期的比较运算符
示例任务
本节展示如何:
找到SenSchumer及其直接社区(infomap指标)
看看他的整个社区
找到与SenSchumer有高边权重的所有人;双向2跳
找到他社区中的每个人
[80]:
g2.chain([n({'title': 'SenSchumer'})])._nodes
[80]:
id | title | community_infomap | pagerank | degree_in | degree_out | degree | |
---|---|---|---|---|---|---|---|
0 | SenSchumer | SenSchumer | 2 | 0.001296 | 25 | 97 | 122 |
你也可以传递chain()
一个节点和边表达式的序列
[81]:
g_immediate_community2 = g2.chain([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])
print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
58 senators 69 relns
[81]:
from | to | weight2 | |
---|---|---|---|
22 | SenSchumer | JacksonLeeTX18 | 0.001546 |
46 | SenSchumer | RepSarbanes | 0.001546 |
23 | SenSchumer | RepJayapal | 0.001546 |
53 | SenSchumer | PeterWelch | 0.001546 |
25 | SenSchumer | RepDaveJoyce | 0.001546 |
26 | SenSchumer | RepRobinKelly | 0.001546 |
28 | SenSchumer | RepAndyKimNJ | 0.001546 |
29 | SenSchumer | RepBarbaraLee | 0.001546 |
50 | SenSchumer | RepPaulTonko | 0.001546 |
32 | SenSchumer | RepMeijer | 0.001546 |
[82]:
g_immediate_community2.plot()
[82]:
通常,我们只是在一个源节点/边/目标节点的三元组上进行过滤,因此hop()
是这种操作的简写形式。所有hop()
参数也可以传递给边表达式。
[83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})
print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)
214 senators 4993 relns
[83]:
from | to | weight | weight2 | |
---|---|---|---|---|
378 | RepDonBeyer | RepSpeier | 0.000658 | 0.000658 |
354 | RepDonBeyer | repcleaver | 0.000658 | 0.000658 |
353 | RepDonBeyer | RepYvetteClarke | 0.000658 | 0.000658 |
352 | RepDonBeyer | RepCasten | 0.000658 | 0.000658 |
349 | RepDonBeyer | RepBeatty | 0.000658 | 0.000658 |
360 | RepDonBeyer | RepGaramendi | 0.000658 | 0.000658 |
361 | RepDonBeyer | RepChuyGarcia | 0.000658 | 0.000658 |
362 | RepDonBeyer | RepRaulGrijalva | 0.000658 | 0.000658 |
365 | RepDonBeyer | USRepKeating | 0.000658 | 0.000658 |
366 | RepDonBeyer | RepRickLarsen | 0.000658 | 0.000658 |
[86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()
[86]:
4. 多跳和节点间路径模式挖掘#
方法 chain([...])
可以用于查看多个跳转,甚至可以找到节点之间的路径。
例如:所有连接SenSchumer和SpeakerPelosi的人
[94]:
g_shumer_pelosi_bridges = g2.chain([
n({'title': 'SenSchumer'}),
e_undirected(),
n(),
e_undirected(),
n({'title': 'SpeakerPelosi'})
])
print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)
66 senators
[94]:
from | to | weight | weight2 | |
---|---|---|---|---|
86 | RepJayapal | SpeakerPelosi | 0.000871 | 0.000871 |
47 | SenSchumer | RepMeijer | 0.001546 | 0.001546 |
23 | SenSchumer | RepBuddyCarter | 0.001546 | 0.001546 |
24 | SenSchumer | RepJudyChu | 0.001546 | 0.001546 |
26 | SenSchumer | repcleaver | 0.001546 | 0.001546 |
[92]:
g_shumer_pelosi_bridges.plot()
[92]:
5. 高级过滤谓词#
我们可以使用多种谓词来过滤节点和边,而不仅仅是属性值的相等性。
常见任务包括使用以下方法比较属性:* 集合包含:is_in([...])
* 数值比较:gt(...)
, lt(...)
, ge(...)
, le(...)
* 字符串比较:startswith(...)
, endswith(...)
, contains(...)
* 正则表达式匹配:matches(...)
* 重复检查:duplicated()
图中节点位于前20个页面排名:
[134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr
[134]:
0.005888600097034367
[128]:
g_high_pr = g2.chain([
n({'pagerank': ge(top_20_pr)}),
e_undirected(),
n({'pagerank': ge(top_20_pr)}),
])
len(g_high_pr._nodes)
[128]:
20
[129]:
g_high_pr.plot()
[129]:
图表名称包含Leader
[136]:
g_leaders = g2.hop(
source_node_match={'title': contains('Leader')},
destination_node_match = {'title': contains('Leader')}
)
print(len(g_leaders._nodes), 'leaders')
g_leaders.plot()
2 leaders
[136]:
领导人和参议员的图表
[139]:
g_leaders_and_senators = g2.hop(
source_node_match={'title': match_re(r'Sen|Leader')},
destination_node_match = {'title': match_re(r'Sen|Leader')}
)
print(len(g_leaders_and_senators._nodes), 'leaders and senators')
g_leaders_and_senators.plot()
67 leaders and senators
[139]:
6. 结果标注#
在路径查询中为节点和边命名对于下游推理可能很有用:
[156]:
g_bridges2 = g2.chain([
n({'title': 'SenSchumer'}),
e_undirected(name='from_schumer'),
n(name='found_bridge'),
e_undirected(name='from_pelosi'),
n({'title': 'SpeakerPelosi'})
])
print(len(g_bridges2._nodes), 'senators in full graph')
named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')
g_bridges2.encode_point_color(
'found_bridge',
as_categorical=True,
categorical_mapping={
True: 'orange',
False: 'silver'
}
).plot()
66 senators in full graph
64 bridging senators
75 relns from_schumer 83 relns from_pelosi
[156]:
[ ]: