GFQL: 跳链 - PyGraphistry 数据框上的Cypher风格图模式匹配#

PyGraphistry 支持流行的 Cypher 图查询语言的一个丰富子集,您可以在数据帧上运行它,而无需安装数据库或本地库。它与数据帧原生集成,因此具有 Python 原生的语法,而不是传统的字符串语法。

PyGraphistry 图形模式匹配功能与 Cypher 有主要相似之处

  • 多跳搜索

  • 节点和边属性的谓词

  • 能够识别匹配的节点和边

它在几个关键方面有所不同

  • 纯PyData(Python/C++/Fortran):无需安装数据库、Java等,pip install pygraphistry 就足够了

  • 它是面向集合的,而不是面向路径的:所有操作都保证能有效地转换为向量化的数据框操作,而不是传统图查询引擎中典型的逐行路径操作,后者在渐进意义上较慢。

  • 高级用户可以插入自定义谓词作为原生Python数据框代码


教程

  1. 安装与配置

  2. 加载并丰富一个美国国会推特互动数据集

  3. 简单的图过滤:g.hop()g.chain([...])

  4. 多跳和节点间路径模式挖掘

  5. 高级筛选条件

  6. 结果标注

1. 安装与配置#

[ ]:
# ! pip install graphistry[igraph]

导入#

[141]:
import pandas as pd

import graphistry

from graphistry import (

    # graph operators
    n, e_undirected, e_forward, e_reverse,

    # attribute predicates
    is_in, ge, startswith, contains, match as match_re
)
[ ]:
graphistry.register(api=3, username='...', password='...')

2. 加载并丰富一个美国国会推特互动数据集#

数据#

  • 下载

  • 将json转换为Pandas边数据框

  • 将边缘数据框转换为PyGraphistry图

  • 使用一些有用的图度量来丰富节点和边

  • 可视化完整图表以进行测试

[9]:
# ! wget -q https://snap.stanford.edu/data/congress_network.zip
# ! unzip congress_network.zip

total 1.2M
drwxr-xr-x 1 root root 4.0K Dec  4 03:56 .
drwxr-xr-x 1 root root 4.0K Dec  4 03:33 ..
-rw-r--r-- 1 root root 150K May  9  2017 Attribute
-rw-r--r-- 1 root root  14K May  9  2017 Class_info
drwxr-xr-x 4 root root 4.0K Nov 30 14:24 .config
-rw-r--r-- 1 root root 190K Aug  5 05:26 congress_network.zip
-rw-r--r-- 1 root root 320K May  9  2017 edgelist
drwxr-xr-x 1 root root 4.0K Nov 30 14:27 sample_data
-rw-r--r-- 1 root root   16 May  9  2017 Statistics
-rw-r--r-- 1 root root 221K Dec  4 03:53 twitter.zip
-rw-r--r-- 1 root root 299K May  9  2017 vertex2aid
[40]:
import json

with open('congress_network/congress_network_data.json', 'r') as file:
    data = json.load(file)

edges = []
for i, name in enumerate(data[0]['usernameList']):
  for ii, j in enumerate(data[0]['outList'][i]):
    edges.append({
        'from': name,
        'to': data[0]['usernameList'][j],
        'weight': data[0]['outWeight'][i][ii]
    })
edges_df = pd.DataFrame(edges)

print(edges_df.shape)
edges_df.sample(5)
(13289, 3)
[40]:
from to weight
11112 RepBobbyRush janschakowsky 0.034364
3836 RepCori Ilhan 0.015936
5282 RepTedDeutch RepDWStweets 0.003268
12352 BennieGThompson RepStricklandWA 0.006849
9358 RepCarolMiller RepTroyNehls 0.005291

将数据框加载为PyGraphistry图#

转换为图并预先计算一些有用的图指标

回想一下,g 对象在底层本质上只是两个数据框,g._edgesg._nodes,并且有许多有用的图方法:

[77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')

# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
      .materialize_nodes()
      .nodes(lambda g: g._nodes.assign(title=g._nodes.id))
      .edges(lambda g: g._edges.assign(weight2=g._edges.weight))
      .bind(point_title='title')
      .compute_igraph('community_infomap')
      .compute_igraph('pagerank')
      .get_degrees()
      .encode_point_color(
          'community_infomap',
          as_categorical=True,
          categorical_mapping={
              0: '#32a9a2', # vibrant teal
              1: '#ff6b6b', # soft coral
              2: '#f9d342', # muted yellow
          }
      )
)

g2._nodes
WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs
[77]:
id title community_infomap pagerank degree_in degree_out degree
0 SenatorBaldwin SenatorBaldwin 0 0.001422 26 20 46
1 SenJohnBarrasso SenJohnBarrasso 0 0.001179 22 19 41
2 SenatorBennet SenatorBennet 0 0.001995 33 22 55
3 MarshaBlackburn MarshaBlackburn 0 0.001331 18 38 56
4 SenBlumenthal SenBlumenthal 0 0.001672 30 35 65
... ... ... ... ... ... ... ...
470 RepJoeWilson RepJoeWilson 1 0.001780 21 38 59
471 RobWittman RobWittman 1 0.001017 13 19 32
472 rep_stevewomack rep_stevewomack 1 0.002637 35 19 54
473 RepJohnYarmuth RepJohnYarmuth 2 0.000555 5 20 25
474 RepLeeZeldin RepLeeZeldin 1 0.000511 3 25 28

475 行 × 7 列

[79]:
g2.plot()
[79]:

3. 简单过滤: g.hop() & g.chain([...])#

我们可以通过节点、边以及它们的组合进行过滤

结果是一个图表,我们可以在其中检查节点和边表,或执行进一步的图形操作,如可视化或进一步搜索

关键概念

有2个关键方法:* g.hop(...):过滤源节点、边、目标节点的三元组 * g.chain([....]):任意长度的节点和边谓词序列

他们重用了数据框库中的列操作核心,例如对字符串、数字和日期的比较运算符

示例任务

本节展示如何:

  • 找到SenSchumer及其直接社区(infomap指标)

  • 看看他的整个社区

  • 找到与SenSchumer有高边权重的所有人;双向2跳

  • 找到他社区中的每个人

[80]:
g2.chain([n({'title': 'SenSchumer'})])._nodes
[80]:
id title community_infomap pagerank degree_in degree_out degree
0 SenSchumer SenSchumer 2 0.001296 25 97 122

你也可以传递chain()一个节点和边表达式的序列

[81]:
g_immediate_community2 = g2.chain([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
58 senators 69 relns
[81]:
from to weight2
22 SenSchumer JacksonLeeTX18 0.001546
46 SenSchumer RepSarbanes 0.001546
23 SenSchumer RepJayapal 0.001546
53 SenSchumer PeterWelch 0.001546
25 SenSchumer RepDaveJoyce 0.001546
26 SenSchumer RepRobinKelly 0.001546
28 SenSchumer RepAndyKimNJ 0.001546
29 SenSchumer RepBarbaraLee 0.001546
50 SenSchumer RepPaulTonko 0.001546
32 SenSchumer RepMeijer 0.001546
[82]:
g_immediate_community2.plot()
[82]:

通常,我们只是在一个源节点/边/目标节点的三元组上进行过滤,因此hop()是这种操作的简写形式。所有hop()参数也可以传递给边表达式。

[83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})

print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)
214 senators 4993 relns
[83]:
from to weight weight2
378 RepDonBeyer RepSpeier 0.000658 0.000658
354 RepDonBeyer repcleaver 0.000658 0.000658
353 RepDonBeyer RepYvetteClarke 0.000658 0.000658
352 RepDonBeyer RepCasten 0.000658 0.000658
349 RepDonBeyer RepBeatty 0.000658 0.000658
360 RepDonBeyer RepGaramendi 0.000658 0.000658
361 RepDonBeyer RepChuyGarcia 0.000658 0.000658
362 RepDonBeyer RepRaulGrijalva 0.000658 0.000658
365 RepDonBeyer USRepKeating 0.000658 0.000658
366 RepDonBeyer RepRickLarsen 0.000658 0.000658
[86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()
[86]:

4. 多跳和节点间路径模式挖掘#

方法 chain([...]) 可以用于查看多个跳转,甚至可以找到节点之间的路径。

例如:所有连接SenSchumer和SpeakerPelosi的人

[94]:
g_shumer_pelosi_bridges = g2.chain([
    n({'title': 'SenSchumer'}),
    e_undirected(),
    n(),
    e_undirected(),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)
66 senators
[94]:
from to weight weight2
86 RepJayapal SpeakerPelosi 0.000871 0.000871
47 SenSchumer RepMeijer 0.001546 0.001546
23 SenSchumer RepBuddyCarter 0.001546 0.001546
24 SenSchumer RepJudyChu 0.001546 0.001546
26 SenSchumer repcleaver 0.001546 0.001546
[92]:
g_shumer_pelosi_bridges.plot()
[92]:

5. 高级过滤谓词#

我们可以使用多种谓词来过滤节点和边,而不仅仅是属性值的相等性。

常见任务包括使用以下方法比较属性:* 集合包含:is_in([...]) * 数值比较:gt(...), lt(...), ge(...), le(...) * 字符串比较:startswith(...), endswith(...), contains(...) * 正则表达式匹配:matches(...) * 重复检查:duplicated()

图中节点位于前20个页面排名:

[134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr
[134]:
0.005888600097034367
[128]:
g_high_pr = g2.chain([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)
[128]:
20
[129]:
g_high_pr.plot()
[129]:

图表名称包含Leader

[136]:
g_leaders = g2.hop(
    source_node_match={'title': contains('Leader')},
    destination_node_match = {'title': contains('Leader')}
)

print(len(g_leaders._nodes), 'leaders')

g_leaders.plot()
2 leaders
[136]:

领导人和参议员的图表

[139]:
g_leaders_and_senators = g2.hop(
    source_node_match={'title': match_re(r'Sen|Leader')},
    destination_node_match = {'title': match_re(r'Sen|Leader')}
)

print(len(g_leaders_and_senators._nodes), 'leaders and senators')

g_leaders_and_senators.plot()
67 leaders and senators
[139]:

6. 结果标注#

在路径查询中为节点和边命名对于下游推理可能很有用:

[156]:
g_bridges2 = g2.chain([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()
66 senators in full graph
64 bridging senators
75 relns from_schumer 83 relns from_pelosi
[156]:
[ ]: