GFQL: 跳链 - PyGraphistry 数据框上的Cypher风格图模式匹配#

PyGraphistry 支持流行的 Cypher 图查询语言的一个丰富子集，您可以在数据帧上运行它，而无需安装数据库或本地库。它与数据帧原生集成，因此具有 Python 原生的语法，而不是传统的字符串语法。

PyGraphistry 图形模式匹配功能与 Cypher 有主要相似之处

多跳搜索
节点和边属性的谓词
能够识别匹配的节点和边

它在几个关键方面有所不同

纯PyData（Python/C++/Fortran）：无需安装数据库、Java等，pip install pygraphistry 就足够了
它是面向集合的，而不是面向路径的：所有操作都保证能有效地转换为向量化的数据框操作，而不是传统图查询引擎中典型的逐行路径操作，后者在渐进意义上较慢。
高级用户可以插入自定义谓词作为原生Python数据框代码

教程

安装与配置
加载并丰富一个美国国会推特互动数据集
简单的图过滤：g.hop() 和 g.chain([...])
多跳和节点间路径模式挖掘
高级筛选条件
结果标注

1. 安装与配置#

[ ]:

# ! pip install graphistry[igraph]

导入#

[141]:

import pandas as pd

import graphistry

from graphistry import (

    # graph operators
    n, e_undirected, e_forward, e_reverse,

    # attribute predicates
    is_in, ge, startswith, contains, match as match_re
)

[ ]:

graphistry.register(api=3, username='...', password='...')

2. 加载并丰富一个美国国会推特互动数据集#

数据#

下载
将json转换为Pandas边数据框
将边缘数据框转换为PyGraphistry图
使用一些有用的图度量来丰富节点和边
可视化完整图表以进行测试

[9]:

# ! wget -q https://snap.stanford.edu/data/congress_network.zip
# ! unzip congress_network.zip

total 1.2M
drwxr-xr-x 1 root root 4.0K Dec  4 03:56 .
drwxr-xr-x 1 root root 4.0K Dec  4 03:33 ..
-rw-r--r-- 1 root root 150K May  9  2017 Attribute
-rw-r--r-- 1 root root  14K May  9  2017 Class_info
drwxr-xr-x 4 root root 4.0K Nov 30 14:24 .config
-rw-r--r-- 1 root root 190K Aug  5 05:26 congress_network.zip
-rw-r--r-- 1 root root 320K May  9  2017 edgelist
drwxr-xr-x 1 root root 4.0K Nov 30 14:27 sample_data
-rw-r--r-- 1 root root   16 May  9  2017 Statistics
-rw-r--r-- 1 root root 221K Dec  4 03:53 twitter.zip
-rw-r--r-- 1 root root 299K May  9  2017 vertex2aid

[40]:

import json

with open('congress_network/congress_network_data.json', 'r') as file:
    data = json.load(file)

edges = []
for i, name in enumerate(data[0]['usernameList']):
  for ii, j in enumerate(data[0]['outList'][i]):
    edges.append({
        'from': name,
        'to': data[0]['usernameList'][j],
        'weight': data[0]['outWeight'][i][ii]
    })
edges_df = pd.DataFrame(edges)

print(edges_df.shape)
edges_df.sample(5)

(13289, 3)

[40]:

	from	to	weight
11112	RepBobbyRush	janschakowsky	0.034364
3836	RepCori	Ilhan	0.015936
5282	RepTedDeutch	RepDWStweets	0.003268
12352	BennieGThompson	RepStricklandWA	0.006849
9358	RepCarolMiller	RepTroyNehls	0.005291

将数据框加载为PyGraphistry图#

转换为图并预先计算一些有用的图指标

回想一下，g 对象在底层本质上只是两个数据框，g._edges 和 g._nodes，并且有许多有用的图方法：

[77]:

# Shape
g = graphistry.edges(edges_df, 'from', 'to')

# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
      .materialize_nodes()
      .nodes(lambda g: g._nodes.assign(title=g._nodes.id))
      .edges(lambda g: g._edges.assign(weight2=g._edges.weight))
      .bind(point_title='title')
      .compute_igraph('community_infomap')
      .compute_igraph('pagerank')
      .get_degrees()
      .encode_point_color(
          'community_infomap',
          as_categorical=True,
          categorical_mapping={
              0: '#32a9a2', # vibrant teal
              1: '#ff6b6b', # soft coral
              2: '#f9d342', # muted yellow
          }
      )
)

g2._nodes

WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs

[77]:

	id	title	community_infomap	pagerank	degree_in	degree_out	degree
0	SenatorBaldwin	SenatorBaldwin	0	0.001422	26	20	46
1	SenJohnBarrasso	SenJohnBarrasso	0	0.001179	22	19	41
2	SenatorBennet	SenatorBennet	0	0.001995	33	22	55
3	MarshaBlackburn	MarshaBlackburn	0	0.001331	18	38	56
4	SenBlumenthal	SenBlumenthal	0	0.001672	30	35	65
...	...	...	...	...	...	...	...
470	RepJoeWilson	RepJoeWilson	1	0.001780	21	38	59
471	RobWittman	RobWittman	1	0.001017	13	19	32
472	rep_stevewomack	rep_stevewomack	1	0.002637	35	19	54
473	RepJohnYarmuth	RepJohnYarmuth	2	0.000555	5	20	25
474	RepLeeZeldin	RepLeeZeldin	1	0.000511	3	25	28

475 行 × 7 列

[79]:

g2.plot()

[79]:

3. 简单过滤: `g.hop()` & `g.chain([...])`#

我们可以通过节点、边以及它们的组合进行过滤

结果是一个图表，我们可以在其中检查节点和边表，或执行进一步的图形操作，如可视化或进一步搜索

关键概念

有2个关键方法：* g.hop(...)：过滤源节点、边、目标节点的三元组 * g.chain([....])：任意长度的节点和边谓词序列

他们重用了数据框库中的列操作核心，例如对字符串、数字和日期的比较运算符

示例任务

本节展示如何：

找到SenSchumer及其直接社区（infomap指标）
看看他的整个社区
找到与SenSchumer有高边权重的所有人；双向2跳
找到他社区中的每个人

[80]:

g2.chain([n({'title': 'SenSchumer'})])._nodes

[80]:

	id	title	community_infomap	pagerank	degree_in	degree_out	degree
0	SenSchumer	SenSchumer	2	0.001296	25	97	122

你也可以传递chain()一个节点和边表达式的序列

[81]:

g_immediate_community2 = g2.chain([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)

58 senators 69 relns

[81]:

	from	to	weight2
22	SenSchumer	JacksonLeeTX18	0.001546
46	SenSchumer	RepSarbanes	0.001546
23	SenSchumer	RepJayapal	0.001546
53	SenSchumer	PeterWelch	0.001546
25	SenSchumer	RepDaveJoyce	0.001546
26	SenSchumer	RepRobinKelly	0.001546
28	SenSchumer	RepAndyKimNJ	0.001546
29	SenSchumer	RepBarbaraLee	0.001546
50	SenSchumer	RepPaulTonko	0.001546
32	SenSchumer	RepMeijer	0.001546

[82]:

g_immediate_community2.plot()

[82]:

通常，我们只是在一个源节点/边/目标节点的三元组上进行过滤，因此hop()是这种操作的简写形式。所有hop()参数也可以传递给边表达式。

[83]:

g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})

print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)

214 senators 4993 relns

[83]:

	from	to	weight	weight2
378	RepDonBeyer	RepSpeier	0.000658	0.000658
354	RepDonBeyer	repcleaver	0.000658	0.000658
353	RepDonBeyer	RepYvetteClarke	0.000658	0.000658
352	RepDonBeyer	RepCasten	0.000658	0.000658
349	RepDonBeyer	RepBeatty	0.000658	0.000658
360	RepDonBeyer	RepGaramendi	0.000658	0.000658
361	RepDonBeyer	RepChuyGarcia	0.000658	0.000658
362	RepDonBeyer	RepRaulGrijalva	0.000658	0.000658
365	RepDonBeyer	USRepKeating	0.000658	0.000658
366	RepDonBeyer	RepRickLarsen	0.000658	0.000658

[86]:

g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()

[86]:

4. 多跳和节点间路径模式挖掘#

方法 chain([...]) 可以用于查看多个跳转，甚至可以找到节点之间的路径。

例如：所有连接SenSchumer和SpeakerPelosi的人

[94]:

g_shumer_pelosi_bridges = g2.chain([
    n({'title': 'SenSchumer'}),
    e_undirected(),
    n(),
    e_undirected(),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)

66 senators

[94]:

	from	to	weight	weight2
86	RepJayapal	SpeakerPelosi	0.000871	0.000871
47	SenSchumer	RepMeijer	0.001546	0.001546
23	SenSchumer	RepBuddyCarter	0.001546	0.001546
24	SenSchumer	RepJudyChu	0.001546	0.001546
26	SenSchumer	repcleaver	0.001546	0.001546

[92]:

g_shumer_pelosi_bridges.plot()

[92]:

5. 高级过滤谓词#

我们可以使用多种谓词来过滤节点和边，而不仅仅是属性值的相等性。

常见任务包括使用以下方法比较属性：* 集合包含：is_in([...]) * 数值比较：gt(...), lt(...), ge(...), le(...) * 字符串比较：startswith(...), endswith(...), contains(...) * 正则表达式匹配：matches(...) * 重复检查：duplicated()

图中节点位于前20个页面排名：

[134]:

top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr

[134]:

0.005888600097034367

[128]:

g_high_pr = g2.chain([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

[128]:

[129]:

g_high_pr.plot()

[129]:

图表名称包含Leader

[136]:

g_leaders = g2.hop(
    source_node_match={'title': contains('Leader')},
    destination_node_match = {'title': contains('Leader')}
)

print(len(g_leaders._nodes), 'leaders')

g_leaders.plot()

2 leaders

[136]:

领导人和参议员的图表

[139]:

g_leaders_and_senators = g2.hop(
    source_node_match={'title': match_re(r'Sen|Leader')},
    destination_node_match = {'title': match_re(r'Sen|Leader')}
)

print(len(g_leaders_and_senators._nodes), 'leaders and senators')

g_leaders_and_senators.plot()

67 leaders and senators

[139]:

6. 结果标注#

在路径查询中为节点和边命名对于下游推理可能很有用：

[156]:

g_bridges2 = g2.chain([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()

66 senators in full graph
64 bridging senators
75 relns from_schumer 83 relns from_pelosi

[156]:

[ ]:

GFQL: 跳链式 - PyGraphistry Cypher风格的数据框图模式匹配

目录

GFQL: 跳链 - PyGraphistry 数据框上的Cypher风格图模式匹配#

1. 安装与配置#

导入#

2. 加载并丰富一个美国国会推特互动数据集#

数据#

将数据框加载为PyGraphistry图#

3. 简单过滤: `g.hop()` & `g.chain([...])`#

4. 多跳和节点间路径模式挖掘#

5. 高级过滤谓词#

6. 结果标注#

GFQL: 跳链式 - PyGraphistry Cypher风格的数据框图模式匹配

目录

GFQL: 跳链 - PyGraphistry 数据框上的Cypher风格图模式匹配#

1. 安装与配置#

导入#

2. 加载并丰富一个美国国会推特互动数据集#

数据#

将数据框加载为PyGraphistry图#

3. 简单过滤: g.hop() & g.chain([...])#

4. 多跳和节点间路径模式挖掘#

5. 高级过滤谓词#

6. 结果标注#

3. 简单过滤: `g.hop()` & `g.chain([...])`#