Graphistry Neptune Gremlin 身份图演示#

PyGraphistry 帮助连接到图数据源,使用 Python 数据框工具处理它们,并使用 Graphistry 进行可视化。它通常用于笔记本、数据应用和仪表板。

本笔记本使用PyGraphistry快速实现以下功能:* 连接到Neptune * 运行Gremlin查询,通过内置绑定在gremlin_python上 * 转换为数据框进行数据处理:CPU通过Pandas,GPU通过RAPIDS cuDF * 可视化,自动生成丰富的、交互式的、GPU加速的Graphistry图形可视化会话 * 分享与嵌入您的美观结果

对于下面使用的任何API,运行help(graphistry.the_method)以快速查看其文档

演示使用的是来自我们联合graph-app-kit教程的AWS Neptune身份图数据样本。如果您有自己的数据集,包括非身份数据,示例查询应该仍然有效。

设置#

可选 - 通过graph-app-kit for Neptune快速启动:* Neptune:已在Neptune的身份图数据库示例套件上测试,您可以替换为您自己的 * Graphistry:使用您自己的,获取免费Hub账户,或在AWS中与Neptune的VPC和公共子网一起启动 * Notebook:使用您自己的,或在AWS中与Neptune的VPC和公共子网一起启动

如果你遇到gremlinpython事件运行时错误,尝试这个gist来解决它们

安装#

已在graphistry环境中提供

[1]:
# ! pip install -u gremlinpython graphistry
# ! pip install -u pandas
# see https://rapids.ai/ if trying GPU dataframes

导入#

[2]:
! pip show gremlinpython graphistry | grep 'Name\|Version'
Name: gremlinpython
Version: 3.4.10
Name: graphistry
Version: 0.19.0+5.g5ce1d3fb0
[3]:
import graphistry
graphistry.__version__
[3]:
'0.19.0+5.g5ce1d3fb0'

配置#

[4]:
# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure
[29]:
NEPTUNE_READER_PROTOCOL='wss'
NEPTUNE_READER_HOST='neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com'
NEPTUNE_READER_PORT='8182'

endpoint = f'{NEPTUNE_READER_PROTOCOL}://{NEPTUNE_READER_HOST}:{NEPTUNE_READER_PORT}/gremlin'
endpoint
[29]:
'wss://neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com:8182/gremlin'
[6]:
#import logging
#logging.basicConfig(level=logging.DEBUG)

连接#

[7]:
graphistry.register(**GRAPHISTRY_CFG)

g = graphistry.neptune(endpoint=endpoint)

g._gremlin_client
[7]:
<gremlin_python.driver.client.Client at 0x7fdfc230e3d0>

查询与绘图#

  • PyGraphistry 自动将 gremlin 结果转换为节点/边数据框

  • 边缘查询通常只返回节点ID;调用fetch_nodes()来丰富你的g._nodes数据框

  • PyGraphistry 绘制数据框

[25]:
%%time

g2 = g.gremlin('g.E().limit(10000)')

CPU times: user 4.96 s, sys: 27.9 ms, total: 4.99 s
Wall time: 4.95 s
[26]:
print('NODES:')
g2._nodes.info()
g2._nodes.sample(3)
NODES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      8106 non-null   object
 1   label   8106 non-null   object
dtypes: object(2)
memory usage: 126.8+ KB
[26]:
id label
4102 ed95a9a5be30e4c8/e212d4b4d4a865a/7e3e41e09dfe6... website
6496 6ea77fc3ea42bd5b/87be29bd5615083/d4392e74543e413 website
7540 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... website
[27]:
print('EDGES:')
print(g2._edges.info())

g2._edges.sample(3)
EDGES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      10000 non-null  object
 1   label   10000 non-null  object
 2   src     10000 non-null  object
 3   dst     10000 non-null  object
dtypes: object(4)
memory usage: 312.6+ KB
None
[27]:
id label src dst
2814 f7803bf0ac187592421c0695792b698f43b596ce visited 556de63e26686d50/95263499b67bbda1?f300c39f4f33... 48e740025e70e4e38dc87928cd45357c
8081 fe80cddfec97a7dd802cf93cf277da01d9b5fb65 visited 3ccec85ce35ea661?fa76e6024017220f 23c31ea91be100fd224dff1499939851
2046 4e5290971de41c1e1bcb7433e53ffc6321e410cf visited 6ea77fc3ea42bd5b/9c280de73bf0fb32/bb555a4d63de... 9e77c2a52fdf9f9b7416e85cabaf7c76
[28]:
%%time

# Enrich nodes dataframe with any available server property data

g3 = g2.fetch_nodes()

print(g3._nodes.info())

g3._nodes.sample(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      8106 non-null   object
 1   label   8106 non-null   object
dtypes: object(2)
memory usage: 126.8+ KB
None
CPU times: user 4.32 s, sys: 43.9 ms, total: 4.37 s
Wall time: 4.33 s
[28]:
id label
1242 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... website
3190 493a46bbfd2029ae/4a0cad2f071a71ce/f9ba18598922... website
6782 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... website
[19]:
%%time

g3.plot()
CPU times: user 59.8 ms, sys: 4 ms, total: 63.8 ms
Wall time: 1.68 s
[19]:

自定义您的视觉效果 & 嵌入#

Graphistry 使用智能默认值可视化数据:基于社区的着色、基于度的大小、力导向布局、自动缩放和内置的视觉分析。然而,提前配置视觉效果通常会有所帮助。

示例:* 在新列“type”上启用图例 * 根据节点列“type”为节点着色 * 根据节点类型选择图标 * 设置背景颜色以匹配笔记本 * 使用更紧凑的布局

更多示例请参见 PyGraphistry github repo

[24]:
%%time

g4 = (g3

      # Add node column 'type' based on gremlin-provided column 'label'
      # The legend auto-detects this column and appears
      .nodes(lambda g: g._nodes.assign(type=g._nodes['label']))

      .encode_point_color('type', categorical_mapping={
          'website': 'blue',
          'transientId': 'green'
      })

      .encode_point_icon('type', categorical_mapping ={
          'website': 'link',
          'transientId': 'barcode'
      })

      .addStyle(bg={'color': '#eee'}, page={'title': 'My Graph'})

      # More: https://hub.graphistry.com/docs/api/1/rest/url/
      .settings(url_params={'play': 2000})
)

g4.plot()
CPU times: user 63.5 ms, sys: 3.88 ms, total: 67.3 ms
Wall time: 1.62 s
[24]:

为其他系统生成URL#

[23]:
%%time

url = g4.plot(render=False)

url
CPU times: user 64.8 ms, sys: 0 ns, total: 64.8 ms
Wall time: 1.67 s
[23]:
'https://hub.graphistry.com/graph/graph.html?dataset=7405d0ac396a47ea9ee84acab7b0b31d&type=arrow&viztoken=c5e68946-e922-487e-9484-ef8fc9e2c8f9&usertag=5bf3845f-pygraphistry-0.19.0+5.g5ce1d3fb0&splashAfter=1625879227&info=true&strongGravity=False&play=2000'

下一步#

[ ]: