Databricks <> Graphistry 教程:物联网数据的笔记本和仪表板#

本教程通过根据纬度/经度对一组传感器进行聚类并叠加汇总统计信息来可视化它们

我们展示了如何在Databricks笔记本和仪表板模式下加载交互式图表。这种通用流程也应该适用于其他PySpark环境。

步骤:

  • 安装Graphistry

  • 准备物联网数据

  • 在笔记本中绘图

  • 在仪表板中绘图

  • 将图表作为可分享的URL

安装并通过graphistry服务器进行身份验证#

[ ]:
# Uncomment and run first time or
#  have databricks admin install graphistry python library:
#  https://docs.databricks.com/en/libraries/package-repositories.html#pypi-package

#%pip install graphistry
[ ]:
# Required to run after pip install to pick up new python package:
dbutils.library.restartPython()
[ ]:
import graphistry  # if not yet available, install pygraphistry and/or restart Python kernel using the cells above
graphistry.__version__

使用Databricks密钥检索Graphistry凭据并传递给注册#

[ ]:

# As a best practice, use databricks secrets to store graphistry personal key (access token) # create databricks secrets: https://docs.databricks.com/en/security/secrets/index.html # create graphistry personal key: https://hub.graphistry.com/account/tokens graphistry.register(api=3, personal_key_id=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_id"), personal_key_secret=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_secret"), protocol='https', server='hub.graphistry.com') # Alternatively, use username and password: # graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com') # For more options, see https://github.com/graphistry/pygraphistry#configure

准备物联网数据#

Databricks提供的示例数据

我们为不同的地块创建表格:

  • 设备传感器读取的原始数据表

  • 汇总表:

    • 四舍五入的纬度/经度

    • 总结电池电量、二氧化碳水平、湿度、时间戳的最小值/最大值/平均值

[ ]:
# Load the data from its source.
devices = spark.read \
  .format('json') \
  .load('/databricks-datasets/iot/iot_devices.json')

# Show the results.
print('type: ', str(type(devices)))
display(devices.take(10))
[ ]:
from pyspark.sql import functions as F
from pyspark.sql.functions import concat_ws, col, round

devices_with_rounded_locations = (
    devices
    .withColumn(
        'location_rounded1',
        concat_ws(
            '_',
            round(col('latitude'), 0).cast('integer'),
            round(col('longitude'), 0).cast('integer')))
    .withColumn(
        'location_rounded2',
        concat_ws(
            '_',
            round(col('latitude'), -1).cast('integer'),
            round(col('longitude'), -1).cast('integer')))
)

cols = ['battery_level', 'c02_level', 'humidity', 'timestamp']
id_cols = ['cca2', 'cca3', 'cn', 'device_name', 'ip', 'location_rounded1', 'location_rounded2']
devices_summarized = (
    devices_with_rounded_locations.groupby('device_id').agg(
        *[F.min(col) for col in cols],
        *[F.max(col) for col in cols],
        *[F.avg(col) for col in cols],
        *[F.first(col) for col in id_cols]
    )
)

# [(from1, to1), ...]
renames = (
    [('device_id', 'device_id')]
    + [(f'first({col})', f'{col}') for col in id_cols]
    + [(f'min({col})', f'{col}_min') for col in cols]
    + [(f'max({col})', f'{col}_max') for col in cols]
    + [(f'avg({col})', f'{col}_avg') for col in cols]
 )
devices_summarized = devices_summarized.select(list(
       map(lambda old,new:F.col(old).alias(new),*zip(*renames))
       ))

display(devices_summarized.take(10))

Notebook 绘图#

  • 简单:绘制device_namecca3(国家代码)之间的连接图

  • 高级:绘制多个连接,如 ip -> device_namelocaation_rounded1 -> ip

[ ]:
(
    graphistry
        .edges(devices.sample(fraction=0.1).toPandas(), 'device_name', 'cca3') \
        .settings(url_params={'strongGravity': 'true'}) \
        .plot()
)
[ ]:
hg = graphistry.hypergraph(
    devices_with_rounded_locations.sample(fraction=0.1).toPandas(),
    ['ip', 'device_name', 'location_rounded1', 'location_rounded2', 'cca3'],
    direct=True,
    opts={
        'EDGES': {
            'ip': ['device_name'],
            'location_rounded1': ['ip'],
            'location_rounded2': ['ip'],
            'cca3': ['location_rounded2']
        }
    })
g = hg['graph']
g = g.settings(url_params={'strongGravity': 'true'})  # this setting is great!

g.plot()

仪表板图表#

  • 像往常一样创建一个 graphistry 对象…

  • … 然后禁用启动画面,并可选地设置自定义尺寸

现在,可视化将在不需要在仪表板中进行交互的情况下加载(view -> + New Dashboard

[ ]:
(
    g
        .settings(url_params={'splashAfter': 'false'})  # extends existing setting
        .plot(override_html_style="""
            border: 1px #DDD dotted;
            width: 50em; height: 50em;
        """)
)

将图表作为可分享的URL#

[ ]:
url = g.plot(render=False)
url
[ ]: