Databricks <> Graphistry 教程:物联网数据的笔记本和仪表板#
本教程通过根据纬度/经度对一组传感器进行聚类并叠加汇总统计信息来可视化它们
我们展示了如何在Databricks笔记本和仪表板模式下加载交互式图表。这种通用流程也应该适用于其他PySpark环境。
步骤:
安装Graphistry
准备物联网数据
在笔记本中绘图
在仪表板中绘图
将图表作为可分享的URL
安装并通过graphistry服务器进行身份验证#
[ ]:
# Uncomment and run first time or
# have databricks admin install graphistry python library:
# https://docs.databricks.com/en/libraries/package-repositories.html#pypi-package
#%pip install graphistry
[ ]:
# Required to run after pip install to pick up new python package:
dbutils.library.restartPython()
[ ]:
import graphistry # if not yet available, install pygraphistry and/or restart Python kernel using the cells above
graphistry.__version__
使用Databricks密钥检索Graphistry凭据并传递给注册#
[ ]:
# As a best practice, use databricks secrets to store graphistry personal key (access token)
# create databricks secrets: https://docs.databricks.com/en/security/secrets/index.html
# create graphistry personal key: https://hub.graphistry.com/account/tokens
graphistry.register(api=3,
personal_key_id=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_id"),
personal_key_secret=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_secret"),
protocol='https',
server='hub.graphistry.com')
# Alternatively, use username and password:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure
准备物联网数据#
Databricks提供的示例数据
我们为不同的地块创建表格:
设备传感器读取的原始数据表
汇总表:
四舍五入的纬度/经度
总结电池电量、二氧化碳水平、湿度、时间戳的最小值/最大值/平均值
[ ]:
# Load the data from its source.
devices = spark.read \
.format('json') \
.load('/databricks-datasets/iot/iot_devices.json')
# Show the results.
print('type: ', str(type(devices)))
display(devices.take(10))
[ ]:
from pyspark.sql import functions as F
from pyspark.sql.functions import concat_ws, col, round
devices_with_rounded_locations = (
devices
.withColumn(
'location_rounded1',
concat_ws(
'_',
round(col('latitude'), 0).cast('integer'),
round(col('longitude'), 0).cast('integer')))
.withColumn(
'location_rounded2',
concat_ws(
'_',
round(col('latitude'), -1).cast('integer'),
round(col('longitude'), -1).cast('integer')))
)
cols = ['battery_level', 'c02_level', 'humidity', 'timestamp']
id_cols = ['cca2', 'cca3', 'cn', 'device_name', 'ip', 'location_rounded1', 'location_rounded2']
devices_summarized = (
devices_with_rounded_locations.groupby('device_id').agg(
*[F.min(col) for col in cols],
*[F.max(col) for col in cols],
*[F.avg(col) for col in cols],
*[F.first(col) for col in id_cols]
)
)
# [(from1, to1), ...]
renames = (
[('device_id', 'device_id')]
+ [(f'first({col})', f'{col}') for col in id_cols]
+ [(f'min({col})', f'{col}_min') for col in cols]
+ [(f'max({col})', f'{col}_max') for col in cols]
+ [(f'avg({col})', f'{col}_avg') for col in cols]
)
devices_summarized = devices_summarized.select(list(
map(lambda old,new:F.col(old).alias(new),*zip(*renames))
))
display(devices_summarized.take(10))
Notebook 绘图#
简单:绘制
device_name
和cca3
(国家代码)之间的连接图高级:绘制多个连接,如
ip -> device_name
和locaation_rounded1 -> ip
[ ]:
(
graphistry
.edges(devices.sample(fraction=0.1).toPandas(), 'device_name', 'cca3') \
.settings(url_params={'strongGravity': 'true'}) \
.plot()
)
[ ]:
hg = graphistry.hypergraph(
devices_with_rounded_locations.sample(fraction=0.1).toPandas(),
['ip', 'device_name', 'location_rounded1', 'location_rounded2', 'cca3'],
direct=True,
opts={
'EDGES': {
'ip': ['device_name'],
'location_rounded1': ['ip'],
'location_rounded2': ['ip'],
'cca3': ['location_rounded2']
}
})
g = hg['graph']
g = g.settings(url_params={'strongGravity': 'true'}) # this setting is great!
g.plot()
仪表板图表#
像往常一样创建一个
graphistry
对象…… 然后禁用启动画面,并可选地设置自定义尺寸
现在,可视化将在不需要在仪表板中进行交互的情况下加载(view
-> + New Dashboard
)
[ ]:
(
g
.settings(url_params={'splashAfter': 'false'}) # extends existing setting
.plot(override_html_style="""
border: 1px #DDD dotted;
width: 50em; height: 50em;
""")
)