可视化GPU日志分析第一部分：Python Pandas中的CPU基准测试

目录

可视化GPU日志分析第一部分：Python Pandas中的CPU基准#

Graphistry 很棒 – Graphistry 和 RAPIDS/BlazingDB 更好！

本教程系列使用不同的计算引擎对Zeek/Bro网络连接日志进行可视化分析：

第一部分: Python Pandas 中的 CPU 基准
第二部分: 使用RAPIDS Python cudf绑定的GPU数据框架
第三部分：GPU SQL - 已弃用，因为Dask-SQL在RAPIDS生态系统中取代了BlazingSQL
第四部分：使用RAPIDS cuML UMAP和PyGraphistry进行GPU机器学习
Graphistry cuGraph 绑定

第一部分内容：

使用基于CPU的Python Pandas和Graphistry进行完整ETL和可视化分析流程的时间：

加载数据
分析数据
可视化数据

[ ]:

#!pip install graphistry -q

import pandas as pd

import graphistry
graphistry.__version__

# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')

# For more options, see https://github.com/graphistry/pygraphistry#configure

1. 加载数据#

[ ]:

%%time
# download data
#!if [ ! -f conn.log ]; then \
#    curl https://www.secrepo.com/maccdc2012/conn.log.gz | gzip -d > conn.log; \
#fi

[ ]:

#!head -n 3 conn.log

[ ]:

# OPTIONAL: For slow or limited devices, work on a subset:
LIMIT = 1200000

[ ]:

%%time
df = pd.read_csv("./conn.log", sep="\t", header=None,
                 names=["time", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service",
                        "duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes",
                        "history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"],
                 na_values=['-'], index_col=False, nrows=LIMIT)

[ ]:

df.sample(3)

2. 分析数据#

总结每个通信源/目标IP之间的网络活动，按连接状态拆分

[ ]:

df_summary = df\
.assign(
    sum_bytes=df.apply(lambda row: row['orig_bytes'] + row['resp_bytes'], axis=1))\
.groupby(['id.orig_h', 'id.resp_h', 'conn_state'])\
.agg({
    'time': ['min', 'max', 'size'],
    'id.resp_p':  ['nunique'],
    'uid': ['nunique'],
    'duration':   ['min', 'max', 'mean'],
    'orig_bytes': ['min', 'max', 'sum', 'mean'],
    'resp_bytes': ['min', 'max', 'sum', 'mean'],
    'sum_bytes':  ['min', 'max', 'sum', 'mean']
}).reset_index()

[ ]:

df_summary.columns = [' '.join(col).strip() for col in df_summary.columns.values]
df_summary = df_summary\
.rename(columns={'time size': 'count'})\
.assign(
    conn_state_uid=df_summary.apply(lambda row: row['id.orig_h'] + '_' + row['id.resp_h'] + '_' + row['conn_state'], axis=1))

[ ]:

print ('# rows', len(df_summary))
df_summary.sample(3)

3. 数据可视化#

节点：
- IP地址
- 当涉及它们的会话（按连接状态分割）更多时，更大
边：
- 源IP -> 目标IP，按连接状态分割

[ ]:

hg = graphistry.hypergraph(
    df_summary,
    ['id.orig_h', 'id.resp_h'],
    direct=True,
    opts={
        'CATEGORIES': {
            'ip': ['id.orig_h', 'id.resp_h']
        }
    })

[ ]:

hg['graph'].plot()

下一步#

第一部分: Python Pandas 中的 CPU 基准
第二部分: 使用RAPIDS Python cudf绑定的GPU数据框架
第三部分：GPU SQL - 已弃用，因为Dask-SQL在RAPIDS生态系统中取代了BlazingSQL
第四部分: 使用RAPIDS cuML UMAP和PyGraphistry进行GPU机器学习
Graphistry cuGraph 绑定

[ ]:

[ ]: