你需要多少GPU内存以及一个GPU任务能容纳多少数据？#

GPU内存大小规划 & Parquet、Arrow、RAPIDS/cuDF和Graphistry/GFQL的数据比率#

将过多数据放入GPU或使用内存不足的GPU会导致问题。无论选择哪种GPU，您可能都需要对数据进行分区以确保其适合，但如果分区太小，您可能会面临只能获得可用GPU加速的一小部分的风险。

实现GPU的高性能通常从解决这些问题开始。

一旦你理解了在基本数据管道阶段出现的一些常见数据比率，实际上在你的GPU内存预算内保持非常简单。

使用一个具有代表性的活动日志数据集，我们将通过一个典型的GPU ETL和分析管道，该管道从磁盘开始：

Parquet（磁盘，压缩）：0.1-0.5X
Arrow (CPU, 内存中): 0.2-1X
Pandas (CPU, 内存中): 1X <– 基准
cuDF (GPU, 内存中): 0.2-1X
GPU计算操作（GPU）：0.2-1X <– 包括cuDF表格查询和GFQL图查询
总体峰值使用率：1-2倍
变体：多GPU、多节点和AI+ML

在我们开始之前，请注意上述比率已经显示，GPU库通常只消耗像Pandas这样流行的基于CPU的库所需内存的一小部分：它们通常是为了更好的性能而构建的，而不仅仅是因为GPU处理。

第一阶段：设置和数据创建#

（如果你只是快速浏览，可以直接跳到数据部分）

安装与导入#

Pandas (CPU), RAPIDS cuDF (GPU), PyGraphistry

[9]:

! pip install -q graphistry

[ ]:

# For freely testing on colab.research.google.com:

# RAPIDS for Google Colab
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
! git clone -q https://github.com/rapidsai/rapidsai-csp-utils.git
! python rapidsai-csp-utils/colab/pip-install.py > /dev/null 2>&1

[11]:

import cudf
cudf.__version__

[11]:

'24.10.01'

[12]:

# Initialize RMM with a managed memory pool; this will automatically apply to cuDF allocations.
import cudf
import rmm
import rmm.statistics
rmm.reinitialize(pool_allocator=True, managed_memory=True)
rmm.statistics.enable_statistics()

# Initialize NVML for direct GPU memory measurement
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

[13]:

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import cudf
import graphistry
import matplotlib.pyplot as plt
import os
from graphistry import e, n

数据#

一百万个模拟的网络流量连接事件，带有时间戳的事件（src_ip, dst_ip）表示图的边

[14]:

rows = 1_000_000
data = {
    "timestamp": pd.date_range(start="2023-01-01", periods=rows, freq="S"),
    "src_ip": np.random.choice([f"192.168.1.{i}" for i in range(1, 256)], rows),
    "dst_ip": np.random.choice([f"10.0.0.{i}" for i in range(1, 256)], rows),
    "event_type": np.random.choice(["connect", "disconnect", "data_transfer"], rows),
    "bytes_transferred": np.random.randint(0, 1000, rows),
}
df = pd.DataFrame(data)
df.head()

<ipython-input-14-4e913c230e95>:3: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead.
  "timestamp": pd.date_range(start="2023-01-01", periods=rows, freq="S"),

[14]:

	timestamp	src_ip	dst_ip	event_type	bytes_transferred
0	2023-01-01 00:00:00	192.168.1.6	10.0.0.216	disconnect	595
1	2023-01-01 00:00:01	192.168.1.247	10.0.0.73	connect	754
2	2023-01-01 00:00:02	192.168.1.244	10.0.0.32	connect	630
3	2023-01-01 00:00:03	192.168.1.204	10.0.0.207	disconnect	348
4	2023-01-01 00:00:04	192.168.1.121	10.0.0.219	connect	710

第二阶段：使用Parquet测量磁盘上的空间使用情况，并使用Pandas测量CPU内存中的使用情况#

使用Arrow进行4倍CPU内存压缩#

Apache Arrow 内存计算表格式通过将数据打包成类型化列（与典型的行式 SQL、KV、图和日志数据库相比）使分析变得更快。一个典型的好处是数据也会变得更小。

使用Parquet进行20倍磁盘压缩#

Parquet 为每列添加了压缩算法，相比 Arrow 又提升了 5 倍

Parquet 和 Arrow 各有其用途。Arrow 避免在内存中使用压缩，以实现更快的内存访问。Parquet 优先考虑压缩以获得更好的磁盘存储。

[15]:

# Pandas Dataframe size
pandas_memory = df.memory_usage(index=True, deep=True).sum() / (1024**2)

# Arrow Table size
arrow_table = pa.Table.from_pandas(df)
arrow_size = arrow_table.nbytes / (1024**2)

# Parquet compressed size
pq_file_path = "compressed_data.parquet"
pq.write_table(arrow_table, pq_file_path, compression="SNAPPY")
parquet_size = os.path.getsize(pq_file_path) / (1024**2)

print(f"Pandas in-memory size: {pandas_memory:.2f} MB")
print(f"Arrow in-memory size: {arrow_size:.2f} MB")
print(f"Parquet compressed size on disk: {parquet_size:.2f} MB")

Pandas in-memory size: 209.00 MB
Arrow in-memory size: 57.36 MB
Parquet compressed size on disk: 9.57 MB

第三阶段：使用cuDF和PyGraphistry将数据加载到GPU#

使用cuDF进行4倍GPU压缩#

cuDF 是一个开源的基于GPU的数据框库，与Pandas API相匹配。请注意，cuDF是Arrow原生的，因此估计的GPU内存消耗与Apache Arrow完全匹配。即使不进行任何计算，它仍然保持对Pandas的4倍改进。

使用PyGraphistry进行4X GPU压缩#

图用户可以通过g2 = g1.to_cudf()自动将图的表传输到GPU，从而获得与基于Pandas的方法相同的优势。

[16]:

# Convert DataFrame to cuDF for operations
gdf = cudf.from_pandas(df)

# Calculate the size of the gdf in memory
gdf_size_bytes = gdf.memory_usage(deep=True).sum()
gdf_size_mb = gdf_size_bytes / (1024**2)  # Convert bytes to MB
print(f"Total gdf size in memory: {gdf_size_mb:.2f} MB")

Total gdf size in memory: 57.36 MB

通过GPU投影和更高的CPU RAM，为实际工作负载打包10倍以上的数据#

当有大量空间时，将整个数据框移动到GPU是很方便的，因此我们建议在原型设计期间这样做。

然而，通过注意在开始时使用哪些列，通常可以在相同的GPU上轻松处理10倍以上的工作量：

# Only transfer 2 columns from df to the GPU
df2 = cudf.from_pandas(df[['src_ip', 'dst_ip']])

CPU的内存通常比GPU的内存便宜，因此您可能希望CPU的内存比GPU多1-4倍

非GPU IO速度#

为了处理大于内存的数据集，记住数据在从磁盘传输到GPU的过程中会经过不同速度的设备是有帮助的：

它有助于将您的GPU RAM与更多（更便宜的）CPU RAM或磁盘配对：* 单个SSD可以达到1-5 GB/s，而它们的阵列可以达到100GB+/s * 通过PCIe 4.0，磁盘到CPU和CPU到GPU的消费级速度约为每1-2个GPU 32 GB/s * 服务器级通常为PCIe 5.0，每1-2个GPU 64 GB/s

对于高级设置，例如在1-2个GPU上达到100 GB/s的速度，请参阅我们在Dask Summit上关于Graphistry上的100GB/s GPU日志分析的演讲。它回顾了广泛的概念、架构和技巧，例如通过GPU Direct跳过复杂的CPU路径。

第四阶段：GPU计算 - 简单任务和GFQL遍历#

CPU和GPU程序除了输入数据结构的内存外，还需要额外的内存来创建中间数据结构。这通常是输入数据大小的1-5倍。

步骤A：用于内存基准的简单GPU计算#

我们看到简单的cuDF数据框方法，如过滤和连接，都得到了优化，因此两者都小于原始输入大小的1倍。然而，稍后我们将看到峰值要高得多。

[20]:

# Synchronize to ensure a clean memory state before starting
cudf.cuda.current_context().synchronize()

with rmm.statistics.profiler(name="Filter and Sum Operation"):
    filtered = gdf[gdf["event_type"] == "data_transfer"]
    total_bytes = filtered["bytes_transferred"].sum()

subset_gdf = gdf.head(10000)  # Smaller subset to avoid large memory requirements
with rmm.statistics.profiler(name="Join Operation on Subset"):
    joined = subset_gdf.merge(subset_gdf, on="src_ip", how="inner")


filter_sum_stats = rmm.statistics.default_profiler_records.records["Filter and Sum Operation"]
filter_sum_peak_mb = filter_sum_stats.memory_peak / (1024**2)
print(f"Filter & Sum Operation Memory Peak: {filter_sum_peak_mb: .2f} MB")

join_stats = rmm.statistics.default_profiler_records.records["Join Operation on Subset"]
join_peak_mb = join_stats.memory_peak / (1024**2)
print(f"Join Operation on Subset Memory Peak: {join_peak_mb: .2f} MB")

Filter & Sum Operation Memory Peak:  31.16 MB
Join Operation on Subset Memory Peak:  47.92 MB

步骤 B: 使用 GFQL 进行 GPU 图分析 - 2 跳遍历#

下面的示例是在PyGraphistry的GFQL中使用cuDF GPU引擎模式进行的2跳遍历，包括过滤“data_transfer”事件和大于500字节的数据。

图查询更像是一系列数据库操作符的序列，因此我们不仅看到了cuDF的速度优势，还看到了内存优势。内存本质上是使用的优化操作符的总和。

[21]:

# Step 4: Profile the GFQL 2-hop Traversal
g1 = graphistry.edges(gdf, 'src_ip', 'dst_ip')  # Example edge specification for Graphistry
with rmm.statistics.profiler(name="GFQL 2-hop Traversal"):
    g2 = g1.chain([
        n(),
        e(edge_match={'event_type': 'data_transfer'},
          edge_query="bytes_transferred > 500"),
        n()
    ])

gfql_stats = rmm.statistics.default_profiler_records.records["GFQL 2-hop Traversal"]
gfql_peak_mb = gfql_stats.memory_peak / (1024**2)
print(f"GFQL 2-hop Traversal Memory Peak: {gfql_peak_mb: .2f} MB")

GFQL 2-hop Traversal Memory Peak:  80.58 MB

比较图表#

让我们将所有内容放在一起，以便并排查看每个阶段：数据集大小和额外的中间内存

令人着迷的是，GPU版本能够在存储数据的同时对其进行计算，并且占用的内存比Pandas仅用于创建初始数据结构而不进行任何操作所需的内存还要少。

在大规模生产场景中，我们可能会通过有针对性地选择哪些列放在GPU上以及何时淘汰中间结构，来争取另一个10倍以上的提升。

[23]:

import matplotlib.pyplot as plt


# Labels and sizes for the bar chart
labels = [
    'Parquet (disk)',
    'Arrow (in-mem)',
    'Pandas (in-mem)',
    'cuDF (GPU)',
    '+ Filter & Sum (GPU)',
    '+ Join (GPU)',
    '+ GFQL 2-hop (GPU)',
    'Overall Peak (GPU)'
]
sizes = [
    parquet_size,
    arrow_size,
    pandas_memory,
    gdf_size_mb,
    filter_sum_peak_mb,
    join_peak_mb,
    gfql_peak_mb,
    overall_peak_mb
]

colors = ['#1f77b4', '#6baed6', '#9ecae1', '#2ca02c', '#ff7f0e', '#9467bd', '#8c564b', '#000000']

plt.figure(figsize=(14, 8))
bars = plt.bar(labels, sizes, color=colors, edgecolor='black')

# Add labels and title with a modern font size
plt.ylabel('Memory Usage (MB)', fontsize=14)
plt.title('Memory Usage Comparison', fontsize=16, fontweight='bold')
plt.xticks(rotation=45, ha="right", fontsize=12)  # Rotate and size labels for readability
plt.yticks(fontsize=12)  # Increase y-axis label font size for consistency
plt.tight_layout()

# Add value labels on top of each bar with a cleaner font style
for bar, size in zip(bars, sizes):
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height(),
        f'{size:.2f} MB',
        ha='center',
        va='bottom',
        fontsize=11,
        fontweight='medium',
        color='darkblue'  # Softer label color for contrast
    )

# Display the plot
plt.show()

../../_images/demos_gfql_GPU_memory_consumption_tutorial_19_0.png

要点，多GPU/多节点，以及ML+AI#

图表揭示了几个关键的理解：

Parquet 是一种优秀的磁盘格式#

它在压缩大表方面非常出色，只占用内存数据集表示的一小部分

RAPIDS (cuDF) 是一种优秀的内存格式#

其初始GPU内存分配大小与Apache Arrow CPU内存大小相匹配

内存消耗是输入数据大小的倍数#

计算需要额外的空间。与仅存储数据框相比，计算需要1X-5X的额外GPU RAM。

单GPU#

我们建议假设所需的内存大小是磁盘上压缩Parquet文件大小的10倍以上

多GPU、大于内存、以及dask-cudf#

当手动分块大数据集时，例如用于大于内存的计算或将数据分布在多个GPU上，或通过Dask自动分块时，我们通常建议使用1GB以上的块。这比CPU Dask任务大约大10倍，因为GPU通常更注重吞吐量。您可以查看我们在Dask分布式峰会上的演讲Graphistry上的100GB/s GPU日志分析，以了解更多方法。

AI/ML 工作负载#

现代数据科学库，如PyGraphistry的g.umap()，使用GPU和学习来进行扩展：

训练#

通常称为fit()，GPU系统通常可以在您的时间预算内处理10倍以上的AI/ML训练阶段数据。由于您通常不会在所有数据上进行训练，这意味着一个10倍以上的更大样本集，用于更高保真度和更具代表性的模型。

推理#

通常称为transform()，推理将训练好的模型应用于其余数据。这比拟合整个数据更具可扩展性，因此可以大幅加速。使用GPU时，速度也会更快，基本上与您的GPU预算相匹配。

下一步#

我们正在准备后续文章，深入探讨更多性能直觉以及这里讨论的技术，包括如何更仔细地测量您自己的工作负载。

同时，您可能也会发现这些内容很有用：

Graphistry 的 100GB/s GPU 日志分析在 Dask 分布式峰会上录制的演讲
PyGraphistry GPU加速的视觉图分析
PyGraphistry GPU umap() 用于视觉图AI
开源的GFQL dataframe-native 图查询语言，支持可选的GPU模式
在Graphistry Hub上亲自尝试

你需要多少GPU内存以及一个GPU任务能容纳多少数据？

目录