10分钟了解GFQL#

欢迎来到GFQL（GraphFrame查询语言），这是第一个数据框原生的图查询语言。GFQL旨在将图查询的强大功能带入您的数据科学工作流程，而无需外部图数据库或复杂的基础设施。它与PyData、Apache Arrow和GPU加速生态系统无缝集成，使您能够高效处理大规模图数据。

在本指南中，我们将在短短10分钟内探索GFQL的基础知识。您将学习如何：

查询和过滤节点和边。
链式多个跳跃并应用谓词。
利用自动GPU加速。
将GFQL集成到您现有的Python工作流程中。
在远程GPU和远程数据上运行GFQL和Python。

让我们开始吧！

GFQL简介#

GFQL 填补了数据社区中的一个关键空白，它提供了一种在计算层操作的高性能图查询语言，该语言在进程中运行。与将存储和计算耦合在一起的传统图数据库不同，GFQL 允许您直接在数据帧上执行图查询，无论这些数据帧是在内存中还是在磁盘上，是在 CPU 上还是在 GPU 上。

主要优势：

Dataframe-Native: 直接与Pandas、cuDF和其他数据框库一起工作。
高性能：针对CPU和GPU执行进行了优化。
易用性： 无需外部数据库或新基础设施。
互操作性： 与Python数据科学生态系统集成，包括用于可视化的PyGraphistry。

设置GFQL#

GFQL 是开源 graphistry 库的一部分。使用 pip 安装它：

pip install graphistry

确保你已经安装了pandas或cudf，这取决于你希望在CPU还是GPU上运行。

基本概念#

在我们开始示例之前，让我们了解一些基本概念：

节点和边： 在GFQL中，图使用数据框来表示节点和边。
链式操作：GFQL 查询通过链式操作构建，这些操作用于过滤和遍历图。
谓词：应用于节点或边的条件，以根据属性进行过滤。

示例#

1. 查找特定类型的节点#

您可以使用n()函数根据节点的属性进行过滤。

示例：查找所有类型为“person”的节点

from graphistry import n

people_nodes_df = g.chain([ n({"type": "person"}) ])._nodes
print('Number of person nodes:', len(people_nodes_df))

解释：

n({“type”: “person”}) 过滤出 type 属性为 “person” 的节点。
g.chain([…]) 将操作链应用于图 g。
._nodes 检索结果节点数据框。

2. 查找具有属性的2跳边序列#

使用e_forward()遍历多个跳数并根据属性过滤边。

示例：查找标记为“有趣”的边的2跳路径

from graphistry import e_forward

g_2_hops = g.chain([ e_forward({"interesting": True}, hops=2) ])
print('Number of edges in 2-hop paths:', len(g_2_hops._edges))
g_2_hops.plot()

解释：

e_forward({“interesting”: True}, hops=2) 遍历具有 interesting == True 的前向边，跳数为2。
g_2_hops.plot() 可视化生成的子图。

3. 查找1-2跳远的节点并标记每一跳#

在遍历中标记跳数以分析特定关系。

示例：查找距离节点“a”最多2跳的节点，并为每一跳添加标签

from graphistry import n, e_undirected

g_2_hops = g.chain([
    n({g._node: "a"}),
    e_undirected(name="hop1"),
    e_undirected(name="hop2")
])
first_hop_edges = g_2_hops._edges[ g_2_hops._edges.hop1 == True ]
print('Number of first-hop edges:', len(first_hop_edges))

解释：

n({g._node: “a”}) 从节点 “a” 开始遍历，其中 g._node 是标识列的名称。
e_undirected(name=”hop1”) 遍历无向边并将它们标记为 hop1。
e_undirected(name=”hop2”) 继续遍历并将边标记为 hop2。
标签允许您从特定的跳数中过滤和分析边。

4. 查询风险节点之间的交易节点#

链式多个遍历以查找节点之间的模式。

示例：查找两种风险节点之间的交易节点

from graphistry import n, e_forward, e_reverse

g_risky = g.chain([
    n({"risk1": True}),
    e_forward(to_fixed_point=True),
    n({"type": "transaction"}, name="hit"),
    e_reverse(to_fixed_point=True),
    n({"risk2": True})
])
hits = g_risky._nodes[ g_risky._nodes["hit"] == True ]
print('Number of transaction hits:', len(hits))

解释：

从具有risk1 == True的节点开始。
向前遍历事务节点，将它们标记为命中。
向后遍历到具有risk2 == True的节点。
识别连接在两个风险节点之间的交易节点。

5. 使用is_in按多个节点类型过滤#

使用is_in谓词通过多个值过滤节点或边。

示例：按多种类型过滤节点和边

from graphistry import n, e_forward, e_reverse, is_in

g_filtered = g.chain([
    n({"type": is_in(["person", "company"])}),
    e_forward({"e_type": is_in(["owns", "reviews"])}, to_fixed_point=True),
    n({"type": is_in(["transaction", "account"])}, name="hit"),
    e_reverse(to_fixed_point=True),
    n({"risk2": True})
])
hits = g_filtered._nodes[ g_filtered._nodes["hit"] == True ]
print('Number of filtered hits:', len(hits))

解释：

过滤类型为“person”或“company”的节点。
遍历类型为“owns”或“reviews”的前向边。
过滤类型为“transaction”或“account”的节点，并将它们标记为hit。
向后遍历到具有risk2 == True的节点。

利用GPU加速#

GFQL 使用 cudf 和 rapids 进行了 GPU 加速优化。当使用 GPU 数据帧时，GFQL 会自动在 GPU 上执行查询，以实现大幅加速。

6. 自动GPU加速#

示例：使用GPU数据帧运行GFQL查询

import cudf
import graphistry

# Load data into GPU dataframes
e_gdf = cudf.read_parquet('edges.parquet')
n_gdf = cudf.read_parquet('nodes.parquet')

# Create a graph with GPU dataframes
g_gpu = graphistry.edges(e_gdf, 'src', 'dst').nodes(n_gdf, 'id')

# Run GFQL query (executes on GPU)
g_result = g_gpu.chain([ ... ])
print('Number of resulting edges:', len(g_result._edges))

解释：

cudf.read_parquet() 将数据直接加载到GPU内存中。
GFQL 检测 cudf 数据帧并在 GPU 上运行查询。
在大数据集上实现了显著的性能改进。

7. 强制GPU模式#

您可以明确设置引擎以确保GPU执行。

示例：强制GFQL使用GPU引擎

g_result = g_gpu.chain([ ... ], engine='cudf')

解释：

engine=’cudf’ 强制使用GPU加速引擎。
当您希望确保查询在GPU上运行时非常有用。

与PyData生态系统的集成#

GFQL 与 PyData 生态系统无缝集成，允许您将其与 pandas、networkx、igraph 和 PyTorch 等库结合使用。

8. 结合GFQL与图算法#

示例：在结果图上计算PageRank

# Assuming g_result is the result from a GFQL query

# Compute PageRank using cuGraph (GPU)
g_enriched = g_result.compute_cugraph('pagerank')

# View top nodes by PageRank
top_nodes = g_enriched._nodes.sort_values('pagerank', ascending=False).head(5)
print('Top nodes by PageRank:')
print(top_nodes[['id', 'pagerank']])

解释：

compute_cugraph('pagerank') 使用GPU加速计算节点的PageRank。
现在，增强的图在节点数据框中包含一个pagerank列。

9. 可视化图表#

使用PyGraphistry的可视化功能来探索您的图形。

示例：可视化高PageRank节点

from graphistry import n, e

# Filter nodes with high PageRank
g_high_pagerank = g_enriched.chain([
    n(query='pagerank > 0.1'),
    e(),
    n(query='pagerank > 0.1')
])

# Plot the subgraph
g_high_pagerank.plot()

解释：

过滤节点，其中pagerank > 0.1。
可视化由高PageRank节点组成的子图。

远程运行

您可能希望远程运行GFQL，因为数据是远程的或者远程有可用的GPU：

示例：远程运行GFQL

from graphistry import n, e

g2 = g1.chain_remote([n(), e(), n()])

示例：远程运行GFQL，并解耦上传步骤

from graphistry import n, e

g2 = g1.upload()
assert g2._dataset_id is not None, "Uploading sets `dataset_id` for subsequent calls"
g3 = g2.chain_remote([n(), e(), n()])

额外的参数允许控制选项，例如执行引擎以及返回的内容

示例：绑定到现有的远程数据并获取它

import graphistry
from graphistry import n

g2 = graphistry.bind(dataset_id='my-dataset-id')

nodes_df = g2.chain_remote([n()])._nodes
edges_df = g2.chain_remote([e()])._edges

示例：在远程GPU上运行Python处理远程数据

def compute_shape(g):
    g2 = g.materialize_nodes()
    return {
        'nodes': g2._nodes.shape,
        'edges': g2._edges.shape
    }

g = graphistry.bind(dataset_id='my-dataset-id')
print(g.python_remote_json(compute_shape))

示例：在远程GPU上运行Python并返回图表

def compute_shape(g):
    g2 = g.materialize_nodes()
    return g2

g = graphistry.bind(dataset_id='my-dataset-id')
g2 = g.python_remote_g(compute_shape)
print(g2._nodes)

结论与下一步#

恭喜！您已经在短短10分钟内掌握了GFQL的基础知识。您已经学会了如何：

使用GFQL查询和过滤节点和边。
链式多个跳转并应用高级谓词。
利用GPU加速进行高性能图查询。
将GFQL与图算法和可视化工具集成。

下一步：

在您的数据上尝试GFQL： 将您学到的知识应用到您的数据集上，并亲自体验其好处。
在SQL、Pandas、Cypher和GFQL之间进行翻译
GFQL 快速参考
10 Minutes to PyGraphistry: 利用 PyGraphistry 进行高级可视化和分析。
Join the Community: 在GFQL社区的Slack频道中与其他用户和开发者联系。

GFQL 为大规模图分析开辟了新的可能性，而无需管理外部数据库或基础设施的开销。凭借其与 Python 生态系统的无缝集成以及对 GPU 加速的支持，GFQL 是现代数据科学工作流程中的强大工具。

愉快的图查询！

10分钟了解GFQL

目录

10分钟了解GFQL#

GFQL简介#

设置GFQL#

基本概念#

示例#

1. 查找特定类型的节点#

2. 查找具有属性的2跳边序列#

3. 查找1-2跳远的节点并标记每一跳#

4. 查询风险节点之间的交易节点#

5. 使用is_in按多个节点类型过滤#

利用GPU加速#

6. 自动GPU加速#

7. 强制GPU模式#

与PyData生态系统的集成#

8. 结合GFQL与图算法#

9. 可视化图表#

结论与下一步#