模式发现

图模式查找指的是在图中搜索结构模式。要查看实际应用示例，请查阅图模式查找教程。

GraphFrame 模式查找使用一种简单的领域特定语言（DSL）来表达结构查询。例如，graph.find("(a)-[e1]->(b); (b)-[e2]->(a)") 或 graph.find("(a)<-[e]->(b)") 将搜索通过边双向连接的顶点对 a,b。它将返回图中所有此类结构的 DataFrame，其中包含模式中每个命名元素（顶点或边）的列。在这种情况下，返回的列将是 "a, b, e1, e2."

用于表达结构模式的领域特定语言：

模式的基本单位是一条边。例如，"(a)-[e]->(b)" 表示从顶点 a 指向顶点 b 的边 e。注意顶点用圆括号 (a) 表示，而边用方括号 [e] 表示。
模式表示为边的连接。边模式可以用分号连接。模式 "(a)-[e1]->(b); (b)-[e2]->(c)" 指定了从 a 到 b 再到 c 的两条边。
简单来说，您也可以量化固定长度，例如 "(a)-[e*2]->(c)"。该模式解析器会通过任意插入中间顶点将其分解为多个模式 "(a)-[e1]->(_v1);(_v1)-[e1]->(c)"。它指定了从 a 到 _v1 再到 c 的两条边。
为了搜索可变长度的模式，您可以指定范围 "(a)-[e*1..3]->(c)"。它会将每个可能长度的结果 "(a)-[e*1]->(c)"、"(a)-[e*2]->(c)" 和 "(a)-[e*3]->(c)" 合并到一个DataFrame中。
如果省略方向 "(a)-[e]-(b)"，它表示一个无向模式——即 "(a)-[e]->(b)" 或 "(a)<-[e]-(b)"，这包括传入或传出的边。
在模式中，可以为顶点和边分配名称。例如， "(a)-[e]->(b)" 包含三个命名元素：顶点 a,b 和边 e。这些名称有两个用途：

名称可以识别边之间的共同元素。例如， "(a)-[e1]->(b); (b)-[e2]->(c)" 指定了同一个顶点 b 既是边 e1 的目标节点又是边 e2 的源节点。

这些名称在结果DataFrame中用作列名。如果某个模式包含命名顶点a，则结果DataFrame将包含名为"a"的列，该列为StructType结构类型，其子字段等同于GraphFrame.vertices的架构（列）。类似地，模式中的边e将在结果DataFrame中生成名为"e"的列，其子字段等同于GraphFrame.edges的架构（列）。

请注意，名称不标识不同的元素：具有不同名称的两个元素可能指向同一个图元素。例如，在模式 "(a)-[e1]->(b); (b)-[e2]->(c)"中，名称a和c可能指向同一个顶点。要限制命名元素为不同的顶点或边，请使用后置过滤器例如resultDataframe.filter("a.id != c.id")。
当不需要时，可以省略顶点或边的名称。例如，"(a)-[]->(b)" 表示顶点 a,b 之间的边，但未给该边分配名称。在结果 DataFrame 中将不会有匿名边的列。类似地，"(a)-[e]->()" 表示顶点 a 的出边，但未命名目标顶点。这些被称为匿名顶点和边。
一条边可以被否定以表示该边不应出现在图中。例如，"(a)-[]->(b); !(b)-[]->(a)" 查找从 a 到 b 的边，其中不存在 从 b 到 a 的边。

限制条件：

Motifs 不允许包含没有任何命名元素的边："()-[]->()" 和 "!()-[]->()" 是禁止使用的术语。
Motifs 不允许在否定项中包含命名边（因为这些命名边永远不会出现在结果中）。例如，"!(a)-[ab]->(b)" 是无效的，但 "!(a)-[]->(b)" 是有效的。
对于可变长度模式、双向模式和无向模式，不支持否定："!(a)-[*1..3]->(b)"、"!(a)<-[]->(b)" 和 "!(a)-[]-(b)" 是不允许的。
不支持无边界长度模式："(a)-[*..3]->(b)" 和 "(a)-[*1..]->(b)" 是不允许的。
您无法将额外的边与可变长度模式连接："(a)-[*1..3]-(b);(b)-[]-(c)"是无效的。

更复杂的查询，例如对顶点或边属性进行操作的查询，可以通过对结果DataFrame应用过滤器来表达。

这可能会返回重复的行。例如，查询 "(u)-[]->()" 将为每个匹配的边返回结果，即使这些边共享相同的顶点 u。

Python API

有关API详情，请参阅 graphframes.GraphFrame.find。

from graphframes.examples import Graphs

g = Graphs(spark).friends()  # Get example graph

# Search for pairs of vertices with edges in both directions between them
motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(a)")
motifs.show()

# More complex queries can be expressed by applying filters
motifs.filter("b.age > 30").show()

Scala API

有关API详情，请参阅org.graphframes.GraphFrame。

import org.apache.spark.sql.DataFrame
import org.graphframes.{examples,GraphFrame}

val g: GraphFrame = examples.Graphs.friends  // get example graph

// Search for pairs of vertices with edges in both directions between them.
val motifs: DataFrame = g.find("(a)-[e1]->(b); (b)-[e2]->(a)")
motifs.show()

// More complex queries can be expressed by applying filters.
motifs.filter("b.age > 30").show()

示例

许多模式查询是无状态的且易于表达，正如上述示例所示。接下来的示例展示了更复杂的查询，这些查询沿着模式路径携带状态。这些查询可以通过将GraphFrame模式查找与结果筛选器相结合来表达，其中筛选器使用序列操作来构建一系列DataFrame Column。

例如，假设我们希望识别具有由一系列函数定义的某种属性的4个顶点链。也就是说，在4个顶点链 a->b->c->d 中，识别匹配此复杂筛选条件的链子集：

初始化路径上的状态。
根据顶点 a 更新状态。
基于顶点 b 更新状态。
等等，对于 c 和 d。
如果最终状态满足某些条件，则该链被过滤器接受。

以下代码片段演示了这一过程，我们识别由4个顶点组成的链，其中至少3条边中有2条是"朋友"关系。在此示例中，状态是当前"朋友"边的计数；通常，它可以是任何 DataFrame Column。

Python API

from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import IntegerType
from graphframes.examples import Graphs


g = Graphs(spark).friends()  # Get example graph

chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")

# Query on sequence, with state (cnt)
# (a) Define method for updating state given the next element of the motif
sumFriends =\
  lambda cnt,relationship: when(relationship == "friend", cnt+1).otherwise(cnt)

# (b) Use sequence operation to apply method to sequence of elements in motif
# In this case, the elements are the 3 edges
condition =\
  reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))

# (c) Apply filter to DataFrame
chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()

Scala API

import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions.{col, when}
import org.graphframes.{examples,GraphFrame}

val g: GraphFrame = examples.Graphs.friends  // get example graph

// Find chains of 4 vertices.
val chain4: DataFrame = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
chain4.show()

// Query on sequence, with state (cnt)
//  (a) Define method for updating state given the next element of the motif.
def sumFriends(cnt: Column, relationship: Column): Column = {
  when(relationship === "friend", cnt + 1).otherwise(cnt)
}
//  (b) Use sequence operation to apply method to sequence of elements in motif.
//      In this case, the elements are the 3 edges.
val condition = { Seq("ab", "bc", "cd")
  .foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship"))) }
//  (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()