复杂主题的知识图谱¶

介绍¶

什么是知识图谱？

知识图谱，也称为语义网络，表示现实世界中的实体及其关系。它由节点、边和标签组成。节点可以表示任何实体，而边则定义它们之间的连接。例如，表示作者如“J.K. Rowling”的节点可以连接到表示她的一本书如“哈利·波特”的另一个节点，边为“作者”。

知识图谱的应用

知识图谱有各种应用，包括：

搜索引擎：通过整合来自多种来源的语义搜索信息来增强搜索结果。
推荐系统：它们根据用户的行为和偏好推荐产品或服务。
自然语言处理：它们有助于理解和生成人类语言。
数据集成：它们通过识别关系来促进来自不同来源的数据集成。
人工智能和机器学习：它们提供背景信息以改善决策制定。

设置和依赖¶

今天，我们将使用instructor库来简化OpenAI与我们的代码之间的交互。同时使用Graphviz库来为我们复杂的主题带来结构，并进行图形可视化。

在 [2]:

  Copied!     
 
import instructor 
from openai import OpenAI

client = instructor.patch(OpenAI())
导入 instructor 从 openai 导入 OpenAI 客户端 = instructor.patch(OpenAI())

根据您的操作系统安装Graphviz https://graphviz.org/download/

节点和边类¶

我们首先使用节点和边对象来建模我们的知识图谱。

节点对象代表关键概念或实体，而边对象表示它们之间的关系。

在 [3]:

  Copied!     
 
from pydantic import BaseModel, Field
from typing import Optional

class Node(BaseModel):
    id: int
    label: str
    color: str

class Edge(BaseModel):
    source: int
    target: int
    label: str
    color: str = "black"
来自pydantic的导入BaseModel、Field 从typing导入Optional 类Node（BaseModel）： id: int 标签: str 颜色: str 类Edge（BaseModel）： 源: int 目标: int 标签: str 颜色: str = "黑色"

`KnowledgeGraph` 类¶

KnowledgeGraph 类结合节点和边来创建一个全面的图结构。它包括节点和边的列表，其中每个节点代表一个关键概念或实体，每条边代表两个节点之间的关系。

稍后，你会看到我们设计这个类是为了匹配graphviz库中的图形对象，这使得我们的图形更容易可视化。

visualize_knowledge_graph 函数用于可视化知识图谱。它接受一个 KnowledgeGraph 对象作为输入，该对象包含节点和边。该函数利用 graphviz 库生成一个有向图（Digraph）。来自 KnowledgeGraph 的每个节点和边都以其各自的属性（id、label、color）添加到 Digraph 中。最后，图形被渲染并显示出来。

在 [4]:

  Copied!     
 
from graphviz import Digraph
from IPython.display import display

class KnowledgeGraph(BaseModel):
    nodes: list[Node] = Field(..., default_factory=list)  # A list of nodes in the knowledge graph.
    edges: list[Edge] = Field(..., default_factory=list)  # A list of edges in the knowledge graph.


    def visualize_knowledge_graph(self):
        dot = Digraph(comment="Knowledge Graph")

        for node in self.nodes:
            dot.node(name=str(node.id), label=node.label, color=node.color)
        for edge in self.edges:
            dot.edge(str(edge.source), str(edge.target), label=edge.label, color=edge.color)
        
        return display(dot)
from graphviz import Digraph from IPython.display import display class KnowledgeGraph(BaseModel): nodes: list[Node] = Field(..., default_factory=list) # 知识图谱中的节点列表。 edges: list[Edge] = Field(..., default_factory=list) # 知识图谱中的边列表。 def visualize_knowledge_graph(self): dot = Digraph(comment="Knowledge Graph") for node in self.nodes: dot.node(name=str(node.id), label=node.label, color=node.color) for edge in self.edges: dot.edge(str(edge.source), str(edge.target), label=edge.label, color=edge.color) return display(dot) 

生成知识图谱¶

生成图表函数¶

generate_graph 函数使用 OpenAI 的模型从输入字符串创建 KnowledgeGraph 对象。

它请求模型将输入解释为详细的知识图谱，并使用响应来形成KnowledgeGraph对象。

第 [8] 行：

  Copied!     
 
def generate_graph(input) -> KnowledgeGraph:
    return client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=[
            {
                "role": "user",
                "content": f"Help me understand the following by describing it as small knowledge graph: {input}",
            }
        ],
        response_model=KnowledgeGraph,
    )
def generate_graph(input) -> KnowledgeGraph: return client.chat.completions.create( model="gpt-4-1106-preview", messages=[ { "role": "user", "content": f"请通过描述小知识图来帮助我理解以下内容： {input}", } ], response_model=KnowledgeGraph, )

在 [9]：

  Copied!     
 
generate_graph("Explain quantum mechanics").visualize_knowledge_graph()
生成图形("解释量子力学").可视化知识图()

No description has been provided for this image

高级：积累知识图谱¶

在处理较大的数据集或随时间增长的知识时，由于提示长度的限制或内容的复杂性，一次性处理所有数据可能会具有挑战性。在这种情况下，采用迭代的方法构建知识图谱可能是有益的。这种方法涉及将文本分成较小、可管理的块进行处理，并使用每个块中的新信息更新图谱。

这种方法有哪些好处？¶

可扩展性：这种方法可以通过将大数据集分解成更小、更易管理的部分来处理。
灵活性：它允许对图表进行动态更新，以适应新信息的出现。
效率：处理较小的文本块可以更高效，并且不容易出现错误或遗漏。

发生了什么变化？¶

前面的示例提供了一个基本结构，而这个新示例引入了额外的复杂性和功能。Node 和 Edge 类现在有一个 hash 方法，允许它们在集合中使用，并简化了重复处理。

KnowledgeGraph 类已增强，新增了两个方法：update 和 draw。

在KnowledgeGraph类中，节点和边字段现在是可选的，提供了更大的灵活性。

update 方法允许合并两个图并移除重复项。

draw 方法包含一个前缀参数，使得在迭代过程中更容易创建不同的图形版本。

在 [10]:

  Copied!     
 
class Node(BaseModel):
    id: int
    label: str
    color: str

    def __hash__(self) -> int:
        return hash((id, self.label))
    
class Edge(BaseModel):
    source: int
    target: int
    label: str
    color: str = "black"

    def __hash__(self) -> int:
        return hash((self.source, self.target, self.label))
class Node(BaseModel): id: int label: str color: str def __hash__(self) -> int: return hash((id, self.label)) class Edge(BaseModel): source: int target: int label: str color: str = "black" def __hash__(self) -> int: return hash((self.source, self.target, self.label))

在 [11]:

  Copied!     
 
class KnowledgeGraph(BaseModel):
    # Optional list of nodes and edges in the knowledge graph
    nodes: Optional[list[Node]] = Field(..., default_factory=list)
    edges: Optional[list[Edge]] = Field(..., default_factory=list)

    def update(self, other: "KnowledgeGraph") -> "KnowledgeGraph":
        # This method updates the current graph with the other graph, deduplicating nodes and edges.
        return KnowledgeGraph(
            nodes=list(set(self.nodes + other.nodes)),  # Combine and deduplicate nodes
            edges=list(set(self.edges + other.edges)),  # Combine and deduplicate edges
        )
    

    def visualize_knowledge_graph(self):
        dot = Digraph(comment="Knowledge Graph")

        for node in self.nodes:
            dot.node(str(node.id), node.label, color=node.color)
        for edge in self.edges:
            dot.edge(str(edge.source), str(edge.target), label=edge.label, color=edge.color)
        
        return display(dot)
class KnowledgeGraph(BaseModel): # 可选的知识图谱中的节点和边的列表 nodes: Optional[list[Node]] = Field(..., default_factory=list) edges: Optional[list[Edge]] = Field(..., default_factory=list) def update(self, other: "KnowledgeGraph") -> "KnowledgeGraph": # 此方法用其他图更新当前图，去重节点和边。 return KnowledgeGraph( nodes=list(set(self.nodes + other.nodes)), # 合并并去重节点 edges=list(set(self.edges + other.edges)), # 合并并去重边 ) def visualize_knowledge_graph(self): dot = Digraph(comment="知识图谱") for node in self.nodes: dot.node(str(node.id), node.label, color=node.color) for edge in self.edges: dot.edge(str(edge.source), str(edge.target), label=edge.label, color=edge.color) return display(dot) 

生成迭代图¶

更新后的generate_graph函数专门设计用于迭代处理输入列表。它会随着每一条新信息的到来更新图表。

仔细观察，这种模式类似于一种常见的编程技术，称为“reduce”或“fold”函数。一个简单的例子是遍历列表以找到所有元素的平方和。

这是一个Python的示例：

cur_state = 0
for i in [1, 2, 3, 4, 5]:
    cur_state += i**2
print(cur_state)

在 [12]:

  Copied!     
 
def generate_graph(input: list[str]) -> KnowledgeGraph:
    # Initialize an empty KnowledgeGraph
    cur_state = KnowledgeGraph()

    # Iterate over the input list
    for i, inp in enumerate(input):
        new_updates = client.chat.completions.create(
            model="gpt-4-1106-preview",
            messages=[
                {
                    "role": "system",
                    "content": """You are an iterative knowledge graph builder.
                    You are given the current state of the graph, and you must append the nodes and edges 
                    to it Do not procide any duplcates and try to reuse nodes as much as possible.""",
                },
                {
                    "role": "user",
                    "content": f"""Extract any new nodes and edges from the following:
                    # Part {i}/{len(input)} of the input:

                    {inp}""",
                },
                {
                    "role": "user",
                    "content": f"""Here is the current state of the graph:
                    {cur_state.model_dump_json(indent=2)}""",
                },
            ],
            response_model=KnowledgeGraph,
        )  # type: ignore

        # Update the current state with the new updates
        cur_state = cur_state.update(new_updates)

        # Draw the current state of the graph
        cur_state.visualize_knowledge_graph() 
        
    # Return the final state of the KnowledgeGraph
    return cur_state
def generate_graph(input: list[str]) -> KnowledgeGraph: # 初始化一个空的知识图 cur_state = KnowledgeGraph() # 遍历输入列表 for i, inp in enumerate(input): new_updates = client.chat.completions.create( model="gpt-4-1106-preview", messages=[ { "role": "system", "content": """你是一个迭代的知识图构建器。你将获得图的当前状态，你必须向其中添加节点和边。请勿提供任何重复项，并尽量重复使用节点。""", }, { "role": "user", "content": f"""从以下内容中提取任何新的节点和边: # 输入的第 {i}/{len(input)} 部分: {inp}""", }, { "role": "user", "content": f"""这是图的当前状态: {cur_state.model_dump_json(indent=2)}""", }, ], response_model=KnowledgeGraph, ) # type: ignore # 使用新的更新更新当前状态 cur_state = cur_state.update(new_updates) # 绘制图的当前状态 cur_state.visualize_knowledge_graph() # 返回知识图的最终状态 return cur_state 

示例用例¶

在这种方法中，我们一次处理可管理的文本块。

这种方法在处理可能无法适应单个提示的大量文本时特别有益。

在构建复杂主题的知识图谱等场景中尤其有用，其中信息分布在多个文档或部分中。

在 [13]:

  Copied!     
 
text_chunks = [
    "Jason knows a lot about quantum mechanics. He is a physicist. He is a professor",
    "Professors are smart.",
    "Sarah knows Jason and is a student of his.",
    "Sarah is a student at the University of Toronto. and UofT is in Canada.",
]

graph: KnowledgeGraph = generate_graph(text_chunks)
text_chunks = [ "Jason knows a lot about quantum mechanics. He is a physicist. He is a professor", "教授们很聪明。", "Sarah knows Jason and is a student of his.", "Sarah is a student at the University of Toronto. and UofT is in Canada.", ] graph: KnowledgeGraph = generate_graph(text_chunks)

结论¶

本教程展示了如何为复杂主题生成和可视化知识图谱。它还演示了如何从语言模型或提供的文本中提取图形知识。教程强调了通过处理较小块的文本并使用新信息更新图谱来构建知识图谱的迭代过程。

使用这种方法，我们可以提取各种内容，包括：

故事中的人物及其关系。

class People(BaseModel):
    id: str
    name: str
    description: str

class Relationship(BaseModel):
    id: str
    source: str
    target: str
    label: str
    description: str

class Story(BaseModel):
    people: List[People]
    relationships: List[Relationship]

来自记录的任务依赖关系和行动项目。

class Task(BaseModel):
    id: str
    name: str
    description: str

class Participant(BaseModel):
    id: str
    name: str
    description: str

class Assignment(BaseModel):
    id: str
    source: str
    target: str
    label: str
    description: str

class Transcript(BaseModel):
    tasks: List[Task]
    participants: List[Participant]
    assignments: List[Assignment]

研究论文中的关键概念及其关系。
新闻文章中的实体及其关系。

作为练习，尝试实现上述示例之一。

所有这些都将遵循一个迭代提取越来越多信息并将其积累到某种状态的想法。