文档分割¶

在本指南中，我们将演示如何使用LLM的结构化输出来进行文档分割。我们将使用command-r-plus——Cohere最新的LLM之一，具有128k的上下文长度，并在解释Transformer架构的文章上测试该方法。相同的文档分割方法可以应用于任何其他需要将复杂的长文档分解为较小块的领域。

动机

有时我们需要一种方法将文档分割成围绕单一关键概念/思想的有意义的部分。基于长度/规则的简单文本分割器不够可靠。考虑文档包含代码片段或数学公式的情况——我们不希望在这些情况下使用'\n\n'进行分割，也不希望为不同类型的文档编写大量规则。事实证明，具有足够长上下文长度的LLM非常适合这项任务。

定义数据结构¶

首先，我们需要为文档的每个部分定义一个Section类。StructuredDocument类将封装这些部分的列表。

请注意，为了避免LLM重新生成每个部分的内容，我们可以简单地枚举输入文档的每一行，然后通过提供每个部分的起始-结束行号来要求LLM进行分段。

from pydantic import BaseModel, Field
from typing import List


class Section(BaseModel):
    title: str = Field(description="main topic of this section of the document")
    start_index: int = Field(description="line number where the section begins")
    end_index: int = Field(description="line number where the section ends")


class StructuredDocument(BaseModel):
    """obtains meaningful sections, each centered around a single concept/topic"""

    sections: List[Section] = Field(description="a list of sections of the document")

文档预处理¶

预处理输入document，通过在每行前面加上行号。

def doc_with_lines(document):
    document_lines = document.split("\n")
    document_with_line_numbers = ""
    line2text = {}
    for i, line in enumerate(document_lines):
        document_with_line_numbers += f"[{i}] {line}\n"
        line2text[i] = line
    return document_with_line_numbers, line2text

分割¶

接下来使用Cohere客户端从预处理文档中提取StructuredDocument。

import instructor
import cohere

# Apply the patch to the cohere client
# enables response_model keyword
client = instructor.from_cohere(cohere.Client())


system_prompt = f"""\
You are a world class educator working on organizing your lecture notes.
Read the document below and extract a StructuredDocument object from it where each section of the document is centered around a single concept/topic that can be taught in one lesson.
Each line of the document is marked with its line number in square brackets (e.g. [1], [2], [3], etc). Use the line numbers to indicate section start and end.
"""


def get_structured_document(document_with_line_numbers) -> StructuredDocument:
    return client.chat.completions.create(
        model="command-r-plus",
        response_model=StructuredDocument,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": document_with_line_numbers,
            },
        ],
    )  # type: ignore

接下来，我们需要根据开始/结束索引和预处理步骤中的line2text字典来获取章节文本。

def get_sections_text(structured_doc, line2text):
    segments = []
    for s in structured_doc.sections:
        contents = []
        for line_id in range(s.start_index, s.end_index):
            contents.append(line2text.get(line_id, ''))
        segments.append(
            {
                "title": s.title,
                "content": "\n".join(contents),
                "start": s.start_index,
                "end": s.end_index,
            }
        )
    return segments

示例¶

这里有一个使用这些类和函数来分割Sebastian Raschka关于Transformers教程的示例。我们可以使用trafilatura包来抓取文章的网页内容。

from trafilatura import fetch_url, extract


url = 'https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html'
downloaded = fetch_url(url)
document = extract(downloaded)


document_with_line_numbers, line2text = doc_with_lines(document)
structured_doc = get_structured_document(document_with_line_numbers)
segments = get_sections_text(structured_doc, line2text)

print(segments[5]['title'])
"""
Introduction to Multi-Head Attention
"""
print(segments[5]['content'])
"""
Multi-Head Attention
In the very first figure, at the top of this article, we saw that transformers use a module called multi-head attention. How does that relate to the self-attention mechanism (scaled-dot product attention) we walked through above?
In the scaled dot-product attention, the input sequence was transformed using three matrices representing the query, key, and value. These three matrices can be considered as a single attention head in the context of multi-head attention. The figure below summarizes this single attention head we covered previously:
As its name implies, multi-head attention involves multiple such heads, each consisting of query, key, and value matrices. This concept is similar to the use of multiple kernels in convolutional neural networks.
To illustrate this in code, suppose we have 3 attention heads, so we now extend the \(d' \times d\) dimensional weight matrices so \(3 \times d' \times d\):
In:
h = 3
multihead_W_query = torch.nn.Parameter(torch.rand(h, d_q, d))
multihead_W_key = torch.nn.Parameter(torch.rand(h, d_k, d))
multihead_W_value = torch.nn.Parameter(torch.rand(h, d_v, d))
Consequently, each query element is now \(3 \times d_q\) dimensional, where \(d_q=24\) (here, let’s keep the focus on the 3rd element corresponding to index position 2):
In:
multihead_query_2 = multihead_W_query.matmul(x_2)
print(multihead_query_2.shape)
Out:
torch.Size([3, 24])
"""