数据摄取管道 + 文档管理¶

在数据摄取管道中附加docstore将启用文档管理功能。

使用document.doc_id或node.ref_doc_id作为基准点，数据摄取管道将主动查找重复文档。

它的工作原理是

存储一个从doc_id到document_hash的映射关系
如果检测到重复的doc_id且哈希值已更改，该文档将被重新处理
如果哈希值未改变，该文档将在处理流程中被跳过

如果不附加向量存储，我们只能检查并移除重复的输入。

如果附加了向量存储，我们还可以处理更新插入操作！关于更新插入和向量存储，我们有另一份指南。

创建种子数据¶

In [ ]:

Copied!

%pip install llama-index-storage-docstore-redis
%pip install llama-index-storage-docstore-mongodb
%pip install llama-index-embeddings-huggingface
%pip install llama-index-storage-docstore-redis
%pip install llama-index-storage-docstore-mongodb
%pip install llama-index-embeddings-huggingface

In [ ]:

Copied!





# Make some test data
!mkdir -p data
!echo "This is a test file: one!" > data/test1.txt
!echo "This is a test file: two!" > data/test2.txt
# 创建一些测试数据
!mkdir -p data
!echo "这是一个测试文件：第一个！" > data/test1.txt
!echo "这是一个测试文件：第二个！" > data/test2.txt

In [ ]:

Copied!

from llama_index.core import SimpleDirectoryReader

# load documents with deterministic IDs
documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()
from llama_index.core import SimpleDirectoryReader

# 加载具有确定性ID的文档
documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()

/home/loganm/.cache/pypoetry/virtualenvs/llama-index-4a-wkI5X-py3.11/lib/python3.11/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.8.9) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
  warnings.warn(

使用文档存储创建管道¶

In [ ]:

Copied!





from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.storage.docstore.mongodb import MongoDocumentStore
from llama_index.core.node_parser import SentenceSplitter


pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.storage.docstore.mongodb import MongoDocumentStore
from llama_index.core.node_parser import SentenceSplitter


pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),
)

In [ ]:

Copied!

nodes = pipeline.run(documents=documents)
nodes = pipeline.run(documents=documents)

Docstore strategy set to upserts, but no vector store. Switching to duplicates_only strategy.

In [ ]:

Copied!

print(f"Ingested {len(nodes)} Nodes")
print(f"已处理 {len(nodes)} 个节点")

Ingested 2 Nodes

[可选] 保存/加载管道¶

保存管道将同时保存内部缓存和文档存储。

注意： 如果您使用的是远程缓存/文档存储，则不需要此步骤

In [ ]:

Copied!

pipeline.persist("./pipeline_storage")
pipeline.persist("./pipeline_storage")

In [ ]:

Copied!





pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

# restore the pipeline
pipeline.load("./pipeline_storage")
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

# 恢复管道
pipeline.load("./pipeline_storage")

测试文档管理¶

在这里，我们可以创建新文档，也可以编辑现有文档，以测试文档管理功能。

新文档和编辑过的文档将被处理，而未更改的文档将被跳过

In [ ]:

Copied!

!echo "This is a test file: three!" > data/test3.txt
!echo "This is a NEW test file: one!" > data/test1.txt
!echo "这是一个测试文件: 三!" > data/test3.txt
!echo "这是一个新的测试文件: 一!" > data/test1.txt

In [ ]:

Copied!

documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()
documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()

In [ ]:

Copied!

nodes = pipeline.run(documents=documents)
nodes = pipeline.run(documents=documents)

Docstore strategy set to upserts, but no vector store. Switching to duplicates_only strategy.

In [ ]:

Copied!

print(f"Ingested {len(nodes)} Nodes")
print(f"已处理 {len(nodes)} 个节点")

Ingested 2 Nodes

让我们确认哪些节点已被摄取：

In [ ]:

Copied!

for node in nodes:
    print(f"Node: {node.text}")
for node in nodes:
    print(f"Node: {node.text}")

Node: This is a NEW test file: one!
Node: This is a test file: three!

我们还可以验证文档存储中仅跟踪了三个文档

In [ ]:

Copied!

print(len(pipeline.docstore.docs))
print(len(pipeline.docstore.docs))