自定义存储
默认情况下,LlamaIndex隐藏了复杂性,让您用不到5行代码即可查询数据:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()response = query_engine.query("Summarize the documents.")在底层,LlamaIndex 还支持可替换的存储层,允许您自定义存储已摄取文档(即 Node 对象)、嵌入向量和索引元数据的位置。

为此,不使用高级API,
index = VectorStoreIndex.from_documents(documents)我们使用一个提供更细粒度控制的底层API:
from llama_index.core.storage.docstore import SimpleDocumentStorefrom llama_index.core.storage.index_store import SimpleIndexStorefrom llama_index.core.vector_stores import SimpleVectorStorefrom llama_index.core.node_parser import SentenceSplitter
# create parser and parse document into nodesparser = SentenceSplitter()nodes = parser.get_nodes_from_documents(documents)
# create storage context using default storesstorage_context = StorageContext.from_defaults( docstore=SimpleDocumentStore(), vector_store=SimpleVectorStore(), index_store=SimpleIndexStore(),)
# create (or load) docstore and add nodesstorage_context.docstore.add_documents(nodes)
# build indexindex = VectorStoreIndex(nodes, storage_context=storage_context)
# save indexindex.storage_context.persist(persist_dir="<persist_dir>")
# can also set index_id to save multiple indexes to the same folderindex.set_index_id("<index_id>")index.storage_context.persist(persist_dir="<persist_dir>")
# to load index later, make sure you setup the storage context# this will loaded the persisted stores from persist_dirstorage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")
# then load the index objectfrom llama_index.core import load_index_from_storage
loaded_index = load_index_from_storage(storage_context)
# if loading an index from a persist_dir containing multiple indexesloaded_index = load_index_from_storage(storage_context, index_id="<index_id>")
# if loading multiple indexes from a persist dirloaded_indicies = load_index_from_storage( storage_context, index_ids=["<index_id>", ...])只需一行代码修改即可自定义底层存储,实例化不同的文档存储、索引存储和向量存储。 详见文档存储、向量存储、索引存储指南获取更多详细信息。
我们的大多数向量存储集成将整个索引(向量 + 文本)存储在向量存储本身中。这带来了一个主要优势:无需像上面所示那样显式持久化索引,因为向量存储已经托管并持久化着我们索引中的数据。
支持此实践的向量存储包括:
- AzureAISearchVectorStore
- ChatGPTRetrievalPluginClient
- CassandraVectorStore
- ChromaVectorStore
- EpsillaVectorStore
- DocArrayHnswVectorStore
- DocArrayInMemoryVectorStore
- JaguarVectorStore
- LanceDBVectorStore
- MetalVectorStore
- MilvusVectorStore
- MyScaleVectorStore
- OpensearchVectorStore
- PineconeVectorStore
- QdrantVectorStore
- TablestoreVectorStore
- RedisVectorStore
- UpstashVectorStore
- WeaviateVectorStore
以下是一个使用 Pinecone 的小示例:
import pineconefrom llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.vector_stores.pinecone import PineconeVectorStore
# Creating a Pinecone indexapi_key = "api_key"pinecone.init(api_key=api_key, environment="us-west1-gcp")pinecone.create_index( "quickstart", dimension=1536, metric="euclidean", pod_type="p1")index = pinecone.Index("quickstart")
# construct vector storevector_store = PineconeVectorStore(pinecone_index=index)
# create storage contextstorage_context = StorageContext.from_defaults(vector_store=vector_store)
# load documentsdocuments = SimpleDirectoryReader("./data").load_data()
# create index, which will insert documents/vectors to pineconeindex = VectorStoreIndex.from_documents( documents, storage_context=storage_context)如果您已有一个包含已加载数据的现有向量存储,
您可以连接到它并直接创建一个 VectorStoreIndex,操作如下:
index = pinecone.Index("quickstart")vector_store = PineconeVectorStore(pinecone_index=index)loaded_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)