Question answering with LangChain, Deep Lake, & OpenAI

本笔记本展示了如何利用LangChain、Deep Lake作为向量存储和OpenAI嵌入来实现问答系统。我们将通过以下步骤实现这一目标：

加载一个Deep Lake文本数据集
使用LangChain初始化一个Deep Lake向量存储
向向量存储中添加文本
在数据库上运行查询
完成！

您还可以参考其他教程，例如针对任意类型数据（PDF、json、csv、文本）的问答：与存储在Deep Lake中的任何数据聊天、代码理解、PDF文档问答，或歌曲推荐。

加载Deep Lake文本数据集

在本示例中，我们将使用cohere-wikipedia-22数据集的20000个样本子集。

hub://activeloop/cohere-wikipedia-22-sample loaded successfully. Dataset(path='hub://activeloop/cohere-wikipedia-22-sample', read_only=True, tensors=['ids', 'metadata', 'text']) tensor htype shape dtype compression ------- ------- ------- ------- ------- ids text (20000, 1) str None metadata json (20000, 1) str None text text (20000, 1) str None

让我们来看几个示例：

['The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.', 'A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say "Tuesday at 24:00" and "Wednesday at 00:00" to mean exactly the same time.', 'However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say "23:59", which is one minute before midnight.']

LangChain的Deep Lake向量存储

让我们定义一个dataset_path，这是您的Deep Lake向量数据库存放文本嵌入的位置。

我们将设置OpenAI的text-embedding-3-small作为我们的嵌入函数，并在dataset_path初始化一个Deep Lake向量存储...

from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import DeepLake embedding = OpenAIEmbeddings(model="text-embedding-3-small") db = DeepLake(dataset_path, embedding=embedding, overwrite=True)

...并使用add_texts方法，每次一批地填充样本。

from tqdm.auto import tqdm batch_size = 100 nsamples = 10 # for testing. Replace with len(ds) to append everything for i in tqdm(range(0, nsamples, batch_size)): # find end of batch i_end = min(nsamples, i + batch_size) batch = ds[i:i_end] id_batch = batch.ids.data()["value"] text_batch = batch.text.data()["value"] meta_batch = batch.metadata.data()["value"] db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)

creating embeddings: 0%| | 0/1 [00:00<?, ?it/s][A creating embeddings: 100%|██████████| 1/1 [00:02<00:00, 2.11s/it] 100%|██████████| 10/10 [00:00<00:00, 462.42it/s]

Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id']) tensor htype shape dtype compression ------- ------- ------- ------- ------- text text (10, 1) str None metadata json (10, 1) str None embedding embedding (10, 1536) float32 None id text (10, 1) str None

在数据库上运行用户查询

底层的Deep Lake数据集对象可通过db.vectorstore.dataset访问，数据结构可使用db.vectorstore.summary()进行汇总，该操作会显示包含10个样本的4个张量：

我们现在将使用GPT-3.5-Turbo作为我们的LLM，在向量存储上设置问答系统。

from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI # Re-load the vector store in case it's no longer initialized # db = DeepLake(dataset_path = dataset_path, embedding_function=embedding) qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type="stuff", retriever=db.as_retriever())

让我们尝试运行一个提示并检查输出结果。在内部，该API会执行嵌入搜索，以找到最相关的数据输入到LLM上下文中。

瞧！

2023年9月30日

使用LangChain、Deep Lake和OpenAI进行问答

安装依赖项

认证

加载Deep Lake文本数据集

LangChain的Deep Lake向量存储

在数据库上运行用户查询