Isaacus 嵌入
llama-index-embeddings-isaacus 软件包包含与Isaacus法律AI嵌入模型构建应用程序的LlamaIndex集成。该集成使您能够轻松连接并使用Kanon 2嵌入器——在大规模法律嵌入基准测试(MLEB)上全球最精准的法律嵌入模型。
Isaacus 嵌入支持任务特定优化:
task="retrieval/query": 优化搜索查询的嵌入向量task="retrieval/document": 优化待索引文档的嵌入向量
在本笔记本中,我们将演示使用Isaacus嵌入进行法律文档检索。
安装必要的集成组件。
如果您在 Colab 上打开这个笔记本,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-embeddings-isaacus%pip install llama-index-llms-openai%pip install llama-index获取您的 Isaacus API 密钥
Section titled “Get your Isaacus API key”- 在 Isaacus 平台 创建账户
- 添加一个支付方式来领取您的免费额度
- Create an API密钥
import os
# Set your Isaacus API keyisaacus_api_key = "YOUR_ISAACUS_API_KEY"os.environ["ISAACUS_API_KEY"] = isaacus_api_keyfrom llama_index.embeddings.isaacus import IsaacusEmbedding
# Initialize the Isaacus Embedding modelembed_model = IsaacusEmbedding( api_key=isaacus_api_key, model="kanon-2-embedder",)
# Get a single embeddingembedding = embed_model.get_text_embedding( "This agreement shall be governed by the laws of Delaware.")
print(f"Embedding dimension: {len(embedding)}")print(f"First 5 values: {embedding[:5]}")# Get embeddings for multiple legal textslegal_texts = [ "The parties agree to binding arbitration.", "Confidential information shall not be disclosed.", "This contract may be terminated with 30 days notice.",]
embeddings = embed_model.get_text_embedding_batch(legal_texts)
print(f"Number of embeddings: {len(embeddings)}")print(f"Each embedding has {len(embeddings[0])} dimensions")Isaacus 嵌入支持不同的任务以实现最佳性能:
retrieval/document: 用于待索引的文档retrieval/query: 用于搜索查询
使用适当的任务可提高检索准确性。
# For documents (use when indexing)doc_embed_model = IsaacusEmbedding( api_key=isaacus_api_key, task="retrieval/document",)
doc_embedding = doc_embed_model.get_text_embedding( "The Company has the right to terminate this agreement.")
print(f"Document embedding dimension: {len(doc_embedding)}")# For queries (automatically used by get_query_embedding)query_embedding = embed_model.get_query_embedding( "What are the termination conditions?")
print(f"Query embedding dimension: {len(query_embedding)}")您可以降低嵌入维度以实现更快的搜索和更低的存储成本:
# Use reduced dimensions (default is 1792)embed_model_512 = IsaacusEmbedding( api_key=isaacus_api_key, dimensions=512,)
embedding_512 = embed_model_512.get_text_embedding("Legal text example")
print(f"Reduced embedding dimension: {len(embedding_512)}")现在让我们使用Isaacus嵌入与法律文档(Uber的10-K SEC申报文件)构建一个完整的RAG流程。
import loggingimport sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.llms.openai import OpenAIfrom llama_index.core.response.notebook_utils import display_source_nodefrom IPython.display import Markdown, display我们将使用优步的10-K美国证券交易委员会文件,其中包含法律和监管信息——非常适合展示Kanon 2在法律领域的专业能力。
!mkdir -p 'data/10k/'!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'documents = SimpleDirectoryReader("./data/10k/").load_data()print(f"Loaded {len(documents)} document(s)")我们在构建索引时使用task="retrieval/document"来优化文档存储的嵌入效果。
# Initialize embedding model for documentsembed_model = IsaacusEmbedding( api_key=isaacus_api_key, model="kanon-2-embedder", task="retrieval/document",)
# Build the indexindex = VectorStoreIndex.from_documents( documents=documents, embed_model=embed_model,)现在我们将使用法律相关的问题来查询索引。请注意,get_query_embedding 会自动使用 task="retrieval/query" 以获得最佳查询性能。
# Create a retrieverretriever = index.as_retriever(similarity_top_k=3)
# Query about risk factorsretrieved_nodes = retriever.retrieve( "What are the main risk factors mentioned in the document?")
print(f"Retrieved {len(retrieved_nodes)} nodes\n")
for i, node in enumerate(retrieved_nodes): print(f"\n--- Node {i+1} (Score: {node.score:.4f}) ---") display_source_node(node, source_length=500)# Query about legal proceedingsretrieved_nodes = retriever.retrieve( "What legal proceedings or litigation is the company involved in?")
print(f"Retrieved {len(retrieved_nodes)} nodes\n")
for i, node in enumerate(retrieved_nodes): print(f"\n--- Node {i+1} (Score: {node.score:.4f}) ---") display_source_node(node, source_length=500)将Isaacus嵌入与LLM结合,实现完整的问答功能:
import os
# Set your OpenAI API keyopenai_api_key = "YOUR_OPENAI_API_KEY"os.environ["OPENAI_API_KEY"] = openai_api_key# Set up LLMllm = OpenAI(model="gpt-4o-mini", temperature=0)
# Create query enginequery_engine = index.as_query_engine( llm=llm, similarity_top_k=5,)
# Ask a legal questionresponse = query_engine.query( "What are the company's main regulatory and legal risks?")
display(Markdown(f"**Answer:** {response}"))response = query_engine.query( "What intellectual property does the company rely on?")
display(Markdown(f"**Answer:** {response}"))Isaacus 嵌入还支持异步操作,以便在异步应用中实现更好的性能:
import asyncio
async def get_embeddings_async(): embed_model = IsaacusEmbedding( api_key=isaacus_api_key, )
# Get async single embedding embedding = await embed_model.aget_text_embedding( "Async legal document text" )
# Get async batch embeddings embeddings = await embed_model.aget_text_embedding_batch( ["Text 1", "Text 2", "Text 3"] )
return embedding, embeddings
# Run async functionembedding, embeddings = await get_embeddings_async()
print(f"Async single embedding dimension: {len(embedding)}")print( f"Async batch: {len(embeddings)} embeddings of {len(embeddings[0])} dimensions each")在本笔记本中,我们演示了:
- 基础用法 - 获取单个和批量嵌入向量
- 任务特定优化 - 使用
retrieval/document进行索引,使用retrieval/query进行搜索 - 降维 - 为提升效率而减小嵌入尺寸
- 法律检索增强生成管道 - 使用法律文档(Uber 10-K报表)构建完整的检索系统
- 异步操作 - 使用异步方法以获得更好的性能
Kanon 2 嵌入器擅长法律文档理解与检索,使其成为法律科技应用、合规工具、合同分析等场景的理想选择。