跳转到内容

Isaacus 嵌入

llama-index-embeddings-isaacus 软件包包含与Isaacus法律AI嵌入模型构建应用程序的LlamaIndex集成。该集成使您能够轻松连接并使用Kanon 2嵌入器——在大规模法律嵌入基准测试(MLEB)上全球最精准的法律嵌入模型。

Isaacus 嵌入支持任务特定优化:

  • task="retrieval/query": 优化搜索查询的嵌入向量
  • task="retrieval/document": 优化待索引文档的嵌入向量

在本笔记本中,我们将演示使用Isaacus嵌入进行法律文档检索。

安装必要的集成组件。

如果您在 Colab 上打开这个笔记本,您可能需要安装 LlamaIndex 🦙。

%pip install llama-index-embeddings-isaacus
%pip install llama-index-llms-openai
%pip install llama-index

获取您的 Isaacus API 密钥

Section titled “Get your Isaacus API key”
  1. Isaacus 平台 创建账户
  2. 添加一个支付方式来领取您的免费额度
  3. Create an API密钥
import os
# Set your Isaacus API key
isaacus_api_key = "YOUR_ISAACUS_API_KEY"
os.environ["ISAACUS_API_KEY"] = isaacus_api_key
from llama_index.embeddings.isaacus import IsaacusEmbedding
# Initialize the Isaacus Embedding model
embed_model = IsaacusEmbedding(
api_key=isaacus_api_key,
model="kanon-2-embedder",
)
# Get a single embedding
embedding = embed_model.get_text_embedding(
"This agreement shall be governed by the laws of Delaware."
)
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
# Get embeddings for multiple legal texts
legal_texts = [
"The parties agree to binding arbitration.",
"Confidential information shall not be disclosed.",
"This contract may be terminated with 30 days notice.",
]
embeddings = embed_model.get_text_embedding_batch(legal_texts)
print(f"Number of embeddings: {len(embeddings)}")
print(f"Each embedding has {len(embeddings[0])} dimensions")

Isaacus 嵌入支持不同的任务以实现最佳性能:

  • retrieval/document: 用于待索引的文档
  • retrieval/query: 用于搜索查询

使用适当的任务可提高检索准确性。

# For documents (use when indexing)
doc_embed_model = IsaacusEmbedding(
api_key=isaacus_api_key,
task="retrieval/document",
)
doc_embedding = doc_embed_model.get_text_embedding(
"The Company has the right to terminate this agreement."
)
print(f"Document embedding dimension: {len(doc_embedding)}")
# For queries (automatically used by get_query_embedding)
query_embedding = embed_model.get_query_embedding(
"What are the termination conditions?"
)
print(f"Query embedding dimension: {len(query_embedding)}")

您可以降低嵌入维度以实现更快的搜索和更低的存储成本:

# Use reduced dimensions (default is 1792)
embed_model_512 = IsaacusEmbedding(
api_key=isaacus_api_key,
dimensions=512,
)
embedding_512 = embed_model_512.get_text_embedding("Legal text example")
print(f"Reduced embedding dimension: {len(embedding_512)}")

现在让我们使用Isaacus嵌入与法律文档(Uber的10-K SEC申报文件)构建一个完整的RAG流程。

import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.response.notebook_utils import display_source_node
from IPython.display import Markdown, display

我们将使用优步的10-K美国证券交易委员会文件,其中包含法律和监管信息——非常适合展示Kanon 2在法律领域的专业能力。

!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
documents = SimpleDirectoryReader("./data/10k/").load_data()
print(f"Loaded {len(documents)} document(s)")

我们在构建索引时使用task="retrieval/document"来优化文档存储的嵌入效果。

# Initialize embedding model for documents
embed_model = IsaacusEmbedding(
api_key=isaacus_api_key,
model="kanon-2-embedder",
task="retrieval/document",
)
# Build the index
index = VectorStoreIndex.from_documents(
documents=documents,
embed_model=embed_model,
)

现在我们将使用法律相关的问题来查询索引。请注意,get_query_embedding 会自动使用 task="retrieval/query" 以获得最佳查询性能。

# Create a retriever
retriever = index.as_retriever(similarity_top_k=3)
# Query about risk factors
retrieved_nodes = retriever.retrieve(
"What are the main risk factors mentioned in the document?"
)
print(f"Retrieved {len(retrieved_nodes)} nodes\n")
for i, node in enumerate(retrieved_nodes):
print(f"\n--- Node {i+1} (Score: {node.score:.4f}) ---")
display_source_node(node, source_length=500)
# Query about legal proceedings
retrieved_nodes = retriever.retrieve(
"What legal proceedings or litigation is the company involved in?"
)
print(f"Retrieved {len(retrieved_nodes)} nodes\n")
for i, node in enumerate(retrieved_nodes):
print(f"\n--- Node {i+1} (Score: {node.score:.4f}) ---")
display_source_node(node, source_length=500)

将Isaacus嵌入与LLM结合,实现完整的问答功能:

import os
# Set your OpenAI API key
openai_api_key = "YOUR_OPENAI_API_KEY"
os.environ["OPENAI_API_KEY"] = openai_api_key
# Set up LLM
llm = OpenAI(model="gpt-4o-mini", temperature=0)
# Create query engine
query_engine = index.as_query_engine(
llm=llm,
similarity_top_k=5,
)
# Ask a legal question
response = query_engine.query(
"What are the company's main regulatory and legal risks?"
)
display(Markdown(f"**Answer:** {response}"))
response = query_engine.query(
"What intellectual property does the company rely on?"
)
display(Markdown(f"**Answer:** {response}"))

Isaacus 嵌入还支持异步操作,以便在异步应用中实现更好的性能:

import asyncio
async def get_embeddings_async():
embed_model = IsaacusEmbedding(
api_key=isaacus_api_key,
)
# Get async single embedding
embedding = await embed_model.aget_text_embedding(
"Async legal document text"
)
# Get async batch embeddings
embeddings = await embed_model.aget_text_embedding_batch(
["Text 1", "Text 2", "Text 3"]
)
return embedding, embeddings
# Run async function
embedding, embeddings = await get_embeddings_async()
print(f"Async single embedding dimension: {len(embedding)}")
print(
f"Async batch: {len(embeddings)} embeddings of {len(embeddings[0])} dimensions each"
)

在本笔记本中,我们演示了:

  1. 基础用法 - 获取单个和批量嵌入向量
  2. 任务特定优化 - 使用 retrieval/document 进行索引,使用 retrieval/query 进行搜索
  3. 降维 - 为提升效率而减小嵌入尺寸
  4. 法律检索增强生成管道 - 使用法律文档(Uber 10-K报表)构建完整的检索系统
  5. 异步操作 - 使用异步方法以获得更好的性能

Kanon 2 嵌入器擅长法律文档理解与检索,使其成为法律科技应用、合规工具、合同分析等场景的理想选择。