跳转到内容

Pydantic

Pydantic 是一个Python数据验证库。 LanceDB 与 Pydantic 集成,用于模式推断、数据摄取和查询结果转换。 使用 LanceModel,用户可以无缝地将 Pydantic 与 LanceDB 的其他 API 集成。

import lancedb
from lancedb.pydantic import Vector, LanceModel

class PersonModel(LanceModel):
    name: str
    age: int
    vector: Vector(2)


url = "./example"
db = lancedb.connect(url)
table = db.create_table("person", schema=PersonModel)
table.add(
    [
        PersonModel(name="bob", age=1, vector=[1.0, 2.0]),
        PersonModel(name="alice", age=2, vector=[3.0, 4.0]),
    ]
)
assert table.count_rows() == 2
person = table.search([0.0, 0.0]).limit(1).to_pydantic(PersonModel)
assert person[0].name == "bob"

向量字段

LanceDB 提供了 Vector(dim) 方法来在 Pydantic 模型中定义向量字段。

lancedb.pydantic.Vector

Vector(dim: int, value_type: DataType = pa.float32(), nullable: bool = True) -> Type[FixedSizeListMixin]

Pydantic 向量类型。

警告

实验性功能。

参数:

  • dim (int) –

    向量的维度。

  • value_type (DataType, 默认值: float32() ) –

    向量的值类型,默认为pa.float32()

  • nullable (bool, 默认值: True ) –

    向量是否可为空,默认值为True。

示例:

>>> import pydantic
>>> from lancedb.pydantic import Vector
...
>>> class MyModel(pydantic.BaseModel):
...     id: int
...     url: str
...     embeddings: Vector(768)
>>> schema = pydantic_to_schema(MyModel)
>>> assert schema == pa.schema([
...     pa.field("id", pa.int64(), False),
...     pa.field("url", pa.utf8(), False),
...     pa.field("embeddings", pa.list_(pa.float32(), 768))
... ])
Source code in lancedb/pydantic.py
def Vector(
    dim: int, value_type: pa.DataType = pa.float32(), nullable: bool = True
) -> Type[FixedSizeListMixin]:
    """Pydantic Vector Type.

    !!! warning
        Experimental feature.

    Parameters
    ----------
    dim : int
        The dimension of the vector.
    value_type : pyarrow.DataType, optional
        The value type of the vector, by default pa.float32()
    nullable : bool, optional
        Whether the vector is nullable, by default it is True.

    Examples
    --------

    >>> import pydantic
    >>> from lancedb.pydantic import Vector
    ...
    >>> class MyModel(pydantic.BaseModel):
    ...     id: int
    ...     url: str
    ...     embeddings: Vector(768)
    >>> schema = pydantic_to_schema(MyModel)
    >>> assert schema == pa.schema([
    ...     pa.field("id", pa.int64(), False),
    ...     pa.field("url", pa.utf8(), False),
    ...     pa.field("embeddings", pa.list_(pa.float32(), 768))
    ... ])
    """

    # TODO: make a public parameterized type.
    class FixedSizeList(list, FixedSizeListMixin):
        def __repr__(self):
            return f"FixedSizeList(dim={dim})"

        @staticmethod
        def nullable() -> bool:
            return nullable

        @staticmethod
        def dim() -> int:
            return dim

        @staticmethod
        def value_arrow_type() -> pa.DataType:
            return value_type

        @classmethod
        def __get_pydantic_core_schema__(
            cls, _source_type: Any, _handler: pydantic.GetCoreSchemaHandler
        ) -> CoreSchema:
            return core_schema.no_info_after_validator_function(
                cls,
                core_schema.list_schema(
                    min_length=dim,
                    max_length=dim,
                    items_schema=core_schema.float_schema(),
                ),
            )

        @classmethod
        def __get_validators__(cls) -> Generator[Callable, None, None]:
            yield cls.validate

        # For pydantic v1
        @classmethod
        def validate(cls, v):
            if not isinstance(v, (list, range, np.ndarray)) or len(v) != dim:
                raise TypeError("A list of numbers or numpy.ndarray is needed")
            return cls(v)

        if PYDANTIC_VERSION.major < 2:

            @classmethod
            def __modify_schema__(cls, field_schema: Dict[str, Any]):
                field_schema["items"] = {"type": "number"}
                field_schema["maxItems"] = dim
                field_schema["minItems"] = dim

    return FixedSizeList

类型转换

LanceDB 自动将 Pydantic 字段转换为 Apache Arrow 数据类型

当前支持的类型转换:

Pydantic 字段类型 PyArrow 数据类型
int pyarrow.int64
float pyarrow.float64
bool pyarrow.bool
str pyarrow.utf8()
list pyarrow.List
BaseModel pyarrow.Struct
Vector(n) pyarrow.FixedSizeList(float32, n)

LanceDB支持通过pydantic_to_schema()方法从Pydantic BaseModel创建Apache Arrow Schema。

lancedb.pydantic.pydantic_to_schema

pydantic_to_schema(model: Type[BaseModel]) -> Schema

Pydantic模型转换为 PyArrow模式

参数:

  • model (Type[BaseModel]) –

    要转换为Arrow模式的Pydantic BaseModel。

返回:

  • Schema

    Arrow 数据结构模式

示例:

>>> from typing import List, Optional
>>> import pydantic
>>> from lancedb.pydantic import pydantic_to_schema, Vector
>>> class FooModel(pydantic.BaseModel):
...     id: int
...     s: str
...     vec: Vector(1536)  # fixed_size_list<item: float32>[1536]
...     li: List[int]
...
>>> schema = pydantic_to_schema(FooModel)
>>> assert schema == pa.schema([
...     pa.field("id", pa.int64(), False),
...     pa.field("s", pa.utf8(), False),
...     pa.field("vec", pa.list_(pa.float32(), 1536)),
...     pa.field("li", pa.list_(pa.int64()), False),
... ])
Source code in lancedb/pydantic.py
def pydantic_to_schema(model: Type[pydantic.BaseModel]) -> pa.Schema:
    """Convert a [Pydantic Model][pydantic.BaseModel] to a
       [PyArrow Schema][pyarrow.Schema].

    Parameters
    ----------
    model : Type[pydantic.BaseModel]
        The Pydantic BaseModel to convert to Arrow Schema.

    Returns
    -------
    pyarrow.Schema
        The Arrow Schema

    Examples
    --------

    >>> from typing import List, Optional
    >>> import pydantic
    >>> from lancedb.pydantic import pydantic_to_schema, Vector
    >>> class FooModel(pydantic.BaseModel):
    ...     id: int
    ...     s: str
    ...     vec: Vector(1536)  # fixed_size_list<item: float32>[1536]
    ...     li: List[int]
    ...
    >>> schema = pydantic_to_schema(FooModel)
    >>> assert schema == pa.schema([
    ...     pa.field("id", pa.int64(), False),
    ...     pa.field("s", pa.utf8(), False),
    ...     pa.field("vec", pa.list_(pa.float32(), 1536)),
    ...     pa.field("li", pa.list_(pa.int64()), False),
    ... ])
    """
    fields = _pydantic_model_to_fields(model)
    return pa.schema(fields)