筛选

前置与后置过滤

LanceDB支持基于元数据字段对查询结果进行过滤。默认情况下，系统会对向量搜索返回的top-k结果执行后过滤。但也可以选择预过滤方式，即在向量搜索之前执行过滤操作。这种方法对于缩小海量数据集的搜索范围、降低查询延迟特别有用。

请注意，预过滤和后过滤都可能产生误判。对于预过滤来说，如果筛选条件过于严格，可能会排除向量搜索本可以识别为良好匹配的相关条目。在这种情况下，增加nprobes参数将有助于减少此类误判。如果您知道筛选条件具有高度选择性，建议调用bypass_vector_index()函数。

同样地，高度选择性的后置过滤器可能导致误判。同时增加nprobes和refine_factor可以缓解这个问题。在决定使用前置过滤还是后置过滤时，如果不确定，前置过滤通常是更安全的选择。

PythonTypeScript

# Synchronous client
result = tbl.search([0.5, 0.2]).where("id = 10", prefilter=True).limit(1).to_arrow()
# Asynchronous client
result = await async_tbl.query().where("id = 10").nearest_to([0.5, 0.2]).limit(1).to_arrow()

@lancedb/lancedbvectordb (deprecated)

const _result = await tbl
  .search(Array(1536).fill(0.5))
  .limit(1)
  .where("id = 10")
  .toArray();

let result = await tbl
  .search(Array(1536).fill(0.5))
  .limit(1)
  .filter("id = 10")
  .prefilter(true)
  .execute();

注意

创建标量索引可加速筛选。

SQL过滤器

由于LanceDB构建在DataFusion之上，它支持使用标准SQL表达式作为过滤操作的谓词。在向量搜索、更新和删除操作期间都可以使用SQL。

LanceDB 支持越来越多的 SQL 表达式：

>, >=, <, <=, =
AND, OR, NOT
IS NULL, IS NOT NULL
IS TRUE, IS NOT TRUE, IS FALSE, IS NOT FALSE
IN
LIKE, NOT LIKE
CAST
regexp_match(column, pattern)
DataFusion 函数

例如，以下筛选字符串是可接受的：

PythonTypeScript

# Synchronous client
tbl.search([100, 102]).where(
    "(item IN ('item 0', 'item 2')) AND (id > 10)"
).to_arrow()
# Asynchronous client
await (
    async_tbl.query()
    .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
    .nearest_to([100, 102])
    .to_arrow()
)

@lancedb/lancedbvectordb (deprecated)

const result = await (
  tbl.search(Array(1536).fill(0)) as lancedb.VectorQuery
)
  .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
  .postfilter()
  .toArray();

await tbl
  .search(Array(1536).fill(0))
  .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
  .execute();

如果您的列名包含特殊字符、大写字母或是SQL关键字，可以使用反引号(`)进行转义。对于嵌套字段，路径的每个部分都必须用反引号包裹。

SQL

`CUBE` = 10 AND `UpperCaseName` = '3' AND `column name with space` IS NOT NULL
  AND `nested with space`.`inner with space` < 2

不支持包含句点(.)的字段名称。

日期、时间戳和小数的字面量可以通过在类型名称后写入字符串值来表示。例如：

SQL

date_col = date '2021-01-01'
and timestamp_col = timestamp '2021-01-01 00:00:00'
and decimal_col = decimal(8,3) '1.000'

对于时间戳列，可以在类型参数中指定精度值。微秒级精度（6）是默认设置。

SQL	时间单位
`timestamp(0)`	Seconds
`timestamp(3)`	Milliseconds
`timestamp(6)`	Microseconds
`timestamp(9)`	Nanoseconds

LanceDB 内部以 Apache Arrow 格式存储数据。 SQL 类型到 Arrow 类型的映射关系如下：

SQL类型	Arrow类型
`boolean`	`Boolean`
`tinyint` / `tinyint unsigned`	`Int8` / `UInt8`
`smallint` / `smallint unsigned`	`Int16` / `UInt16`
`int` or `integer` / `int unsigned` or `integer unsigned`	`Int32` / `UInt32`
`bigint` / `bigint unsigned`	`Int64` / `UInt64`
`float`	`Float32`
`double`	`Float64`
`decimal(precision, scale)`	`Decimal128`
`date`	`Date32`
`timestamp`	`Timestamp` ¹
`string`	`Utf8`
`binary`	`Binary`

无需向量搜索的过滤

你也可以在不进行搜索的情况下过滤数据：

PythonTypeScript

# Synchronous client
tbl.search().where("id = 10").limit(10).to_arrow()
# Asynchronous client
await async_tbl.query().where("id = 10").limit(10).to_arrow()

@lancedb/lancedbvectordb (deprecated)

await tbl.query().where("id = 10").limit(10).toArray();

await tbl.filter("id = 10").limit(10).execute();

如果您的表很大，这可能会返回大量数据。除非您确定需要返回整个结果集，否则请务必使用limit子句。

请参阅前表中的精度映射。↩