跳到主要内容

Faiss

Faiss 是一个高效处理稠密向量(dense vectors)相似性搜索(similarity search)和聚类(clustering)的开源库。无论数据量有多大,即使全部数据无法装入内存(RAM),Faiss 都能高效处理。Faiss 还包含评估方法和参数调优的辅助代码。主要以 C++ 编写,同时为 Python(2和3)提供完整的接口封装。其中部分核心算法也实现了 GPU 加速。Faiss 主要由 Meta AI Research 维护,并有外部 贡献者参与开发。

什么是相似性搜索(similarity search)?

假设有一组 dd 维向量 {x1,...,xn}\{ x_1, ..., x_n \},Faiss 会在内存中构建一个数据结构。当你有新的 dd 维向量 xx 时,可以高效地执行如下操作:

i=argminixxii = \mathrm{argmin}_i \lVert x - x_i \rVert

这里的 .\lVert . \rVert 表示欧氏距离(欧几里得距离,L2 距离)。

在 Faiss 的术语中,这个数据结构叫做 索引(index),它是一个能通过 add 方法添加 xix_i 向量的对象。注意:所有 xix_i 的维度必须固定。

寻找 ii 使距离最小的操作称为搜索(search)

这就是 Faiss 的核心功能。此外,Faiss 还可以:

  • 查询最近邻的同时返回第 2、3、...、第 k 个最近邻(即 top-k 搜索)
  • 一次处理多个查询向量(批量处理 batch processing),很多索引类型下同时处理多条查询比逐条更快
  • 可以精度和速度进行权衡,比如只正确返回 90% 的结果,却能提速 10 倍或减少 10 倍内存占用
  • 支持最大内积搜索(maximum inner product search),即寻找 $\mathrm{argmax}_i\langle x, x_i \rangle$,即最大点积的索引结果。此外还有限度地支持其他距离类型,比如 L1、L∞ 等
  • 支持范围搜索(range search),即返回所有在给定查询半径内的样本
  • 支持索引持久化到磁盘,而不仅仅放在内存
  • 支持二进制向量(binary vectors)而不仅仅是浮点向量(float vectors)
  • 支持通过向量 id 的条件忽略部分向量
提示

Faiss 的许多操作都支持批量处理,多向量同时搜索可以极大提升效率。

Faiss 的研究基础

Faiss 基于多年的学术研究,最著名的核心实现包括:

有一篇关于乘积量化和相关方法的综述论文推荐阅读:"A Survey of Product Quantization",Yusuke Matsui, Yusuke Uchida, Hervé Jégou, Shin’ichi Satoh, ITE transactions on MTA, 2018。

下图来自上述论文,直观体现了各类 PQ 变体(点击图片可放大查看):

PQ 相关方法概览

2021-08-19 注:图中红框标注为 Faiss 已实现的方法(另外添加了 SIMD 选项和加法量化方法)。

参考文献索引说明
André+, “Cache Locality is not Enough: High-performance Nearest Neighbor Search with Product Quantization Fast Scan”, VLDB 15
André+, “Accelerated Nearest Neighbor Search with Quick ADC”, ICMR 17
André+, “Quicker ADC : Unlocking the Hidden Potential of Product Quantization with SIMD”, IEEE TPAMI 20
Babenko and Lempitsky, “The Inverted Multi-index”, CVPR 12
Babenko and Lempitsky, “Additive Quantization for Extreme Vector Compression”, CVPR 14
Babenko and Lempitsky, “The Inverted Multi-index”, IEEE TPAMI 15
Babenko and Lempitsky, “Tree Quantization for Large-scale Similarity Search and Classification”, CVPR 15
Babenko and Lempitsky, “Efficient Indexing of Billion-scale Datasets of Deep Descriptors”, CVPR 16
Babenko and Lempitsky, “Product Split Trees”, CVPR 17
Bagherinezhad+, “LCNN: Lookup-based Convolutional Neural Network”, CVPR 17
Baranchuk+, “Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors”, ECCV 18
Blalock and Guttag, “Bolt: Accelerated Data Mining with Fast Vector Compression”, KDD 17
Eghbali and Tahvildari, “Deep Spherical Quantization for Image Search”, CVPR 19
Douze+, “Polysemous Codes”, ECCV 16
Douze+, “Link and code: Fast Indexing with Graphs and Compact Regression Codes”, CVPR 18
Ge+, “Optimized Product Quantization”, IEEE TPAMI 14
Ge+, “Product Sparse Coding”, CVPR 14
He+, “K-means Hashing: An Affinity-preserving Quantization Method for Learning Binary Compact Codes”, CVPR 13
Heo+, “Short-list Selection with Residual-aware Distance Estimator for K-nearest Neighbor Search”, CVPR 16
Heo+, “Distance Encoded Product Quantization”, CVPR 14
Iwamura+, “What is the Most Efficient Way to Select Nearest Neighbor Candidates for Fast Approximate Nearest Neighbor Search?”, ICCV 13
Jain+, “Approximate Search with Quantized Sparse Representations”, ECCV 16
Jégou+, “Product Quantization for Nearest Neighbor Search”, IEEE TPAMI 11
Jégou+, “Aggregating Local Descriptors into a Compact Image Representation”, CVPR 10
Jégou+, “Searching in One Billion Vectors: Re-rank with Source Coding”, ICASSP 11
Johnson+, “Billion-scale Similarity Search with GPUs”, IEEE TBD 20
Klein and Wolf, “End-to-end Supervised Product Quantization for Image Search and Retrieval”, CVPR 19
Kalantidis and Avrithis, “Locally Optimized Product Quantization for Approximate Nearest Neighbor Search”, CVPR 14
Li+, “Online Variable Coding Length Product Quantization for Fast Nearest Neighbor Search in Mobile Retrieval”, IEEE TMM 17
Martinez+, “Revisiting Additive Quantization”, ECCV 16
Martinez+, “LSQ++: Lower Running Time and Higher Recall in Multi-codebook Quantization”, ECCV 18
Matsui+, “PQTable: Fast Exact Asymmetric Distance Neighbor Search for Product Quantization using Hash Tables”, ICCV 15
Matsui+, “PQk-means: Billion-scale Clustering for Product-quantized Codes”, ACMMM 17
Matsui+, “Reconfigurable Inverted Index”, ACMMM 18
Ning+, “Scalable Image Retrieval by Sparse Product Quantization”, IEEE TMM 17
Norouzi and Fleet, “Cartesian k-means”, CVPR 13
Ozan+, “Competitive Quantization for Approximate Nearest Neighbor Search”, IEEE TKDE 16
Shicong+, “Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search”, AAAI 15
Spyromitros-Xious+, “A Comprehensive Study over VLAD and Product Quantization in Large-scale Image Retrieval”, IEEE TMM 14
Yu+, “Product Quantization Network for Fast Image Retrieval”, ECCV 18
Yu+, “Generative Adversarial Product Quantization”, ACMMM 18
Wang+, “Optimized Distances for Binary Code Ranking”, ACMMM 14
Wang+, “Optimized Cartesian k-means”, IEEE TKDE 15
Wang+, “Supervised Quantization for Similarity Search”, CVPR 16
Wang and Zhang, “Composite Quantization”, IEEE TPAMI 19
Wieschollek+, “Efficient Large-scale Approximate Nearest Neighbor Search on the GPU”, CVPR 16
Wu+, “Multiscale Quantization for Fast Similarity Search”, NIPS 17
Wu+, “Quantized Convolutional Neural Networks for Mobile Devices”, CVPR 16
Xia+, “Joint Inverted Indexing”, ICCV 13
Zhang+, “Composite Quantization for Approximate Nearest Neighbor Search”, ICML 14
Zhang+, “Sparse Composite Quantization”, CVPR 15.
Zhang+, “Collaborative Quantization for Crossmodal Similarity Search”, CVPR 16
Zhang+, “Efficient Large-scale Approximate Nearest Neighbor Search on OpenCL FPGA”, CVPR 18

图片来源:Yusuke Matsui,特别感谢其授权使用!

本 Wiki 说明

本 Wiki 提供了 Faiss 的高层次信息和教程内容。可通过侧边栏导航浏览。

大部分示例以 Python 呈现(简洁明了),但 C++ 接口基本相同,二者相互转换非常容易。

备注

如果需要基于 C++ 的使用实例,可参考 Python 示例并直接在 C++ 代码中实现,API 设计高度一致。