-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Implement pruning for neural sparse search #946
Comments
To make the ingest processor work for raw sparse vectors ingestion, one prerequiste is to implement this issue: #793 . We can add a parameter to configure whether call model inference for raw vectors ingestion |
Another tricky part is the combination with 2-phase search. My thought on the proper behavior is we first prune, then split the queries to two phase. Please feel free to put more comments about this. |
@zhichao-aws Can you add some context on what is pruning? |
In the context of sparse vector representations, pruning is a technique used to reduce the size of the sparse vectors by removing or "pruning" tokens that have relatively low semantic weights or importance. In neural sparse search, documents and queries are encoded into sparse vectors, where each entry represents a token and its corresponding semantic weight. For example: By applying pruning strategies, users can achieve a balance between search accuracy and storage costs, as research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy. |
I conducted experiments to test the impact of pruning on ingestion of search. POC code: https://github.com/zhichao-aws/neural-search/tree/pruning. Benchmark code: https://github.com/zhichao-aws/neural-search/tree/prune_test. In conclusion, we can save about 60% index size with a trade-off of ~1% search relevance by applying the pruning during ingestion (works for both doc-only and bi-encoder). |
Is there a GH issue for this RFC? |
No, we don't have an issue for it |
Please help me to understand why #793 is prerequisite? In my understanding #793 is about not calling model inference again during document update when there is no changes on the original text for the embedding. |
Background
Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.
Since the model expands the tokens with semantic weights during the encoding process, the number of tokens in the sparse vectors is often greater than the original raw text. Additionally, the token weights in the sparse vectors exhibit a significant long-tail distribution, where tokens with lower semantic importance occupy a large portion of the storage space. In the experiments of this blog, we found that the index sizes produced by the two modes of neural sparse search are 4.7 times and 6.8 times larger than the BM25 inverted index.
Pruning can effectively alleviate this problem. During the process of ingestion and search, we prune the sparse vectors according to different strategies, removing tokens with relatively small weights. Research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy[1]. This can help users achieve a better balance between search accuracy and cost.
What are we going to do?
sparse_encoding
ingestion processor. Users can configure the pruning strategy when create the processor, and the processor will prune the sparse vectors before write to index.neural_sparse
query clause. Users can configure the pruning strategy when search with neural_sparse query. The query builder will prune the query before search on index.Pruning strategy
We propose to implent these 4 pruning strategies:
Pruning by weight threshold
For this method, given a threshold T, all tokens whose weight is smaller than T will be pruned.
Pruning by ratio with max weight
For this method, given a sparse vector X, we first find the max weight of X, and calculate the weight ratio of every token and the max weight. If the ratio is smaller than threshold T, it will be pruned.
Pruning by Top-K
For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. And we only keep the tokens with Top-K values.
Pruning by alpha-mass[2]
For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. We iterate on the vector entries and record the accumulated values, until the ratio of accumulated values and sum of all values is larger than threshold T. And the non-iterated entries are dropped.
API
To create an ingest processor with pruning:
To search with pruning:
References
[1]: A Static Pruning Study on Sparse Neural Retrievers
[2]: Efficient Inverted Indexes for Approximate Retrieval over
Learned Sparse Representations
The text was updated successfully, but these errors were encountered: