[RFC] Implement pruning for neural sparse search #946

zhichao-aws · 2024-10-18T09:04:09Z

Background

Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.

Since the model expands the tokens with semantic weights during the encoding process, the number of tokens in the sparse vectors is often greater than the original raw text. Additionally, the token weights in the sparse vectors exhibit a significant long-tail distribution, where tokens with lower semantic importance occupy a large portion of the storage space. In the experiments of this blog, we found that the index sizes produced by the two modes of neural sparse search are 4.7 times and 6.8 times larger than the BM25 inverted index.

Pruning can effectively alleviate this problem. During the process of ingestion and search, we prune the sparse vectors according to different strategies, removing tokens with relatively small weights. Research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy[1]. This can help users achieve a better balance between search accuracy and cost.

What are we going to do?

Implement pruning at sparse_encoding ingestion processor. Users can configure the pruning strategy when create the processor, and the processor will prune the sparse vectors before write to index.
Implement pruning at neural_sparse query clause. Users can configure the pruning strategy when search with neural_sparse query. The query builder will prune the query before search on index.
Give a quantitative analysis of the pruning strategies. Users can refer to the analysis results to configure the pruning strategy, thereby achieving the desired balance between accuracy and cost.

Pruning strategy

We propose to implent these 4 pruning strategies:

Pruning by weight threshold

For this method, given a threshold T, all tokens whose weight is smaller than T will be pruned.

def prune_by_weight_threshold(sparse_vector, threshold):
    pruned_vector = {}
    for token, weight in sparse_vector.items():
        if weight >= threshold:
            pruned_vector[token] = weight
    return pruned_vector

Pruning by ratio with max weight

For this method, given a sparse vector X, we first find the max weight of X, and calculate the weight ratio of every token and the max weight. If the ratio is smaller than threshold T, it will be pruned.

def prune_by_ratio_with_max_weight(sparse_vector, threshold):
    pruned_vector = {}
    max_weight = max(sparse_vector.values())
    for token, weight in sparse_vector.items():
        ratio = weight / max_weight
        if ratio >= threshold:
            pruned_vector[token] = weight
    return pruned_vector

Pruning by Top-K

For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. And we only keep the tokens with Top-K values.

def prune_by_top_k(sparse_vector, k):
    sorted_vector = sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)
    pruned_vector = dict(sorted_vector[:k])
    return pruned_vector

Pruning by alpha-mass[2]

For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. We iterate on the vector entries and record the accumulated values, until the ratio of accumulated values and sum of all values is larger than threshold T. And the non-iterated entries are dropped.

def prune_by_alpha_mass(sparse_vector, threshold):
    sorted_vector = sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)
    pruned_vector = {}
    accumulated_mass = 0
    total_mass = sum(sparse_vector.values())
    for token, weight in sorted_vector:
        accumulated_mass += weight
        if accumulated_mass / total_mass >= threshold:
            break
        pruned_vector[token] = weight
    return pruned_vector

API

To create an ingest processor with pruning:

PUT /_ingest/pipeline/sparse-pipeline
{
    "description": "Calling sparse model to generate expanded tokens",
    "processors": [
        {
            "sparse_encoding": {
                "model_id": "fousVokBjnSupmOha8aN",
                "field_map": {
                    "body": "body_sparse"
                },
                "pruning_config":{
                    "pruning_type": "alpha_mass",
                    "threshold": 0.8
                }
            }
        }
    ]
}

To search with pruning:

GET /test-index/_search
{
    "query": {
        "neural_sparse": {
            "body_sparse": {
                "query_text": "i be going",
                "model_id": "fousVokBjnSupmOha8aN",
                "pruning_config":{
                    "pruning_type": "alpha_mass",
                    "threshold": 0.8
                }
            }
        }
    }
}

References

[1]: A Static Pruning Study on Sparse Neural Retrievers

[2]: Efficient Inverted Indexes for Approximate Retrieval over
Learned Sparse Representations

The text was updated successfully, but these errors were encountered:

zhichao-aws · 2024-10-18T09:06:26Z

To make the ingest processor work for raw sparse vectors ingestion, one prerequiste is to implement this issue: #793 . We can add a parameter to configure whether call model inference for raw vectors ingestion

zhichao-aws · 2024-10-18T09:08:51Z

Another tricky part is the combination with 2-phase search. My thought on the proper behavior is we first prune, then split the queries to two phase. Please feel free to put more comments about this.

vibrantvarun · 2024-10-18T17:12:49Z

@zhichao-aws Can you add some context on what is pruning?

zhichao-aws · 2024-10-22T06:46:04Z

Can you add some context on what is pruning?

In the context of sparse vector representations, pruning is a technique used to reduce the size of the sparse vectors by removing or "pruning" tokens that have relatively low semantic weights or importance.

In neural sparse search, documents and queries are encoded into sparse vectors, where each entry represents a token and its corresponding semantic weight. For example: "hello world" -> {"hello": 1.1, "world":1.2, "hi": 0.9, "planet": 0.1, "greeting": 0.5, "earth":0.15} (just for example, not real encoding result). For pruning purpose, we can remove planet and earth, to reduce the storage and increase search speed.

By applying pruning strategies, users can achieve a balance between search accuracy and storage costs, as research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy.

zhichao-aws · 2024-11-11T06:16:09Z

I conducted experiments to test the impact of pruning on ingestion of search. POC code: https://github.com/zhichao-aws/neural-search/tree/pruning. Benchmark code: https://github.com/zhichao-aws/neural-search/tree/prune_test.

In conclusion, we can save about 60% index size with a trade-off of ~1% search relevance by applying the pruning during ingestion (works for both doc-only and bi-encoder).

doc-only:

bi-encoder:

zhichao-aws · 2024-11-11T06:21:49Z

We then test the impact on search relevance by pruning the sparse vector at retrieval time. We test on the BEIR datasets that have sufficient amount of documents, and use the index that are pruned to ~40% index size. We test these pruning methods and the 2-phase search.

In conclusion, the search-time pruning can boost the search time, but the search relevance get dropped very fast when we prune more tokens. In conclusion, 2-phase search is a more robust method that is both fast and precise. (especially bi-encoder)

zhichao-aws · 2024-11-15T06:50:57Z

Further experiments are conducted to test different prune types on two-phase search. On bi-encoder different strategies have similar performance. With doc-only, alpha mass is slightly better. We can add the choice to two phase search pipeline to provide more choices to users.

martin-gaievski · 2024-11-16T02:13:02Z

Is there a GH issue for this RFC?

zhichao-aws · 2024-11-16T02:18:49Z

Is there a GH issue for this RFC?

No, we don't have an issue for it

heemin32 · 2024-11-16T06:47:20Z

To make the ingest processor work for raw sparse vectors ingestion, one prerequiste is to implement this issue: #793 . We can add a parameter to configure whether call model inference for raw vectors ingestion

Please help me to understand why #793 is prerequisite? In my understanding #793 is about not calling model inference again during document update when there is no changes on the original text for the embedding.

github-actions bot added the untriaged label Oct 18, 2024

zhichao-aws removed the untriaged label Oct 18, 2024

zhichao-aws self-assigned this Oct 18, 2024

zhichao-aws mentioned this issue Nov 15, 2024

[Feature] Implement pruning for neural sparse search #988

Draft

5 tasks

martin-gaievski mentioned this issue Nov 16, 2024

[FEATURE] Enhanced adaptive token pruning for neural sparse search #989

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Implement pruning for neural sparse search #946

[RFC] Implement pruning for neural sparse search #946

zhichao-aws commented Oct 18, 2024

zhichao-aws commented Oct 18, 2024

zhichao-aws commented Oct 18, 2024

vibrantvarun commented Oct 18, 2024

zhichao-aws commented Oct 22, 2024

zhichao-aws commented Nov 11, 2024

zhichao-aws commented Nov 11, 2024

zhichao-aws commented Nov 15, 2024

martin-gaievski commented Nov 16, 2024

zhichao-aws commented Nov 16, 2024

heemin32 commented Nov 16, 2024

[RFC] Implement pruning for neural sparse search #946

[RFC] Implement pruning for neural sparse search #946

Comments

zhichao-aws commented Oct 18, 2024

Background

What are we going to do?

Pruning strategy

Pruning by weight threshold

Pruning by ratio with max weight

Pruning by Top-K

Pruning by alpha-mass[2]

API

References

zhichao-aws commented Oct 18, 2024

zhichao-aws commented Oct 18, 2024

vibrantvarun commented Oct 18, 2024

zhichao-aws commented Oct 22, 2024

zhichao-aws commented Nov 11, 2024

zhichao-aws commented Nov 11, 2024

zhichao-aws commented Nov 15, 2024

martin-gaievski commented Nov 16, 2024

zhichao-aws commented Nov 16, 2024

heemin32 commented Nov 16, 2024