Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement pruning for neural sparse search #946

Open
zhichao-aws opened this issue Oct 18, 2024 · 10 comments
Open

[RFC] Implement pruning for neural sparse search #946

zhichao-aws opened this issue Oct 18, 2024 · 10 comments
Assignees

Comments

@zhichao-aws
Copy link
Member

Background

Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.

Since the model expands the tokens with semantic weights during the encoding process, the number of tokens in the sparse vectors is often greater than the original raw text. Additionally, the token weights in the sparse vectors exhibit a significant long-tail distribution, where tokens with lower semantic importance occupy a large portion of the storage space. In the experiments of this blog, we found that the index sizes produced by the two modes of neural sparse search are 4.7 times and 6.8 times larger than the BM25 inverted index.

Pruning can effectively alleviate this problem. During the process of ingestion and search, we prune the sparse vectors according to different strategies, removing tokens with relatively small weights. Research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy[1]. This can help users achieve a better balance between search accuracy and cost.

What are we going to do?

  • Implement pruning at sparse_encoding ingestion processor. Users can configure the pruning strategy when create the processor, and the processor will prune the sparse vectors before write to index.
  • Implement pruning at neural_sparse query clause. Users can configure the pruning strategy when search with neural_sparse query. The query builder will prune the query before search on index.
  • Give a quantitative analysis of the pruning strategies. Users can refer to the analysis results to configure the pruning strategy, thereby achieving the desired balance between accuracy and cost.

Pruning strategy

We propose to implent these 4 pruning strategies:

Pruning by weight threshold

For this method, given a threshold T, all tokens whose weight is smaller than T will be pruned.

def prune_by_weight_threshold(sparse_vector, threshold):
    pruned_vector = {}
    for token, weight in sparse_vector.items():
        if weight >= threshold:
            pruned_vector[token] = weight
    return pruned_vector

Pruning by ratio with max weight

For this method, given a sparse vector X, we first find the max weight of X, and calculate the weight ratio of every token and the max weight. If the ratio is smaller than threshold T, it will be pruned.

def prune_by_ratio_with_max_weight(sparse_vector, threshold):
    pruned_vector = {}
    max_weight = max(sparse_vector.values())
    for token, weight in sparse_vector.items():
        ratio = weight / max_weight
        if ratio >= threshold:
            pruned_vector[token] = weight
    return pruned_vector

Pruning by Top-K

For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. And we only keep the tokens with Top-K values.

def prune_by_top_k(sparse_vector, k):
    sorted_vector = sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)
    pruned_vector = dict(sorted_vector[:k])
    return pruned_vector

Pruning by alpha-mass[2]

For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. We iterate on the vector entries and record the accumulated values, until the ratio of accumulated values and sum of all values is larger than threshold T. And the non-iterated entries are dropped.

def prune_by_alpha_mass(sparse_vector, threshold):
    sorted_vector = sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)
    pruned_vector = {}
    accumulated_mass = 0
    total_mass = sum(sparse_vector.values())
    for token, weight in sorted_vector:
        accumulated_mass += weight
        if accumulated_mass / total_mass >= threshold:
            break
        pruned_vector[token] = weight
    return pruned_vector

API

To create an ingest processor with pruning:

PUT /_ingest/pipeline/sparse-pipeline
{
    "description": "Calling sparse model to generate expanded tokens",
    "processors": [
        {
            "sparse_encoding": {
                "model_id": "fousVokBjnSupmOha8aN",
                "field_map": {
                    "body": "body_sparse"
                },
                "pruning_config":{
                    "pruning_type": "alpha_mass",
                    "threshold": 0.8
                }
            }
        }
    ]
}

To search with pruning:

GET /test-index/_search
{
    "query": {
        "neural_sparse": {
            "body_sparse": {
                "query_text": "i be going",
                "model_id": "fousVokBjnSupmOha8aN",
                "pruning_config":{
                    "pruning_type": "alpha_mass",
                    "threshold": 0.8
                }
            }
        }
    }
}

References

[1]: A Static Pruning Study on Sparse Neural Retrievers

[2]: Efficient Inverted Indexes for Approximate Retrieval over
Learned Sparse Representations

@zhichao-aws
Copy link
Member Author

To make the ingest processor work for raw sparse vectors ingestion, one prerequiste is to implement this issue: #793 . We can add a parameter to configure whether call model inference for raw vectors ingestion

@zhichao-aws
Copy link
Member Author

Another tricky part is the combination with 2-phase search. My thought on the proper behavior is we first prune, then split the queries to two phase. Please feel free to put more comments about this.

@vibrantvarun
Copy link
Member

@zhichao-aws Can you add some context on what is pruning?

@zhichao-aws
Copy link
Member Author

Can you add some context on what is pruning?

In the context of sparse vector representations, pruning is a technique used to reduce the size of the sparse vectors by removing or "pruning" tokens that have relatively low semantic weights or importance.

In neural sparse search, documents and queries are encoded into sparse vectors, where each entry represents a token and its corresponding semantic weight. For example: "hello world" -> {"hello": 1.1, "world":1.2, "hi": 0.9, "planet": 0.1, "greeting": 0.5, "earth":0.15} (just for example, not real encoding result). For pruning purpose, we can remove planet and earth, to reduce the storage and increase search speed.

By applying pruning strategies, users can achieve a balance between search accuracy and storage costs, as research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy.

@zhichao-aws
Copy link
Member Author

I conducted experiments to test the impact of pruning on ingestion of search. POC code: https://github.com/zhichao-aws/neural-search/tree/pruning. Benchmark code: https://github.com/zhichao-aws/neural-search/tree/prune_test.

In conclusion, we can save about 60% index size with a trade-off of ~1% search relevance by applying the pruning during ingestion (works for both doc-only and bi-encoder).

doc-only:
doc-v2-distill_ndcg_index-size

bi-encoder:
v2-distill_ndcg_index-size

@zhichao-aws
Copy link
Member Author

We then test the impact on search relevance by pruning the sparse vector at retrieval time. We test on the BEIR datasets that have sufficient amount of documents, and use the index that are pruned to ~40% index size. We test these pruning methods and the 2-phase search.

In conclusion, the search-time pruning can boost the search time, but the search relevance get dropped very fast when we prune more tokens. In conclusion, 2-phase search is a more robust method that is both fast and precise. (especially bi-encoder)
doc-v2-distill_alpha_mass_ndcg_total-took
v2-distill_alpha_mass_ndcg_total-took

@zhichao-aws
Copy link
Member Author

Further experiments are conducted to test different prune types on two-phase search. On bi-encoder different strategies have similar performance. With doc-only, alpha mass is slightly better. We can add the choice to two phase search pipeline to provide more choices to users.

prune-2-phase
prune-2-phase-doc

@martin-gaievski
Copy link
Member

Is there a GH issue for this RFC?

@zhichao-aws
Copy link
Member Author

Is there a GH issue for this RFC?

No, we don't have an issue for it

@heemin32
Copy link
Collaborator

To make the ingest processor work for raw sparse vectors ingestion, one prerequiste is to implement this issue: #793 . We can add a parameter to configure whether call model inference for raw vectors ingestion

Please help me to understand why #793 is prerequisite? In my understanding #793 is about not calling model inference again during document update when there is no changes on the original text for the embedding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants