Explainability in hybrid query #970

martin-gaievski · 2024-10-31T01:13:53Z

Description

Adding support for explain flag in hybrid query, implementation follows design shared in the RFC #905.

PR will be merged to the feature branch because some work regarding security still needs to be done.

Search pipeline need to have a new response processor in order to format the explanation payload. Example of search pipeline configuration:

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "normalization-processor": {
                "normalization": {
                    "technique": "min_max"
                },
                "combination": {
                    "technique": "arithmetic_mean",
                    "parameters": {
                        "weights": [
                            0.3,
                            0.7
                        ]
                    }
                }
            }
        }
    ],
    "response_processors": [
        {
            "explanation_response_processor": {}
        }
    ]
}

Example of search request, no changes there just now we will support explain flag:

GET {{base_url}}/index-test/_search?search_pipeline=nlp-search-pipeline&explain=true
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "range": {
                        "field1": {
                            "gte": 0,
                            "lte": 500
                        }
                    }
                },
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                5.0,
                                4.0,
                                2.1
                            ],
                            "k": 12
                        }
                    }
                }
            ]
        }
    }
}

example of response:

{
{
    "took": 32,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_shard": "[index-test][2]",
                "_node": "RXY_lXK9Q9yDsiuBGnDWZQ",
                "_index": "index-test",
                "_id": "bLm3JpMBn-5JX7ZEfRYl",
                "_score": 1.0,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic",
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "category": "novel",
                    "price": 20
                },
                "_explanation": {
                    "value": 1.0,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.026301946,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": "[index-test][1]",
                "_node": "RXY_lXK9Q9yDsiuBGnDWZQ",
                "_index": "index-test",
                "_id": "cLm3JpMBn-5JX7ZEfRYm",
                "_score": 0.74275696,
                "_source": {
                    "field1": 100,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java",
                    "name": "Does he have a big family?",
                    "category": "biography",
                    "price": 70
                },
                "_explanation": {
                    "value": 0.74275696,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 0.63287735,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.023969319,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": "[index-test][2]",
                "_node": "RXY_lXK9Q9yDsiuBGnDWZQ",
                "_index": "index-test",
                "_id": "bbm3JpMBn-5JX7ZEfRYm",
                "_score": 0.74275696,
                "_source": {
                    "name": "I brought home the trophy",
                    "category": "story",
                    "price": 20,
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                },
                "_explanation": {
                    "value": 0.74275696,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 0.63287735,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.023969319,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },
....

same query framed as bool will give following response when used with explain:

{
    "took": 45,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.026302,
        "hits": [
            {
                "_shard": "[index-test][0]",
                "_node": "XDmTbExDSFGxNBnzUfm6kQ",
                "_index": "index-test",
                "_id": "DBK6DZMBkau9uH9Rb7pV",
                "_score": 1.026302,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic",
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "category": "novel",
                    "price": 20
                },
                "_explanation": {
                    "value": 1.026302,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "field1:[0 TO 500]",
                            "details": []
                        },
                        {
                            "value": 0.026301946,
                            "description": "within top 12",
                            "details": []
                        }
                    ]
                }
            },
            {
                "_shard": "[index-test][0]",
                "_node": "XDmTbExDSFGxNBnzUfm6kQ",
                "_index": "index-test",
                "_id": "DRK6DZMBkau9uH9Rb7pW",
                "_score": 1.0239693,
                "_source": {
                    "name": "I brought home the trophy",
                    "category": "story",
                    "price": 20,
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                },
                "_explanation": {
                    "value": 1.0239693,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "field1:[0 TO 500]",
                            "details": []
                        },
                        {
                            "value": 0.023969319,
                            "description": "within top 12",
                            "details": []
                        }
                    ]
                }
            },
...

we don't support the explain by doc id, as per design, but I checked if we're not breaking things, here is the response for such explain by id :

GET {{base_url}}/index-test/_explain/2g505JIBpgWgV4s0KKZf
{
    "_index": "index-test",
    "_id": "2g505JIBpgWgV4s0KKZf",
    "matched": true,
    "explanation": {
        "value": 1.0,
        "description": "base scores from subqueries:",
        "details": [
            {
                "value": 1.0,
                "description": "field1:[0 TO 500]",
                "details": []
            },
            {
                "value": 0.026301946,
                "description": "within top 12",
                "details": []
            }
        ]
    }
}

Related Issues

#658
Change request for documentation: opensearch-project/documentation-website#8645

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski · 2024-11-13T18:14:44Z

Pushed commit that changes the format to a hierarchical structure, similar to other queries. This addresses Heemin's request

Following is updated format, also put it to the PR description

                "_explanation": {
                    "value": 1.0,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.026301946,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },

martin-gaievski · 2024-11-13T18:16:48Z

Few more questions. What will the response look like if they pass explain=true but does not included explain_response_processor in their pipeline?

I've added integ test for this scenario

src/main/java/org/opensearch/neuralsearch/processor/explain/ExplanationDetails.java

src/main/java/org/opensearch/neuralsearch/processor/explain/ExplanationUtils.java

Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun · 2024-11-15T02:29:25Z

@martin-gaievski Can you help me understand what does

field1 [0 to 500] and within top 12 mean in explanation response.

Overall the code looks good me. One question, are we not suppose to show original scores prior to normalization in the search response?

src/main/java/org/opensearch/neuralsearch/processor/ExplanationResponseProcessor.java

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java

src/main/java/org/opensearch/neuralsearch/processor/combination/ScoreCombiner.java

src/main/java/org/opensearch/neuralsearch/processor/explain/ExplanationUtils.java

src/test/java/org/opensearch/neuralsearch/query/HybridQueryExplainIT.java

martin-gaievski · 2024-11-15T04:26:09Z

@martin-gaievski Can you help me understand what does

field1 [0 to 500] and within top 12 mean in explanation response.

Overall the code looks good me. One question, are we not suppose to show original scores prior to normalization in the search response?

Original scores are there. Please check the breakdown of explanation elements below, hope that will answer all questions

"_explanation": {
                    "value": 1.0,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",  // combination technique used for score normalization
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",  // normalization technique used for score normalization
                            "details": [
                                {
                                    "value": 1.0,  //raw/source score from sub-query 1
                                    "description": "field1:[0 TO 500]", // this is explanation from sub query 1, in this case this is range query
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:", // normalization technique used for score normalization
                            "details": [
                                {
                                    "value": 0.026301946,  //raw/source score from sub-query 2
                                    "description": "within top 12", // this is explanation from sub query 2, in this case this is knn query
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },

vibrantvarun · 2024-11-15T05:54:53Z

Can you add one integ test with "multi_match" and "neural" subquery? Most of the customer use this combination and lately we have seen this query combination is creating issues.

vibrantvarun · 2024-11-15T05:52:39Z

src/main/java/org/opensearch/neuralsearch/processor/explain/ExplanationDetails.java

+    List<Pair<Float, String>> scoreDetails;
+
+    public ExplanationDetails(List<Pair<Float, String>> scoreDetails) {
+        this(-1, scoreDetails);


Why hardcode this -1 value?

this will match default docId in SearchHit https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/SearchHit.java#L170, we set it when we don't have access to real docId value

Can you add this explanation at line 24? As by the look of it, it will very difficult to guess. Also please the code ref as well.

Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun · 2024-11-15T05:56:45Z

...rg/opensearch/neuralsearch/processor/combination/GeometricMeanScoreCombinationTechnique.java

    @ToString.Include
    public static final String TECHNIQUE_NAME = "geometric_mean";
-    public static final String PARAM_NAME_WEIGHTS = "weights";


Why we removed this?

it is a duplicate, same constant is in ScoreCombinationUtil, we reference it from all technique classes https://github.com/martin-gaievski/neural-search/blob/poc/explain_for_hybrid_v2/src/main/java/org/opensearch/neuralsearch/processor/combination/GeometricMeanScoreCombinationTechnique.java#L14

martin-gaievski · 2024-11-15T06:07:05Z

Can you add one integ test with "multi_match" and "neural" subquery? Most of the customer use this combination and lately we have seen this query combination is creating issues.

I'm not sure if it's necessary in scope of explain, it's doesn't matter which query we pick, all we care about is the format of newly added sections. We can add it in later PR if needed, anyways code goes to feature branch

vibrantvarun · 2024-11-15T06:08:17Z

The reason I asked for that is because it involes multiple scorer behind the scenes so it will good to test. You can add it in the actual PR.

vibrantvarun · 2024-11-15T06:10:53Z

One thing, I see in the HLD that when we call explain api it will trigger explain for the docId before executing new response processor. Don't you think we don't need that step as we by default get raw scores from QueryPhase and we just have add in the explain response. For normalization explanation we are triggering this new search response processor.
I am referring the circled area? Can we disable that? I mean not in scope of this PR or feature.

Just thoughts!!

martin-gaievski · 2024-11-15T06:21:00Z

One thing, I see in the HLD that when we call explain api it will trigger explain for the docId before executing new response processor. Don't you think we don't need that step as we by default get raw scores from QueryPhase and we just have add in the explain response. For normalization explanation we are triggering this new search response processor.

Just thoughts!!

explain is one of those features that are implemented with two phase approach: first it retrieves data at the query phase and then compile final response in fetch phase. We need to enable explain in the query by collecting sub-query scores, and then scores will be set to SearchHits in the Fetch phase automatically. We do not call anything explicitly, just adding implementation for existing interface methods

vibrantvarun · 2024-11-15T06:23:02Z

Got it, approving this PR now. Will review the final PR once again after the security clearence.

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski · 2024-11-15T18:43:21Z

The reason I asked for that is because it involes multiple scorer behind the scenes so it will good to test. You can add it in the actual PR.

I've added the test, just one change - I did multi-match + knn query to avoid model deployment

martin-gaievski changed the title ~~Poc/explain for hybrid v2~~ Explainability in hybrid query Oct 31, 2024

martin-gaievski added the skip-changelog label Oct 31, 2024

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch 6 times, most recently from e8a374c to 6761ac7 Compare November 4, 2024 17:16

martin-gaievski marked this pull request as ready for review November 4, 2024 17:21

martin-gaievski requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, sean-zheng-amazon, model-collapse, zane-neo, vibrantvarun, zhichao-aws and yuye-aws as code owners November 4, 2024 17:21

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch 5 times, most recently from f590f46 to a19de09 Compare November 4, 2024 17:40

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications and removed skip-changelog labels Nov 4, 2024

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch from 0f3813f to 37cd3c1 Compare November 4, 2024 17:57

martin-gaievski added 2 commits November 12, 2024 18:39

Doing some refactoring

a25acc5

Signed-off-by: Martin Gaievski <[email protected]>

Refactor classes and methods

72c0ac3

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch 2 times, most recently from f7348b9 to d04f21f Compare November 13, 2024 17:52

Change response format, switch to hierarchical structure

9830ab3

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch from d04f21f to 9830ab3 Compare November 13, 2024 17:54

heemin32 reviewed Nov 13, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/explain/ExplanationDetails.java Outdated Show resolved Hide resolved

src/main/java/org/opensearch/neuralsearch/processor/explain/ExplanationUtils.java Show resolved Hide resolved

Convert record to lombok value, add unit tests

7a95087

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski requested a review from heemin32 November 13, 2024 20:48

heemin32 approved these changes Nov 14, 2024

View reviewed changes

vibrantvarun reviewed Nov 15, 2024

View reviewed changes

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch from 9ac81cd to d297e0f Compare November 15, 2024 05:54

vibrantvarun reviewed Nov 15, 2024

View reviewed changes

Address revie comments

9098781

Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun reviewed Nov 15, 2024

View reviewed changes

martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch from d297e0f to 9098781 Compare November 15, 2024 05:59

vibrantvarun approved these changes Nov 15, 2024

View reviewed changes

Add and refactor integ tests

e21d4ee

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski merged commit a3fd3a2 into opensearch-project:feature/explainability-in-hybrid-query Nov 15, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explainability in hybrid query #970

Explainability in hybrid query #970

martin-gaievski commented Oct 31, 2024 •

edited

Loading

martin-gaievski commented Nov 13, 2024 •

edited

Loading

martin-gaievski commented Nov 13, 2024

vibrantvarun commented Nov 15, 2024

martin-gaievski commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024 •

edited

Loading

vibrantvarun Nov 15, 2024

martin-gaievski Nov 15, 2024

vibrantvarun Nov 15, 2024

martin-gaievski Nov 15, 2024

vibrantvarun Nov 15, 2024

martin-gaievski Nov 15, 2024

martin-gaievski commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024 •

edited

Loading

martin-gaievski commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024

martin-gaievski commented Nov 15, 2024

Explainability in hybrid query #970

Explainability in hybrid query #970

Conversation

martin-gaievski commented Oct 31, 2024 • edited Loading

Description

Related Issues

Check List

martin-gaievski commented Nov 13, 2024 • edited Loading

martin-gaievski commented Nov 13, 2024

vibrantvarun commented Nov 15, 2024

martin-gaievski commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024 • edited Loading

vibrantvarun Nov 15, 2024

Choose a reason for hiding this comment

martin-gaievski Nov 15, 2024

Choose a reason for hiding this comment

vibrantvarun Nov 15, 2024

Choose a reason for hiding this comment

martin-gaievski Nov 15, 2024

Choose a reason for hiding this comment

vibrantvarun Nov 15, 2024

Choose a reason for hiding this comment

martin-gaievski Nov 15, 2024

Choose a reason for hiding this comment

martin-gaievski commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024 • edited Loading

martin-gaievski commented Nov 15, 2024

vibrantvarun commented Nov 15, 2024

martin-gaievski commented Nov 15, 2024

martin-gaievski commented Oct 31, 2024 •

edited

Loading

martin-gaievski commented Nov 13, 2024 •

edited

Loading

vibrantvarun commented Nov 15, 2024 •

edited

Loading

vibrantvarun commented Nov 15, 2024 •

edited

Loading