Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explainability in hybrid query #970

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Oct 31, 2024

Description

Adding support for explain flag in hybrid query, implementation follows design shared in the RFC #905.

PR will be merged to the feature branch because some work regarding security still needs to be done.

Search pipeline need to have a new response processor in order to format the explanation payload. Example of search pipeline configuration:

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "normalization-processor": {
                "normalization": {
                    "technique": "min_max"
                },
                "combination": {
                    "technique": "arithmetic_mean",
                    "parameters": {
                        "weights": [
                            0.3,
                            0.7
                        ]
                    }
                }
            }
        }
    ],
    "response_processors": [
        {
            "explanation_response_processor": {}
        }
    ]
}

Example of search request, no changes there just now we will support explain flag:

GET {{base_url}}/index-test/_search?search_pipeline=nlp-search-pipeline&explain=true
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "range": {
                        "field1": {
                            "gte": 0,
                            "lte": 500
                        }
                    }
                },
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                5.0,
                                4.0,
                                2.1
                            ],
                            "k": 12
                        }
                    }
                }
            ]
        }
    }
}

example of response:

{
{
    "took": 32,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_shard": "[index-test][2]",
                "_node": "RXY_lXK9Q9yDsiuBGnDWZQ",
                "_index": "index-test",
                "_id": "bLm3JpMBn-5JX7ZEfRYl",
                "_score": 1.0,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic",
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "category": "novel",
                    "price": 20
                },
                "_explanation": {
                    "value": 1.0,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.026301946,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": "[index-test][1]",
                "_node": "RXY_lXK9Q9yDsiuBGnDWZQ",
                "_index": "index-test",
                "_id": "cLm3JpMBn-5JX7ZEfRYm",
                "_score": 0.74275696,
                "_source": {
                    "field1": 100,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java",
                    "name": "Does he have a big family?",
                    "category": "biography",
                    "price": 70
                },
                "_explanation": {
                    "value": 0.74275696,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 0.63287735,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.023969319,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": "[index-test][2]",
                "_node": "RXY_lXK9Q9yDsiuBGnDWZQ",
                "_index": "index-test",
                "_id": "bbm3JpMBn-5JX7ZEfRYm",
                "_score": 0.74275696,
                "_source": {
                    "name": "I brought home the trophy",
                    "category": "story",
                    "price": 20,
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                },
                "_explanation": {
                    "value": 0.74275696,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 0.63287735,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.023969319,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },
....

same query framed as bool will give following response when used with explain:

{
    "took": 45,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.026302,
        "hits": [
            {
                "_shard": "[index-test][0]",
                "_node": "XDmTbExDSFGxNBnzUfm6kQ",
                "_index": "index-test",
                "_id": "DBK6DZMBkau9uH9Rb7pV",
                "_score": 1.026302,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic",
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "category": "novel",
                    "price": 20
                },
                "_explanation": {
                    "value": 1.026302,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "field1:[0 TO 500]",
                            "details": []
                        },
                        {
                            "value": 0.026301946,
                            "description": "within top 12",
                            "details": []
                        }
                    ]
                }
            },
            {
                "_shard": "[index-test][0]",
                "_node": "XDmTbExDSFGxNBnzUfm6kQ",
                "_index": "index-test",
                "_id": "DRK6DZMBkau9uH9Rb7pW",
                "_score": 1.0239693,
                "_source": {
                    "name": "I brought home the trophy",
                    "category": "story",
                    "price": 20,
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                },
                "_explanation": {
                    "value": 1.0239693,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "field1:[0 TO 500]",
                            "details": []
                        },
                        {
                            "value": 0.023969319,
                            "description": "within top 12",
                            "details": []
                        }
                    ]
                }
            },
...

we don't support the explain by doc id, as per design, but I checked if we're not breaking things, here is the response for such explain by id :

GET {{base_url}}/index-test/_explain/2g505JIBpgWgV4s0KKZf
{
    "_index": "index-test",
    "_id": "2g505JIBpgWgV4s0KKZf",
    "matched": true,
    "explanation": {
        "value": 1.0,
        "description": "base scores from subqueries:",
        "details": [
            {
                "value": 1.0,
                "description": "field1:[0 TO 500]",
                "details": []
            },
            {
                "value": 0.026301946,
                "description": "within top 12",
                "details": []
            }
        ]
    }
}

Related Issues

#658
Change request for documentation: opensearch-project/documentation-website#8645

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski changed the title Poc/explain for hybrid v2 Explainability in hybrid query Oct 31, 2024
@martin-gaievski martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch 6 times, most recently from e8a374c to 6761ac7 Compare November 4, 2024 17:16
@martin-gaievski martin-gaievski marked this pull request as ready for review November 4, 2024 17:21
@martin-gaievski martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch 5 times, most recently from f590f46 to a19de09 Compare November 4, 2024 17:40
@martin-gaievski martin-gaievski added Enhancements Increases software capabilities beyond original client specifications and removed skip-changelog labels Nov 4, 2024
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
@martin-gaievski martin-gaievski force-pushed the poc/explain_for_hybrid_v2 branch 2 times, most recently from f7348b9 to d04f21f Compare November 13, 2024 17:52
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Nov 13, 2024

Pushed commit that changes the format to a hierarchical structure, similar to other queries. This addresses Heemin's request

Following is updated format, also put it to the PR description

                "_explanation": {
                    "value": 1.0,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 1.0,
                                    "description": "field1:[0 TO 500]",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",
                            "details": [
                                {
                                    "value": 0.026301946,
                                    "description": "within top 12",
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },

@martin-gaievski
Copy link
Member Author

Few more questions. What will the response look like if they pass explain=true but does not included explain_response_processor in their pipeline?

I've added integ test for this scenario

@vibrantvarun
Copy link
Member

@martin-gaievski Can you help me understand what does

field1 [0 to 500] and within top 12 mean in explanation response.

Overall the code looks good me. One question, are we not suppose to show original scores prior to normalization in the search response?

@martin-gaievski
Copy link
Member Author

@martin-gaievski Can you help me understand what does

field1 [0 to 500] and within top 12 mean in explanation response.

Overall the code looks good me. One question, are we not suppose to show original scores prior to normalization in the search response?

Original scores are there. Please check the breakdown of explanation elements below, hope that will answer all questions

"_explanation": {
                    "value": 1.0,
                    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",  // combination technique used for score normalization
                    "details": [
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:",  // normalization technique used for score normalization
                            "details": [
                                {
                                    "value": 1.0,  //raw/source score from sub-query 1
                                    "description": "field1:[0 TO 500]", // this is explanation from sub query 1, in this case this is range query
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 1.0,
                            "description": "min_max normalization of:", // normalization technique used for score normalization
                            "details": [
                                {
                                    "value": 0.026301946,  //raw/source score from sub-query 2
                                    "description": "within top 12", // this is explanation from sub query 2, in this case this is knn query
                                    "details": []
                                }
                            ]
                        }
                    ]
                }
            },

@vibrantvarun
Copy link
Member

vibrantvarun commented Nov 15, 2024

Can you add one integ test with "multi_match" and "neural" subquery? Most of the customer use this combination and lately we have seen this query combination is creating issues.

List<Pair<Float, String>> scoreDetails;

public ExplanationDetails(List<Pair<Float, String>> scoreDetails) {
this(-1, scoreDetails);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardcode this -1 value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will match default docId in SearchHit https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/SearchHit.java#L170, we set it when we don't have access to real docId value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this explanation at line 24? As by the look of it, it will very difficult to guess. Also please the code ref as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, added

Signed-off-by: Martin Gaievski <[email protected]>
@ToString.Include
public static final String TECHNIQUE_NAME = "geometric_mean";
public static final String PARAM_NAME_WEIGHTS = "weights";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we removed this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-gaievski
Copy link
Member Author

Can you add one integ test with "multi_match" and "neural" subquery? Most of the customer use this combination and lately we have seen this query combination is creating issues.

I'm not sure if it's necessary in scope of explain, it's doesn't matter which query we pick, all we care about is the format of newly added sections. We can add it in later PR if needed, anyways code goes to feature branch

@vibrantvarun
Copy link
Member

The reason I asked for that is because it involes multiple scorer behind the scenes so it will good to test. You can add it in the actual PR.

@vibrantvarun
Copy link
Member

vibrantvarun commented Nov 15, 2024

One thing, I see in the HLD that when we call explain api it will trigger explain for the docId before executing new response processor. Don't you think we don't need that step as we by default get raw scores from QueryPhase and we just have add in the explain response. For normalization explanation we are triggering this new search response processor.
I am referring the circled area? Can we disable that? I mean not in scope of this PR or feature.
Screenshot 2024-11-14 at 10 14 15 PM

Just thoughts!!

@martin-gaievski
Copy link
Member Author

One thing, I see in the HLD that when we call explain api it will trigger explain for the docId before executing new response processor. Don't you think we don't need that step as we by default get raw scores from QueryPhase and we just have add in the explain response. For normalization explanation we are triggering this new search response processor.

Just thoughts!!

explain is one of those features that are implemented with two phase approach: first it retrieves data at the query phase and then compile final response in fetch phase. We need to enable explain in the query by collecting sub-query scores, and then scores will be set to SearchHits in the Fetch phase automatically. We do not call anything explicitly, just adding implementation for existing interface methods

@vibrantvarun
Copy link
Member

Got it, approving this PR now. Will review the final PR once again after the security clearence.

Signed-off-by: Martin Gaievski <[email protected]>
@martin-gaievski
Copy link
Member Author

The reason I asked for that is because it involes multiple scorer behind the scenes so it will good to test. You can add it in the actual PR.

I've added the test, just one change - I did multi-match + knn query to avoid model deployment

@martin-gaievski martin-gaievski merged commit a3fd3a2 into opensearch-project:feature/explainability-in-hybrid-query Nov 15, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants