-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: trace LLMEvaluator runs #1185
base: main
Are you sure you want to change the base?
Conversation
if isinstance(self.score_config, CategoricalScoreConfig): | ||
value = output["score"] | ||
explanation = output.get("explanation", None) | ||
explanation = output.get(self.reasoning_key, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should handle if any of (self.reasoning_key, "reasoning", "explanation", "comment") are provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am slightly confused on how something other than self.score_config.reasoning_key would exist. In the _create_score_json_schema
don't we only set score_config.reasoning_key? I am probably just missing something simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just a few nits left hten should be good
include_explanation: bool = False | ||
explanation_description: Optional[str] = None | ||
reasoning_key: Optional[str] = None | ||
reasoning_description: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, so we'll need to be able to handle accepting the other value in the init and log a deprecation warning
if isinstance(score_config, CategoricalScoreConfig): | ||
properties["score"] = { | ||
properties["value"] = { | ||
"type": "string", | ||
"enum": score_config.choices, | ||
"description": f"The score for the evaluation, one of " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't a score anymore, is it? It's the selected category.
The descriptions here matter a lot
await self.runnable.ainvoke(variables, config={"run_id": source_run_id}), | ||
) | ||
|
||
return self._parse_output(output, str(source_run_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could leave as a UUID
@@ -15,7 +15,7 @@ def test_llm_evaluator_init() -> None: | |||
key="vagueness", | |||
choices=["Y", "N"], | |||
description="Whether the response is vague. Y for yes, N for no.", | |||
include_explanation=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should keep a couple of these tests with include_explanation around for backward compat testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some backward compatibility tests
description (str): Detailed description provided to the LLM judge of what | ||
this score evaluates. | ||
reasoning_key (Optional[str]): Key used to store the reasoning/explanation | ||
for the score. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say what None means (don't include a reasoning/ CoT field)
Ditto with description (Defaults to ""The explanation for the score."")
I still think we should have a better default than "The explanation for the score." - like "Think step-by-step about what the correct score should be."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
if data.get("include_explanation") and data.get("reasoning_key"): | ||
raise ValueError( | ||
"Cannot include both include_explanation and reasoning_key, " | ||
"please just use reasoning_key - include_explanation has been deprecated" # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If we're already splitting across lines let's split them within the limit rather than using noqa
include_explanation: bool = False # Deprecated | ||
explanation_description: Optional[str] = None # Deprecated | ||
|
||
def __init__(self, **data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sad byproduct of us doing this is you lose IDE completion. Could we add the keys here?
Ditto with below.
@pytest.mark.parametrize( | ||
"config_class", [CategoricalScoreConfig, ContinuousScoreConfig] | ||
) | ||
def test_backwards_compatibility(config_class) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Thse could probably go in unit_tests/*
ratehr than in integration tests
No description provided.