Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: trace LLMEvaluator runs #1185

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

isahers1
Copy link
Contributor

@isahers1 isahers1 commented Nov 6, 2024

No description provided.

if isinstance(self.score_config, CategoricalScoreConfig):
value = output["score"]
explanation = output.get("explanation", None)
explanation = output.get(self.reasoning_key, None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle if any of (self.reasoning_key, "reasoning", "explanation", "comment") are provided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am slightly confused on how something other than self.score_config.reasoning_key would exist. In the _create_score_json_schema don't we only set score_config.reasoning_key? I am probably just missing something simple

python/langsmith/evaluation/llm_evaluator.py Outdated Show resolved Hide resolved
python/langsmith/evaluation/llm_evaluator.py Outdated Show resolved Hide resolved
python/langsmith/evaluation/llm_evaluator.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@hinthornw hinthornw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a few nits left hten should be good

include_explanation: bool = False
explanation_description: Optional[str] = None
reasoning_key: Optional[str] = None
reasoning_description: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, so we'll need to be able to handle accepting the other value in the init and log a deprecation warning

if isinstance(score_config, CategoricalScoreConfig):
properties["score"] = {
properties["value"] = {
"type": "string",
"enum": score_config.choices,
"description": f"The score for the evaluation, one of "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't a score anymore, is it? It's the selected category.
The descriptions here matter a lot

python/langsmith/evaluation/llm_evaluator.py Outdated Show resolved Hide resolved
await self.runnable.ainvoke(variables, config={"run_id": source_run_id}),
)

return self._parse_output(output, str(source_run_id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could leave as a UUID

@@ -15,7 +15,7 @@ def test_llm_evaluator_init() -> None:
key="vagueness",
choices=["Y", "N"],
description="Whether the response is vague. Y for yes, N for no.",
include_explanation=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should keep a couple of these tests with include_explanation around for backward compat testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some backward compatibility tests

description (str): Detailed description provided to the LLM judge of what
this score evaluates.
reasoning_key (Optional[str]): Key used to store the reasoning/explanation
for the score. Defaults to None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say what None means (don't include a reasoning/ CoT field)
Ditto with description (Defaults to ""The explanation for the score."")
I still think we should have a better default than "The explanation for the score." - like "Think step-by-step about what the correct score should be."

Copy link
Collaborator

@hinthornw hinthornw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

if data.get("include_explanation") and data.get("reasoning_key"):
raise ValueError(
"Cannot include both include_explanation and reasoning_key, "
"please just use reasoning_key - include_explanation has been deprecated" # noqa: E501
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we're already splitting across lines let's split them within the limit rather than using noqa

include_explanation: bool = False # Deprecated
explanation_description: Optional[str] = None # Deprecated

def __init__(self, **data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sad byproduct of us doing this is you lose IDE completion. Could we add the keys here?
Ditto with below.

@pytest.mark.parametrize(
"config_class", [CategoricalScoreConfig, ContinuousScoreConfig]
)
def test_backwards_compatibility(config_class) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Thse could probably go in unit_tests/* ratehr than in integration tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants