Get Accuracy Metrics reported #63

parfeniukink · 2024-10-28T06:33:03Z

Eval Harness enablement - to be scoped and broken up further.

The goal with this ticket is to leverage the LLM Eval Harness code to tie in eval benchmarking into the performance benchmarking that is done by GuideLLM.

Eval Benchmarking on public/private datasets generally can take a long time, we will have the user plug in how long they want to run a benchmark for. Can’t use a specific amount of time. - short, medium, long being the amount of time/ each benchmark can be a task.

The challenges here are developing subsets that are representative of the massive original datasets in order to get accurate benchmarks. Due to this reality, the first task here to be done by the research team is to split up these larger benchmark datasets into smaller, benchmarkable subsets so we can run evals in a matter of minutes vs. hours.

Mark to lay out what we need there to extend out the backend.

parfeniukink self-assigned this Oct 28, 2024

parfeniukink linked a pull request Oct 28, 2024 that will close this issue

Introduce the --subset-size CLI parameter #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Accuracy Metrics reported #63

Get Accuracy Metrics reported #63

parfeniukink commented Oct 28, 2024

Get Accuracy Metrics reported #63

Get Accuracy Metrics reported #63

Comments

parfeniukink commented Oct 28, 2024