Refactor measurements evaluations #351

olichtne · 2023-11-24T16:00:28Z

Description

This implements some much needed refactoring of MeasurementResults and the BaselineEvaluator and as such brings some reduction and unification of the relevant code.

Tests

Tested locally with some custom changes to use the "current result -> baseline result" and "threshold = 1" code adjustments with the following recipe:

#!/usr/bin/env python3

from lnst.Recipes.ENRT.SimpleNetworkRecipe import SimpleNetworkRecipe
from lnst.Controller.ContainerPoolManager import ContainerPoolManager
from lnst.Controller.MachineMapper import ContainerMapper
from lnst.Controller import Controller
from lnst.Controller.RunSummaryFormatters import HumanReadableRunSummaryFormatter
from lnst.Controller.RecipeResults import ResultLevel
import logging

from lnst.RecipeCommon.Perf.Evaluators.BaselineCPUAverageEvaluator import BaselineCPUAverageEvaluator
from lnst.RecipeCommon.Perf.Evaluators.BaselineEvaluator import BaselineEvaluator

class SimpleNetworkRecipe(SimpleNetworkRecipe):
    @property
    def net_perf_evaluators(self):
        parent = list(super().net_perf_evaluators)
        parent.append(
            BaselineEvaluator()
        )
        return parent

    @property
    def cpu_perf_evaluators(self):
        parent = list(super().cpu_perf_evaluators)
        parent.append(
            BaselineCPUAverageEvaluator(
                evaluation_filter={
                    "host1": ["cpu"],
                    "host2": ["cpu"],
                }
            )
        )
        return parent

recipe = SimpleNetworkRecipe(
    perf_tests=['tcp_stream'],
    perf_duration=5,
    perf_iterations=2,
    perf_msg_sizes=[1400],
    ip_versions=['ipv4'],
    dev_intr_cpu=[0],
    perf_tool_cpu=[0],
    ping_count=1,
    offload_combinations=[
        {'gro': 'on'},
    ],
    do_linuxperf_measurement=False
)

ctl = Controller(
    debug=True,
    poolMgr=ContainerPoolManager,
    mapper=ContainerMapper,
    podman_uri="unix:///run/podman/podman.sock",
    image="lnst"
)
try:
    ctl.run(recipe)
except Exception as e:
    print(e)
    pass

summary_fmt = HumanReadableRunSummaryFormatter(level=ResultLevel.IMPORTANT)
for run in recipe.runs:
    logging.info(summary_fmt.format_run(run))

After running this this is what the evaluation result descriptions look like:

    PASS 53_TestResult:
        Baseline evaluation of
        host host1 cpu 'cpu' utilization: 130.86 +-16.58 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
    PASS 54_TestResult:
        Baseline evaluation of
        host host2 cpu 'cpu' utilization: 130.93 +-16.77 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
    PASS 55_TestResult:
        Nonzero evaluation of flow:
        Flow(
            type=tcp_stream,
            generator=Host(machine_id=host1),
            generator_bind=192.168.101.1,
            generator_nic=Device(machine=host1, id=eth0, name=eth1, ifindex=3),
            generator_port=12000,
            receiver=Host(machine_id=host2),
            receiver_bind=192.168.101.2,
            receiver_nic=Device(machine=host2, id=eth0, name=eth1, ifindex=3),
            receiver_port=12000,
            msg_size=1400,
            duration=5,
            parallel_streams=1,
            generator_cpupin=[0],
            receiver_cpupin=[0],
            aggregated_flow=False
            warmup_duration=0,
        )
        PASS: generator_results reported non-zero throughput
        PASS: receiver_results reported non-zero throughput
    PASS 56_TestResult:
        Baseline evaluation of
        Flow(
            type=tcp_stream,
            generator=Host(machine_id=host1),
            generator_bind=192.168.101.1,
            generator_nic=Device(machine=host1, id=eth0, name=eth1, ifindex=3),
            generator_port=12000,
            receiver=Host(machine_id=host2),
            receiver_bind=192.168.101.2,
            receiver_nic=Device(machine=host2, id=eth0, name=eth1, ifindex=3),
            receiver_port=12000,
            msg_size=1400,
            duration=5,
            parallel_streams=1,
            generator_cpupin=[0],
            receiver_cpupin=[0],
            aggregated_flow=False
            warmup_duration=0,
        )
        Generator measured throughput: 1221659418.30 +-59245322.92(4.85%) bits per second.
        Generator process CPU data: 44.93 +-0.03 cpu_percent per second.
        Receiver measured throughput: 1213241304.82 +-58536978.08(4.82%) bits per second.
        Receiver process CPU data: 24.82 +-0.23 cpu_percent per second.
        New generator_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New generator_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%

Additionally if you remove the "cpu evaluation filters" this is what we get:

    PASS 53_TestResult:
        Baseline evaluation of
        host host1 cpu 'cpu' utilization: 193.62 +-8.60 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu0' utilization: 90.81 +-4.04 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu1' utilization: 12.30 +-0.35 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu2' utilization: 13.36 +-1.68 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu3' utilization: 17.52 +-1.86 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu4' utilization: 14.69 +-1.02 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu5' utilization: 13.88 +-1.66 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu6' utilization: 13.26 +-0.42 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu7' utilization: 17.86 +-2.49 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
    PASS 54_TestResult:
        Baseline evaluation of
        host host2 cpu 'cpu' utilization: 194.44 +-8.06 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu0' utilization: 90.59 +-3.71 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu1' utilization: 12.67 +-0.27 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu2' utilization: 13.48 +-1.68 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu3' utilization: 17.81 +-1.82 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu4' utilization: 14.77 +-0.92 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu5' utilization: 14.06 +-1.61 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu6' utilization: 13.28 +-0.34 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu7' utilization: 17.85 +-2.42 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%

so an interleaving of "result.description" and "comparison.description"...

jtluka

Code wise this looks good.

jtluka · 2023-11-28T14:01:24Z

I have just one question about the output.

I see that there is:

        host host1 cpu 'cpu' utilization: 130.86 +-16.58 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%

and

        Generator measured throughput: 1221659418.30 +-59245322.92(4.85%) bits per second.
        Generator process CPU data: 44.93 +-0.03 cpu_percent per second.
        Receiver measured throughput: 1213241304.82 +-58536978.08(4.82%) bits per second.
        Receiver process CPU data: 24.82 +-0.23 cpu_percent per second.
        New generator_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New generator_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%

I'm wondering if we could change the output so that the individual metric names are same in both metric measurement description and metric evaluation description, that is e.g:

        Generator measured throughput (generator_results): 1221659418.30 +-59245322.92(4.85%) bits per second.
        Generator process CPU data (generator_cpu_stats): 44.93 +-0.03 cpu_percent per second.
        Receiver measured throughput (receiver_results): 1213241304.82 +-58536978.08(4.82%) bits per second.
        Receiver process CPU data (receiver_cpu_stats): 24.82 +-0.23 cpu_percent per second.
        New generator_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New generator_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%

or ideally (I'm aware that this would require metric name to human readable name translation):

        Generator measured throughput (generator_results): 1221659418.30 +-59245322.92(4.85%) bits per second.
        ...
        Generator measured throughput average is 0.00% higher from the baseline. Allowed difference: 1.0%

It is probably out of scope of this MR and could be done separately.

olichtne · 2023-12-01T13:24:21Z

    Generator measured throughput (generator_results): 1221659418.30 +-59245322.92(4.85%) bits per second.

this should be simple so i'll do at least that in this PR

jtluka

Looks good.

olichtne · 2024-02-06T11:22:36Z

added one more commit that makes the OvSDPDKPvPRecipe and VhostNetPvPRecipe use hostids consistent with the rest of the enrt recipes

Tested the OvSDPDKRecipe in J:8895512, and we don't currently run the VhostNetPvPRecipe so i didn't test that...

This will be helpful for automation and easier implementation of classes that use the results - we can now write generic code that evaluates ANY result class by simply iterating the metrics to evaluate. Signed-off-by: Ondrej Lichtner <[email protected]>

…sults class This is a better place for the generic flow description generation, and is more consistent with the cpu measurement and cpu measurement results as well. Signed-off-by: Ondrej Lichtner <[email protected]>

Now that MeasurementResults export a `metrics` property, Baseline evaluation can be refactored and unified to work with almost any MeasurementResult in a much simpler manner. This commit does just that, additionally there are two additional related changes: * MetricComparison class is extended to contain more information so that we can unify it and remove the use of a "comparison dictionary" * BaselineEvaluationResult class is added to store all of the grouped MetricComparisons and to generate a nice description for them. This does change the Result description format as it was previously more customized to the individual MeasurementResults, but this IMO actually is more accurate although maybe a little less nice to look at. Finally this gives us a bunch more power to actually machine post process the baseline evaluations so I've also removed the "good direction -> improvement warning" override which IMO should be done as a post processing step and not directly as a part of the evaluation method. The unification of the BaselineEvaluator means that we can completely remove: * BaselineFlowAverageEvaluator * BaselineRDMABandwidthAverageEvaluator * BaselineTcRunAverageEvaluator We need to keep the BaselineCPUAverageEvaluator as this does still override some methods wrt. how filtering and result grouping works for CPU Measurements. Signed-off-by: Ondrej Lichtner <[email protected]>

This parameter logically doesn't fit into the generic upstream BaseEnrtRecipe class. The reason for this is that from the upstream point of view it doesn't actually do anything - it only recognized the "none" and "nonzero" values which... are special use cases and the implementation was a bit clunky. We could refactor this to make it make a bit more sense in the upstream and for it to be extensible - I think this should be an extensible Mixin class that handles evaluator registration for enrt perf measurements... But I think it doesn't have a valid use case in upstream yet so I'm removing it for now. We may want to revisit this later. Signed-off-by: Ondrej Lichtner <[email protected]>

Signed-off-by: Ondrej Lichtner <[email protected]>

These two tests are using hostids that are different than every other ENRT recipe for no reason. Making them consistent should make the data migration that we need to do for our measurement database easier. Signed-off-by: Ondrej Lichtner <[email protected]>

This function was written back in 2015 when we couldn't use python3 yet and statistics module was added in 3.4. At the same time, the formula used is "incorrect", this calculates the POPULATION std deviation... whereas we really should be using a SAMPLE standard deviation formula. Signed-off-by: Ondrej Lichtner <[email protected]>

olichtne requested review from jtluka and Kuba314 November 24, 2023 16:00

olichtne force-pushed the refactor-measurements-evaluations branch 2 times, most recently from 31b8f87 to 38ac87a Compare November 27, 2023 14:21

jtluka previously approved these changes Nov 28, 2023

View reviewed changes

olichtne dismissed jtluka’s stale review via 41d45d8 December 1, 2023 13:47

olichtne force-pushed the refactor-measurements-evaluations branch 2 times, most recently from 41d45d8 to 5567455 Compare December 7, 2023 13:58

olichtne force-pushed the refactor-measurements-evaluations branch from 5567455 to ad36ad5 Compare January 4, 2024 13:18

olichtne force-pushed the refactor-measurements-evaluations branch from cf97742 to d274332 Compare February 1, 2024 08:58

jtluka previously approved these changes Feb 1, 2024

View reviewed changes

olichtne dismissed jtluka’s stale review via 2d15587 February 6, 2024 08:48

olichtne force-pushed the refactor-measurements-evaluations branch from 3a79273 to 7646480 Compare February 8, 2024 10:46

olichtne added 8 commits February 12, 2024 12:48

Perf/Measurement/Results: add metric names to result descriptions

dfb460b

Signed-off-by: Ondrej Lichtner <[email protected]>

BaselineEvaluator: type fixes for MetricComparison

caa0c01

Signed-off-by: Ondrej Lichtner <[email protected]>

olichtne force-pushed the refactor-measurements-evaluations branch from 7646480 to 3d07e4e Compare February 12, 2024 11:48

olichtne merged commit 35f1057 into LNST-project:master Feb 12, 2024
3 checks passed

olichtne deleted the refactor-measurements-evaluations branch February 12, 2024 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor measurements evaluations #351

Refactor measurements evaluations #351

olichtne commented Nov 24, 2023

jtluka left a comment

jtluka commented Nov 28, 2023 •

edited

Loading

olichtne commented Dec 1, 2023

jtluka left a comment

olichtne commented Feb 6, 2024 •

edited

Loading

Refactor measurements evaluations #351

Refactor measurements evaluations #351

Conversation

olichtne commented Nov 24, 2023

Description

Tests

jtluka left a comment

Choose a reason for hiding this comment

jtluka commented Nov 28, 2023 • edited Loading

olichtne commented Dec 1, 2023

jtluka left a comment

Choose a reason for hiding this comment

olichtne commented Feb 6, 2024 • edited Loading

jtluka commented Nov 28, 2023 •

edited

Loading

olichtne commented Feb 6, 2024 •

edited

Loading