Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor measurements evaluations #351

Merged

Conversation

olichtne
Copy link
Collaborator

Description

This implements some much needed refactoring of MeasurementResults and the BaselineEvaluator and as such brings some reduction and unification of the relevant code.

Tests

Tested locally with some custom changes to use the "current result -> baseline result" and "threshold = 1" code adjustments with the following recipe:

#!/usr/bin/env python3

from lnst.Recipes.ENRT.SimpleNetworkRecipe import SimpleNetworkRecipe
from lnst.Controller.ContainerPoolManager import ContainerPoolManager
from lnst.Controller.MachineMapper import ContainerMapper
from lnst.Controller import Controller
from lnst.Controller.RunSummaryFormatters import HumanReadableRunSummaryFormatter
from lnst.Controller.RecipeResults import ResultLevel
import logging

from lnst.RecipeCommon.Perf.Evaluators.BaselineCPUAverageEvaluator import BaselineCPUAverageEvaluator
from lnst.RecipeCommon.Perf.Evaluators.BaselineEvaluator import BaselineEvaluator

class SimpleNetworkRecipe(SimpleNetworkRecipe):
    @property
    def net_perf_evaluators(self):
        parent = list(super().net_perf_evaluators)
        parent.append(
            BaselineEvaluator()
        )
        return parent

    @property
    def cpu_perf_evaluators(self):
        parent = list(super().cpu_perf_evaluators)
        parent.append(
            BaselineCPUAverageEvaluator(
                evaluation_filter={
                    "host1": ["cpu"],
                    "host2": ["cpu"],
                }
            )
        )
        return parent

recipe = SimpleNetworkRecipe(
    perf_tests=['tcp_stream'],
    perf_duration=5,
    perf_iterations=2,
    perf_msg_sizes=[1400],
    ip_versions=['ipv4'],
    dev_intr_cpu=[0],
    perf_tool_cpu=[0],
    ping_count=1,
    offload_combinations=[
        {'gro': 'on'},
    ],
    do_linuxperf_measurement=False
)

ctl = Controller(
    debug=True,
    poolMgr=ContainerPoolManager,
    mapper=ContainerMapper,
    podman_uri="unix:///run/podman/podman.sock",
    image="lnst"
)
try:
    ctl.run(recipe)
except Exception as e:
    print(e)
    pass

summary_fmt = HumanReadableRunSummaryFormatter(level=ResultLevel.IMPORTANT)
for run in recipe.runs:
    logging.info(summary_fmt.format_run(run))

After running this this is what the evaluation result descriptions look like:

    PASS 53_TestResult:
        Baseline evaluation of
        host host1 cpu 'cpu' utilization: 130.86 +-16.58 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
    PASS 54_TestResult:
        Baseline evaluation of
        host host2 cpu 'cpu' utilization: 130.93 +-16.77 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
    PASS 55_TestResult:
        Nonzero evaluation of flow:
        Flow(
            type=tcp_stream,
            generator=Host(machine_id=host1),
            generator_bind=192.168.101.1,
            generator_nic=Device(machine=host1, id=eth0, name=eth1, ifindex=3),
            generator_port=12000,
            receiver=Host(machine_id=host2),
            receiver_bind=192.168.101.2,
            receiver_nic=Device(machine=host2, id=eth0, name=eth1, ifindex=3),
            receiver_port=12000,
            msg_size=1400,
            duration=5,
            parallel_streams=1,
            generator_cpupin=[0],
            receiver_cpupin=[0],
            aggregated_flow=False
            warmup_duration=0,
        )
        PASS: generator_results reported non-zero throughput
        PASS: receiver_results reported non-zero throughput
    PASS 56_TestResult:
        Baseline evaluation of
        Flow(
            type=tcp_stream,
            generator=Host(machine_id=host1),
            generator_bind=192.168.101.1,
            generator_nic=Device(machine=host1, id=eth0, name=eth1, ifindex=3),
            generator_port=12000,
            receiver=Host(machine_id=host2),
            receiver_bind=192.168.101.2,
            receiver_nic=Device(machine=host2, id=eth0, name=eth1, ifindex=3),
            receiver_port=12000,
            msg_size=1400,
            duration=5,
            parallel_streams=1,
            generator_cpupin=[0],
            receiver_cpupin=[0],
            aggregated_flow=False
            warmup_duration=0,
        )
        Generator measured throughput: 1221659418.30 +-59245322.92(4.85%) bits per second.
        Generator process CPU data: 44.93 +-0.03 cpu_percent per second.
        Receiver measured throughput: 1213241304.82 +-58536978.08(4.82%) bits per second.
        Receiver process CPU data: 24.82 +-0.23 cpu_percent per second.
        New generator_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New generator_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%

Additionally if you remove the "cpu evaluation filters" this is what we get:

    PASS 53_TestResult:
        Baseline evaluation of
        host host1 cpu 'cpu' utilization: 193.62 +-8.60 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu0' utilization: 90.81 +-4.04 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu1' utilization: 12.30 +-0.35 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu2' utilization: 13.36 +-1.68 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu3' utilization: 17.52 +-1.86 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu4' utilization: 14.69 +-1.02 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu5' utilization: 13.88 +-1.66 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu6' utilization: 13.26 +-0.42 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host1 cpu 'cpu7' utilization: 17.86 +-2.49 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
    PASS 54_TestResult:
        Baseline evaluation of
        host host2 cpu 'cpu' utilization: 194.44 +-8.06 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu0' utilization: 90.59 +-3.71 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu1' utilization: 12.67 +-0.27 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu2' utilization: 13.48 +-1.68 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu3' utilization: 17.81 +-1.82 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu4' utilization: 14.77 +-0.92 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu5' utilization: 14.06 +-1.61 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu6' utilization: 13.28 +-0.34 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%
        host host2 cpu 'cpu7' utilization: 17.85 +-2.42 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%

so an interleaving of "result.description" and "comparison.description"...

@olichtne olichtne force-pushed the refactor-measurements-evaluations branch 2 times, most recently from 31b8f87 to 38ac87a Compare November 27, 2023 14:21
jtluka
jtluka previously approved these changes Nov 28, 2023
Copy link
Collaborator

@jtluka jtluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise this looks good.

@jtluka
Copy link
Collaborator

jtluka commented Nov 28, 2023

I have just one question about the output.

I see that there is:

        host host1 cpu 'cpu' utilization: 130.86 +-16.58 time units per second
        New utilization average is 0.00% higher from the baseline. Allowed difference: 1.0%

and

        Generator measured throughput: 1221659418.30 +-59245322.92(4.85%) bits per second.
        Generator process CPU data: 44.93 +-0.03 cpu_percent per second.
        Receiver measured throughput: 1213241304.82 +-58536978.08(4.82%) bits per second.
        Receiver process CPU data: 24.82 +-0.23 cpu_percent per second.
        New generator_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New generator_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%

I'm wondering if we could change the output so that the individual metric names are same in both metric measurement description and metric evaluation description, that is e.g:

        Generator measured throughput (generator_results): 1221659418.30 +-59245322.92(4.85%) bits per second.
        Generator process CPU data (generator_cpu_stats): 44.93 +-0.03 cpu_percent per second.
        Receiver measured throughput (receiver_results): 1213241304.82 +-58536978.08(4.82%) bits per second.
        Receiver process CPU data (receiver_cpu_stats): 24.82 +-0.23 cpu_percent per second.
        New generator_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New generator_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_results average is 0.00% higher from the baseline. Allowed difference: 1.0%
        New receiver_cpu_stats average is 0.00% higher from the baseline. Allowed difference: 1.0%

or ideally (I'm aware that this would require metric name to human readable name translation):

        Generator measured throughput (generator_results): 1221659418.30 +-59245322.92(4.85%) bits per second.
        ...
        Generator measured throughput average is 0.00% higher from the baseline. Allowed difference: 1.0%

It is probably out of scope of this MR and could be done separately.

@olichtne
Copy link
Collaborator Author

olichtne commented Dec 1, 2023

    Generator measured throughput (generator_results): 1221659418.30 +-59245322.92(4.85%) bits per second.

this should be simple so i'll do at least that in this PR

@olichtne olichtne force-pushed the refactor-measurements-evaluations branch 2 times, most recently from 41d45d8 to 5567455 Compare December 7, 2023 13:58
@olichtne olichtne force-pushed the refactor-measurements-evaluations branch from 5567455 to ad36ad5 Compare January 4, 2024 13:18
@olichtne olichtne force-pushed the refactor-measurements-evaluations branch from cf97742 to d274332 Compare February 1, 2024 08:58
jtluka
jtluka previously approved these changes Feb 1, 2024
Copy link
Collaborator

@jtluka jtluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@olichtne
Copy link
Collaborator Author

olichtne commented Feb 6, 2024

added one more commit that makes the OvSDPDKPvPRecipe and VhostNetPvPRecipe use hostids consistent with the rest of the enrt recipes

Tested the OvSDPDKRecipe in J:8895512, and we don't currently run the VhostNetPvPRecipe so i didn't test that...

@olichtne olichtne force-pushed the refactor-measurements-evaluations branch from 3a79273 to 7646480 Compare February 8, 2024 10:46
This will be helpful for automation and easier implementation of classes
that use the results - we can now write generic code that evaluates ANY
result class by simply iterating the metrics to evaluate.

Signed-off-by: Ondrej Lichtner <[email protected]>
…sults class

This is a better place for the generic flow description generation, and
is more consistent with the cpu measurement and cpu measurement results
as well.

Signed-off-by: Ondrej Lichtner <[email protected]>
Now that MeasurementResults export a `metrics` property, Baseline
evaluation can be refactored and unified to work with almost any
MeasurementResult in a much simpler manner.

This commit does just that, additionally there are two additional
related changes:
* MetricComparison class is extended to contain more information so that
  we can unify it and remove the use of a "comparison dictionary"
* BaselineEvaluationResult class is added to store all of the grouped
  MetricComparisons and to generate a nice description for them.

This does change the Result description format as it was previously more
customized to the individual MeasurementResults, but this IMO actually
is more accurate although maybe a little less nice to look at.

Finally this gives us a bunch more power to actually machine post
process the baseline evaluations so I've also removed the "good
direction -> improvement warning" override which IMO should be done as a
post processing step and not directly as a part of the evaluation
method.

The unification of the BaselineEvaluator means that we can completely
remove:
* BaselineFlowAverageEvaluator
* BaselineRDMABandwidthAverageEvaluator
* BaselineTcRunAverageEvaluator

We need to keep the BaselineCPUAverageEvaluator as this does still
override some methods wrt. how filtering and result grouping works for
CPU Measurements.

Signed-off-by: Ondrej Lichtner <[email protected]>
This parameter logically doesn't fit into the generic upstream
BaseEnrtRecipe class.

The reason for this is that from the upstream point of view it doesn't
actually do anything - it only recognized the "none" and "nonzero"
values which... are special use cases and the implementation was a bit
clunky.

We could refactor this to make it make a bit more sense in the upstream
and for it to be extensible - I think this should be an extensible Mixin
class that handles evaluator registration for enrt perf measurements...

But I think it doesn't have a valid use case in upstream yet so I'm
removing it for now. We may want to revisit this later.

Signed-off-by: Ondrej Lichtner <[email protected]>
These two tests are using hostids that are different than every other
ENRT recipe for no reason.

Making them consistent should make the data migration that we need to do
for our measurement database easier.

Signed-off-by: Ondrej Lichtner <[email protected]>
This function was written back in 2015 when we couldn't use python3
yet and statistics module was added in 3.4.

At the same time, the formula used is "incorrect", this calculates the
POPULATION std deviation... whereas we really should be using a SAMPLE
standard deviation formula.

Signed-off-by: Ondrej Lichtner <[email protected]>
@olichtne olichtne force-pushed the refactor-measurements-evaluations branch from 7646480 to 3d07e4e Compare February 12, 2024 11:48
@olichtne olichtne merged commit 35f1057 into LNST-project:master Feb 12, 2024
3 checks passed
@olichtne olichtne deleted the refactor-measurements-evaluations branch February 12, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants