Add numerical perturbation detector #2040

Kranium2002 · 2024-10-13T22:26:54Z

Description

This PR adds the numerical perturbation detector feature to robustness detector. This PR is still a work in progress.

The detector will perturb the values by 1 percent and then check for errors in classification mode and for regression the model flags any error which is 5% away from the real value. This could be made so that the user can adjust these percentages but for now this is fixed. Review needed on this.

Related Issue

closes #1846

Type of Change

📚 Examples / docs / tutorials / dependencies update
🔧 Bug fix (non-breaking change which fixes an issue)
🥂 Improvement (non-breaking change which improves an existing feature)
🚀 New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
🔐 Security fix

Checklist

I've read the CODE_OF_CONDUCT.md document.
I've read the CONTRIBUTING.md guide.
I've written tests for all new methods and classes that I created.
I've written the docstring in Google format for all the methods and classes that I used.
I've updated the pdm.lock running pdm update-lock (only applicable when pyproject.toml has been
modified)

Kranium2002 · 2024-10-17T14:39:00Z

@kevinmessiaen Please review

alexcombessie · 2024-10-31T08:58:41Z

Hey @henchaves - Would you have some bandwidth to review this PR ?

Kranium2002 · 2024-11-12T00:14:25Z

@henchaves Did you get a chance to look at this? Do I need to make any changes?

henchaves · 2024-11-14T19:51:01Z

Hello @Kranium2002,
A researcher will review your PR as soon, to check if something is still needed in the algorithm logic.
After that, me or @kevinmessiaen will review the code itself. Sorry for the delay!

mattbit

Hi @Kranium2002, thanks so much for the contribution! I think numerical perturbation is a great addition to the library.

I've noted a few issues with the proposed implementation, in summary:

We should check for features not only based on their data type, but column type (sometimes we have categorical features represented as numbers, but we don't want to apply the numerical perturbation to those)
The perturbation functions used should be defined as instances of TranformationFunction (the same we use for text perturbation or other transformations)
We need to be careful not to alter the underlying data type. Certain transformations should be applied depending on whether the feature has integer or float values (e.g. we don't want to add Gaussian noise to integers)

I also feel there is a lot of redundancy between the BaseNumericalPerturbationDetector calss and BaseTextPerturbationDetector. It's probably worth refactoring this to have a common base class covering the shared behavior instead of duplicating code — @henchaves can help with that ;)

mattbit · 2024-11-15T09:21:15Z

giskard/scanner/robustness/base_numerical_detector.py

+
+    def run(self, model: BaseModel, dataset: Dataset, features: Sequence[str]) -> Sequence[Issue]:
+        """Run the numerical perturbation detector."""
+        numerical_features = [f for f in features if pd.api.types.is_numeric_dtype(dataset.df[f])]


This check is not enough, because sometimes categorical features are represented with numerical types (e.g. integers). The more reliable way would be to extract this from the dataset.column_types.

Suggested change

numerical_features = [f for f in features if pd.api.types.is_numeric_dtype(dataset.df[f])]

numerical_features = [f for f in features if dataset.column_types[feature] == "numeric"]

mattbit · 2024-11-15T09:36:33Z

giskard/scanner/robustness/base_numerical_detector.py

+                    "failed_size": failed_size,
+                    "slice_size": slice_size,
+                    "threshold": threshold,
+                    "output_sensitivity": output_sensitivity,


add params metric, metric_value (these are used in standardized export, e.g. AVID report)

Suggested change

"output_sensitivity": output_sensitivity,

"output_sensitivity": output_sensitivity,

"metric": "Fail rate",

"metric_value": fail_rate,

mattbit · 2024-11-15T09:37:31Z

giskard/scanner/robustness/base_numerical_detector.py

+                features=[feature],
+                meta={
+                    "feature": feature,
+                    "perturbation_fraction": self.perturbation_fraction,


Add domain and deviation (these are used for visualization in the scan widget).

Suggested change

"perturbation_fraction": self.perturbation_fraction,

"perturbation_fraction": self.perturbation_fraction,

"domain": f"Feature `{feature}`",

"deviation": f"{failed_size}/{slice_size} tested samples ({round(fail_rate * 100, 2)}%) changed prediction after perturbation",

mattbit · 2024-11-15T09:39:09Z

giskard/scanner/robustness/base_numerical_detector.py

+                group=self._issue_group,
+                level=issue_level,
+                description=desc,
+                features=[feature],


Add transformation_fn to indicate which transformation (i.e. the perturbation) was performed.

Suggested change

features=[feature],

features=[feature],

transformation_fn="Numerical perturbation", # TODO: define proper transformation functions

mattbit · 2024-11-15T09:43:45Z

giskard/scanner/robustness/numerical_perturbation_detector.py

+            lambda x: x * 1.01,
+            lambda x: x * 0.99,
+            lambda x: x + np.random.normal(0, 0.01, x.shape),


These should be proper transformation functions (extending TransformationFunction).

IMPORTANT: all of these perturbations assume that the numerical values are floats. This is not always the case, integer features are common: when such data types are present we must not alter that. Applying the transformations above (for example Gaussian noise) to an integer feature like num_passengers would silently convert the data type to float, potentially breaking the model inference.

chore(add): Base class for numerical perturbation detector

6c590d6

Kranium2002 mentioned this pull request Oct 13, 2024

Scan: Add a robustness detector to the scan that perturbs numerical values #1846

Open

Kranium2002 added 4 commits October 17, 2024 16:34

fix: minor issues with base class

4e21ed9

add: detector file with default values

f51e560

add: mock models for test

13d543f

add: tests for numerical perturbation detector

fe0e2b0

henchaves assigned kevinmessiaen Oct 18, 2024

henchaves requested a review from kevinmessiaen October 18, 2024 11:47

henchaves and others added 2 commits October 18, 2024 08:47

Merge branch 'main' into main

a44bca4

Merge branch 'main' into main

d1b240b

alexcombessie requested a review from henchaves October 31, 2024 08:58

henchaves added 2 commits October 31, 2024 10:17

Merge branch 'main' into main

f9cceda

Format files

ae9323c

henchaves added Lockfile Temporary label to update pdm.lock and removed Lockfile Temporary label to update pdm.lock labels Oct 31, 2024

Merge branch 'main' into main

dd04ec7

henchaves requested a review from mattbit November 14, 2024 13:14

Merge branch 'main' into main

ce3f39f

mattbit requested changes Nov 15, 2024

View reviewed changes

henchaves self-assigned this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add numerical perturbation detector #2040

Add numerical perturbation detector #2040

Kranium2002 commented Oct 13, 2024 •

edited

Loading

Kranium2002 commented Oct 17, 2024

alexcombessie commented Oct 31, 2024

Kranium2002 commented Nov 12, 2024

henchaves commented Nov 14, 2024

mattbit left a comment

mattbit Nov 15, 2024

mattbit Nov 15, 2024

mattbit Nov 15, 2024

mattbit Nov 15, 2024

mattbit Nov 15, 2024

	numerical_features = [f for f in features if pd.api.types.is_numeric_dtype(dataset.df[f])]
	numerical_features = [f for f in features if dataset.column_types[feature] == "numeric"]

	features=[feature],
	features=[feature],
	transformation_fn="Numerical perturbation", # TODO: define proper transformation functions

Add numerical perturbation detector #2040

Are you sure you want to change the base?

Add numerical perturbation detector #2040

Conversation

Kranium2002 commented Oct 13, 2024 • edited Loading

Description

Related Issue

Type of Change

Checklist

Kranium2002 commented Oct 17, 2024

alexcombessie commented Oct 31, 2024

Kranium2002 commented Nov 12, 2024

henchaves commented Nov 14, 2024

mattbit left a comment

Choose a reason for hiding this comment

mattbit Nov 15, 2024

Choose a reason for hiding this comment

mattbit Nov 15, 2024

Choose a reason for hiding this comment

mattbit Nov 15, 2024

Choose a reason for hiding this comment

mattbit Nov 15, 2024

Choose a reason for hiding this comment

mattbit Nov 15, 2024

Choose a reason for hiding this comment

Kranium2002 commented Oct 13, 2024 •

edited

Loading