Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add numerical perturbation detector #2040

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Kranium2002
Copy link
Contributor

@Kranium2002 Kranium2002 commented Oct 13, 2024

Description

This PR adds the numerical perturbation detector feature to robustness detector. This PR is still a work in progress.

The detector will perturb the values by 1 percent and then check for errors in classification mode and for regression the model flags any error which is 5% away from the real value. This could be made so that the user can adjust these percentages but for now this is fixed. Review needed on this.

Related Issue

closes #1846

Type of Change

  • 📚 Examples / docs / tutorials / dependencies update
  • 🔧 Bug fix (non-breaking change which fixes an issue)
  • 🥂 Improvement (non-breaking change which improves an existing feature)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 🔐 Security fix

Checklist

  • I've read the CODE_OF_CONDUCT.md document.
  • I've read the CONTRIBUTING.md guide.
  • I've written tests for all new methods and classes that I created.
  • I've written the docstring in Google format for all the methods and classes that I used.
  • I've updated the pdm.lock running pdm update-lock (only applicable when pyproject.toml has been
    modified)

@Kranium2002
Copy link
Contributor Author

@kevinmessiaen Please review

@alexcombessie
Copy link
Member

Hey @henchaves - Would you have some bandwidth to review this PR ?

@henchaves henchaves added Lockfile Temporary label to update pdm.lock and removed Lockfile Temporary label to update pdm.lock labels Oct 31, 2024
@Kranium2002
Copy link
Contributor Author

@henchaves Did you get a chance to look at this? Do I need to make any changes?

@henchaves
Copy link
Member

Hello @Kranium2002,
A researcher will review your PR as soon, to check if something is still needed in the algorithm logic.
After that, me or @kevinmessiaen will review the code itself. Sorry for the delay!

Copy link
Member

@mattbit mattbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Kranium2002, thanks so much for the contribution! I think numerical perturbation is a great addition to the library.

I've noted a few issues with the proposed implementation, in summary:

  • We should check for features not only based on their data type, but column type (sometimes we have categorical features represented as numbers, but we don't want to apply the numerical perturbation to those)
  • The perturbation functions used should be defined as instances of TranformationFunction (the same we use for text perturbation or other transformations)
  • We need to be careful not to alter the underlying data type. Certain transformations should be applied depending on whether the feature has integer or float values (e.g. we don't want to add Gaussian noise to integers)

I also feel there is a lot of redundancy between the BaseNumericalPerturbationDetector calss and BaseTextPerturbationDetector. It's probably worth refactoring this to have a common base class covering the shared behavior instead of duplicating code — @henchaves can help with that ;)


def run(self, model: BaseModel, dataset: Dataset, features: Sequence[str]) -> Sequence[Issue]:
"""Run the numerical perturbation detector."""
numerical_features = [f for f in features if pd.api.types.is_numeric_dtype(dataset.df[f])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not enough, because sometimes categorical features are represented with numerical types (e.g. integers). The more reliable way would be to extract this from the dataset.column_types.

Suggested change
numerical_features = [f for f in features if pd.api.types.is_numeric_dtype(dataset.df[f])]
numerical_features = [f for f in features if dataset.column_types[feature] == "numeric"]

"failed_size": failed_size,
"slice_size": slice_size,
"threshold": threshold,
"output_sensitivity": output_sensitivity,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add params metric, metric_value (these are used in standardized export, e.g. AVID report)

Suggested change
"output_sensitivity": output_sensitivity,
"output_sensitivity": output_sensitivity,
"metric": "Fail rate",
"metric_value": fail_rate,

features=[feature],
meta={
"feature": feature,
"perturbation_fraction": self.perturbation_fraction,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add domain and deviation (these are used for visualization in the scan widget).

Suggested change
"perturbation_fraction": self.perturbation_fraction,
"perturbation_fraction": self.perturbation_fraction,
"domain": f"Feature `{feature}`",
"deviation": f"{failed_size}/{slice_size} tested samples ({round(fail_rate * 100, 2)}%) changed prediction after perturbation",

group=self._issue_group,
level=issue_level,
description=desc,
features=[feature],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add transformation_fn to indicate which transformation (i.e. the perturbation) was performed.

Suggested change
features=[feature],
features=[feature],
transformation_fn="Numerical perturbation", # TODO: define proper transformation functions

Comment on lines +25 to +27
lambda x: x * 1.01,
lambda x: x * 0.99,
lambda x: x + np.random.normal(0, 0.01, x.shape),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be proper transformation functions (extending TransformationFunction).

IMPORTANT: all of these perturbations assume that the numerical values are floats. This is not always the case, integer features are common: when such data types are present we must not alter that. Applying the transformations above (for example Gaussian noise) to an integer feature like num_passengers would silently convert the data type to float, potentially breaking the model inference.

@henchaves henchaves self-assigned this Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Scan: Add a robustness detector to the scan that perturbs numerical values
5 participants