Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] enable early stopping automatically in scikit-learn interface (fixes #3313) #5808

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ClaudioSalvatoreArcidiacono

Fixes #3313

Implements Scikit-learn like interface for early stopping.

@ClaudioSalvatoreArcidiacono

This comment was marked as resolved.

@jameslamb jameslamb changed the title 3313 enable auto early stopping [python-package] enable early stopping automatically in scikit-learn interface (fixes #3313) Mar 26, 2023
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to contribute to LightGBM! I haven't reviewed this yet, but just leaving a blocking review to make it clear to other maintainers that I'd like the opportunity to review before this is merged. The discussion in #3313 talked about waiting for scikit-learn's interface for HistGradientBoostingClassifier to stabilize and mature and then adapting to the choices they made... I'd like the opportunity to look at that interface thoroughly prior to reviewing this.

@jameslamb
Copy link
Collaborator

Until we have a chance to review, you can improve the chances of this being merged by addressing any CI failures you see. You can safely ignore failing R-package jobs, there are some known issues with those (fixed in #5807 ).

@ClaudioSalvatoreArcidiacono
Copy link
Author

Hey @jameslamb thanks for picking this up. I have started to take a look at the ci failures, I think I can solve most of them easily.

There is one check for which I need some input.

I see that one of the tests checks that the init parameters for the sklearn API and the Dask API have the same arguments (see this one). Early stopping is not available yet for the Dask API, so I do not see how can we easily iron that out. Shall I add some more exceptions to that specific test?

@jameslamb
Copy link
Collaborator

Shall I add some more exceptions to that specific test?

@ClaudioSalvatoreArcidiacono no, please do not do that. This test you've linked to is there exactly to catch such deviations.

If you absolutely need to add arguments to the constructors of the scikit-learn estimators in this project, add the same arguments in the same order to the Dask estimators, with default values of None or something, and raise NotImplementedError in the Dask interface when any non-None values are passed to those arguments.

For what it's worth, I am not totally convinced yet that we should take on the exact same interface as scikit-learn, (especially the large added complexity of this new validation_set_split_strategy argument). #3313 was primarily about whether or not to enable early stopping by default... not explicitly about changing the signature of the lightgbm.sklearn estimators to match HistGradientBoostingClassifier.

If you are committed to getting the CI passing here I'm willing to consider it, but just want to set the right expectation that I expect you to also explain specifically the benefit of adding all this new complexity.

@jameslamb
Copy link
Collaborator

By the way, it looks like you are not signing your commits with an email address tied to your GitHub account.

Screen Shot 2023-04-18 at 8 52 12 PM

See #5532 (comment) and the comments linked from it for an explanation of what I mean by that and an explanation of how to fix it.

@jameslamb
Copy link
Collaborator

It has been about 6 weeks since I last provided a review on this PR, and there has not been any activity on it since then.

I'm closing this, assuming it's been abandoned. To make it clear to others onterested in this feature that they shouldn't be waiting on this PR.

@ClaudioSalvatoreArcidiacono if you have time in the future to work with maintainers here, we'd welcome future contributions.

@jameslamb jameslamb closed this May 30, 2023
@ClaudioSalvatoreArcidiacono
Copy link
Author

Hey @jameslamb, I did not have much time to take a look at this lately, I should be more available now and in the coming weeks.

If you also have some time to help me reviewing it I can pick this PR up again.

Regarding your previous comments, thanks for the heads up on signing commits, I will sign the next commits as you mentioned.

About the complexity of the proposed implementation, I am definitely open for feedback from the maintainers and I am willing to change the proposed implementation if necessary.

In the proposed implementation I tried to stick to what is written in the FAQ:

The appropriate splitting strategy depends on the task and domain of the data, information that a modeler has but which LightGBM as a general-purpose tool does not.

So, in the proposed implementation I tried to find a common ground between convenience of activating early stopping using only init params and customisability of the splitting strategy.

@jameslamb
Copy link
Collaborator

Just to set the right expectation...I personally will not be able to look at this for at least another week, and I think it's unlikely it'll make it into LightGBM 4.0 (#5952).

I'm sorry, but this is quite complex and will require a significant investment of time to review. Some questions I'll be looking to answer when I start reviewing this:

  • what does "Scikit-learn like interface" mean, precisely?
    • does it mean you've implemented exactly the same interface as HistGradientBoostingClassifier and HistGradientBoostingRegressor ? If so can you please link to code and docs showing that?
    • and more broadly, why does this need to change ANYTHING about the public interface of lightgbm? And couldn't the existing mechanisms used inside lightgbm.cv() be used instead of adding all this new code for splitting? e.g.
      If object, it should be one of the scikit-learn splitter classes
      (https://scikit-learn.org/stable/modules/classes.html#splitter-classes)
      and have ``split`` method.
  • what has to happen to make these changes consistent with how lightgbm currently works? For example...
    • what happens when early_stopping_rounds is passed to the estimator constructor via **kwargs and n_iter_no_change is set to a non-default value in the constructor... which value wins?
    • What happens if early_stopping=True is passed but valid_sets are also passed to .fit()? Does that disable the automatic splitting and just use the provided validation sets?

@ClaudioSalvatoreArcidiacono
Copy link
Author

ClaudioSalvatoreArcidiacono commented Jul 1, 2023

Hey @jameslamb, no problem. Thanks for being transparent on your availability on this PR and for your feedback. I followed it and I made the PR easier to review now.

I have removed the functionality to use a custom splitter and I made the changes much smaller.

  • what does "Scikit-learn like interface" mean, precisely?

    • does it mean you've implemented exactly the same interface as HistGradientBoostingClassifier and HistGradientBoostingRegressor ? If so can you please link to code and docs showing that?

I have now implemented the same interface as HistGradientBoostingClassifier and HistGradientBoostingRegressor.

  • and more broadly, why does this need to change ANYTHING about the public interface of lightgbm? And couldn't the existing mechanisms used inside lightgbm.cv() be used instead of adding all this new code for splitting? e.g.
    If object, it should be one of the scikit-learn splitter classes
    (https://scikit-learn.org/stable/modules/classes.html#splitter-classes)
    and have ``split`` method.

In this implementation I tried to reuse the splitting function used in lightgbm.cv(). Thanks for the tip.

  • what has to happen to make these changes consistent with how lightgbm currently works? For example...

    • what happens when early_stopping_rounds is passed to the estimator constructor via **kwargs and n_iter_no_change is set to a non-default value in the constructor... which value wins?

Good observation, I think it is indeed needed to rename the arguments so that they will be more consistent with LightGBM naming conventions.

  • What happens if early_stopping=True is passed but valid_sets are also passed to .fit()? Does that disable the automatic splitting and just use the provided validation sets?

Correct.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ClaudioSalvatoreArcidiacono now that LightGBM v4.0 is out, I'd be happy to return to reviewing this.

Can you please update it to the latest master? Please also see my next round of comments.

@@ -412,6 +412,9 @@ def __init__(
random_state: Optional[Union[int, np.random.RandomState]] = None,
n_jobs: Optional[int] = None,
importance_type: str = 'split',
use_early_stopping: bool = False,
validation_fraction: Optional[float] = 0.1,
early_stopping_round: int = 10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be necessary to add n_iter_no_change or early_stopping_round to the public interface of these classes.

All keyword arguments passed through **kwargs are appended to the params stored on the object and passed through to LightGBM's C++ code.

self.set_params(**kwargs)

for key, value in params.items():
setattr(self, key, value)
if hasattr(self, f"_{key}"):
setattr(self, f"_{key}", value)
self._other_params[key] = value

n_iter_no_change is supported as a parameter alias for early_stopping_rounds in LightGBM.

{"n_iter_no_change", "early_stopping_round"},

And all aliases will be resolved in lightgbm.train(), which is called by the scikit-learn estimators' fit() methods.

self._Booster = train(

params = _choose_param_value(
main_param_name="early_stopping_round",
params=params,
default_value=None
)

Can you please explain why you added this? If you just didn't realize LightGBM had this mechanism, please remove this keyword argument from all these classes (but keep passing it explicitly in the unit tests, if it's the preferred argument for scikit-learn).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not fully understand how aliases worked in LightGBM, this comment was super helpful in unblocking me, thanks a lot for it.

I have removed n_iter_no_change from the public interface.

@@ -412,6 +412,9 @@ def __init__(
random_state: Optional[Union[int, np.random.RandomState]] = None,
n_jobs: Optional[int] = None,
importance_type: str = 'split',
use_early_stopping: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dask and scikit-learn interfaces should be identical.

Since HistGradientBoosterClassifier calls this early_stopping (docs link), please also call it early_stopping here and in the Dask interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why is this defaulting to False? This PR's title says "enable early stopping automatically", but if it was merged as-is early stopping would be OFF by default.

Since the goal of this PR is to make LightGBM's scikit-learn interface consistent with scikit-learn itself, please follow what HistGradientBoostingClassifier does. As seen in their docs: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html

early_stopping: ‘auto’ or bool, default=’auto’
.
If ‘auto’, early stopping is enabled if the sample size is larger than 10000. If True, early stopping is enabled, otherwise early stopping is disabled.

Please match that behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter has been removed from the public interface.

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
@ClaudioSalvatoreArcidiacono
Copy link
Author

Hey @jameslamb thanks a lot for your review. I am still interested in solving this issue, but I will be on holidays for the next two weeks. I will take a look at your comments once I am back.

@jameslamb
Copy link
Collaborator

No problem, thanks for letting us know! I'll be happy to continue working on this whenever you have time.

@ClaudioSalvatoreArcidiacono
Copy link
Author

Hey @jameslamb, Thanks again for your review comments! I was aware of the aliases mechanism but I did not fully understand how it worked. Your comment really helped me in understanding them.

I have worked on your feedback and I think now the PR is in good shape.

In this implementation I tried to stick to the scikit-learn interface of HistGradientBoostingClassifier, so the parameter early_stopping is 'auto' by default.

Since we are not adding early_stopping_round to the public interface, the default value for early_stopping_round is somewhere else in the code, I think it would be better to create a constant in the sklearn.py file where we set the default value, happy to hear alternative solutions from you.

Lastly, we should still mention somewhere in the documentation that early stopping is enabled by default in the scikit-learn interface of LightGBM and also we should mention how to forcefully disable it. Where would you suggest to add it?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued effort on this. I did a quick review this morning and left comments on a few things, but that set of suggestions isn't comprehensive. It will take some time to provide a more thorough review... I'll do that when I can.

In addition to the comments I left inline, please add tests covering each of the new if-else code branches you've introduced:

  • early stopping is only enabled automatically if the training data has > 10_000 rows
  • passing early_stopping_round=True / early_stopping_round=False does the expected thing

@@ -2538,14 +2538,12 @@ def set_categorical_feature(
self : Dataset
Dataset with set categorical features.
"""
if self.categorical_feature == categorical_feature:
if self.categorical_feature == categorical_feature or categorical_feature == 'auto':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support moving this up here to reduce the complexity of the code but how is this related to this pull request? If it's not, can you please put up a separate PR with this change?

That would also have the added benefit of removing your status as a first-time contributor to LightGBM... so you'd no longer have to wait for a maintainer to click a button to run the CI jobs when you push a commit 😊

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary in order to make early stopping work with setting categorical features during fit. Without this change the following error is raised

lightgbm.basic.LightGBMError: Cannot set categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

I will add a test to check that early_stopping=True works together with setting categorical features in fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! Right right, setting categorical_feature after the Dataset's constructed and when you don't have the raw data around any more is problematic.

Now I'm even more confused though... how does adding automatic early stopping interact with anything related to setting categorical features?

You are just enabling automatically something that is already possible in lightgbm, changes like this shouldn't be necessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that in order to enable automatic early stopping, we also need to automatically create an evaluation set.

In the current implementation, the evaluation set is created by splitting the train_set (python-package/lightgbm/sklearn.py#L818-831) instead of splitting the raw data passed to fit (X, y). This has been done to reuse the existing code for creating evaluation sets which relies on the Dataset API.

I am not very fluent with the Dataset API of lightgbm, so if you could recommend better ways for splitting train and validation I would be open for changes.

@@ -566,6 +571,7 @@ def __init__(
self._n_features_in: int = -1
self._classes: Optional[np.ndarray] = None
self._n_classes: int = -1
self.validation_fraction = validation_fraction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this up to right after self.importance_type = importance_type.

If you look carefully, you'll see that all such assignments are ordered by the order of the keyword arguments in the construct... let's please preserve that ordering, to avoid mistakes and make the code easier to read.

default_value="auto",
)
if params["early_stopping_round"] == "auto":
params["early_stopping_round"] = 10 if hasattr(self, "n_rows_train") and self.n_rows_train > 10000 else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this into an if-else block instead of using such a long 1-line statement. This is very difficult to read (at least for me).

@@ -789,44 +809,61 @@ def fit(
train_set = Dataset(data=_X, label=_y, weight=sample_weight, group=group,
init_score=init_score, categorical_feature=categorical_feature,
params=params)
self._n_rows_train = _X.shape[0]
if params["early_stopping_round"] == "auto":
params["early_stopping_round"] = 10 if self.n_rows_train > 10000 else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this an if-else block instead of a one-liner

)
weight = np.full_like(y, 2) if use_weight else None
gbm.fit(X, y, sample_weight=weight)
assert bool(gbm.best_iteration_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make these tests stricter. Instead of just "best_iteration_ is set", please add assertions checking that:

  • best_iteration_ is set to the expected value (e.g. that early stopping actually happened)
  • the number of trees (gbm.booster_.num_trees()) is set to the expected value

For example, you're not testing here that early_stopping_rounds is set to 10 automatically if it's not provided.

If you need help with testing that early stopping actually happened, look around the other tests in the project or the one I've added in #6095.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this applies to all 3 tests you've added)

def test_binary_classification_with_auto_early_stopping(use_weight):

X, y = load_breast_cancer(return_X_y=True)
n_estimators = 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_estimators = 1000
n_estimators = 15

Please try to keep these tests as small and fast as is necessary to test the correctness of the code. We run 500+ unit tests on every build of the the Python package... the smaller they are, the more testing cycles we can get through and the faster development on the project moves.

It would also be much appreciated if you'd set num_leaves=5 or some similar smaller-than-default value.

@@ -609,21 +654,21 @@ def test_pandas_categorical():
X[cat_cols_actual] = X[cat_cols_actual].astype('category')
X_test[cat_cols_actual] = X_test[cat_cols_actual].astype('category')
cat_values = [X[col].cat.categories.tolist() for col in cat_cols_to_store]
gbm0 = lgb.sklearn.LGBMClassifier(n_estimators=10).fit(X, y)
gbm0 = lgb.sklearn.LGBMClassifier(n_estimators=10, random_state=42).fit(X, y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this, it comes from a previous implementation where early stopping was true by default, now it is 'auto' by default. This test does not use early stopping indeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you. Please, as you continue on this, check the diff and be confident that every change you're proposing is necessary and can be justified. The more you can eliminate these "what is this? ... oh nevermind we don't need it any more" rounds of reviews, the faster this will get to a state that we feel confident merging.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for the suggestion.

@jameslamb
Copy link
Collaborator

Lastly, we should still mention somewhere in the documentation that early stopping is enabled by default in the scikit-learn interface of LightGBM and also we should mention how to forcefully disable it. Where would you suggest to add it?

I think here would be appropriate:

Early Stopping

@ClaudioSalvatoreArcidiacono
Copy link
Author

Lastly, we should still mention somewhere in the documentation that early stopping is enabled by default in the scikit-learn interface of LightGBM and also we should mention how to forcefully disable it. Where would you suggest to add it?

I think here would be appropriate:

Early Stopping

Thanks, I have added something there.
Shall we also mention something here?

@ClaudioSalvatoreArcidiacono
Copy link
Author

Hey @jameslamb, I think the PR is ready for a second look, please let me know if there are some Changes you would like me to do :)

@jameslamb
Copy link
Collaborator

Thanks for returning to this.

@jmoralez could you take the next round of reviews on this?

@jmoralez
Copy link
Collaborator

jmoralez commented Feb 1, 2024

The original issue (#3313) requested the possibility of doing early stopping with the scikit-learn API by specifying arguments in the constructor, not automatically performing early stopping. I understand that's what HistGradientBoosting(Classifier|Regressor) do, but I think we could also consider doing what GradientBoosting(Classifier|Regressor) do, where the default is not to do it but having the arguments in the init signature to support it. Otherwise this would be a silently breaking change (I know we would list it as a breaking change but it could confuse people who don't read the release notes).

I would like to have this clear before reviewing, since we would be doing many things behind the user's back (automatically enabling early stopping for >10,000 rows, stratified if classification, setting the number of folds, shuffling). I would be more comfortable if it was an explicit decision (like setting early_stopping_rounds>0 in the params for example).

@ClaudioSalvatoreArcidiacono
Copy link
Author

ClaudioSalvatoreArcidiacono commented Feb 2, 2024

Hey @jmoralez, thanks for your comment. I agree with you, my preference would be for having early stopping off by default and to have it on only when it is explicitly set by an init parameter. This was also my original implementation. I have changed it due to this comment from @jameslamb.

Could the two of you maybe agree on what would you like to see as the default behaviour :)?

@jmoralez
Copy link
Collaborator

jmoralez commented Feb 2, 2024

What are your thoughts on this (enabling early stopping by default) @borchero?

@borchero
Copy link
Collaborator

What are your thoughts on this (enabling early stopping by default) @borchero?

Sorry, I only saw this comment now 🫣 I wouldn't enable it by default as (1) it is not integral to use for boosted trees and (2) early stopping is not supported for all boosting strategies (I recently learnt that dart does not support it) which might cause confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Auto early stopping in Sklearn API
4 participants