-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] enable early stopping automatically in scikit-learn interface (fixes #3313) #5808
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to contribute to LightGBM! I haven't reviewed this yet, but just leaving a blocking review to make it clear to other maintainers that I'd like the opportunity to review before this is merged. The discussion in #3313 talked about waiting for scikit-learn's interface for HistGradientBoostingClassifier
to stabilize and mature and then adapting to the choices they made... I'd like the opportunity to look at that interface thoroughly prior to reviewing this.
Until we have a chance to review, you can improve the chances of this being merged by addressing any CI failures you see. You can safely ignore failing |
Hey @jameslamb thanks for picking this up. I have started to take a look at the ci failures, I think I can solve most of them easily. There is one check for which I need some input. I see that one of the tests checks that the init parameters for the sklearn API and the Dask API have the same arguments (see this one). Early stopping is not available yet for the Dask API, so I do not see how can we easily iron that out. Shall I add some more exceptions to that specific test? |
@ClaudioSalvatoreArcidiacono no, please do not do that. This test you've linked to is there exactly to catch such deviations. If you absolutely need to add arguments to the constructors of the scikit-learn estimators in this project, add the same arguments in the same order to the Dask estimators, with default values of For what it's worth, I am not totally convinced yet that we should take on the exact same interface as If you are committed to getting the CI passing here I'm willing to consider it, but just want to set the right expectation that I expect you to also explain specifically the benefit of adding all this new complexity. |
By the way, it looks like you are not signing your commits with an email address tied to your GitHub account. See #5532 (comment) and the comments linked from it for an explanation of what I mean by that and an explanation of how to fix it. |
It has been about 6 weeks since I last provided a review on this PR, and there has not been any activity on it since then. I'm closing this, assuming it's been abandoned. To make it clear to others onterested in this feature that they shouldn't be waiting on this PR. @ClaudioSalvatoreArcidiacono if you have time in the future to work with maintainers here, we'd welcome future contributions. |
Hey @jameslamb, I did not have much time to take a look at this lately, I should be more available now and in the coming weeks. If you also have some time to help me reviewing it I can pick this PR up again. Regarding your previous comments, thanks for the heads up on signing commits, I will sign the next commits as you mentioned. About the complexity of the proposed implementation, I am definitely open for feedback from the maintainers and I am willing to change the proposed implementation if necessary. In the proposed implementation I tried to stick to what is written in the FAQ:
So, in the proposed implementation I tried to find a common ground between convenience of activating early stopping using only init params and customisability of the splitting strategy. |
f2ccd52
to
7349a33
Compare
Just to set the right expectation...I personally will not be able to look at this for at least another week, and I think it's unlikely it'll make it into LightGBM 4.0 (#5952). I'm sorry, but this is quite complex and will require a significant investment of time to review. Some questions I'll be looking to answer when I start reviewing this:
|
Hey @jameslamb, no problem. Thanks for being transparent on your availability on this PR and for your feedback. I followed it and I made the PR easier to review now. I have removed the functionality to use a custom splitter and I made the changes much smaller.
I have now implemented the same interface as
In this implementation I tried to reuse the splitting function used in
Good observation, I think it is indeed needed to rename the arguments so that they will be more consistent with LightGBM naming conventions.
Correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ClaudioSalvatoreArcidiacono now that LightGBM v4.0 is out, I'd be happy to return to reviewing this.
Can you please update it to the latest master
? Please also see my next round of comments.
python-package/lightgbm/sklearn.py
Outdated
@@ -412,6 +412,9 @@ def __init__( | |||
random_state: Optional[Union[int, np.random.RandomState]] = None, | |||
n_jobs: Optional[int] = None, | |||
importance_type: str = 'split', | |||
use_early_stopping: bool = False, | |||
validation_fraction: Optional[float] = 0.1, | |||
early_stopping_round: int = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be necessary to add n_iter_no_change
or early_stopping_round
to the public interface of these classes.
All keyword arguments passed through **kwargs
are appended to the params
stored on the object and passed through to LightGBM's C++ code.
LightGBM/python-package/lightgbm/sklearn.py
Line 569 in 7a801f7
self.set_params(**kwargs) |
LightGBM/python-package/lightgbm/sklearn.py
Lines 617 to 621 in 7a801f7
for key, value in params.items(): | |
setattr(self, key, value) | |
if hasattr(self, f"_{key}"): | |
setattr(self, f"_{key}", value) | |
self._other_params[key] = value |
n_iter_no_change
is supported as a parameter alias for early_stopping_rounds
in LightGBM.
LightGBM/src/io/config_auto.cpp
Line 83 in 7a801f7
{"n_iter_no_change", "early_stopping_round"}, |
And all aliases will be resolved in lightgbm.train()
, which is called by the scikit-learn estimators' fit()
methods.
LightGBM/python-package/lightgbm/sklearn.py
Line 842 in 7a801f7
self._Booster = train( |
LightGBM/python-package/lightgbm/engine.py
Lines 175 to 179 in 7a801f7
params = _choose_param_value( | |
main_param_name="early_stopping_round", | |
params=params, | |
default_value=None | |
) |
Can you please explain why you added this? If you just didn't realize LightGBM had this mechanism, please remove this keyword argument from all these classes (but keep passing it explicitly in the unit tests, if it's the preferred argument for scikit-learn
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not fully understand how aliases worked in LightGBM, this comment was super helpful in unblocking me, thanks a lot for it.
I have removed n_iter_no_change
from the public interface.
python-package/lightgbm/sklearn.py
Outdated
@@ -412,6 +412,9 @@ def __init__( | |||
random_state: Optional[Union[int, np.random.RandomState]] = None, | |||
n_jobs: Optional[int] = None, | |||
importance_type: str = 'split', | |||
use_early_stopping: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dask and scikit-learn
interfaces should be identical.
Since HistGradientBoosterClassifier
calls this early_stopping
(docs link), please also call it early_stopping
here and in the Dask interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why is this defaulting to False
? This PR's title says "enable early stopping automatically", but if it was merged as-is early stopping would be OFF by default.
Since the goal of this PR is to make LightGBM's scikit-learn interface consistent with scikit-learn itself, please follow what HistGradientBoostingClassifier
does. As seen in their docs: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html
early_stopping:
‘auto’
orbool
,default=’auto’
.
If ‘auto’, early stopping is enabled if the sample size is larger than 10000. If True, early stopping is enabled, otherwise early stopping is disabled.
Please match that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter has been removed from the public interface.
Hey @jameslamb thanks a lot for your review. I am still interested in solving this issue, but I will be on holidays for the next two weeks. I will take a look at your comments once I am back. |
No problem, thanks for letting us know! I'll be happy to continue working on this whenever you have time. |
ed6a774
to
340ac3c
Compare
Hey @jameslamb, Thanks again for your review comments! I was aware of the aliases mechanism but I did not fully understand how it worked. Your comment really helped me in understanding them. I have worked on your feedback and I think now the PR is in good shape. In this implementation I tried to stick to the scikit-learn interface of HistGradientBoostingClassifier, so the parameter Since we are not adding Lastly, we should still mention somewhere in the documentation that early stopping is enabled by default in the scikit-learn interface of LightGBM and also we should mention how to forcefully disable it. Where would you suggest to add it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your continued effort on this. I did a quick review this morning and left comments on a few things, but that set of suggestions isn't comprehensive. It will take some time to provide a more thorough review... I'll do that when I can.
In addition to the comments I left inline, please add tests covering each of the new if-else
code branches you've introduced:
- early stopping is only enabled automatically if the training data has
> 10_000
rows - passing
early_stopping_round=True
/early_stopping_round=False
does the expected thing
@@ -2538,14 +2538,12 @@ def set_categorical_feature( | |||
self : Dataset | |||
Dataset with set categorical features. | |||
""" | |||
if self.categorical_feature == categorical_feature: | |||
if self.categorical_feature == categorical_feature or categorical_feature == 'auto': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support moving this up here to reduce the complexity of the code but how is this related to this pull request? If it's not, can you please put up a separate PR with this change?
That would also have the added benefit of removing your status as a first-time contributor to LightGBM... so you'd no longer have to wait for a maintainer to click a button to run the CI jobs when you push a commit 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary in order to make early stopping work with setting categorical features during fit. Without this change the following error is raised
lightgbm.basic.LightGBMError: Cannot set categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
I will add a test to check that early_stopping=True works together with setting categorical features in fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! Right right, setting categorical_feature
after the Dataset's constructed and when you don't have the raw data around any more is problematic.
Now I'm even more confused though... how does adding automatic early stopping interact with anything related to setting categorical features?
You are just enabling automatically something that is already possible in lightgbm
, changes like this shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that in order to enable automatic early stopping, we also need to automatically create an evaluation set.
In the current implementation, the evaluation set is created by splitting the train_set
(python-package/lightgbm/sklearn.py#L818-831) instead of splitting the raw data passed to fit (X, y). This has been done to reuse the existing code for creating evaluation sets which relies on the Dataset API.
I am not very fluent with the Dataset API of lightgbm, so if you could recommend better ways for splitting train and validation I would be open for changes.
python-package/lightgbm/sklearn.py
Outdated
@@ -566,6 +571,7 @@ def __init__( | |||
self._n_features_in: int = -1 | |||
self._classes: Optional[np.ndarray] = None | |||
self._n_classes: int = -1 | |||
self.validation_fraction = validation_fraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this up to right after self.importance_type = importance_type
.
If you look carefully, you'll see that all such assignments are ordered by the order of the keyword arguments in the construct... let's please preserve that ordering, to avoid mistakes and make the code easier to read.
python-package/lightgbm/sklearn.py
Outdated
default_value="auto", | ||
) | ||
if params["early_stopping_round"] == "auto": | ||
params["early_stopping_round"] = 10 if hasattr(self, "n_rows_train") and self.n_rows_train > 10000 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break this into an if-else
block instead of using such a long 1-line statement. This is very difficult to read (at least for me).
python-package/lightgbm/sklearn.py
Outdated
@@ -789,44 +809,61 @@ def fit( | |||
train_set = Dataset(data=_X, label=_y, weight=sample_weight, group=group, | |||
init_score=init_score, categorical_feature=categorical_feature, | |||
params=params) | |||
self._n_rows_train = _X.shape[0] | |||
if params["early_stopping_round"] == "auto": | |||
params["early_stopping_round"] = 10 if self.n_rows_train > 10000 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make this an if-else
block instead of a one-liner
) | ||
weight = np.full_like(y, 2) if use_weight else None | ||
gbm.fit(X, y, sample_weight=weight) | ||
assert bool(gbm.best_iteration_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make these tests stricter. Instead of just "best_iteration_
is set", please add assertions checking that:
best_iteration_
is set to the expected value (e.g. that early stopping actually happened)- the number of trees (
gbm.booster_.num_trees()
) is set to the expected value
For example, you're not testing here that early_stopping_rounds
is set to 10
automatically if it's not provided.
If you need help with testing that early stopping actually happened, look around the other tests in the project or the one I've added in #6095.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this applies to all 3 tests you've added)
def test_binary_classification_with_auto_early_stopping(use_weight): | ||
|
||
X, y = load_breast_cancer(return_X_y=True) | ||
n_estimators = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_estimators = 1000 | |
n_estimators = 15 |
Please try to keep these tests as small and fast as is necessary to test the correctness of the code. We run 500+ unit tests on every build of the the Python package... the smaller they are, the more testing cycles we can get through and the faster development on the project moves.
It would also be much appreciated if you'd set num_leaves=5
or some similar smaller-than-default value.
@@ -609,21 +654,21 @@ def test_pandas_categorical(): | |||
X[cat_cols_actual] = X[cat_cols_actual].astype('category') | |||
X_test[cat_cols_actual] = X_test[cat_cols_actual].astype('category') | |||
cat_values = [X[col].cat.categories.tolist() for col in cat_cols_to_store] | |||
gbm0 = lgb.sklearn.LGBMClassifier(n_estimators=10).fit(X, y) | |||
gbm0 = lgb.sklearn.LGBMClassifier(n_estimators=10, random_state=42).fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this, it comes from a previous implementation where early stopping was true by default, now it is 'auto' by default. This test does not use early stopping indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thank you. Please, as you continue on this, check the diff and be confident that every change you're proposing is necessary and can be justified. The more you can eliminate these "what is this? ... oh nevermind we don't need it any more" rounds of reviews, the faster this will get to a state that we feel confident merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for the suggestion.
I think here would be appropriate: LightGBM/docs/Python-Intro.rst Line 223 in 921479b
|
Thanks, I have added something there. |
9c3aa85
to
ad43e17
Compare
Final number of trees changes in CUDA and linux
a120194
to
457c7f6
Compare
This reverts commit 457c7f6.
Hey @jameslamb, I think the PR is ready for a second look, please let me know if there are some Changes you would like me to do :) |
Thanks for returning to this. @jmoralez could you take the next round of reviews on this? |
The original issue (#3313) requested the possibility of doing early stopping with the scikit-learn API by specifying arguments in the constructor, not automatically performing early stopping. I understand that's what HistGradientBoosting(Classifier|Regressor) do, but I think we could also consider doing what GradientBoosting(Classifier|Regressor) do, where the default is not to do it but having the arguments in the init signature to support it. Otherwise this would be a silently breaking change (I know we would list it as a breaking change but it could confuse people who don't read the release notes). I would like to have this clear before reviewing, since we would be doing many things behind the user's back (automatically enabling early stopping for >10,000 rows, stratified if classification, setting the number of folds, shuffling). I would be more comfortable if it was an explicit decision (like setting |
Hey @jmoralez, thanks for your comment. I agree with you, my preference would be for having early stopping off by default and to have it on only when it is explicitly set by an init parameter. This was also my original implementation. I have changed it due to this comment from @jameslamb. Could the two of you maybe agree on what would you like to see as the default behaviour :)? |
What are your thoughts on this (enabling early stopping by default) @borchero? |
Sorry, I only saw this comment now 🫣 I wouldn't enable it by default as (1) it is not integral to use for boosted trees and (2) early stopping is not supported for all boosting strategies (I recently learnt that |
Fixes #3313
Implements Scikit-learn like interface for early stopping.