Can Kedro "Save" Intermediate Nodes to Avoid Rerunning Them Without Changes to Input or Code? #2558

dvirginz · 2023-05-05T06:59:34Z

dvirginz
May 5, 2023

We have a pipeline where the preprocessing stage can take over 12 hours to run.

While we can run it separately, we prefer not to because the results have a significant impact on the model (positive and negative samples selection, sampling strategy, etc.).

The issue is that every time we run the Kedro pipeline, we have to wait the full 12 hours or start the run for the node after the preprocessing. The problem with this approach is that something might have changed in the preprocessing strategy, such as the way we sample positive labels, and the new node won't even know about it.

The two questions are as follows:

Can we cache the results of one node (not in memory, as in the CachedDataSet meaning) and only rerun it if the input parameters or code have changed?
Can we store two different dataset configurations in parallel? For example, if we want to test two different sampling strategies and create two different datasets, we don't want the Kedro project system to override one sampling strategy when the other one is being run.

astrojuanlu · 2023-05-05T16:39:01Z

astrojuanlu
May 5, 2023
Maintainer

Hi @dvirginz, thanks for your interest in Kedro!

You should be able to use an appropriate dataset that serializes results to disk, for example pandas.CSVDataSet, so that the results of the preprocessing are cached on disk, and then use some of the flags of kedro run like --from-nodes to run only part of your pipeline.
You can parametrize a pipeline: https://docs.kedro.org/en/stable/configuration/parameters.html

Let me know if this helps!

1 reply

noklam May 5, 2023
Collaborator

The caching mechanism does not exist in kedro yet. How would you like it to be implemented?

dvirginz · 2023-05-07T07:49:53Z

dvirginz
May 7, 2023
Author

Dear Noklam and astrojuanlu,

Thank you for your response.

We are using serialized datasets. The problem arises when we want to run multiple runs that generate different intermediate datasets, let's say:

kedro run --preproc_param_1 50

and

kedro run --preproc_param_1 80

In the current data model of Kedro (as we see it), these runs will override each other. We are looking to save these two intermediate runs side by side and be able to do:

kedro run --from_nodes <for preproc 50>

and

kedro run --from_nodes <for preproc 80>

without overriding each other. Does that make sense?

Thanks.

1 reply

noklam May 7, 2023
Collaborator

So a few things here. Did you saved this data and turn the versioned flag on? They shouldn't override each other as every new run get its own timestamp.

When version isn't provided, it will take the latest timestamp instead.

To achieve what you need, you may define the load-version associated with the run you want using the kedro run - - load-version flag. This get a bit cumblesome though as you need to track the version and it's not really mean to be readable.

The alternative is using template variable in catalog, where different params get to save in different path so they don't interfere with each other. This required you to know what params you are going to iterate and doesn't scale very well with lots of params.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Kedro "Save" Intermediate Nodes to Avoid Rerunning Them Without Changes to Input or Code? #2558

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can Kedro "Save" Intermediate Nodes to Avoid Rerunning Them Without Changes to Input or Code? #2558

dvirginz May 5, 2023

We have a pipeline where the preprocessing stage can take over 12 hours to run.

Replies: 2 comments · 2 replies

astrojuanlu May 5, 2023 Maintainer

noklam May 5, 2023 Collaborator

dvirginz May 7, 2023 Author

noklam May 7, 2023 Collaborator

dvirginz
May 5, 2023

Replies: 2 comments 2 replies

astrojuanlu
May 5, 2023
Maintainer

noklam May 5, 2023
Collaborator

dvirginz
May 7, 2023
Author

noklam May 7, 2023
Collaborator