Replies: 2 comments 2 replies
-
Hi @dvirginz, thanks for your interest in Kedro!
Let me know if this helps! |
Beta Was this translation helpful? Give feedback.
1 reply
-
Dear Noklam and astrojuanlu, Thank you for your response. We are using serialized datasets. The problem arises when we want to run multiple runs that generate different intermediate datasets, let's say:
and
In the current data model of Kedro (as we see it), these runs will override each other. We are looking to save these two intermediate runs side by side and be able to do:
and
without overriding each other. Does that make sense? Thanks. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have a pipeline where the preprocessing stage can take over 12 hours to run.
While we can run it separately, we prefer not to because the results have a significant impact on the model (positive and negative samples selection, sampling strategy, etc.).
The issue is that every time we run the Kedro pipeline, we have to wait the full 12 hours or start the run for the node after the preprocessing. The problem with this approach is that something might have changed in the preprocessing strategy, such as the way we sample positive labels, and the new node won't even know about it.
The two questions are as follows:
Can we cache the results of one node (not in memory, as in the CachedDataSet meaning) and only rerun it if the input parameters or code have changed?
Can we store two different dataset configurations in parallel? For example, if we want to test two different sampling strategies and create two different datasets, we don't want the Kedro project system to override one sampling strategy when the other one is being run.
Beta Was this translation helpful? Give feedback.
All reactions