Kedro pipeline iterate over the iterator if a node returns an iterator #4313

Sage0614 · 2023-07-28T21:32:03Z

Sage0614
Jul 28, 2023

Description

when you define a node to return an iterator, kedro pipeline will iterate over the iterator

Context

I am aware kedro support generator function in node, it is good, however, kedro shouldn't assume what user want to do with the generator, and should pass it as what it is by default, rather than implicitly guess when people want to execute the generator, and should provide explicit argument in pipeline.node whether user want to keep it as it is or execute the generator.

in general, people expect the pipeline to work as pure function by default, and other behavior should be explicit

currently, there's no way to specify that I want to keep the iterator as it is as output, and the iterator is always exhausted when running the node.

Steps to Reproduce

in nodes.py:

class SampleIter():
    def __init__(self):
        self.curr = 0
        self.max = 10
    def __iter__(self):
        return self
    def __next__(self):
        if self.curr < self.max:
            self.curr += 1
            return self
        else:
            raise StopIteration
        

def node_iter():
    return SampleIter()

def node_call(iter):
    res = 0
    for i in iter:
        res += i.curr
    print(res)
    return res

in pipeline.py:

from kedro.pipeline import Pipeline, node

from .nodes import node_iter, node_call

def create_pipeline(**kwargs)-> Pipeline:
    return Pipeline(
        [
            node(
                func=node_iter, 
                inputs=None, 
                outputs= "iter"),
            node(
                func=node_call,
                inputs = "iter",
                outputs = "res")
        ]
    )

Expected Result

res should be 55

Actual Result

res is 0

Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.11
Python version used (python -V): Python 3.10.12
Operating system and version: windows WSL2 of Ubuntu 22.04.1 LTS

noklam · 2023-07-28T21:47:24Z

noklam
Jul 28, 2023
Collaborator

Thank you for reporting this, we will have a look at this soon. This example looks artificial, did you encounter this problem when you are using certain libraries?

0 replies

noklam · 2023-07-28T21:55:05Z

noklam
Jul 28, 2023
Collaborator

Potentially a bug of #2161

0 replies

Sage0614 · 2023-07-28T22:21:36Z

Sage0614
Jul 28, 2023
Author

Thank you for reporting this, we will have a look at this soon. This example looks artificial, did you encounter this problem when you are using certain libraries?

no, my use case is I create a parameterized sampler which have pretty complicated behavior where each iteration depends on the status of previous iteration, and I don't know how many iteration it will generate in advance, so it make sense to implement it as a iterator, which I have several down stream task to consume the sampler and generate analytics

0 replies

deepyaman · 2024-05-31T17:49:46Z

deepyaman
May 31, 2024
Collaborator

I think generator functions only work for passing to datasets, but can't be passed through to other nodes (via other datasets)? This is something I expected to work, but ran into an issue with, in https://github.com/deepyaman/partitioned-dataset-demo/tree/dd2d05f14fac0d2ff7fb4a949e8aac062dc70431:

(kedro) deepyaman@deepyaman-mac new-kedro-project % kedro run
[05/31/24 13:48:45] INFO     Kedro project new-kedro-project                                                                                                                                      session.py:324
[05/31/24 13:48:46] INFO     Using synchronous mode for loading and saving data. Use the --async flag for potential performance gains.                                                   sequential_runner.py:64
                             https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html#load-and-save-asynchronously                                                                              
                    INFO     Loading data from params:n (MemoryDataset)...                                                                                                                   data_catalog.py:483
                    INFO     Running node: generate_emails([params:n]) -> [emails]                                                                                                                   node.py:361
                    INFO     Saving data to emails (PartitionedDataset)...                                                                                                                   data_catalog.py:525
                    INFO     Completed 1 out of 4 tasks                                                                                                                                  sequential_runner.py:90
                    INFO     Loading data from emails (PartitionedDataset)...                                                                                                                data_catalog.py:483
                    INFO     Running node: capitalize_content([emails]) -> [capitalized_emails]                                                                                                      node.py:361
                    INFO     Saving data to capitalized_emails (PartitionedDataset)...                                                                                                       data_catalog.py:525
                    INFO     Completed 2 out of 4 tasks                                                                                                                                  sequential_runner.py:90
                    INFO     Loading data from capitalized_emails (PartitionedDataset)...                                                                                                    data_catalog.py:483
                    INFO     Running node: extract_content([capitalized_emails]) -> [contents]                                                                                                       node.py:361
                    INFO     Saving data to contents (MemoryDataset)...                                                                                                                      data_catalog.py:525
                    INFO     Saving data to contents (MemoryDataset)...                                                                                                                      data_catalog.py:525
                    INFO     Saving data to contents (MemoryDataset)...                                                                                                                      data_catalog.py:525
                    INFO     Completed 3 out of 4 tasks                                                                                                                                  sequential_runner.py:90
                    INFO     Loading data from contents (MemoryDataset)...                                                                                                                   data_catalog.py:483
                    INFO     Running node: tokenize([contents]) -> [tokens]                                                                                                                          node.py:361
                    INFO     Saving data to tokens (PartitionedDataset)...                                                                                                                   data_catalog.py:525
                    INFO     Completed 4 out of 4 tasks                                                                                                                                  sequential_runner.py:90
                    INFO     Pipeline execution completed successfully.                                                                                                                            runner.py:119

Would expect tokenize to process all three outputs from contents.

0 replies

astrojuanlu · 2024-11-08T18:42:12Z

astrojuanlu
Nov 8, 2024
Maintainer

Turning this into a discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kedro pipeline iterate over the iterator if a node returns an iterator #4313

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Kedro pipeline iterate over the iterator if a node returns an iterator #4313

Sage0614 Jul 28, 2023

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Replies: 5 comments

noklam Jul 28, 2023 Collaborator

noklam Jul 28, 2023 Collaborator

Sage0614 Jul 28, 2023 Author

deepyaman May 31, 2024 Collaborator

astrojuanlu Nov 8, 2024 Maintainer

Sage0614
Jul 28, 2023

noklam
Jul 28, 2023
Collaborator

noklam
Jul 28, 2023
Collaborator

Sage0614
Jul 28, 2023
Author

deepyaman
May 31, 2024
Collaborator

astrojuanlu
Nov 8, 2024
Maintainer