Should we change the output of `session.run`? #3745

noklam · 2022-08-23T21:39:35Z

noklam
Aug 23, 2022
Collaborator

Background

What's the output of session.run()? Currently, this is not clear as you think and it isn't documented anywhere. The logic is defined in runner.py, this can be counter-intuitive in some cases, is there a good reason why we want to do this?

kedro/kedro/runner/runner.py

Lines 78 to 91 in f491420

    
           free_outputs = pipeline.outputs() - set(catalog.list()) 
        
           unregistered_ds = pipeline.data_sets() - set(catalog.list()) 
        
           for ds_name in unregistered_ds: 
        
               catalog.add(ds_name, self.create_default_data_set(ds_name)) 
        
           if self._is_async: 
        
               self._logger.info( 
        
                   "Asynchronous mode is enabled for loading and saving data" 
        
               ) 
        
           self._run(pipeline, catalog, hook_manager, session_id) 
        
           self._logger.info("Pipeline execution completed successfully.") 
        
           return {ds_name: catalog.load(ds_name) for ds_name in free_outputs}

kedro has improved a lot in terms of how to run the pipeline with packaging & KedroSession as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate with kedro for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn't behave like a function, i.e. you can't really define an input as an argument easily (it has to be a Catalog entry), the output of the pipeline is also very restricted.

Motivation

Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.

This issue mainly focuses on the output side, this will improve the experience to integrate the kedro pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with "Kedro Project", but if we are moving towards a kedro package, i.e. using from kedro_package import main, it should behave just like a Python function, I think this is a reasonable expectation.

1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionWebService(model)

Questions

What should be return with session.run?

Things to consider

How can any Python developer integrate with the kedro pipeline easily? Can it behave just like a function?
In an interactive workflow, it may make sense to keep all intermediate output in the resulting dict
Is there a known reason why the output is defined as it is?

Related Issue:

It would enable a better interactive workflow Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721
Improve kedro run as a package #1423 Is trying to improve/simplifies how we run kedro pipeline as a standalone package
How to distribute and extend kedro pipelines #795 Discussion of how to use kedro as upstream/downstream

Workflow of debugging Kedro pipeline in notebook #1832

antonymilne · 2022-08-23T22:54:16Z

antonymilne
Aug 23, 2022

This is a very interesting question. I think it's right to focus just on the output side here so I'll save my comments on input for another time 🙂

I think we'd need @idanov or maybe even @tsanikgr to explain exactly why session.run returns what it does. AFAIK it's always been this way. Intuitively it kind of feels like the right thing to me, since those are the "unprocessed" datasets which you might want to work with further. All intermediate datasets have already been consumed by the pipeline and so shouldn't be required further downstream. If you really want to make them available then you could make a mock identify node that copies them to a free output. Returning all intermediate datasets feels like too much to me.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Also, technically it looks to me like the code that finds free_outputs is not quite right. If I define something explicitly as a MemoryDataSet in my catalog (unusual, but not unheard of, e.g. to change the copy_mode) then it won't count as a free_output when probably it should do. It's an edge case, but worth mentioning since we're discussing it here. What free_output means in the code is just "output that's not defined in the catalog", which is a subset of "output that's not a MemoryDataSet".

0 replies

noklam · 2022-10-03T15:15:53Z

noklam
Oct 3, 2022
Collaborator Author

Add this related SO Question - How to run a kedro pipeline interactively like a fuction - this issues only focus on the output of a pipeline, what about input? I think this will be the next question.

0 replies

noklam · 2022-10-04T10:06:41Z

noklam
Oct 4, 2022
Collaborator Author

Notes for Tech Design

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Less controversial - Change the default - the definition of free_output is a bit buggy, we should change it.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Open up an optional argument for session.run to return any targeted datasets - even if it's an intermediate dataset or persisted dataset - this is more useful for interactive workflow (i.e. notebook) or debugging purpose. Currently it's tricky to make it work. This one is highly related to Workflow of debugging Kedro pipeline in notebook #1832
2.1. If it's an intermediate Memory dataset - you can't really get it.
2.2. If it's an intermediate persisted dataset - you need to first session.run and then do catalog.load

0 replies

merelcht · 2022-10-05T14:05:00Z

merelcht
Oct 5, 2022
Maintainer

Notes from Technical Design session:

There was agreement that the "free outputs" output from session isn't very clear. It was suggested to simply return all output from nodes that is not consumed, even if it's defined in the catalog. However, this could lead to very large amounts of data being returned. Instead we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

The second point about adding an optional argument for session.run() to return any targeted datasets was discussed briefly, but it was decided to talk about it more thoroughly in a separate workstream about node debugging.

Change "free outputs" to also return MemoryDataSet entries from the catalog #1900

0 replies

noklam · 2022-10-05T15:16:50Z

noklam
Oct 5, 2022
Collaborator Author

Supplement on the above comments to address @AntonyMilneQB question:

i.e. we could have free_output = pipeline.outputs()

The answer to that is there is a catalog.load call at the end, it's an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. For MemoryDataSet, it's loaded in memory already, so there is no harm to return it.

0 replies

noklam · 2022-10-06T13:55:07Z

noklam
Oct 6, 2022
Collaborator Author

I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the nbdev system works. See DebugRunner

https://noklam.github.io/kedro-debug-runner/core.html

0 replies

noklam · 2023-03-23T15:01:31Z

noklam
Mar 23, 2023
Collaborator Author

Adding this as inspiration on whether we should have some kind of argument or debug mode that can specifically return output easily without editing configuration.

At the moment, the proper way to inspect is

For "free memory data" - it will return by session.run
"intermediate memory data" - it will be deleted as soon as it not needed (Not possible to be returned by a session, user need to do a session.run which make the "targeted dataset" as "free output"
For "persisted dataset" - user need to do `catalog.load("dataset_name")

The complication is mainly due to the kedro run need to be efficient and thus some data is deleted on the fly to reduce memory footprint.

The question is how can we improve the user experience? It's hard to reason what is "free output" and what is not.
I would also question that there are significant users working with moderate size of data, keeping everything in memory isn't a problem and make the development experience smoother. kedro-org/kedro-plugins#44
Is there a way to let users do what they want without touching any configuration?

Minor improvements in the IPython and Jupyter Notebook workflows #1075 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we change the output of `session.run`? #3745

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Should we change the output of session.run? #3745

noklam Aug 23, 2022 Collaborator

Background

Motivation

Questions

Things to consider

Related Issue:

Replies: 7 comments

antonymilne Aug 23, 2022

noklam Oct 3, 2022 Collaborator Author

noklam Oct 4, 2022 Collaborator Author

merelcht Oct 5, 2022 Maintainer

noklam Oct 5, 2022 Collaborator Author

noklam Oct 6, 2022 Collaborator Author

noklam Mar 23, 2023 Collaborator Author

Should we change the output of `session.run`? #3745

noklam
Aug 23, 2022
Collaborator

antonymilne
Aug 23, 2022

noklam
Oct 3, 2022
Collaborator Author

noklam
Oct 4, 2022
Collaborator Author

merelcht
Oct 5, 2022
Maintainer

noklam
Oct 5, 2022
Collaborator Author

noklam
Oct 6, 2022
Collaborator Author

noklam
Mar 23, 2023
Collaborator Author