Distributed support (rework) #996

lrzpellegrini · 2022-04-22T12:33:48Z

This draft contains the implementation for supporting distributed (multi-GPU) training.

The examples folder contains a simple example (to run naive, replay, replay+scheduler) and a bash script to launch it.

This is an alternative to #907 created with the idea to minimize the changes required to the strategies.

…rt_rework

lrzpellegrini · 2022-04-26T10:30:15Z

It seems that unit tests failed due to a temporary error while downloading CIFAR. I triggered a re-run and unit tests are ok!

avalanche/training/plugins/evaluation.py

AntonioCarta · 2022-04-26T15:51:41Z

avalanche/training/templates/supervised.py

        return avalanche_forward(self.model, self.mb_x, self.mb_task_id)

+    @final
    def model_adaptation(self, model=None):


Not sure about this. Local adaptation probably needs all the data.

The reason here is that we want the users to override _model_adaptation instead of model_adaptation. The _model_adaptation is then called from within the with self.use_local_model() context. This is needed to ensure the users are changing the local model instead of the model (which may be the wrapped one).

examples/distributed_training.py

avalanche/training/templates/supervised.py

AntonioCarta · 2022-04-26T15:59:25Z

avalanche/training/templates/supervised.py

    def forward(self):
-        """Compute the model's output given the current mini-batch."""


do we need _forward and forward? I would keep only one of them. Does it make sense to move this method to the DistributedModelStrategy helper?

This is the same as _model_adaptation and model_adaptation. The wrapper is mainly needed to add the context managers.

AntonioCarta · 2022-04-26T15:59:58Z

avalanche/training/templates/supervised.py

        return avalanche_forward(self.model, self.mb_x, self.mb_task_id)

+    @final
    def model_adaptation(self, model=None):


same comments as _forward above.

avalanche/distributed/distributed_batch.py

AntonioCarta · 2022-04-26T16:02:45Z

avalanche/distributed/distributed_batch.py

+    An intermediate abstract class in charge of synchronizing data batches.
+
+    This class can handle batches as either tuples of elements (as usual) or
+    even single values.


I don't understand this. Why do we need a single class to deal with both tuples and single values? It seems easier to have different classes each with their own merge and sync methods.

A class that can handle both situations may be useful for unsupervised strategies that may have a batch made of the X value only. In that case, the batch is not a tuple and things need to be managed in a slightly different way. Switching classes depending on that may be more complicated than doing this...

avalanche/distributed/distributed_tensor.py

AntonioCarta · 2022-04-26T16:17:53Z

I left some general comments about the API and some things I didn't understand about the implementation. Overall, I think it's now much easier to add distributed support and integrate it. I think the API is a bit more verbose than it needs to be, especially due to the large number of methods that we need to add to a new distributed value. Personally, I would have chosen a more minimal set of methods but the solution as is is already quite good.

One thing that is missing, and it's quite important, is a test that checks which plugins actually work with the distributed training, which may not be obvious right now.

lrzpellegrini · 2022-04-27T15:55:42Z

Yes, the current issue is to check which plugins may not work. It seems that replay and the scheduler plugin are currently working as expected, but there are many more to test. I'm working on this!

…rt_rework

lrzpellegrini · 2022-04-29T14:29:33Z

@AntonioCarta I fixed the naming of the methods you mentioned and added the use_local strategy method. I also added, in the example, an in-code description of the changes to be made in the main script to support distributed training. I'll add more docstrings and fix the remaining things ASAP.

We still have to finalize a choice regarding the _forward, _model_adaptation, ... methods.

coveralls · 2022-04-29T14:39:22Z

Pull Request Test Coverage Report for Build 2699048914

508 of 966 (52.59%) changed or added relevant lines in 26 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-1.07%) to 71.195%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
avalanche/benchmarks/utils/data_loader.py	5	6	83.33%
avalanche/distributed/distributed_commons.py	7	8	87.5%
avalanche/logging/base_logger.py	2	3	66.67%
avalanche/training/plugins/evaluation.py	9	10	90.0%
avalanche/training/supervised/deep_slda.py	10	11	90.91%
avalanche/training/templates/supervised.py	19	20	95.0%
avalanche/training/supervised/strategy_wrappers.py	1	3	33.33%
avalanche/distributed/strategies/distributed_loss_strategy.py	22	26	84.62%
avalanche/distributed/strategies/distributed_model_strategy.py	20	24	83.33%
avalanche/distributed/strategies/distributed_strategy_support.py	8	12	66.67%

Files with Coverage Reduction	New Missed Lines	%
avalanche/benchmarks/scenarios/classification_scenario.py	2	86.08%

Totals
Change from base Build 2696355092:	-1.07%
Covered Lines:	13238
Relevant Lines:	18594

💛 - Coveralls

AntonioCarta · 2022-04-29T16:04:33Z

Thanks!

We still have to finalize a choice regarding the _forward, _model_adaptation, ... methods.

Maybe we could have a forward and a _forward_impl, with the idea that users call forward but override _forward_impl? Should fine if documented.

…rt_rework

…ed_support_rework

…Additional tests. Detection WIP.

ContinualAI-bot · 2022-11-22T16:06:54Z

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

avalanche/benchmarks/utils/collate_functions.py:131:81: E501 line too long (86 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:180:81: E501 line too long (81 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:183:81: E501 line too long (82 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:201:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:220:81: E501 line too long (105 > 80 characters)
avalanche/benchmarks/utils/data_attribute.py:39:81: E501 line too long (94 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:84:81: E501 line too long (94 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:87:81: E501 line too long (94 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:147:81: E501 line too long (82 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:369:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:489:81: E501 line too long (85 > 80 characters)
avalanche/benchmarks/utils/flat_data.py:75:81: E501 line too long (83 > 80 characters)
avalanche/benchmarks/utils/flat_data.py:246:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/scenarios/classification_scenario.py:34:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/scenarios/detection_scenario.py:109:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/scenarios/detection_scenario.py:112:81: E501 line too long (83 > 80 characters)
avalanche/benchmarks/scenarios/lazy_dataset_sequence.py:15:81: E501 line too long (84 > 80 characters)
avalanche/training/templates/problem_type/supervised_problem.py:3:81: E501 line too long (113 > 80 characters)
avalanche/training/templates/problem_type/supervised_problem.py:26:81: E501 line too long (90 > 80 characters)
avalanche/training/templates/problem_type/supervised_problem.py:27:81: E501 line too long (88 > 80 characters)
avalanche/training/templates/observation_type/online_observation.py:69:81: E501 line too long (84 > 80 characters)
avalanche/distributed/distributed_batch.py:96:81: E501 line too long (84 > 80 characters)
avalanche/distributed/distributed_batch.py:128:81: E501 line too long (85 > 80 characters)
avalanche/distributed/distributed_helper.py:14:81: E501 line too long (82 > 80 characters)
tests/run_dist_tests.py:58:81: E501 line too long (84 > 80 characters)
tests/distributed/test_distributed_batch.py:79:81: E501 line too long (85 > 80 characters)
tests/distributed/test_distributed_batch.py:104:81: E501 line too long (81 > 80 characters)
tests/distributed/test_distributed_helper.py:10:81: E501 line too long (93 > 80 characters)
tests/distributed/test_distributed_helper.py:28:81: E501 line too long (91 > 80 characters)
tests/distributed/test_distributed_helper.py:29:81: E501 line too long (83 > 80 characters)
tests/distributed/test_distributed_helper.py:32:81: E501 line too long (95 > 80 characters)
tests/distributed/test_distributed_helper.py:38:81: E501 line too long (101 > 80 characters)
tests/distributed/test_distributed_helper.py:41:81: E501 line too long (82 > 80 characters)
tests/distributed/test_distributed_helper.py:43:81: E501 line too long (95 > 80 characters)
tests/distributed/test_distributed_helper.py:59:81: E501 line too long (95 > 80 characters)
tests/distributed/test_distributed_helper.py:71:81: E501 line too long (95 > 80 characters)
36      E501 line too long (86 > 80 characters)

…efault loggers creation. Added distributed training integration unit tests.

ContinualAI-bot · 2022-11-23T15:54:38Z

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

avalanche/benchmarks/utils/collate_functions.py:131:81: E501 line too long (86 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:180:81: E501 line too long (81 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:183:81: E501 line too long (82 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:201:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/utils/collate_functions.py:220:81: E501 line too long (105 > 80 characters)
avalanche/benchmarks/utils/data_attribute.py:39:81: E501 line too long (94 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:84:81: E501 line too long (94 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:87:81: E501 line too long (94 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:147:81: E501 line too long (82 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:369:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/utils/detection_dataset.py:489:81: E501 line too long (85 > 80 characters)
avalanche/benchmarks/utils/flat_data.py:75:81: E501 line too long (83 > 80 characters)
avalanche/benchmarks/utils/flat_data.py:246:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/scenarios/classification_scenario.py:34:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/scenarios/detection_scenario.py:109:81: E501 line too long (84 > 80 characters)
avalanche/benchmarks/scenarios/detection_scenario.py:112:81: E501 line too long (83 > 80 characters)
avalanche/benchmarks/scenarios/lazy_dataset_sequence.py:15:81: E501 line too long (84 > 80 characters)
avalanche/training/templates/base_sgd.py:53:81: E501 line too long (95 > 80 characters)
avalanche/training/templates/base_sgd.py:312:5: E303 too many blank lines (2)
avalanche/training/templates/base_sgd.py:338:81: E501 line too long (83 > 80 characters)
avalanche/training/templates/base_sgd.py:350:81: E501 line too long (90 > 80 characters)
avalanche/training/templates/base_sgd.py:377:81: E501 line too long (81 > 80 characters)
avalanche/training/templates/base_sgd.py:417:81: E501 line too long (81 > 80 characters)
avalanche/training/templates/problem_type/supervised_problem.py:3:81: E501 line too long (113 > 80 characters)
avalanche/training/templates/problem_type/supervised_problem.py:26:81: E501 line too long (90 > 80 characters)
avalanche/training/templates/problem_type/supervised_problem.py:27:81: E501 line too long (88 > 80 characters)
avalanche/training/templates/observation_type/online_observation.py:69:81: E501 line too long (84 > 80 characters)
avalanche/training/supervised/naive_object_detection.py:143:81: E501 line too long (81 > 80 characters)
avalanche/training/supervised/naive_object_detection.py:173:9: E125 continuation line with same indent as next logical line
avalanche/training/supervised/naive_object_detection.py:182:81: E501 line too long (81 > 80 characters)
avalanche/distributed/distributed_batch.py:96:81: E501 line too long (84 > 80 characters)
avalanche/distributed/distributed_batch.py:128:81: E501 line too long (85 > 80 characters)
avalanche/distributed/distributed_helper.py:14:81: E501 line too long (82 > 80 characters)
avalanche/distributed/strategies/distributed_mbatch_strategy.py:5:81: E501 line too long (111 > 80 characters)
avalanche/distributed/strategies/distributed_mbatch_strategy.py:171:81: E501 line too long (91 > 80 characters)
tests/run_dist_tests.py:46:81: E501 line too long (83 > 80 characters)
tests/run_dist_tests.py:54:81: E501 line too long (82 > 80 characters)
tests/run_dist_tests.py:71:81: E501 line too long (84 > 80 characters)
tests/distributed/distributed_test_utils.py:10:81: E501 line too long (82 > 80 characters)
tests/distributed/test_distributed_batch.py:10:81: E501 line too long (110 > 80 characters)
tests/distributed/test_distributed_batch.py:69:81: E501 line too long (85 > 80 characters)
tests/distributed/test_distributed_batch.py:94:81: E501 line too long (81 > 80 characters)
tests/distributed/test_distributed_helper.py:9:81: E501 line too long (93 > 80 characters)
tests/distributed/test_distributed_helper.py:12:81: E501 line too long (110 > 80 characters)
tests/distributed/test_distributed_helper.py:26:81: E501 line too long (101 > 80 characters)
tests/distributed/test_distributed_helper.py:29:81: E501 line too long (82 > 80 characters)
tests/distributed/test_distributed_model.py:7:81: E501 line too long (110 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:11:81: E501 line too long (83 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:15:81: E501 line too long (110 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:92:81: E501 line too long (108 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:111:81: E501 line too long (85 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:115:81: E501 line too long (81 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:123:81: E501 line too long (116 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:129:81: E501 line too long (90 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:138:81: E501 line too long (81 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:170:81: E501 line too long (100 > 80 characters)
tests/distributed/test_distributed_strategy_support.py:178:81: E501 line too long (100 > 80 characters)
tests/distributed/test_distributed_tensor.py:8:81: E501 line too long (110 > 80 characters)
1       E125 continuation line with same indent as next logical line
1       E303 too many blank lines (2)
56      E501 line too long (86 > 80 characters)

lrzpellegrini · 2022-11-28T11:13:21Z

@AntonioCarta The PR is almost completed and I'd like to add more unit tests to be sure everything works fine.

The only real part that I still need to figure out is the mb_output collate. Currently, I slightly reworked the whole collate things to have an abstract class that contains:

The collate_fn (the usual function passed to DataLoader)
A collate for single values (that can be used to collate only x or y o t)
A collate that merges batches (that were already collated) [needed for distributed training support]
A collate that merges single value batches (that were already collated) [will be needed in the future to make metrics/logging in distributed training more efficient]

The good part is that we can use instances of this class to populate the collate_fn field of AvalancheDatasets because they have a call method that calls the collate_fn. For the moment, I have implemented the Collate class for classification and detection. However, this works for the input batch.

For distributed support, we also need to implement the same thing, but on the outputs. Synchronizing the output across processes is important to have the plugins and metrics aligned. I already took care of the loss by averaging it across processes, but mb_output needs a proper collate. The problem is that it has to be different from the input one (as the format may be different for inputs and outputs). We can't leverage the collate_fn from the experience dataset unless we enrich the Collate class to also contain the collate methods needed to manage the output (this is one possible solution). A different solution is to have the strategy itself define the collate for the output.

Do you any suggestion on this?

AntonioCarta · 2022-11-28T12:39:28Z

I need to think about this. Unfortunately the output collate cannot be part of the dataset since it depends on the model's definition and its output.

At this point, I'm wondering whether it's just easier to enforce a more structured and general mini-batch definition. Something like tensordict. It would provide a single and well defined method to collate values (also for scalar and tensor metrics).

AndreaCossu · 2022-11-30T08:47:39Z

I agree that TensorDict (the official torch API for that is still experimental) or simply dictionaries are a more flexible choice to manipulate the minibatch information (x, y, task labels etc). However, even with TensorDict you need to know the semantics of the tensors to know how to collate them (how to collate multiple x together, multiple y together etc).

I think if we use dictionaries and 1. a collate for single values + 2. a collate that merges single value batches (often the same as 1.) we could recover all the other collate operations.
Creating the minibatch in the dataloader amounts to use 1 + dictionary. Stacking together multiple mini-batches amounts to use 2 + dictionary.

AntonioCarta · 2022-11-30T09:20:10Z

I was definitely underestimating the complexity of possible collate functions.

I think if we use dictionaries and 1. a collate for single values + 2. a collate that merges single value batches (often the same as 1.) we could recover all the other collate operations.

There is also an additional problem. In general, it may not be true that the single collate is the same for all the values in the minibatch.

ContinualAI-bot · 2022-12-11T14:35:56Z

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

tests/test_avalanche_classification_dataset.py:1716:27: E741 ambiguous variable name 'l'
1       E741 ambiguous variable name 'l'

ContinualAI-bot · 2023-01-10T15:21:36Z

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

avalanche/distributed/distributed_helper.py:285:81: E501 line too long (109 > 80 characters)
avalanche/distributed/distributed_helper.py:286:81: E501 line too long (114 > 80 characters)
avalanche/distributed/distributed_helper.py:287:81: E501 line too long (89 > 80 characters)
tests/test_avalanche_classification_dataset.py:1716:27: E741 ambiguous variable name 'l'
tests/distributed/test_distributed_helper.py:15:81: E501 line too long (82 > 80 characters)
tests/distributed/test_distributed_helper.py:16:81: E501 line too long (96 > 80 characters)
tests/distributed/test_distributed_helper.py:26:81: E501 line too long (95 > 80 characters)
tests/distributed/test_distributed_helper.py:165:81: E501 line too long (81 > 80 characters)
tests/distributed/test_distributed_helper.py:174:81: E501 line too long (97 > 80 characters)
tests/distributed/test_distributed_helper.py:188:81: E501 line too long (81 > 80 characters)
tests/distributed/test_distributed_helper.py:191:81: E501 line too long (96 > 80 characters)
tests/distributed/test_distributed_helper.py:194:81: E501 line too long (91 > 80 characters)
tests/distributed/test_distributed_helper.py:198:81: E501 line too long (98 > 80 characters)
tests/distributed/test_distributed_helper.py:203:81: E501 line too long (100 > 80 characters)
tests/distributed/test_distributed_helper.py:204:81: E501 line too long (104 > 80 characters)
tests/distributed/test_distributed_helper.py:205:81: E501 line too long (136 > 80 characters)
tests/distributed/test_distributed_helper.py:213:81: E501 line too long (83 > 80 characters)
tests/distributed/test_distributed_helper.py:219:81: E501 line too long (88 > 80 characters)
tests/distributed/test_distributed_helper.py:223:81: E501 line too long (106 > 80 characters)
tests/distributed/test_distributed_helper.py:232:81: E501 line too long (110 > 80 characters)
tests/distributed/test_distributed_helper.py:241:81: E501 line too long (117 > 80 characters)
tests/distributed/test_distributed_helper.py:246:81: E501 line too long (97 > 80 characters)
tests/distributed/test_distributed_helper.py:251:81: E501 line too long (99 > 80 characters)
tests/distributed/test_distributed_helper.py:252:81: E501 line too long (104 > 80 characters)
tests/distributed/test_distributed_helper.py:253:81: E501 line too long (136 > 80 characters)
tests/distributed/test_distributed_helper.py:260:81: E501 line too long (88 > 80 characters)
tests/distributed/test_distributed_helper.py:266:81: E501 line too long (97 > 80 characters)
tests/distributed/test_distributed_helper.py:271:81: E501 line too long (99 > 80 characters)
tests/distributed/test_distributed_helper.py:272:81: E501 line too long (104 > 80 characters)
tests/distributed/test_distributed_helper.py:273:81: E501 line too long (136 > 80 characters)
tests/distributed/test_distributed_helper.py:280:81: E501 line too long (106 > 80 characters)
tests/distributed/test_distributed_helper.py:291:81: E501 line too long (112 > 80 characters)
tests/distributed/test_distributed_helper.py:292:81: E501 line too long (104 > 80 characters)
tests/distributed/test_distributed_helper.py:293:81: E501 line too long (136 > 80 characters)
tests/distributed/test_distributed_helper.py:300:81: E501 line too long (83 > 80 characters)
tests/distributed/test_distributed_helper.py:389:13: E265 block comment should start with '# '
tests/distributed/test_distributed_helper.py:391:17: E265 block comment should start with '# '
tests/distributed/test_distributed_helper.py:394:21: E265 block comment should start with '# '
tests/distributed/test_distributed_helper.py:403:17: E265 block comment should start with '# '
tests/distributed/test_distributed_helper.py:408:21: E265 block comment should start with '# '
tests/distributed/test_distributed_helper.py:417:17: E265 block comment should start with '# '
tests/distributed/test_distributed_helper.py:421:81: E501 line too long (87 > 80 characters)
6       E265 block comment should start with '# '
35      E501 line too long (109 > 80 characters)
1       E741 ambiguous variable name 'l'

botcs · 2023-01-18T01:19:34Z

Hi,

Thanks for the amazing job!
Any update on this?

lrzpellegrini · 2023-03-07T14:49:03Z

I'm moving the development status to #1315. This PR, in its current state, is very difficult to merge in a single step as it is too big and based on a too old codebase.

lrzpellegrini added 7 commits April 8, 2022 15:27

Merge remote-tracking branch 'upstream/master'

6bd297b

Reworking distributed support (WIP).

55a5480

Working strategy composition and example (naive, replay, scheduler).

b0ce2e3

Fixed pep8 issues.

976e5c5

Fixed typing error. Removed debug code.

efb7f86

Merge remote-tracking branch 'upstream/master' into distributed_suppo…

e13f067

…rt_rework

Removed debug prints.

3017aeb

lrzpellegrini requested a review from AntonioCarta April 26, 2022 10:30

AntonioCarta reviewed Apr 26, 2022

View reviewed changes

AntonioCarta mentioned this pull request Apr 27, 2022

How to run programs on multiple GPUs？ #1003

Open

lrzpellegrini added 4 commits April 29, 2022 14:56

Implemented lazy creation of the default logger.

f8882d7

[Distributed] Simplified internal API and example. Added in-code guide.

8571b91

Added support for general use_local in strategies.

b752568

Merge remote-tracking branch 'upstream/master' into distributed_suppo…

f5eaf96

…rt_rework

AntonioCarta mentioned this pull request May 27, 2022

Multiple GPU training #1032

Closed

AntonioCarta mentioned this pull request Jul 4, 2022

Support distributed training with torch.nn.DataParallel() #850

Open

lrzpellegrini mentioned this pull request Jul 15, 2022

Implement sampler-based dataloading logic #1095

Merged

lrzpellegrini added 4 commits July 19, 2022 17:13

Merge remote-tracking branch 'upstream/master' into distributed_suppo…

b13cc9b

…rt_rework

Add type hints to _make_data_loader. Fix distributed training example.

d1b9d28

Partial merge remote-tracking branch 'upstream/master' into distribut…

f104a0e

…ed_support_rework

Integrated distributed training with RNGManager, new collate system. …

88f75a9

…Additional tests. Detection WIP.

Improved management of dataloader arguments in strategies. Improved d…

1717b8d

…efault loggers creation. Added distributed training integration unit tests.

lrzpellegrini added 2 commits November 23, 2022 17:20

Improved distributed strategy unit tests. Fixed PEP8 issues.

da5c58c

Aligned environment update action content.

cdcd8c4

Fix multitask issues. Improve distributed training support and tests.

2a93ad8

Added additional unit tests. Issue with all_gather to be fixed.

1174f33

Tests for DistributedHelper. Distributed support field in plugins.

6a3dd1f

lrzpellegrini mentioned this pull request Mar 7, 2023

Port distributed training support from existing PR #1315

Open

lrzpellegrini mentioned this pull request May 10, 2023

Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed support (rework) #996

Distributed support (rework) #996

lrzpellegrini commented Apr 22, 2022

lrzpellegrini commented Apr 26, 2022

AntonioCarta Apr 26, 2022

lrzpellegrini Apr 27, 2022 •

edited

Loading

AntonioCarta Apr 26, 2022

lrzpellegrini Apr 27, 2022

AntonioCarta Apr 26, 2022

AntonioCarta Apr 26, 2022

lrzpellegrini Apr 27, 2022

AntonioCarta commented Apr 26, 2022

lrzpellegrini commented Apr 27, 2022

lrzpellegrini commented Apr 29, 2022

coveralls commented Apr 29, 2022 •

edited

Loading

AntonioCarta commented Apr 29, 2022

ContinualAI-bot commented Nov 22, 2022

ContinualAI-bot commented Nov 23, 2022

lrzpellegrini commented Nov 28, 2022

AntonioCarta commented Nov 28, 2022

AndreaCossu commented Nov 30, 2022

AntonioCarta commented Nov 30, 2022

ContinualAI-bot commented Dec 11, 2022

ContinualAI-bot commented Jan 10, 2023

botcs commented Jan 18, 2023

lrzpellegrini commented Mar 7, 2023

		def forward(self):
		"""Compute the model's output given the current mini-batch."""

Distributed support (rework) #996

Are you sure you want to change the base?

Distributed support (rework) #996

Conversation

lrzpellegrini commented Apr 22, 2022

lrzpellegrini commented Apr 26, 2022

AntonioCarta Apr 26, 2022

Choose a reason for hiding this comment

lrzpellegrini Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

AntonioCarta Apr 26, 2022

Choose a reason for hiding this comment

lrzpellegrini Apr 27, 2022

Choose a reason for hiding this comment

AntonioCarta Apr 26, 2022

Choose a reason for hiding this comment

AntonioCarta Apr 26, 2022

Choose a reason for hiding this comment

lrzpellegrini Apr 27, 2022

Choose a reason for hiding this comment

AntonioCarta commented Apr 26, 2022

lrzpellegrini commented Apr 27, 2022

lrzpellegrini commented Apr 29, 2022

coveralls commented Apr 29, 2022 • edited Loading

Pull Request Test Coverage Report for Build 2699048914

💛 - Coveralls

AntonioCarta commented Apr 29, 2022

ContinualAI-bot commented Nov 22, 2022

ContinualAI-bot commented Nov 23, 2022

lrzpellegrini commented Nov 28, 2022

AntonioCarta commented Nov 28, 2022

AndreaCossu commented Nov 30, 2022

AntonioCarta commented Nov 30, 2022

ContinualAI-bot commented Dec 11, 2022

ContinualAI-bot commented Jan 10, 2023

botcs commented Jan 18, 2023

lrzpellegrini commented Mar 7, 2023

lrzpellegrini Apr 27, 2022 •

edited

Loading

coveralls commented Apr 29, 2022 •

edited

Loading