-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding the parameter settings of the metadrive
environment
#291
Comments
Hello, Thank you very much for your support and recognition! Regarding the hyperparameters for training with MetaDrive, the current configuration metadrive_sampled_efficientzero_config.py can converge to a reasonable return (~250) at around 500K environment steps. However, please note that we have not yet conducted comprehensive tests across a wide range of MetaDrive environments. Therefore, it is recommended that you first conduct preliminary tests using the default configuration and observe the training performance. During testing, you can make targeted adjustments to the hyperparameters based on the following aspects to accelerate convergence and improve training performance:
It is recommended that during training, you analyze TensorBoard logs ( Wishing you success with your training! If you have any further questions, feel free to reach out anytime. |
Thank you very much for your response; it’s been very helpful. I have another question about multi-GPU. As I understand it, multi-GPU involves data parallelism. If my single GPU’s memory is large enough, the benefits of using multiple GPUs might not be significant? I mainly want to confirm if using multiple GPUs would speed up the data collection process in the environment (as far as I understand, it shouldn’t?). Thanks again! |
Hello! Thank you for your question. Regarding the use of multiple GPUs, your understanding is generally correct. Multi-GPU setups are mainly used for data parallelism, which involves distributing the same model across different GPUs to process different data batches in parallel. This can speed up model training, especially when dealing with large datasets, or when the memory of a single GPU is insufficient to accommodate the entire dataset or a large model. In such cases, using multiple GPUs can effectively distribute the load. However, if the memory of a single GPU is already sufficient to handle the entire dataset, using multiple GPUs may not result in significant speedup in certain situations. This is due to the following reasons:
Additionally, regarding the data collection process in the environment, using multiple GPUs generally will not significantly accelerate data collection. Data collection typically relies more on the CPU, I/O devices, and the response speed of the environment itself (e.g., in reinforcement learning scenarios, data collection depends on the step speed of the environment and the agent’s interaction rate). GPUs are mainly used to accelerate computationally intensive tasks, such as forward inference and backpropagation, rather than directly participating in data collection within the environment. However, in our DDP implementation, we also use You can refer to our example to test how distributed data parallelism (DDP) can speed up the training process. In our tests, the speedup ratio is almost linearly proportional to the number of GPUs. You can refer to the following link for specific usage instructions: #223. I hope this answer is helpful to you! If you have further questions, feel free to continue the discussion. (Partially Modified from GPT4o-latest's response :) |
Oh, I understand now, thank you very much!
|
Hi,
Thank you very much to the author for creating such a convenient codebase.
I have a question about metadrive that I would like to ask, as I am new to it. I hope to get some suggestions for hyperparameter settings, such as collector_env_num and bs, etc. I am trying to use metadrive to replay some scenarios from nuScenes, based on which to train the agent. I wonder if the author has any suggestions on parameter settings to speed up convergence?
The text was updated successfully, but these errors were encountered: