Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support to PPO for continuous actions spaces #103

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kuds
Copy link

@kuds kuds commented Sep 10, 2024

#77

Design

Created a new class called ContinuousProximalPolicyOptimization to handle continuous action spaces for PPO. Updated README with new tutorial, added new unit tests, and general comments clean-up

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 10, 2024
@kuds kuds changed the title Add support to PPO for continuous actions spaces Adding support to PPO for continuous actions spaces Sep 11, 2024
@rodrigodesalvobraz
Copy link
Contributor

Thank you! I am currently reviewing the PR.

@rodrigodesalvobraz
Copy link
Contributor

rodrigodesalvobraz commented Sep 13, 2024

Thanks, I've reviewed it and have two requests:

  • You duplicated quite a bit of code from ppo.py to continuous_ppo.py. It seems that almost all changes are in _actor_loss. Would you be able to define a base class BasePPO containing most of the common code now in PPO, and subclass it into two subclasses PPO and ContinuousPPO, each of them defining the appropriate _actor_loss?
  • Can you replace tuple[Tensor, Normal] by Tuple[Tensor, Normal]? The former does not work with Python 3.8, which we still currently support in pyproject.toml.

@kuds
Copy link
Author

kuds commented Sep 14, 2024

Sure, it makes sense to consolidate the core PPO logic into a base class and have the discrete and continuous versions override where needed (e.g., the _actor_loss method). I will work on incorporating both requests into my pull request and should have those done sometime next week. Thanks for the feedback!

@kuds
Copy link
Author

kuds commented Sep 30, 2024

Sorry for the delay on this. I have the base class created for PPO with the discrete and continuous versions extending it. I will get my changes checked in by end of this week.

@rodrigodesalvobraz
Copy link
Contributor

Sorry for the delay on this. I have the base class created for PPO with the discrete and continuous versions extending it. I will get my changes checked in by end of this week.

No problem, thanks for the update! Looking forward to it.

@kuds
Copy link
Author

kuds commented Oct 8, 2024

I appreciate your patience on this! I have finished creating the PPO base class with the discrete and continuous versions, inheriting from it and overriding it where needed. Let me know your thoughts and if I can help with any other issues or development priorities!

@rodrigodesalvobraz
Copy link
Contributor

rodrigodesalvobraz commented Oct 18, 2024

It's looking good! I was running the Lunar Lander tutorial (nice!) and encountered an error. I am curious if you are seeing the same thing by any chance? Here's the notebook with the error.

@kuds
Copy link
Author

kuds commented Oct 18, 2024

So the issue you are running into is known (Issue #1142) with Gymnasium 0.29.1 and Numpy Verison 2+. You have to downgrade Numpy to version 1+. There is a comment in the Lunar Lander tutorial about this to help during this transition.

image

The Farama Foundation just released version 1.0.0 of Gymnasium about a week ago, which should resolve this issue with Numpy 2+, but I have not tried it with Pearl. The library upgrade for Gymansium and Numpy should probably be its effort/issue/pull request if you are ok with that.

Let me know if you have any other issues or additional questions!

@kuds
Copy link
Author

kuds commented Oct 25, 2024

@rodrigodesalvobraz

It looks like the Pearl repo has undergone some significant changes in the last couple of days, like around the replay buffer. Would you like me to rework my pull request to handle these latest updates?

@rodrigodesalvobraz
Copy link
Contributor

@kuds, that would be greatly appreciated, thank you!
Thank you also for diagnosing that last issue!

…fers and data classes

- Rollback PPO changes and sync up with latest round of changes
- Updated PPO to use the PPOReplayBuffer, PPOTransition, and PPO TransitionBatch
- Update Lunar Lander Tutorial with BasicReplayBuffer and PPOReplayBuffer
- Update Tutorials to use new replay buffer
@kuds kuds reopened this Oct 29, 2024
@kuds
Copy link
Author

kuds commented Oct 29, 2024

I have merged in the latest round of changes from the main pearl branch and assimilated the renamed replay buffers into the PPO code base. I am still running some tests, but I wanted to get this up for an initial review.

@rodrigodesalvobraz
Copy link
Contributor

I have merged in the latest round of changes from the main pearl branch and assimilated the renamed replay buffers into the PPO code base. I am still running some tests, but I wanted to get this up for an initial review.

Sounds good, thanks. I am checking things around, but I see you are still making changes, so let me know when you're ready.

@rodrigodesalvobraz
Copy link
Contributor

Hi @kuds ,
I checked out your PR today and a few unit tests failed:
image

Errors were:

pearl\policy_learners\sequential_decision_making\ppo.py", line 68, in __init__:
TypeError: ProximalPolicyOptimizationBase.__init__() got an unexpected keyword argument 'history_summarization_learning_rate'
pearl\policy_learners\sequential_decision_making\ppo_continuous.py", line 101, in __init__:
AttributeError: 'ContinuousProximalPolicyOptimization' object has no attribute 'is_action_continuous'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants