Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)
Overview
PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.
Original paper:
Implemented Variants
Variants Implemented | Description |
---|---|
ppo_dna_atari_envpool.py , docs |
Uses the blazing fast Envpool Atari vectorized environment. |
Below are our single-file implementations of PPO-DNA:
ppo_dna_atari_envpool.py
The ppo_dna_atari_envpool.py has the following features:
- Uses the blazing fast Envpool vectorized environment.
- For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Warning
Note that ppo_dna_atari_envpool.py
does not work in Windows and MacOs . See envpool's built wheels here: https://pypi.org/project/envpool/#files
Usage
poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_dna_atari_envpool.py uses a customized RecordEpisodeStatistics
to work with envpool but has the same other implementation details as ppo_atari.py
(see related docs).
Note that the original DNA implementation uses the StickyAction
environment pre-processing wrapper (see (Machado et al., 2018)1), but we did not implement it in ppo_dna_atari_envpool.py because envpool for now does not support StickyAction
.
Experiment results
Below are the average episodic returns for ppo_dna_atari_envpool.py
compared to ppo_atari_envpool.py
.
Environment | ppo_dna_atari_envpool.py |
ppo_atari_envpool.py |
---|---|---|
BattleZone-v5 (40M steps) | 74000 ± 15300 | 28700 ± 6300 |
BeamRider-v5 (10M steps) | 5200 ± 900 | 1900 ± 530 |
Breakout-v5 (10M steps) | 319 ± 63 | 349 ± 42 |
DoubleDunk-v5 (40M steps) | -4.1 ± 1.0 | -2.0 ± 0.8 |
NameThisGame-v5 (40M steps) | 19100 ± 2300 | 4400 ± 1200 |
Phoenix-v5 (45M steps) | 186000 ± 67000 | 9900 ± 2700 |
Pong-v5 (3M steps) | 19.5 ± 1.0 | 16.6 ± 2.4 |
Qbert-v5 (45M steps) | 12800 ± 4200 | 11400 ± 3600 |
Tennis-v5 (10M steps) | 19.6 ± 0.0 | -12.4 ± 2.9 |
Learning curves:
data:image/s3,"s3://crabby-images/5f47e/5f47e7d174e95114d1a18f7c374bb7ba5f78af52" alt=""
data:image/s3,"s3://crabby-images/fecff/fecffb87ebc0c126641a0ea5257f29cfa84aeec6" alt=""
data:image/s3,"s3://crabby-images/a4597/a4597b7a33c3cdd01af625c0c3431117b1846891" alt=""
data:image/s3,"s3://crabby-images/bc528/bc52802989f795d8f3dc599625abfaf5b7c46e40" alt=""
data:image/s3,"s3://crabby-images/049dc/049dc7d135323a1a73ea3c0acab51f6bf7e78cd1" alt=""
data:image/s3,"s3://crabby-images/aba36/aba363d5cc524a4f46a784321ac9177ee18ec6ab" alt=""
data:image/s3,"s3://crabby-images/4b689/4b689cd0411bd534a9e3cffe1831ab57577f0b4f" alt=""
data:image/s3,"s3://crabby-images/4452b/4452be042c4dcf2e40b09503058ba34adbde4b89" alt=""
data:image/s3,"s3://crabby-images/618cc/618cc2b1f60bc866a944ed21c1fa8a0b36b34a37" alt=""
data:image/s3,"s3://crabby-images/30366/303669fa1a9f328ad35fa13a8970c0e933519f5b" alt=""
data:image/s3,"s3://crabby-images/7ac9d/7ac9d2b530c29fd37a15e101e952a429f506a161" alt=""
data:image/s3,"s3://crabby-images/952b3/952b3ff01ff8018bd614cf7cb584e79acc3e259f" alt=""
data:image/s3,"s3://crabby-images/cbf4c/cbf4c0380b7b8499775cced8805fdcb2b76113c8" alt=""
data:image/s3,"s3://crabby-images/11a67/11a67ff27c61f1c9afe822eea13f608020bbb175" alt=""
data:image/s3,"s3://crabby-images/50de2/50de2e086105a638005d972bf7fe2c7d67686a0c" alt=""
data:image/s3,"s3://crabby-images/1f2ac/1f2acaa1a9a74e4e6f096110eabbeb4c08d7a7f9" alt=""
data:image/s3,"s3://crabby-images/dd90c/dd90cdc53132480f38b696b12dbd267774c46ee5" alt=""
data:image/s3,"s3://crabby-images/9e0eb/9e0ebed8a1fd3e95cde7fa8fd19dc340eb2d10b7" alt=""
Tracked experiments:
-
Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562. ↩