This repo contains a general MAGAIL
implementation, it's useful when learning a Joint-policy :
mixing agent policy
and environment policy
together. The Agent
can interact with Environment
by taking action according to the state given by the Environment
,
and the Environment
sends state according to agent's action.
In GAIL, it's the most trivial case that only a single agent. Multi-Agent can be general more than one, here we only focus two agents.
As you can imagine a scenario in Commodity Recommendation
: The Platform will decide what kind of commodities to recommend according to user's action (buy ? browse ? search ? add to shopping cart ? ...... ),
from another point of view, A user will take corresponding actions according to what they see(he like the goods, so he bought, he is interested in the commodities, so he browse them or add to the shopping cart).
The structure should be like this:
1. python >= 3.6
2. pytorch >= 1.3.1
3. pandas >= 1.0.1
4. PyYAML >= 5.3
2. Filling in model parameters into config/config.yml
An example configuration file should be like this:
# general parameters
general:
seed: 2020
expert_batch_size: 2000
expert_data_path: ../data/train_data_sas.csv
training_epochs: 500000
num_states: 155
num_actions: 6
# parameters for general advantage estimation
gae:
gamma: 0.995
tau: 0.96
# parameters for PPO algorithm
ppo:
clip_ratio: 0.1
ppo_optim_epochs: 1
ppo_mini_batch_size: 200
sample_batch_size: 2000
# parameters for joint-policy
jointpolicy:
learning_rate: !!float 1e-4
trajectory_length: 10
user:
num_states: 155
num_actions: 6
num_discrete_actions: 0
discrete_actions_sections: !!python/tuple [0]
action_log_std: 0.0
use_multivariate_distribution: False
num_hiddens: !!python/tuple [256]
activation: LeaklyReLU
drop_rate: 0.5
env:
num_states: 161
num_actions: 155
num_discrete_actions: 132
discrete_actions_sections: !!python/tuple [5, 2, 4, 3, 2, 9, 2, 32, 35, 7, 2, 21, 2, 3, 3]
action_log_std: 0.0
use_multivariate_distribution: False
num_hiddens: !!python/tuple [256]
activation: LeakyReLU
drop_rate: 0.5
# parameters for critic
value:
num_states: 155
num_hiddens: !!python/tuple [256, 256]
activation: LeakyReLU
drop_rate: 0.5
learning_rate: !!float 3e-4
l2_reg: !!float 1e-3
# parameters for discriminator
discriminator:
num_states: 155
num_actions: 6
num_hiddens: !!python/tuple [256, 256]
activation: LeakyReLU
drop_rate: 0.5
learning_rate: !!float 4e-4
use_noise: True # trick: add noise
noise_std: 0.15
use_label_smoothing: True # trick: label smoothing
label_smooth_rate: 0.1
For judging performance of the algorithm, we mainly focus on the Reward
given by Discriminator
. You may need fine tuning
in your experiments, luckily, almost all the tips and tricks applying to GAN can be used in MAGAIL
training.
- Discriminator Loss
It's identical to original GAN's discriminator loss objective: $$ - \mathbb E_{x \sim p_{expert}} \left[\log D(x)\right] - \mathbb E_{x \sim p_{generated}} \left[ \log (1 - D(x)) \right] $$
The optimal result will be asymptotic to
- Expert Reward(Discriminator output)
At the beginning, the discriminator can easily tell which is from expert and assign a high reward which can be about 0.97, As the Policy improve gradually, it starts to go down and eventually converges to around 0.6.
- Generator Reward(Discriminator output)
In Generator, it just acts like the opposite, and finally converges to about 0.4.
The final converge ratio is about 6:4 (you may see some slight tendency to break out this convergence, just train it for much longer time) , which is not perfect but usable (As you may know that training GAN is so hard needless to say MAGAIL).
From step 1 to 3, the policy will be trained fine and then it should be used in real world cases.
[1]. Generative Adversarial Imitation Learning
[2]. Virtual-Taobao: Virtualizing real-world online retail environment for reinforcement learning
[3]. Tricks of GANs
[4]. IMPROVING GENERATIVE ADVERSARIAL IMITATION LEARNING WITH NON-EXPERT DEMONSTRATIONS
[5]. Disagreement-Regularized Imitation Learning
[6]. SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards