Shortcuts

CartPole

Overview

The inverted pendulum problem is a classic control problem in reinforcement learning. CartPole is a discrete control task in the inverted pendulum problem. In the game there is a car with a pole on it. The cart slides side to side on a smooth and frictionless track to keep the pole upright. As shown below.

../_images/cartpole.gif

Install

Installation Method

The CartPole environment is built into the gym, and you can install the gym directly. Its environment id is CartPole-v0.

pip install gym

Verify Installation

Run the following command on the Python command line to verify that the installation is successful.

import gym
env = gym.make('CartPole-v0')
obs = env.reset()
print(obs)

Environment Introduction

Action Space

The action space of CartPole belongs to the discrete action space, and there are two discrete actions, namely left shift and right shift.

  • Left Move : 0 means to move the agent to the left.

  • Right move : 1 means to move the agent to the right.

Using the gym environment space definition can be expressed as:

action_space = spaces.Discrete(2)

State Space

CartPole’s state space has 4 elements, which are:

  • Cart Position : Cart position, in the range [-4.8, 4.8] .

  • Cart Velocity : The speed of the cart, in the range [-inf, inf] .

  • Pole Angle : The angle of the pole, in the range [-24 deg, 24 deg].

  • Pole Angular Velocity : The angular velocity of the pole, in the range [-inf, inf].

Reward Space

Each step will receive a reward of 1 until the episode terminates (the termination state will also receive a reward of 1).

Termination Condition

The termination condition for each episode of the CartPole environment is any of the following:

  • The angle of the rod offset is more than 12 degrees.

  • The cart is out of bounds, and the distance is usually set as 2.4.

  • Reaching the maximum step of episode, whose default is 200.

When Does the CartPole Mission Count as a Victory

When the average episode reward for 100 trials reaches 195 or more, the game is considered a victory.

Others

Store Video

Some environments have their own rendering plug-ins, but DI-engine does not support the rendering plug-ins that come with the environment, but generates video recordings by saving the logs during training. For details, please refer to the Visualization & Logging section under the DI-engine official documentation Quick start chapter.

DI-zoo Runnable Code Example

The following provides a complete CartPole environment config, using the DQN algorithm as the policy. Please run the dqn_nstep.py file in the DI-engine/ding/example directory, as follows.

import gym
from ditk import logging
from ding.model import DQN
from ding.policy import DQNPolicy
from ding.envs import DingEnvWrapper, BaseEnvManagerV2
from ding.data import DequeBuffer
from ding.config import compile_config
from ding.framework import task
from ding.framework.context import OnlineRLContext
from ding.framework.middleware import OffPolicyLearner, StepCollector, interaction_evaluator, data_pusher, \
    eps_greedy_handler, CkptSaver, nstep_reward_enhancer, final_ctx_saver
from ding.utils import set_pkg_seed
from dizoo.classic_control.cartpole.config.cartpole_dqn_config import main_config, create_config


def main():
    logging.getLogger().setLevel(logging.INFO)
    main_config.exp_name = 'cartpole_dqn_nstep'
    main_config.policy.nstep = 3
    cfg = compile_config(main_config, create_cfg=create_config, auto=True)
    with task.start(async_mode=False, ctx=OnlineRLContext()):
        collector_env = BaseEnvManagerV2(
            env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.collector_env_num)],
            cfg=cfg.env.manager
        )
        evaluator_env = BaseEnvManagerV2(
            env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.evaluator_env_num)],
            cfg=cfg.env.manager
        )

        set_pkg_seed(cfg.seed, use_cuda=cfg.policy.cuda)

        model = DQN(**cfg.policy.model)
        buffer_ = DequeBuffer(size=cfg.policy.other.replay_buffer.replay_buffer_size)
        policy = DQNPolicy(cfg.policy, model=model)

        task.use(interaction_evaluator(cfg, policy.eval_mode, evaluator_env))
        task.use(eps_greedy_handler(cfg))
        task.use(StepCollector(cfg, policy.collect_mode, collector_env))
        task.use(nstep_reward_enhancer(cfg))
        task.use(data_pusher(cfg, buffer_))
        task.use(OffPolicyLearner(cfg, policy.learn_mode, buffer_))
        task.use(CkptSaver(policy, cfg.exp_name, train_freq=100))
        task.use(final_ctx_saver(cfg.exp_name))
        task.run()


if __name__ == "__main__":
    main()

Experimental Results

The experimental results using the DQN algorithm are as follows. The abscissa is env step , and the ordinate is episode reward (return) mean .

../_images/cartpole_dqn.png

References

Read the Docs v: latest
Versions
latest
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.