CartPole Balance using Python | OpenAI Gym | Reinforcement Learning Project Tutorial

Hackers Realm
Aug 25, 2022
6 min read

Updated: May 30, 2023

The Cartpole balance problem is the classic inverted pendulum problem which consists of a cart that moves along the horizontal axis and objective is to balance the pole on the cart using python. Instead of applying control theories, the goal here is to solve it using controlled trial-and-error, also known as reinforcement learning.

Cartpole balance using reinforcement learning openai — Cartpole balance using reinforcement learning

In this tutorial, we will use the OpenAI Gym module as a reinforcement learning tool to process and evaluate the Cartpole simulation.

You can watch the video-based tutorial with step by step explanation down below.

Project Description

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

# install modules
!pip install gym stable_baselines3

Necessary installation to use the Gym module
You must previously install Pytorch and Tensorflow in order to use baseline 3.

Import Modules

import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

PPO - Proximal Policy Optimization algorithm to train AI policies in challenging environments.
DummyVecEnv - Vectorized environment to parallelly train multiple models.
evaluate_policy - Pre-defined evaluation to test your current model.

Environment Testing with Random Actions

env_name = 'CartPole-v0'
env = gym.make(env_name)

Creation of the simulation environment

for episode in range(1, 11):
    score = 0
    state = env.reset()
    done = False
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score += reward
        
    print('Episode:', episode, 'Score:', score)
env.close()

Episode: 1 Score: 19.0 Episode: 2 Score: 17.0 Episode: 3 Score: 77.0 Episode: 4 Score: 36.0 Episode: 5 Score: 17.0 Episode: 6 Score: 22.0 Episode: 7 Score: 18.0 Episode: 8 Score: 25.0 Episode: 9 Score: 11.0 Episode: 10 Score: 15.0

Display video of the Cartpole balance episode.
It will stop after iterating 10 episodes or stops receiving any more inputs.
score - total of the reward points
state = env.reset() - resetting the environment to the initial stage
env.render() - render the environment
action = env.action_space - Gives two discrete values, 0 (or) 1 to move left (or) right. Using sample() will return a random number, either 0 (or) 1.

Model Training

env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1)

Using cuda device

Creation of the environment for training
Policy defined in order to determine how the input and output will be processed
verbose = 1 - Display all the information in the console

model.learn(total_timesteps=20000)

-----------------------------
| time/              |      |
|    fps             | 352  |
|    iterations      | 1    |
|    time_elapsed    | 5    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 360         |
|    iterations           | 2           |
|    time_elapsed         | 11          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009253612 |
|    clip_fraction        | 0.112       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.00415    |
|    learning_rate        | 0.0003      |
|    loss                 | 5.15        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0178     |
|    value_loss           | 50.7        |
-----------------------------------------

-----------------------------------------
| time/                   |             |
|    fps                  | 369         |
|    iterations           | 3           |
|    time_elapsed         | 16          |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.011285827 |
|    clip_fraction        | 0.071       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.666      |
|    explained_variance   | 0.059       |
|    learning_rate        | 0.0003      |
|    loss                 | 13.8        |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0164     |
|    value_loss           | 36.5        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 373         |
|    iterations           | 4           |
|    time_elapsed         | 21          |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.008387755 |
|    clip_fraction        | 0.0782      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.631      |
|    explained_variance   | 0.22        |
|    learning_rate        | 0.0003      |
|    loss                 | 12.9        |
|    n_updates            | 30          |
|    policy_gradient_loss | -0.0171     |
|    value_loss           | 52          |
-----------------------------------------

-----------------------------------------
| time/                   |             |
|    fps                  | 375         |
|    iterations           | 5           |
|    time_elapsed         | 27          |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.009724656 |
|    clip_fraction        | 0.0705      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.605      |
|    explained_variance   | 0.31        |
|    learning_rate        | 0.0003      |
|    loss                 | 18.7        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0153     |
|    value_loss           | 58.7        |
-----------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 377          |
|    iterations           | 6            |
|    time_elapsed         | 32           |
|    total_timesteps      | 12288        |
| train/                  |              |
|    approx_kl            | 0.0058702496 |
|    clip_fraction        | 0.0446       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.591       |
|    explained_variance   | 0.303        |
|    learning_rate        | 0.0003       |
|    loss                 | 23.8         |
|    n_updates            | 50           |
|    policy_gradient_loss | -0.00995     |
|    value_loss           | 71.7         |
------------------------------------------

------------------------------------------
| time/                   |              |
|    fps                  | 378          |
|    iterations           | 7            |
|    time_elapsed         | 37           |
|    total_timesteps      | 14336        |
| train/                  |              |
|    approx_kl            | 0.0020627829 |
|    clip_fraction        | 0.012        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.59        |
|    explained_variance   | 0.271        |
|    learning_rate        | 0.0003       |
|    loss                 | 9.21         |
|    n_updates            | 60           |
|    policy_gradient_loss | -0.00347     |
|    value_loss           | 68.9         |
------------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 379          |
|    iterations           | 8            |
|    time_elapsed         | 43           |
|    total_timesteps      | 16384        |
| train/                  |              |
|    approx_kl            | 0.0068374504 |
|    clip_fraction        | 0.0835       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.571       |
|    explained_variance   | 0.734        |
|    learning_rate        | 0.0003       |
|    loss                 | 16.4         |
|    n_updates            | 70           |
|    policy_gradient_loss | -0.00912     |
|    value_loss           | 44.5         |
------------------------------------------

-----------------------------------------
| time/                   |             |
|    fps                  | 381         |
|    iterations           | 9           |
|    time_elapsed         | 48          |
|    total_timesteps      | 18432       |
| train/                  |             |
|    approx_kl            | 0.005816878 |
|    clip_fraction        | 0.0488      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.554      |
|    explained_variance   | 0.449       |
|    learning_rate        | 0.0003      |
|    loss                 | 28.8        |
|    n_updates            | 80          |
|    policy_gradient_loss | -0.00556    |
|    value_loss           | 83.1        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 382         |
|    iterations           | 10          |
|    time_elapsed         | 53          |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.008002328 |
|    clip_fraction        | 0.0686      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.569      |
|    explained_variance   | 0.7         |
|    learning_rate        | 0.0003      |
|    loss                 | 25.5        |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.00599    |
|    value_loss           | 64.9        |
-----------------------------------------

# save the model
model.save('ppo model')

Model Testing

evaluate_policy(model, env, n_eval_episodes=10, render=True)

(200.0, 0.0)

Shows the max score that can be obtained by the model

env.close()

Now we will apply an alternative way after training the model

for episode in range(1, 11):
    score = 0
    obs = env.reset()
    done = False
    
    while not done:
        env.render()
        action, _ = model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
        
    print('Episode:', episode, 'Score:', score)
env.close()

Episode: 1 Score: [200.] Episode: 2 Score: [200.] Episode: 3 Score: [200.] Episode: 4 Score: [200.] Episode: 5 Score: [200.] Episode: 6 Score: [200.] Episode: 7 Score: [200.] Episode: 8 Score: [200.] Episode: 9 Score: [200.] Episode: 10 Score: [200.]

Performance has increased significantly after model training, giving perfect scores.

Final Thoughts

Processing large amount of episodes can take a lot of time and system resource.
You may use other simulation models like Atari games, space shooting games, etc.
You can create multi environments and process them simultaneously.

In this project tutorial, we have explored the Cartpole balance problem using the OpenAI Gym module as a reinforcement learning project. We have obtained very good results after processing and training the model.

Toptal provides a top-rated platform connecting businesses and startups with expert OpenAI Gym developers. Clients trust Toptal to supply them with mission-critical talent for their advanced OpenAI Gym projects, including developing and testing reinforcement learning algorithms, designing and building virtual environments for training and testing, tuning hyperparameters, and integrating OpenAI Gym with other machine learning libraries and tools. You can augment your organization's development team and AI capabilities with access to this talent guide where you can easily find and hire skilled OpenAI Gym developers.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm