top of page
• Hackers Realm

CartPole Balance using Python | OpenAI Gym | Reinforcement Learning Project Tutorial

Updated: May 30, 2023

The Cartpole balance problem is the classic inverted pendulum problem which consists of a cart that moves along the horizontal axis and objective is to balance the pole on the cart using python. Instead of applying control theories, the goal here is to solve it using controlled trial-and-error, also known as reinforcement learning.

In this tutorial, we will use the OpenAI Gym module as a reinforcement learning tool to process and evaluate the Cartpole simulation.

You can watch the video-based tutorial with step by step explanation down below.

Project Description

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

```# install modules
!pip install gym stable_baselines3```
• Necessary installation to use the Gym module

• You must previously install Pytorch and Tensorflow in order to use baseline 3.

Import Modules

```import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy```
• PPO - Proximal Policy Optimization algorithm to train AI policies in challenging environments.

• DummyVecEnv - Vectorized environment to parallelly train multiple models.

• evaluate_policy - Pre-defined evaluation to test your current model.

Environment Testing with Random Actions

```env_name = 'CartPole-v0'
env = gym.make(env_name)```
• Creation of the simulation environment

```for episode in range(1, 11):
score = 0
state = env.reset()
done = False

while not done:
env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score += reward

print('Episode:', episode, 'Score:', score)
env.close()```

Episode: 1 Score: 19.0 Episode: 2 Score: 17.0 Episode: 3 Score: 77.0 Episode: 4 Score: 36.0 Episode: 5 Score: 17.0 Episode: 6 Score: 22.0 Episode: 7 Score: 18.0 Episode: 8 Score: 25.0 Episode: 9 Score: 11.0 Episode: 10 Score: 15.0

• Display video of the Cartpole balance episode.

• It will stop after iterating 10 episodes or stops receiving any more inputs.

• score - total of the reward points

• state = env.reset() - resetting the environment to the initial stage

• env.render() - render the environment

• action = env.action_space - Gives two discrete values, 0 (or) 1 to move left (or) right. Using sample() will return a random number, either 0 (or) 1.

Model Training

```env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1)```

Using cuda device

• Creation of the environment for training

• Policy defined in order to determine how the input and output will be processed

• verbose = 1 - Display all the information in the console

`model.learn(total_timesteps=20000)`
```-----------------------------
| time/              |      |
|    fps             | 352  |
|    iterations      | 1    |
|    time_elapsed    | 5    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 360         |
|    iterations           | 2           |
|    time_elapsed         | 11          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009253612 |
|    clip_fraction        | 0.112       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.00415    |
|    learning_rate        | 0.0003      |
|    loss                 | 5.15        |
|    value_loss           | 50.7        |
-----------------------------------------```

```-----------------------------------------
| time/                   |             |
|    fps                  | 369         |
|    iterations           | 3           |
|    time_elapsed         | 16          |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.011285827 |
|    clip_fraction        | 0.071       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.666      |
|    explained_variance   | 0.059       |
|    learning_rate        | 0.0003      |
|    loss                 | 13.8        |
|    value_loss           | 36.5        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 373         |
|    iterations           | 4           |
|    time_elapsed         | 21          |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.008387755 |
|    clip_fraction        | 0.0782      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.631      |
|    explained_variance   | 0.22        |
|    learning_rate        | 0.0003      |
|    loss                 | 12.9        |
|    value_loss           | 52          |
-----------------------------------------```

```-----------------------------------------
| time/                   |             |
|    fps                  | 375         |
|    iterations           | 5           |
|    time_elapsed         | 27          |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.009724656 |
|    clip_fraction        | 0.0705      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.605      |
|    explained_variance   | 0.31        |
|    learning_rate        | 0.0003      |
|    loss                 | 18.7        |
|    value_loss           | 58.7        |
-----------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 377          |
|    iterations           | 6            |
|    time_elapsed         | 32           |
|    total_timesteps      | 12288        |
| train/                  |              |
|    approx_kl            | 0.0058702496 |
|    clip_fraction        | 0.0446       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.591       |
|    explained_variance   | 0.303        |
|    learning_rate        | 0.0003       |
|    loss                 | 23.8         |
|    value_loss           | 71.7         |
------------------------------------------```

```------------------------------------------
| time/                   |              |
|    fps                  | 378          |
|    iterations           | 7            |
|    time_elapsed         | 37           |
|    total_timesteps      | 14336        |
| train/                  |              |
|    approx_kl            | 0.0020627829 |
|    clip_fraction        | 0.012        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.59        |
|    explained_variance   | 0.271        |
|    learning_rate        | 0.0003       |
|    loss                 | 9.21         |
|    value_loss           | 68.9         |
------------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 379          |
|    iterations           | 8            |
|    time_elapsed         | 43           |
|    total_timesteps      | 16384        |
| train/                  |              |
|    approx_kl            | 0.0068374504 |
|    clip_fraction        | 0.0835       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.571       |
|    explained_variance   | 0.734        |
|    learning_rate        | 0.0003       |
|    loss                 | 16.4         |
|    value_loss           | 44.5         |
------------------------------------------```

```-----------------------------------------
| time/                   |             |
|    fps                  | 381         |
|    iterations           | 9           |
|    time_elapsed         | 48          |
|    total_timesteps      | 18432       |
| train/                  |             |
|    approx_kl            | 0.005816878 |
|    clip_fraction        | 0.0488      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.554      |
|    explained_variance   | 0.449       |
|    learning_rate        | 0.0003      |
|    loss                 | 28.8        |
|    value_loss           | 83.1        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 382         |
|    iterations           | 10          |
|    time_elapsed         | 53          |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.008002328 |
|    clip_fraction        | 0.0686      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.569      |
|    explained_variance   | 0.7         |
|    learning_rate        | 0.0003      |
|    loss                 | 25.5        |
|    value_loss           | 64.9        |
-----------------------------------------```

```# save the model
model.save('ppo model')```

Model Testing

`evaluate_policy(model, env, n_eval_episodes=10, render=True)`

(200.0, 0.0)

• Shows the max score that can be obtained by the model

`env.close()`

Now we will apply an alternative way after training the model

```for episode in range(1, 11):
score = 0
obs = env.reset()
done = False

while not done:
env.render()
action, _ = model.predict(obs)
obs, reward, done, info = env.step(action)
score += reward

print('Episode:', episode, 'Score:', score)
env.close()```

Episode: 1 Score: [200.] Episode: 2 Score: [200.] Episode: 3 Score: [200.] Episode: 4 Score: [200.] Episode: 5 Score: [200.] Episode: 6 Score: [200.] Episode: 7 Score: [200.] Episode: 8 Score: [200.] Episode: 9 Score: [200.] Episode: 10 Score: [200.]

• Performance has increased significantly after model training, giving perfect scores.

Final Thoughts

• Processing large amount of episodes can take a lot of time and system resource.

• You may use other simulation models like Atari games, space shooting games, etc.

• You can create multi environments and process them simultaneously.

In this project tutorial, we have explored the Cartpole balance problem using the OpenAI Gym module as a reinforcement learning project. We have obtained very good results after processing and training the model.

Toptal provides a top-rated platform connecting businesses and startups with expert OpenAI Gym developers. Clients trust Toptal to supply them with mission-critical talent for their advanced OpenAI Gym projects, including developing and testing reinforcement learning algorithms, designing and building virtual environments for training and testing, tuning hyperparameters, and integrating OpenAI Gym with other machine learning libraries and tools. You can augment your organization's development team and AI capabilities with access to this talent guide where you can easily find and hire skilled OpenAI Gym developers.

Get the project notebook from here