• Hackers Realm

CartPole Balance using Python | OpenAI Gym | Reinforcement Learning Project Tutorial

The Cartpole balance problem is the classic inverted pendulum problem which consists of a cart that moves along the horizontal axis and objective is to balance the pole on the cart. Instead of applying control theories, the goal here is to solve it using controlled trial-and-error, also known as reinforcement learning.


In this tutorial, we will use the OpenAI Gym module as a reinforcement learning tool to process and evaluate the Cartpole simulation.



You can watch the video-based tutorial with step by step explanation down below.



Project Description


A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.



# install modules
!pip install gym stable_baselines3
  • Necessary installation to use the Gym module

  • You must previously install Pytorch and Tensorflow in order to use baseline 3.


Import Modules


import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
  • PPO - Proximal Policy Optimization algorithm to train AI policies in challenging environments.

  • DummyVecEnv - Vectorized environment to parallelly train multiple models.

  • evaluate_policy - Pre-defined evaluation to test your current model.



Environment Testing with Random Actions


env_name = 'CartPole-v0'
env = gym.make(env_name)
  • Creation of the simulation environment


for episode in range(1, 11):
    score = 0
    state = env.reset()
    done = False
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score += reward
        
    print('Episode:', episode, 'Score:', score)
env.close()


Episode: 1 Score: 19.0 Episode: 2 Score: 17.0 Episode: 3 Score: 77.0 Episode: 4 Score: 36.0 Episode: 5 Score: 17.0 Episode: 6 Score: 22.0 Episode: 7 Score: 18.0 Episode: 8 Score: 25.0 Episode: 9 Score: 11.0 Episode: 10 Score: 15.0

  • Display video of the Cartpole balance episode.

  • It will stop after iterating 10 episodes or stops receiving any more inputs.

  • score - total of the reward points

  • state = env.reset() - resetting the environment to the initial stage

  • env.render() - render the environment

  • action = env.action_space - Gives two discrete values, 0 (or) 1 to move left (or) right. Using sample() will return a random number, either 0 (or) 1.



Model Training

env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1)

Using cuda device

  • Creation of the environment for training

  • Policy defined in order to determine how the input and output will be processed

  • verbose = 1 - Display all the information in the console



model.learn(total_timesteps=20000)
-----------------------------
| time/              |      |
|    fps             | 352  |
|    iterations      | 1    |
|    time_elapsed    | 5    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 360         |
|    iterations           | 2           |
|    time_elapsed         | 11          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009253612 |
|    clip_fraction        | 0.112       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.00415    |
|    learning_rate        | 0.0003      |
|    loss                 | 5.15        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0178     |
|    value_loss           | 50.7        |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 369         |
|    iterations           | 3           |
|    time_elapsed         | 16          |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.011285827 |
|    clip_fraction        | 0.071       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.666      |
|    explained_variance   | 0.059       |
|    learning_rate        | 0.0003      |
|    loss                 | 13.8        |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0164     |
|    value_loss           | 36.5        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 373         |
|    iterations           | 4           |
|    time_elapsed         | 21          |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.008387755 |
|    clip_fraction        | 0.0782      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.631      |
|    explained_variance   | 0.22        |
|    learning_rate        | 0.0003      |
|    loss                 | 12.9        |
|    n_updates            | 30          |
|    policy_gradient_loss | -0.0171     |
|    value_loss           | 52          |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 375         |
|    iterations           | 5           |
|    time_elapsed         | 27          |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.009724656 |
|    clip_fraction        | 0.0705      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.605      |
|    explained_variance   | 0.31        |
|    learning_rate        | 0.0003      |
|    loss                 | 18.7        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0153     |
|    value_loss           | 58.7        |
-----------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 377          |
|    iterations           | 6            |
|    time_elapsed         | 32           |
|    total_timesteps      | 12288        |
| train/                  |              |
|    approx_kl            | 0.0058702496 |
|    clip_fraction        | 0.0446       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.591       |
|    explained_variance   | 0.303        |
|    learning_rate        | 0.0003       |
|    loss                 | 23.8         |
|    n_updates            | 50           |
|    policy_gradient_loss | -0.00995     |
|    value_loss           | 71.7         |
------------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 378          |
|    iterations           | 7            |
|    time_elapsed         | 37           |
|    total_timesteps      | 14336        |
| train/                  |              |
|    approx_kl            | 0.0020627829 |
|    clip_fraction        | 0.012        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.59        |
|    explained_variance   | 0.271        |
|    learning_rate        | 0.0003       |
|    loss                 | 9.21         |
|    n_updates            | 60           |
|    policy_gradient_loss | -0.00347     |
|    value_loss           | 68.9         |
------------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 379          |
|    iterations           | 8            |
|    time_elapsed         | 43           |
|    total_timesteps      | 16384        |
| train/                  |              |
|    approx_kl            | 0.0068374504 |
|    clip_fraction        | 0.0835       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.571       |
|    explained_variance   | 0.734        |
|    learning_rate        | 0.0003       |
|    loss                 | 16.4         |
|    n_updates            | 70           |
|    policy_gradient_loss | -0.00912     |
|    value_loss           | 44.5         |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 381         |
|    iterations           | 9           |
|    time_elapsed         | 48          |
|    total_timesteps      | 18432       |
| train/                  |             |
|    approx_kl            | 0.005816878 |
|    clip_fraction        | 0.0488      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.554      |
|    explained_variance   | 0.449       |
|    learning_rate        | 0.0003      |
|    loss                 | 28.8        |
|    n_updates            | 80          |
|    policy_gradient_loss | -0.00556    |
|    value_loss           | 83.1        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 382         |
|    iterations           | 10          |
|    time_elapsed         | 53          |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.008002328 |
|    clip_fraction        | 0.0686      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.569      |
|    explained_variance   | 0.7         |
|    learning_rate        | 0.0003      |
|    loss                 | 25.5        |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.00599    |
|    value_loss           | 64.9        |
-----------------------------------------


# save the model
model.save('ppo model')

Model Testing


evaluate_policy(model, env, n_eval_episodes=10, render=True)

(200.0, 0.0)

  • Shows the max score that can be obtained by the model

env.close()

Now we will apply an alternative way after training the model

for episode in range(1, 11):
    score = 0
    obs = env.reset()
    done = False
    
    while not done:
        env.render()
        action, _ = model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
        
    print('Episode:', episode, 'Score:', score)
env.close()


Episode: 1 Score: [200.] Episode: 2 Score: [200.] Episode: 3 Score: [200.] Episode: 4 Score: [200.] Episode: 5 Score: [200.] Episode: 6 Score: [200.] Episode: 7 Score: [200.] Episode: 8 Score: [200.] Episode: 9 Score: [200.] Episode: 10 Score: [200.]

  • Performance has increased significantly after model training, giving perfect scores.


Final Thoughts


  • Processing large amount of episodes can take a lot of time and system resource.

  • You may use other simulation models like Atari games, space shooting games, etc.

  • You can create multi environments and process them simultaneously.



In this project tutorial, we have explored the Cartpole balance problem using the OpenAI Gym module as a reinforcement learning project. We have obtained very good results after processing and training the model.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

37 views