1 of 16

Lunar Lander problem using a Deep Q-learning Neural Network�

Srinivas Rahul Sapireddy

2021 Fall Hack-A-Roo

2 of 16

Problem Statement

Solving Lunar Lander problem provided by OpenAI Gym using Deep Q Learning Neural Network.

3 of 16

DQN Algorithm�

Deep Q learning neural network is used to calculate the required action at each step in Open AI Gym environment.
Observation space: Coordinates
Actions: The 4 actions taken by the lunar lander in each frame or taking no action at all.
Also use Double DQN to evaluate the original network. Solves the overestimation of action.

4 of 16

Action Space

5 of 16

Data

Generate data using the environment for simulation.
Observation space includes coordinates of the lander.
Action space includes 4 actions.

Environment Source: https://gym.openai.com/envs/LunarLander-v2/

6 of 16

Code and Tools's Used

Requires Python 3 or above.

Code tested on Spyder IDE from Anaconda

Run train.py to see the code in action.

Source Code: GitHub repository

7 of 16

Model

Hyperparameters
Neural Network Architecture

Hyperparameter	Value
episodes	1000
buffer_size	100000
batch_size	64
gamma	0.99
learning rate	1e-3
tau	1e-3
steps	4

8 of 16

Q - Learning

Used to select optimal action-selection policy.
Decides what action to take in next place.
Agent learns function that predicts reward of taking an action for given state. Neural Network approximates this function.

9 of 16

Observation Space

State Space – [x-position, y-position, x-velocity, y-velocity, lander angle, angular velocity, right leg grounded, left leg grounded]
Environment Variables

1. State – current state of the environment (8 – dim state space)

2. Action – Agent acts based on current state.

3. Reward – If lander crashes or comes to rest, the episode is considered complete and receive reward.

Max episode count – 2000
Stopping criteria – If average of 200 for last 100 episodes.

10 of 16

Neural Network Model

Do Nothing

Fire Bottom

Fire Left

Fire Right

Actions

8 observations

11 of 16

Results

Rewards obtained at each training episode with early stopping

Blue line – reward values per experiment

Orange line – rolling mean for last 1000 episodes.

Reward is positive after 300 episodes.

12 of 16

Knowledge Gained

Training the agent take more time.

Improved the neural network model accordingly to solve the agent in less time.

Used different parameter values to see how the training change over time.

The lunar lander can solve the problem using a memory length of 100000 in less time.

13 of 16

Extension – Double DQN

Here I extended the existing neural network using two different function approximators that are trained on different samples, and one is used for selecting the best action and other for evaluating the value of this action. since these two approximators have seen different samples, it is less that they overestimate the same action. Hence the name Double Q-Learning

14 of 16

Conclusion

The experiments conducted as part of this project solved the Lunar Lander Environment using DQN Algorithm with Soft Updates. It is observed that the weight propagation parameter τ and weight decay parameter greedy has significant impact on the learning performance.
One major disadvantage for environments like Lunar Lander is it ignores weight of fuel used and is a 2D simulation of a 3D environment.

Reference: https://gym.openai.com/docs/

15 of 16

Future Work

Simulator in the OpenAI gym interface for Reinforcement Learning (3D environment simulation)