Namespace bitrl

Detailed Description

todo write docs

Implements the Gridworld environment from the book Deep Reinforcement Learning in Action by Manning publications. You can find the original environment here: https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction

Description

The Acrobot environment is based on Sutton’s work in “Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding” and Sutton and Barto’s book. The system consists of two links connected linearly to form a chain, with one end of the chain fixed. The joint between the two links is actuated. The goal is to apply torques on the actuated joint to swing the free end of the linear chain above a given height while starting from the initial state of hanging downwards.

As seen in the Gif: two blue links connected by two green joints. The joint in between the two links is actuated. The goal is to swing the free end of the outer-link to reach the target height (black horizontal line above system) by applying torque on the actuator.

Action Space

The action is discrete, deterministic, and represents the torque applied on the actuated joint between the two links.

Num Action Unit

0 apply -1 torque to the actuated joint torque (N m)

1 apply 0 torque to the actuated joint torque (N m)

2 apply 1 torque to the actuated joint torque (N m)

Observation Space

The observation is a ndarray with shape (6,) that provides information about the two rotational joint angles as well as their angular velocities:

Num Observation Min Max

0 Cosine of theta1 -1 1

1 Sine of theta1 -1 1

2 Cosine of theta2 -1 1

3 Sine of theta2 -1 1

4 Angular velocity of theta1 ~ -12.567 (-4 * pi) ~ 12.567 (4 * pi)

5 Angular velocity of theta2 ~ -28.274 (-9 * pi) ~ 28.274 (9 * pi)

where

  • theta1 is the angle of the first joint, where an angle of 0 indicates the first link is pointing directly downwards.

  • theta2 is relative to the angle of the first link. An angle of 0 corresponds to having the same angle between the two links.

The angular velocities of theta1 and theta2 are bounded at ±4π, and ±9π rad/s respectively. A state of [1, 0, 1, 0, ..., ...] indicates that both links are pointing downwards.

Rewards

The goal is to have the free end reach a designated target height in as few steps as possible, and as such all steps that do not reach the goal incur a reward of -1. Achieving the target height results in termination with a reward of 0. The reward threshold is -100.

Starting State

Each parameter in the underlying state (theta1, theta2, and the two angular velocities) is initialized uniformly between -0.1 and 0.1. This means both links are pointing downwards with some initial stochasticity.

Episode End

The episode ends if one of the following occurs:

  1. Termination: The free end reaches the target height, which is constructed as: -cos(theta1) - cos(theta2 + theta1) > 1.0

  2. Truncation: Episode length is greater than 500 (200 for v0)

Arguments

No additional arguments are currently supported.

env=gym.make(‘Acrobot-v1’)

By default, the dynamics of the acrobot follow those described in Sutton and Barto’s book Reinforcement Learning: An Introduction. However, a book_or_nips parameter can be modified to change the pendulum dynamics to those described in the original NeurIPS paper.

#Tochangethedynamicsasdescribedabove env.env.book_or_nips=’nips’

See the following note and the implementation for details:

The dynamics equations were missing some terms in the NIPS paper which are present in the book. R. Sutton confirmed in personal correspondence that the experimental results shown in the paper and the book were generated with the equations shown in the book. However, there is the option to run the domain with the paper equations by setting book_or_nips = ‘nips`

Version History

  • v1: Maximum number of steps increased from 200 to 500. The observation space for v0 provided direct readings of theta1 and theta2 in radians, having a range of [-pi, pi]. The v1 observation space as described here provides the sine and cosine of each angle instead.

  • v0: Initial versions release (1.0.0) (removed from gym for v1)

References

CartPole environment. The original environment

is described here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py The state variables are:

Observation:

Type: Box(4) Num Observation Min Max 0 Cart Position -2.4 2.4 1 Cart Velocity -Inf Inf 2 Pole Angle -0.209 rad (-12 deg) 0.209 rad (12 deg) 3 Pole Angular Velocity -Inf Inf

Actions:

Type: Discrete(2) Num Action 0 Push cart to the left 1 Push cart to the right Note: The amount the velocity that is reduced or increased is not fixed; it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

Reward:

Reward is 1 for every step taken, including the termination step

Starting State:

All observations are assigned a uniform random value in [-0.05..0.05]

Episode Termination:

Pole Angle is more than 12 degrees. Cart Position is more than 2.4 (center of the cart reaches the edge of the display). Episode length is greater than 200. Solved Requirements: Considered solved when the average return is greater than or equal to 195.0 over 100 consecutive trials.

Pendulum environment. The original environment

is described here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py The state variables are:

@iverbatim ### Description @endiverbatim The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.

The diagram below specifies the coordinate system used for the implementation of the pendulum’s dynamic equations.

<img src=”./diagrams/pendulum.png” alt=”Pendulum Coordinate System”/>

  • <tt>x-y</tt>: cartesian coordinates of the pendulum’s end in meters.

  • <tt>theta</tt> : angle in radians.

  • <tt>tau</tt>: torque in <tt>N m</tt>. Defined as positive <em>counter-clockwise</em>.

@subsubsection autotoc_md9 Action Space

The action is a <tt>ndarray</tt> with shape <tt>(1,)</tt> representing the torque applied to free end of the pendulum.

<table class=”markdownTable”>

<tr class=”markdownTableHead”> <th class=”markdownTableHeadNone”> Num

Action

Min

Max

0

Torque

-2.0

2.0

Observation Space

The observation is a ndarray with shape (3,) representing the x-y coordinates of the pendulum’s free end and its angular velocity.

Num Observation Min Max

0 x = cos(theta) -1.0 1.0

1 y = sin(theta) -1.0 1.0

2 Angular Velocity -8.0 8.0

Rewards

The reward function is defined as:

r = -(theta2 + 0.1 * theta_dt2 + 0.001 * torque2)*

where $\theta$ is the pendulum’s angle normalized between [-pi, pi] (with 0 being in the upright position). Based on the above equation, the minimum reward that can be obtained is -(pi2 + 0.1 * 82 + 0.001 * 22) = -16.2736044*, while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).

Starting State

The starting state is a random angle in [-pi, pi] and a random angular velocity in [-1,1].

Episode Truncation

The episode truncates at 200 time steps.

Arguments

  • g: acceleration of gravity measured in (m s-2) used to calculate the pendulum dynamics. The default value is g = 10.0 .

gym.make(‘Pendulum-v1’,g=9.81)

Version History

v1: Simplify the math equations, no difference in behavior. v0: Initial versions release (1.0.0).

Vector Acrobot environment. This class simply wraps copies of the Acrobot class. See: https://github.com/pockerman/rlenvs_from_cpp/blob/master/src/rlenvs/envs/gymnasium/classic_control/acrobot_env.h for more information

Base class for Gymnasium vector environments. See: https://gymnasium.farama.org/api/vector/sync_vector_env/

BlackJack environment https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py

This is a simple implementation of the Gridworld Cliff einforcement learning task.

Description

The board is a 4x12 matrix, with (using NumPy matrix indexing):

  • [3, 0] as the start at bottom-left

  • [3, 11] as the goal at bottom-right

  • [3, 1..10] as the cliff at bottom-center

If the agent steps on the cliff it returns to the start. An episode terminates when the agent reaches the goal.

Actions

There are 4 discrete deterministic actions:

  • 0: move up

  • 1: move right

  • 2: move down

  • 3: move left

Observations

There are 3x12 + 1 possible states. In fact, the agent cannot be at the cliff, nor at the goal (as this results the end of episode). They remain all the positions of the first 3 rows plus the bottom-left cell. The observation is simply the current position encoded as flattened index.

Reward

Each time step incurs -1 reward, and stepping into the cliff incurs -100 reward.

Wrapper to the FrozenLake OpenAI-Gym environment. The origina environment can be found at: https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py Frozen lake involves crossing a frozen lake from Start(S) to goal(G) without falling into any holes(H). The agent may not always move in the intended direction due to the slippery nature of the frozen lake

The agent take a 1-element vector for actions. The action space is (dir), where dir decides direction to move in which can be:

  • 0: LEFT

  • 1: DOWN

  • 2: RIGHT

  • 3: UP

The observation is a value representing the agents current position as current_row * nrows + current_col

Reward schedule:

  • Reach goal(G): +1

  • Reach hole(H): 0

Arguments

gym.make(‘FrozenLake-v0’,desc=None,map_name=”4x4”,is_slippery=True)

desc: Used to specify custom map for frozen lake. For example, desc=[“SFFF”, “FHFH”, “FFFH”, “HFFG”]. map_name: ID to use any of the preloaded maps. “4x4”:[ “SFFF”, “FHFH”, “FFFH”, “HFFG” ] “8x8”: [ “SFFFFFFF”, “FFFFFFFF”, “FFFHFFFF”, “FFFFFHFF”, “FFFHFFFF”, “FHHFFFHF”, “FHFFHFHF”, “FFFHFFFG”, ] is_slippery: True/False. If True will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions. For example, if action is left and is_slippery is True, then:

  • P(move left)=1/3

  • P(move up)=1/3

  • P(move down)=1/3

Namespaces

Classes

Enums

Functions

Typedefs