.. _namespace_bitrl:
Namespace bitrl
===============
.. contents:: Contents
:local:
:backlinks: none
Detailed Description
--------------------
todo write docs
Implements the Gridworld environment from the book Deep Reinforcement Learning in Action by Manning publications. You can find the original environment here: https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction
Description
The Acrobot environment is based on Sutton's work in "Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding" and Sutton and Barto's book. The system consists of two links connected linearly to form a chain, with one end of the chain fixed. The joint between the two links is actuated. The goal is to apply torques on the actuated joint to swing the free end of the linear chain above a given height while starting from the initial state of hanging downwards.
As seen in the **Gif**: two blue links connected by two green joints. The joint in between the two links is actuated. The goal is to swing the free end of the outer-link to reach the target height (black horizontal line above system) by applying torque on the actuator.
Action Space
The action is discrete, deterministic, and represents the torque applied on the actuated joint between the two links.
Num
Action
Unit
0
apply -1 torque to the actuated joint
torque (N m)
1
apply 0 torque to the actuated joint
torque (N m)
2
apply 1 torque to the actuated joint
torque (N m)
Observation Space
The observation is a ``ndarray`` with shape ``(6,)`` that provides information about the two rotational joint angles as well as their angular velocities:
Num
Observation
Min
Max
0
Cosine of ``theta1``
-1
1
1
Sine of ``theta1``
-1
1
2
Cosine of ``theta2``
-1
1
3
Sine of ``theta2``
-1
1
4
Angular velocity of ``theta1``
~ -12.567 (-4 * pi)
~ 12.567 (4 * pi)
5
Angular velocity of ``theta2``
~ -28.274 (-9 * pi)
~ 28.274 (9 * pi)
where
- ``theta1`` is the angle of the first joint, where an angle of 0 indicates the first link is pointing directly downwards.
- ``theta2`` is *relative to the angle of the first link.* An angle of 0 corresponds to having the same angle between the two links.
The angular velocities of ``theta1`` and ``theta2`` are bounded at ±4π, and ±9π rad/s respectively. A state of ``[1, 0, 1, 0, ..., ...]`` indicates that both links are pointing downwards.
Rewards
The goal is to have the free end reach a designated target height in as few steps as possible, and as such all steps that do not reach the goal incur a reward of -1. Achieving the target height results in termination with a reward of 0. The reward threshold is -100.
Starting State
Each parameter in the underlying state (``theta1``, ``theta2``, and the two angular velocities) is initialized uniformly between -0.1 and 0.1. This means both links are pointing downwards with some initial stochasticity.
Episode End
The episode ends if one of the following occurs:
1. Termination: The free end reaches the target height, which is constructed as: ``-cos(theta1) - cos(theta2 + theta1) > 1.0``
2. Truncation: Episode length is greater than 500 (200 for v0)
Arguments
No additional arguments are currently supported.
env=gym.make('Acrobot-v1')
By default, the dynamics of the acrobot follow those described in Sutton and Barto's book Reinforcement Learning: An Introduction. However, a ``book_or_nips`` parameter can be modified to change the pendulum dynamics to those described in the original NeurIPS paper.
#Tochangethedynamicsasdescribedabove
env.env.book_or_nips='nips'
See the following note and the implementation for details:
The dynamics equations were missing some terms in the NIPS paper which are present in the book. R. Sutton confirmed in personal correspondence that the experimental results shown in the paper and the book were generated with the equations shown in the book. However, there is the option to run the domain with the paper equations by setting book_or_nips = 'nips`
Version History
- v1: Maximum number of steps increased from 200 to 500. The observation space for v0 provided direct readings of ``theta1`` and ``theta2`` in radians, having a range of ``[-pi, pi]``. The v1 observation space as described here provides the sine and cosine of each angle instead.
- v0: Initial versions release (1.0.0) (removed from gym for v1)
References
- Sutton, R. S. (1996). Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. In D. Touretzky, M. C. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8). MIT Press. https://proceedings.neurips.cc/paper/1995/file/8f1d43620bc6bb580df6e80b0dc05c48-Paper.pdf
- Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press. """
CartPole environment. The original environment
is described here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
The state variables are:
Observation:
Type: Box(4)
Num Observation Min Max
0 Cart Position -2.4 2.4
1 Cart Velocity -Inf Inf
2 Pole Angle -0.209 rad (-12 deg) 0.209 rad (12 deg)
3 Pole Angular Velocity -Inf Inf
Actions:
Type: Discrete(2)
Num Action
0 Push cart to the left
1 Push cart to the right
Note: The amount the velocity that is reduced or increased is not
fixed; it depends on the angle the pole is pointing. This is because
the center of gravity of the pole increases the amount of energy needed
to move the cart underneath it
Reward:
Reward is 1 for every step taken, including the termination step
Starting State:
All observations are assigned a uniform random value in [-0.05..0.05]
Episode Termination:
Pole Angle is more than 12 degrees.
Cart Position is more than 2.4 (center of the cart reaches the edge of
the display).
Episode length is greater than 200.
Solved Requirements:
Considered solved when the average return is greater than or equal to
195.0 over 100 consecutive trials.
Pendulum environment. The original environment
is described here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py
The state variables are:
@iverbatim
### Description
@endiverbatim
The inverted pendulum swingup problem is based on the classic problem in control theory.
The system consists of a pendulum attached at one end to a fixed point, and the other end being free.
The pendulum starts in a random position and the goal is to apply torque on the free end to swing it
into an upright position, with its center of gravity right above the fixed point.
The diagram below specifies the coordinate system used for the implementation of the pendulum's
dynamic equations.
- x-y: cartesian coordinates of the pendulum's end in meters.
- theta : angle in radians.
- tau: torque in N m. Defined as positive counter-clockwise.
@subsubsection autotoc_md9 Action Space
The action is a ndarray with shape (1,) representing the torque applied to free end of the pendulum.
| Num Action Min Max 0 Torque -2.0 2.0 Observation Space The observation is a ``ndarray`` with shape ``(3,)`` representing the x-y coordinates of the pendulum's free end and its angular velocity. Num Observation Min Max 0 x = cos(theta) -1.0 1.0 1 y = sin(theta) -1.0 1.0 2 Angular Velocity -8.0 8.0 Rewards The reward function is defined as: r = -(theta2 + 0.1 * theta_dt2 + 0.001 * torque2)* where ``$\theta$`` is the pendulum's angle normalized between *[-pi, pi]* (with 0 being in the upright position). Based on the above equation, the minimum reward that can be obtained is -(pi2 + 0.1 * 82 + 0.001 * 22) = -16.2736044*, while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied). Starting State The starting state is a random angle in *[-pi, pi]* and a random angular velocity in *[-1,1]*. Episode Truncation The episode truncates at 200 time steps. Arguments - ``g``: acceleration of gravity measured in *(m s-2)* used to calculate the pendulum dynamics. The default value is g = 10.0 . gym.make('Pendulum-v1',g=9.81) Version History v1: Simplify the math equations, no difference in behavior. v0: Initial versions release (1.0.0). Vector Acrobot environment. This class simply wraps copies of the Acrobot class. See: https://github.com/pockerman/rlenvs_from_cpp/blob/master/src/rlenvs/envs/gymnasium/classic_control/acrobot_env.h for more information Base class for Gymnasium vector environments. See: https://gymnasium.farama.org/api/vector/sync_vector_env/ BlackJack environment https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py This is a simple implementation of the Gridworld Cliff einforcement learning task. Description The board is a 4x12 matrix, with (using NumPy matrix indexing): - [3, 0] as the start at bottom-left - [3, 11] as the goal at bottom-right - [3, 1..10] as the cliff at bottom-center If the agent steps on the cliff it returns to the start. An episode terminates when the agent reaches the goal. Actions There are 4 discrete deterministic actions: - 0: move up - 1: move right - 2: move down - 3: move left Observations There are 3x12 + 1 possible states. In fact, the agent cannot be at the cliff, nor at the goal (as this results the end of episode). They remain all the positions of the first 3 rows plus the bottom-left cell. The observation is simply the current position encoded as flattened index. Reward Each time step incurs -1 reward, and stepping into the cliff incurs -100 reward. Wrapper to the FrozenLake OpenAI-Gym environment. The origina environment can be found at: https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py Frozen lake involves crossing a frozen lake from Start(S) to goal(G) without falling into any holes(H). The agent may not always move in the intended direction due to the slippery nature of the frozen lake The agent take a 1-element vector for actions. The action space is ``(dir)``, where ``dir`` decides direction to move in which can be: - 0: LEFT - 1: DOWN - 2: RIGHT - 3: UP The observation is a value representing the agents current position as current_row * nrows + current_col Reward schedule: - Reach goal(G): +1 - Reach hole(H): 0 Arguments gym.make('FrozenLake-v0',desc=None,map_name="4x4",is_slippery=True) ``desc``: Used to specify custom map for frozen lake. For example, desc=["SFFF", "FHFH", "FFFH", "HFFG"]. ``map_name``: ID to use any of the preloaded maps. "4x4":[ "SFFF", "FHFH", "FFFH", "HFFG" ] "8x8": [ "SFFFFFFF", "FFFFFFFF", "FFFHFFFF", "FFFFFHFF", "FFFHFFFF", "FHHFFFHF", "FHFFHFHF", "FFFHFFFG", ] ``is_slippery``: True/False. If True will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions. For example, if action is left and is_slippery is True, then: - P(move left)=1/3 - P(move up)=1/3 - P(move down)=1/3 Namespaces ---------- - :ref:`namespace_bitrl__boards` - :ref:`namespace_bitrl__consts` - :ref:`namespace_bitrl__dynamics` - :ref:`namespace_bitrl__envs` - :ref:`namespace_bitrl__network` - :ref:`namespace_bitrl__rigid_bodies` - :ref:`namespace_bitrl__sensors` - :ref:`namespace_bitrl__utils` Classes ------- - :ref:`exhale_struct_structbitrl_1_1ActiveBoundaryObject` - :ref:`exhale_struct_structbitrl_1_1IntegralRange` - :ref:`exhale_struct_structbitrl_1_1IsActive` - :ref:`exhale_struct_structbitrl_1_1NotNull` - :ref:`exhale_struct_structbitrl_1_1Null` - :ref:`exhale_struct_structbitrl_1_1RealRange` - :ref:`exhale_struct_structbitrl_1_1TimeStepEnumUtils` - :ref:`exhale_class_classbitrl_1_1FilteredIterator` - :ref:`exhale_class_classbitrl_1_1TimeStep` - :ref:`exhale_class_classbitrl_1_1VectorTimeStep` Enums ----- - :ref:`exhale_enum_namespacebitrl_1ac5500d0e4543f0036e7d0d63c3c22b26` - :ref:`exhale_enum_namespacebitrl_1a951d92c04fcc4301b396dbf4ecdcba39` Functions --------- - :ref:`exhale_function_namespacebitrl_1a00ff4d27cf5c7e1eca4701f1d8fb7cff` - :ref:`exhale_function_namespacebitrl_1a29cb40c2c2b76f135b58277f7ad73609` - :ref:`exhale_function_namespacebitrl_1aa57e26e1939ecd74d7cf9ed2e05f32a0` - :ref:`exhale_function_namespacebitrl_1a28a7e3c53ab670d73a596975c1ce857d` - :ref:`exhale_function_namespacebitrl_1afc22377f36b642800f9288d986c1e99c` - :ref:`exhale_function_namespacebitrl_1af99d77c0cd18554f0d561b5333ce3b46` Typedefs -------- - :ref:`exhale_typedef_namespacebitrl_1ad6b11d89451cb9d4c3c3ea1c82d10c6d` - :ref:`exhale_typedef_namespacebitrl_1ad1fd5d32de6eedc600d2bbb687bad2c4` - :ref:`exhale_typedef_namespacebitrl_1a459269f8ebaff2925a3b572da8aafb93` - :ref:`exhale_typedef_namespacebitrl_1ac98298f6802b19997411b413d2a37315` - :ref:`exhale_typedef_namespacebitrl_1a47cffc659ff83ce16839d32aeff0ca4b` - :ref:`exhale_typedef_namespacebitrl_1a19def504f2054a18db8447a6fd9f1e19` - :ref:`exhale_typedef_namespacebitrl_1ad3836bc7b4da4d780de3e614383ad323` - :ref:`exhale_typedef_namespacebitrl_1a13455a029619534924ad95d9abd789c0` - :ref:`exhale_typedef_namespacebitrl_1a0cac0d7ccc6723231e71a79a903283e6` - :ref:`exhale_typedef_namespacebitrl_1a2cfbf420e06429f24937971c0d91d51c` - :ref:`exhale_typedef_namespacebitrl_1abb71863f92b5b11823ba504b4ed2da76` - :ref:`exhale_typedef_namespacebitrl_1a2f4504e90084ab8d017fb11d685b01bb` - :ref:`exhale_typedef_namespacebitrl_1a3bcb757c2300ab4a3f3a7dc6a76a1f27` - :ref:`exhale_typedef_namespacebitrl_1a6da8b4ebf67cd8fcb142253d07ec4d62` - :ref:`exhale_typedef_namespacebitrl_1a178223fb0601bbeef6e5c683a0aa4e6c` - :ref:`exhale_typedef_namespacebitrl_1a9f50696a8a0f5598cda4d9cbd3d2fd59` - :ref:`exhale_typedef_namespacebitrl_1adff4d2ecda6aac2cec3fc93c217913cd` - :ref:`exhale_typedef_namespacebitrl_1a209a89e3400e2a55d582bda7ccbec7de` - :ref:`exhale_typedef_namespacebitrl_1acb84417cb7029b5e754045082d48d385` - :ref:`exhale_typedef_namespacebitrl_1a84c96b350ee9feed7262273146a5ad11` |
|---|