Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (2024)

Nathaniel Hamilton§Parallax Advanced Research
Beavercreek, OH, USA
nathaniel.hamilton@parallaxresearch.org
  Kyle Dunlap§Parallax Advanced Research
Beavercreek, OH, USA
kyle.dunlap@parallaxresearch.org
  Kerianne L. HobbsAutonomy Capability Team (ACT3)
Air Force Research Laboratory
Wright-Patterson Air Force Base, USA
kerianne.hobbs@us.af.mil

Abstract

For many space applications, traditional control methods are often used during operation. However, as the number of space assets continues to grow, autonomous operation can enable rapid development of control methods for different space related tasks. One method of developing autonomous control is Reinforcement Learning (RL), which has become increasingly popular after demonstrating promising performance and success across many complex tasks.While it is common for RL agents to learn bounded continuous control values, this may not be realistic or practical for many space tasks that traditionally prefer an on/off approach for control. This paper analyzes using discrete action spaces, where the agent must choose from a predefined list of actions. The experiments explore how the number of choices provided to the agents affects their measured performance during and after training. This analysis is conducted for an inspection task, where the agent must circumnavigate an object to inspect points on its surface, and a docking task, where the agent must move into proximity of another spacecraft and “dock” with a low relative speed. A common objective of both tasks, and most space tasks in general, is to minimize fuel usage, which motivates the agent to regularly choose an action that uses no fuel. Our results show that a limited number of discrete choices leads to optimal performance for the inspection task, while continuous control leads to optimal performance for the docking task.

Index Terms:

Deep Reinforcement Learning, Aerospace Control, Ablation Study

§§footnotetext: These authors contributed equally to this work.

I Introduction

Autonomous spacecraft operation is critical as the number of space assets grows and operations become more complex. For On-orbit Servicing, Assembly, and Manufacturing (OSAM) missions, tasks such as inspection and docking enable the ability to assess, plan for, and execute different objectives. While these tasks are traditionally executed using classical control methods, this requires constant monitoring and adjustment by human operators, which becomes challenging or even impossible as the complexity of the task increases. To this end, the importance of developing high-performing autonomy is growing.

Reinforcement Learning (RL) is a fast-growing field for developing high-performing autonomy with growing impact, spurred by success in agents that learn to beat human experts in games like Go [1] and Starcraft [2]. RL is a promising application to spacecraft operations due to its ability to react in real-time to changing mission objectives and environment uncertainty [3, 4]. Previous works demonstrate using RL to develop waypoints for an inspection mission [5, 6],and inspecting an uncooperative space object [7]. Additionally, RL has also been used for similar docking problems, including a six degree-of-freedom docking task [8], avoiding collisions during docking [9], and guidance for docking [10].Despite these successes, in order for RL solutions to be used in the real world, they must control the spacecraft in a way that is acceptable to human operators.

Spacecraft control designers and operators typically prefer the spacecraft to choose from a set of discrete actions, where the thrusters are either fully on or off. In general, this follows Pontryagin’s maximum principle [11], which minimizes a cost function to find an optimal trajectory from one state to another. In this case, the cost is fuel use. In contrast, it is common for RL agents to operate in a continuous control space at a specified frequency, where control values can be any value within a certain range. Transitioning from a continuous control space to a discrete one can result in choppy control outputs with poor performance when the discretization is coarse, or an oversized policy that takes too long to train when the discretization is fine [12].

In this paper, we compare RL agents trained using continuous control and this classical control principle to determine their advantages and identify special cases. Our experiments focus on two spacecraft tasks: inspection (viewing the surface of another vehicle) and docking (approaching and joining with another vehicle). This paper builds on previous work done using RL to solve the inspection task with illumination [13] and the docking task [3]. For the same docking task, the effect of Run Time Assurance during RL training was analyzed [14, 15]. For a 2D version of the docking task, LQR control was compared to a bang-bang controller [16], similar to the agent with three discrete choices that will be analyzed in this paper.

The main contributions of this work include answering the following questions.

Q 1.

\Copy

q_text:no-opWill increasing the likelihood of choosing “no thrust” improve fuel efficiency?

Fuel efficiency is so important in space missions because fuel is a limited resource that needs to exist beyond any single task. The most effective method to minimize fuel use is for the agent to choose “no thrust”. To this end, we explore two different ways of increasing the likelihood of choosing “no thrust”: (1) transitioning from a continuous to a discrete action space, and (2) decreasing the action space magnitude, so that the continuous range of values is smaller. The results are found in SectionV-A.

Q 2.

\Copy

q_text:granularityDoes a smaller action magnitude or finer granularity matter more at different operating ranges?

The inspection and docking tasks require different operating ranges. For inspection, agents need to circumnavigate the chief and being further away ensures better coverage for inspection. In contrast, agents need to move closer to the chief in order to complete the docking task.To this end, we explore how the operating range impacts the need for either smaller action magnitudes to choose from or finer granularity of choices.The results are found in SectionV-B.

Q 3.

\Copy

q_text:num_choicesIs there an optimal balance between discrete and continuous actions?

While RL agents often perform better with continuous actions, they would likely be more accepted for use in the real world with discrete actions.To this end, we explore if a balance can be found between discrete and continuous control to provide an optimal solution that is suitable for RL training and real world operation.The results are found in SectionV-C.

II Deep Reinforcement Learning

Reinforcement Learning (RL) is a form of machine learning in which an agent acts in an environment, learns through experience, and increases its performance based on rewarded behavior. Deep Reinforcement Learning (DRL) is a newer branch of RL in which a neural network is used to approximate the behavior function, i.e. policy πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.The agent uses a Neural Network Controller (NNC) trained by the RL algorithm to take actions in the environment, which can be comprised of any dynamical system, from Atari simulations ([17, 18]) to complex robotics scenarios ([19, 20, 21, 22, 23, 24]).

Reinforcement learning is based on the reward hypothesis that all goals can be described by the maximization of expected return, i.e. the cumulative reward [25]. During training, the agent chooses an action, 𝒖NNsubscript𝒖𝑁𝑁\boldsymbol{u}_{NN}bold_italic_u start_POSTSUBSCRIPT italic_N italic_N end_POSTSUBSCRIPT, based on the input observation, 𝒐𝒐\boldsymbol{o}bold_italic_o. The action is then executed in the environment, updating the internal state, 𝒔𝒔\boldsymbol{s}bold_italic_s, according to the plant dynamics. The updated state, 𝒔𝒔\boldsymbol{s}bold_italic_s, is then assigned a scalar reward, r𝑟ritalic_r, and transformed into the next observation vector. The process of executing an action and receiving a reward and next observation is referred to as a timestep. Relevant values, like the input observation, action, and reward are collected as a data tuple, i.e. sample, by the RL algorithm to update the current NNC policy, πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, to an improved policy, πϕsuperscriptsubscript𝜋italic-ϕ\pi_{\phi}^{*}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. How often these updates are done is dependent on the RL algorithm.

In this work we focus solely on the Proximal Policy Optimization (PPO) algorithms as our DRL algorithm of choice. PPO has demonstrated success in the space domain for multiple tasks and excels in finding optimal policies across many other domains [26, 15, 3, 14, 13]. Additionally, PPO works for both discrete and continuous action spaces, allowing us to test both types of action spaces without switching algorithms. For RL, the action space is typically defined as either discrete or continuous. For a discrete action space, the agent has a finite set of choices for the action. For a continuous action space, the agent can choose any value for the action within a given range. Therefore, it can be thought of as having infinite choices.In general, discrete action spaces tend to be used for simple tasks while continuous action spaces tend to be used for more complex tasks.

Due to the stochastic nature of RL, the agent typically selects random action values at the beginning of the training process. It then learns from this experience, and selects the actions that maximize the reward function. By using discrete actions, it becomes much easier for the agent to randomly select specific discrete actions, and the agent can quickly learn that these actions are useful. This motivates Q1, where we aim to increase the likelihood of “no thrust.”

III Space Environments

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (1)

This paper considers two spacecraft tasks: inspection and docking. Both tasks consider a passive “chief” and active “deputy” spacecraft, where the agent controls the deputy and the chief is stationary. Both of these tasks are modeled using the Clohessy-Wiltshire equations [27] in Hill’s frame [28], which is a linearized relative motion reference frame centered around the chief spacecraft, which is in a circular orbit around the Earth. The agent is the deputy spacecraft, which is controlled in relation to the passive chief spacecraft. As shown in Fig.1, the origin of Hill’s frame, 𝒪Hsubscript𝒪𝐻\mathcal{O}_{H}caligraphic_O start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, is located at the chief’s center of mass, the unit vector x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG points away from the center of the Earth, y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG points in the direction of motion of the chief, and z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG is normal to x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. The relative motion dynamics between the deputy and chief are,

𝒔˙=A𝒔+B𝒖,˙𝒔𝐴𝒔𝐵𝒖\dot{\boldsymbol{s}}=A{\boldsymbol{s}}+B\boldsymbol{u},over˙ start_ARG bold_italic_s end_ARG = italic_A bold_italic_s + italic_B bold_italic_u ,(1)

where 𝒔𝒔\boldsymbol{s}bold_italic_s is the state vector 𝒔=[x,y,z,x˙,y˙,z˙]T6𝒔superscript𝑥𝑦𝑧˙𝑥˙𝑦˙𝑧𝑇superscript6\boldsymbol{s}=[x,y,z,\dot{x},\dot{y},\dot{z}]^{T}\in\mathbb{R}^{6}bold_italic_s = [ italic_x , italic_y , italic_z , over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG , over˙ start_ARG italic_z end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, 𝒖𝒖\boldsymbol{u}bold_italic_u is the control vector, i.e. action, 𝒖=[Fx,Fy,Fz]T[umax,umax]3𝒖superscriptsubscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧𝑇superscriptsubscript𝑢maxsubscript𝑢max3\boldsymbol{u}=[F_{x},F_{y},F_{z}]^{T}\in[-u_{\rm max},u_{\rm max}]^{3}bold_italic_u = [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ [ - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and,

A=[0001000000100000013n20002n00002n0000n2000],B=[0000000001m0001m0001m].formulae-sequence𝐴matrix0001000000100000013superscript𝑛20002𝑛00002𝑛0000superscript𝑛2000𝐵matrix0000000001𝑚0001𝑚0001𝑚\displaystyle\centering A=\begin{bmatrix}0&0&0&1&0&0\\0&0&0&0&1&0\\0&0&0&0&0&1\\3n^{2}&0&0&0&2n&0\\0&0&0&-2n&0&0\\0&0&-n^{2}&0&0&0\\\end{bmatrix},B=\begin{bmatrix}0&0&0\\0&0&0\\0&0&0\\\frac{1}{m}&0&0\\0&\frac{1}{m}&0\\0&0&\frac{1}{m}\\\end{bmatrix}.\@add@centeringitalic_A = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 3 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 2 italic_n end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 2 italic_n end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] , italic_B = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_CELL end_ROW end_ARG ] .(2)

Here, n=0.001027𝑛0.001027n=0.001027italic_n = 0.001027rad/s is the mean motion of the chief’s orbit, m=12𝑚12m=12italic_m = 12kg is the mass of the deputy, F𝐹Fitalic_F is the force exerted by the thrusters along each axis, and umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is a constant value varied in the experiments. Both spacecraft are modeled as point masses.

III-A Inspection

For the inspection task, introduced in [13], the agent’s goal is to navigate the deputy around the chief spacecraft to inspect the entire surface of the chief spacecraft. In this case, the chief is modeled as a sphere of 99999999 inspectable points distributed uniformly across the surface. The attitude of the deputy is not modeled, because it is assumed that the deputy is always pointed towards the chief. In order for a point to be inspected, it must be within the field of view of the deputy (not obstructed by the near side of the sphere) and illuminated by the Sun. Illumination is determined using a binary ray tracing technique, where the Sun rotates in the x^y^^𝑥^𝑦\hat{x}-\hat{y}over^ start_ARG italic_x end_ARG - over^ start_ARG italic_y end_ARG plane in Hill’s frame at the same rate as mean motion of the chief’s orbit, n𝑛nitalic_n.

While the main objective of the task is to inspect all points, a secondary objective is to minimize fuel use. This is considered in terms of ΔVΔ𝑉\Delta Vroman_Δ italic_V, where,

ΔV=|Fx|+|Fy|+|Fz|mΔt.Δ𝑉subscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧𝑚Δ𝑡\Delta V=\frac{|F_{x}|+|F_{y}|+|F_{z}|}{m}\Delta t.roman_Δ italic_V = divide start_ARG | italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | + | italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | + | italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | end_ARG start_ARG italic_m end_ARG roman_Δ italic_t .(3)

For this task, Δt=10Δ𝑡10\Delta t=10roman_Δ italic_t = 10 seconds.

III-A1 Initial and Terminal Conditions

Each episode is randomly initialized given the following parameters. First, the Sun is initialized at a random angle with respect to the x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG axis so that θSun[0,2π]subscript𝜃Sun02𝜋\theta_{\rm Sun}\in[0,2\pi]italic_θ start_POSTSUBSCRIPT roman_Sun end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ]rad. Next, the deputy’s position is sampled from a uniform distribution for the parameters: radius d[50,100]𝑑50100d\in[50,100]italic_d ∈ [ 50 , 100 ]m, azimuth angle θa[0,2π]subscript𝜃𝑎02𝜋\theta_{a}\in[0,2\pi]italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ]rad, and elevation angle θe[π/2,π/2]subscript𝜃𝑒𝜋2𝜋2\theta_{e}\in[-\pi/2,\pi/2]italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ [ - italic_π / 2 , italic_π / 2 ]rad. The position is then computed as,

x=dcos(θa)cos(θe),y=dsin(θa)cos(θe),z=dsin(θe).formulae-sequence𝑥𝑑subscript𝜃𝑎subscript𝜃𝑒formulae-sequence𝑦𝑑subscript𝜃𝑎subscript𝜃𝑒𝑧𝑑subscript𝜃𝑒\begin{gathered}x=d\cos(\theta_{a})\cos(\theta_{e}),\\y=d\sin(\theta_{a})\cos(\theta_{e}),\\z=d\sin(\theta_{e}).\\\end{gathered}start_ROW start_CELL italic_x = italic_d roman_cos ( italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) roman_cos ( italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_y = italic_d roman_sin ( italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) roman_cos ( italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_z = italic_d roman_sin ( italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) . end_CELL end_ROW(4)

If the deputy’s initialized position results in pointing within 30 degrees of the Sun, the position is negated such that the deputy is then pointing away from the Sun and towards illuminated points. This prevents unsafe and unrealistic initialization, as sensors can burnout when pointed directly at the Sun. Finally, the deputy’s velocity is similarly sampled from a velocity magnitude 𝒗[0,0.3]norm𝒗00.3\|\boldsymbol{v}\|\in[0,0.3]∥ bold_italic_v ∥ ∈ [ 0 , 0.3 ]m/s, azimuth angle θa[0,2π]subscript𝜃𝑎02𝜋\theta_{a}\in[0,2\pi]italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ]rad, and elevation angle θe[π/2,π/2]subscript𝜃𝑒𝜋2𝜋2\theta_{e}\in[-\pi/2,\pi/2]italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ [ - italic_π / 2 , italic_π / 2 ]rad, and the velocity is computed using the same technique as Eq.4.

An episode is terminated under the following conditions:

  • the deputy inspects all 99999999 points,

  • the deputy crashes into the chief (enters within a minimum relative distance of 15151515m, where the chief and deputy have radii 10101010 and 5555m respectively),

  • the deputy exceeds a maximum relative distance from the chief of 800800800800m, and/or

  • the simulation exceeds 1223122312231223 timesteps (the time for the Sun to appear to orbit the chief twice, or 3.43.43.43.4hrs).

III-A2 Observations

The environment is partially observable, using sensors to condense full state information into manageable components of the observation space. At each timestep, the agent receives an observation comprised of the following components. The first component is the deputy’s current position in Hill’s frame, where each element is divided by a value of 100100100100 to ensure most values fall in the range [1,1]11[-1,1][ - 1 , 1 ]. The second component is the deputy’s current velocity in Hill’s frame, where each element is multiplied by a value of 2222 to ensure most values fall in the range [1,1]11[-1,1][ - 1 , 1 ]. The third component is the angle describing the Sun’s position with respect to the x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG axis, θSunsubscript𝜃Sun\theta_{\rm Sun}italic_θ start_POSTSUBSCRIPT roman_Sun end_POSTSUBSCRIPT. The fourth component is the total number of points that have been inspected so far during the episode, npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, divided by a value of 100100100100. The final component is a unit vector pointing towards the nearest cluster of uninspected points, where clusters are determined using k-means clustering. The resulting observation is thus 𝒐=[x,y,z,x˙,y˙,z˙,np,θSun,xUPS,yUPS,zUPS]𝒐𝑥𝑦𝑧˙𝑥˙𝑦˙𝑧subscript𝑛𝑝subscript𝜃Sunsubscript𝑥𝑈𝑃𝑆subscript𝑦𝑈𝑃𝑆subscript𝑧𝑈𝑃𝑆\boldsymbol{o}=[x,y,z,\dot{x},\dot{y},\dot{z},n_{p},\theta_{\rm Sun},x_{UPS},y%_{UPS},z_{UPS}]bold_italic_o = [ italic_x , italic_y , italic_z , over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG , over˙ start_ARG italic_z end_ARG , italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT roman_Sun end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_U italic_P italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_U italic_P italic_S end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_U italic_P italic_S end_POSTSUBSCRIPT ].

III-A3 Reward Function

The reward function consists of the following three elements111The reward function was defined in [13], and the authors determined that the specified configuration produces the desired behavior. An exploration of the reward function is outside the scope of this work.. First, a reward of +0.10.1+0.1+ 0.1 is given for every new point that the deputy inspects at each timestep. Second, a negative reward is given that is proportional to the ΔVΔ𝑉\Delta Vroman_Δ italic_V used at each timestep. This is given as wΔV𝑤Δ𝑉-w*\Delta V- italic_w ∗ roman_Δ italic_V, where w𝑤witalic_w is a scalar multiplier that changes during training to help the agent first learn to inspect all points and then minimize fuel usage. At the beginning of training, w=0.001𝑤0.001w=0.001italic_w = 0.001. If the mean percentage of inspected points for the previous training iteration exceeds 90%, w𝑤witalic_w is increased by 0.000050.000050.000050.00005, and if this percentage drops below 80% for the previous iteration, w𝑤witalic_w is decreased by the same amount. w𝑤witalic_w is enforced to always be in the range [0.001,0.1]0.0010.1[0.001,0.1][ 0.001 , 0.1 ]. Finally, a reward of 11-1- 1 is given if the deputy collides with the chief and ends the episode. This is the only sparse reward given to the agent.For evaluation, a constant value of w=0.1𝑤0.1w=0.1italic_w = 0.1 is used, while all other rewards remain the same.

III-B Docking

For the docking task, the agent’s goal is to navigate the deputy spacecraft to within a docking radius, dd=10subscript𝑑𝑑10d_{d}=10italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 10m of the chief at a relative speed below a maximum docking speed, ν0=0.2subscript𝜈00.2\nu_{0}=0.2italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.2m/s. Secondary objectives for the task are to minimize fuel use and adhere to a distance-dependent speed limit defined as,

𝒗ν0+ν1(𝒑dd),norm𝒗subscript𝜈0subscript𝜈1norm𝒑subscript𝑑𝑑\|\boldsymbol{v}\|\leq\nu_{0}+\nu_{1}(\|\boldsymbol{p}\|-d_{d}),∥ bold_italic_v ∥ ≤ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∥ bold_italic_p ∥ - italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ,(5)

where 𝒑norm𝒑\|\boldsymbol{p}\|∥ bold_italic_p ∥ and 𝒗norm𝒗\|\boldsymbol{v}\|∥ bold_italic_v ∥ are the deputy’s position and velocity, and ν1=2nsubscript𝜈12𝑛\nu_{1}=2nitalic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 italic_nrad/s is the slope of the speed limit. The distance-dependent speed limit requires the deputy to slow down as it approaches the chief to dock safely, and values were chosen based on their relation to elliptical natural motion trajectories [29].

III-B1 Initial and Terminal Conditions

Each episode is randomly initialized given the following parameters. First, the deputy’s position is sampled from a radius d[100,150]𝑑100150d\in[100,150]italic_d ∈ [ 100 , 150 ]m, azimuth angle θa[0,2π]subscript𝜃𝑎02𝜋\theta_{a}\in[0,2\pi]italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ]rad, and elevation angle θe[π/2,π/2]subscript𝜃𝑒𝜋2𝜋2\theta_{e}\in[-\pi/2,\pi/2]italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ [ - italic_π / 2 , italic_π / 2 ]rad, where position is computed according to Eq.4. Second, velocity is sampled from a maximum velocity magnitude 𝒗[0,0.8]𝒗maxnorm𝒗00.8subscriptnorm𝒗max\|\boldsymbol{v}\|\in[0,0.8]*\|\boldsymbol{v}\|_{\rm max}∥ bold_italic_v ∥ ∈ [ 0 , 0.8 ] ∗ ∥ bold_italic_v ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (where 𝒗maxsubscriptnorm𝒗max\|\boldsymbol{v}\|_{\rm max}∥ bold_italic_v ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is determined by Eq.5 given the current position), azimuth angle θa[0,2π]subscript𝜃𝑎02𝜋\theta_{a}\in[0,2\pi]italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ]rad, and elevation angle θe[π/2,π/2]subscript𝜃𝑒𝜋2𝜋2\theta_{e}\in[-\pi/2,\pi/2]italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ [ - italic_π / 2 , italic_π / 2 ]rad, where velocity is computed according to Eq.4.

An episode is terminated under the following conditions:

  • the deputy successfully docks with the chief (𝒑dd,𝒗ν0formulae-sequencenorm𝒑subscript𝑑𝑑norm𝒗subscript𝜈0\|\boldsymbol{p}\|\leq d_{d},\|\boldsymbol{v}\|\leq\nu_{0}∥ bold_italic_p ∥ ≤ italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∥ bold_italic_v ∥ ≤ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT),

  • the deputy crashes into the chief (𝒑dd,𝒗>ν0formulae-sequencenorm𝒑subscript𝑑𝑑norm𝒗subscript𝜈0\|\boldsymbol{p}\|\leq d_{d},\|\boldsymbol{v}\|>\nu_{0}∥ bold_italic_p ∥ ≤ italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∥ bold_italic_v ∥ > italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT),

  • the deputy exceeds a maximum relative distance from the chief of 800800800800m, and/or

  • the simulation exceeds 2000200020002000 timesteps (ΔtΔ𝑡\Delta troman_Δ italic_t = 1 second).

III-B2 Observations

Similar to the inspection environment, the docking environment’s observation is broken up into components. The first and second components are the deputy’s position and velocity, divided by 100100100100 and 0.50.50.50.5 respectively. The third component is the deputy’s current velocity magnitude 𝒗norm𝒗\|\boldsymbol{v}\|∥ bold_italic_v ∥, and the fourth component is the maximum velocity given the current position according to Eq.5, 𝒗maxsubscriptnorm𝒗max\|\boldsymbol{v}\|_{\rm max}∥ bold_italic_v ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. Thus, the observation is 𝒐=[x,y,z,x˙,y˙,z˙,𝒗,𝒗max]𝒐𝑥𝑦𝑧˙𝑥˙𝑦˙𝑧norm𝒗subscriptnorm𝒗max\boldsymbol{o}=[x,y,z,\dot{x},\dot{y},\dot{z},\|\boldsymbol{v}\|,\|\boldsymbol%{v}\|_{\rm max}]bold_italic_o = [ italic_x , italic_y , italic_z , over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG , over˙ start_ARG italic_z end_ARG , ∥ bold_italic_v ∥ , ∥ bold_italic_v ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ].

III-B3 Reward Function

The reward function consists of the following six elements222The reward function was defined in [3], and the authors determined that the specified configuration produces the desired behavior. An exploration of the reward function is outside the scope of this work.. First, a distance change reward is used to encourage the deputy to move towards the chief at each timestep. This reward is given by,

r=2(ea𝒑0ea𝒑1),𝑟2superscript𝑒𝑎subscriptnorm𝒑0superscript𝑒𝑎subscriptnorm𝒑1r=2*(e^{-a\|\boldsymbol{p}\|_{0}}-e^{-a\|\boldsymbol{p}\|_{-1}}),italic_r = 2 ∗ ( italic_e start_POSTSUPERSCRIPT - italic_a ∥ bold_italic_p ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - italic_a ∥ bold_italic_p ∥ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ,(6)

where a=log(2)/100𝑎2100a=\log(2)/100italic_a = roman_log ( 2 ) / 100, 𝒑0subscriptnorm𝒑0\|\boldsymbol{p}\|_{0}∥ bold_italic_p ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the deputy’s current position, and 𝒑1subscriptnorm𝒑1\|\boldsymbol{p}\|_{-1}∥ bold_italic_p ∥ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT is the deputy’s position at the last timestep. Second, a negative reward of 0.010.01-0.01- 0.01 is multiplied by the ΔVΔ𝑉\Delta Vroman_Δ italic_V used by the deputy at each timestep. Unlike the inspection task, this reward remains constant throughout training. Third, if the deputy violated the distance dependent speed limit at the current timestep, a negative reward of 0.010.01-0.01- 0.01 is multiplied by the magnitude of the violation (that is, 0.01(𝒗𝒗max)0.01norm𝒗subscriptnorm𝒗max-0.01(\|\boldsymbol{v}\|-\|\boldsymbol{v}\|_{\rm max})- 0.01 ( ∥ bold_italic_v ∥ - ∥ bold_italic_v ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )). Fourth, a negative reward of 0.010.01-0.01- 0.01 is given at each timestep to encourage the agent to complete the task as quickly as possible. Fifth, a sparse reward of +11+1+ 1 is given if the agent successfully completes the task. Finally, a sparse reward of 11-1- 1 is given if the agent crashes into the chief.

IV Experiments

A common objective of both of these tasks (and most space tasks in general) is to minimize the use of ΔVΔ𝑉\Delta Vroman_Δ italic_V. If the agent always chooses a value of zero for all controls, it will use zero m/s of ΔVΔ𝑉\Delta Vroman_Δ italic_V. However in this case, it is unlikely that the agent is able to successfully complete the task, and therefore a balance must be found between maximizing task completion and minimizing ΔVΔ𝑉\Delta Vroman_Δ italic_V. With continuous actions, it is very difficult for an agent to choose an exact value of zero for control, and therefore it is often using a small amount of ΔVΔ𝑉\Delta Vroman_Δ italic_V at every timestep, as seen in [13]. On the other hand, discrete actions allow an agent to easily choose zero.

Several experiments are run for both the inspection and docking environments to determine how choice affects the learning process. First, a baseline configuration is trained with continuous actions, where the agent can choose any value for 𝒖[umax,umax]𝒖subscript𝑢maxsubscript𝑢max\boldsymbol{u}\in[-u_{\rm max},u_{\rm max}]bold_italic_u ∈ [ - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. Next, several configurations are trained with discrete actions where the number of choices is varied. In each case, the action values are evenly spaced over the interval [umax,umax]subscript𝑢maxsubscript𝑢max[-u_{\rm max},u_{\rm max}][ - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. For example, 3 choices for the agent are [umax,0,umax]subscript𝑢max0subscript𝑢max[-u_{\rm max},0,u_{\rm max}][ - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , 0 , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] and 5 choices for the agent are [umax,umax/2,0,umax/2,umax]subscript𝑢maxsubscript𝑢max20subscript𝑢max2subscript𝑢max[-u_{\rm max},-u_{\rm max}/2,0,u_{\rm max}/2,u_{\rm max}][ - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / 2 , 0 , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / 2 , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. Experiments are run for 3, 5, 7, 9, 11, 21, 31, 41, 51, and 101 choices. The number of choices is always odd such that zero is an option. This set of experiments is repeated for values of umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N and umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N, to determine if the magnitude of the action choices affect the results.

For the docking environment, two additional configurations are trained: 5 discrete choices [1.0,0.1,0.0,0.1,1.0]1.00.10.00.11.0[-1.0,-0.1,0.0,0.1,1.0][ - 1.0 , - 0.1 , 0.0 , 0.1 , 1.0 ] (referred to as 1.0/0.11.00.11.0/0.11.0 / 0.1), and 9 discrete choices [1.0,0.1,0.01,0.001,0.0,0.001,0.01,0.1,1.0]1.00.10.010.0010.00.0010.010.11.0[-1.0,-0.1,-0.01,-0.001,0.0,0.001,0.01,0.1,1.0][ - 1.0 , - 0.1 , - 0.01 , - 0.001 , 0.0 , 0.001 , 0.01 , 0.1 , 1.0 ] (referred to as 1.0/../0.0011.0/../0.0011.0 / . . / 0.001). These configurations are designed to give the agent finer control at small magnitudes, and rationale will be discussed further in SectionV-C.

For each configuration, 10 different agents are trained for different random seeds (which are held constant for all the training configurations). Each agent is trained over 5 million timesteps. The policies are periodically evaluated during training, approximately every 500,000 timesteps, to record their performance according to several metrics333The training curves are not shown in the results, but are included in the Appendix.. The common metrics for both the inspection and docking environments are: average ΔVΔ𝑉\Delta Vroman_Δ italic_V used per episode, average percentage of successful episodes, average total reward per episode, and average episode length. For the inspection environment, the average number of inspected points is also considered, and for the docking environment, the average number of timesteps where the speed limit is violated and the average final speed are both considered.

Each of the 10 policies is evaluated over a set of 10 random test cases, where the same test cases are used every time the policy is evaluated. We record and present the InterQuartile Mean444IQM sorts and discards the bottom and top 25% of the recorded metric data and calculates the mean score on the remaining middle 50% of data. IQM interpolates between mean and median across runs, for a more robust measure of performance [30]. (IQM) for each metric. The IQM is used as it is a better representation of what we can expect to see with future studies as it is not unduly affected by outliers and has a smaller uncertainty even with a handful of runs [30]. At the conclusion of training, the final trained policies are again evaluated deterministically for 100 random test cases to better understand the behavior of the trained agents.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (2)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (3)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (4)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (5)

V Results and Discussion

In this section, we answer the questions posed in the introduction by analyzing the overarching trends found in our experiments. In the interest of being concise but detailed, we include here selected results that highlight the trends we found and provide all the results in the Appendix.

V-A \Pasteq_text:no-op

Answer: Yes. Increasing the likelihood of selecting “no thrust” as an action greatly reduces ΔVΔ𝑉\Delta Vroman_Δ italic_V use.

umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPTTotal RewardInspected PointsSuccess RateΔVΔ𝑉\Delta Vroman_Δ italic_V (m/s)Episode Length (steps)
1.01.01.01.0N7.8198 ±plus-or-minus\pm± 0.529295.81 ±plus-or-minus\pm± 4.68080.448 ±plus-or-minus\pm± 0.497313.0222 ±plus-or-minus\pm± 2.0929323.98 ±plus-or-minus\pm± 36.636
0.10.10.10.1N8.7324 ±plus-or-minus\pm± 0.19999.0 ±plus-or-minus\pm± 0.01.0 ±plus-or-minus\pm± 0.010.8143 ±plus-or-minus\pm± 1.7896333.496 ±plus-or-minus\pm± 13.6857
umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPTTotal RewardSuccess RateΔVΔ𝑉\Delta Vroman_Δ italic_V (m/s)Violation (%)Final Speed (m/s)Episode Length (steps)
1.01.01.01.0N1.4105 ±plus-or-minus\pm± 0.52790.57 ±plus-or-minus\pm± 0.495113.2319 ±plus-or-minus\pm± 1.71530.0 ±plus-or-minus\pm± 0.00.0141 ±plus-or-minus\pm± 0.00431780.944 ±plus-or-minus\pm± 237.6564
0.10.10.10.1N1.8289 ±plus-or-minus\pm± 0.51930.842 ±plus-or-minus\pm± 0.364711.6234 ±plus-or-minus\pm± 1.36190.0 ±plus-or-minus\pm± 0.00.0131 ±plus-or-minus\pm± 0.00741497.154 ±plus-or-minus\pm± 343.938

To answer this question, we employed two methods for increasing the likelihood of selecting “no thrust” (i.e. u=[0.0,0.0,0.0]𝑢0.00.00.0u=[0.0,0.0,0.0]italic_u = [ 0.0 , 0.0 , 0.0 ]N): (1) transitioning from a continuous to a discrete action space, and (2) decreasing the action space magnitude so the continuous range is smaller.

V-A1 Continuous to Discrete Action Space

Transitioning from a continuous action space to a discrete one increases the likelihood of selecting “no thrust” by making it an explicit choice. With a continuous action space, the directional thrusts can be any value between ±umaxplus-or-minussubscript𝑢max\pm u_{\rm max}± italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT so the likelihood of all directional thrust randomly being exactly 0.0 is very low. There are many combinations when all thrust values are near 0.0.

With a discrete action space, it is straightforward for the agent to select exactly 0.0.However, depending on the number of choices available to the agent, it can become more difficult for the agent to choose zero thrust. For the agent with three discrete choices, there is a 1111 in 33superscript333^{3}3 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT chance that the agent does not thrust at all (due to the three control inputs), while for the agent with 101101101101 discrete choices, there is a 1111 in 1013superscript1013101^{3}101 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT chance that the agent does not thrust at all. Therefore, it is easier for agents with fewer discrete actions to explicitly choose “no thrust.”

In Fig.2 we compare the ΔVΔ𝑉\Delta Vroman_Δ italic_V used by the final policies trained in continuous and discrete action spaces. For the inspection environment, our results show that transitioning to a discrete action space generally reduces ΔVΔ𝑉\Delta Vroman_Δ italic_V use.Interestingly, we see in Fig.2(a) that adding more discrete choices reduces ΔVΔ𝑉\Delta Vroman_Δ italic_V use until the agent has 31 choices or more. This shows that while it is harder for the agent to choose zero thrust with more choices, it also enables the agent to choose actions with thrust closer to zero, reducing the over-corrections caused by a coarse discretization of the action space.

For the docking environment, our results show that transitioning to a discrete action space generally results in a large increase in ΔVΔ𝑉\Delta Vroman_Δ italic_V use. The blocks representing the continuous configurations ΔVΔ𝑉\Delta Vroman_Δ italic_V use in Fig.2(c)&(d) are centered around 13.2313.2313.2313.23m/s and 11.6211.6211.6211.62m/s respectively. Increasing the granularity of choices did not help, instead trending towards larger use of ΔVΔ𝑉\Delta Vroman_Δ italic_V instead. However, reducing the number of discrete choices clearly reduces ΔVΔ𝑉\Delta Vroman_Δ italic_V use as it is easier to choose “no thrust.”

V-A2 Decreasing the Action Space Magnitude

As mentioned earlier, it is difficult to select “no thrust” with a continuous action space, but there are many combinations of “near zero” available. By reducing the magnitude of the action space (i.e. decreasing umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT) we increase the likelihood of choosing those “near zero” actions for both our continuous and discrete configurations. Additionally, by reducing umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, we decrease the maximum fuel use for any given timestep, which should also result in a reduction in fuel use.

Our results in Fig.2 show that reducing umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT from 1111N to 0.10.10.10.1N generally reduces the amount of ΔVΔ𝑉\Delta Vroman_Δ italic_V used, in some cases reducing by more than 300300300300m/s in the docking environment (discrete 41, 51, and 101). Therefore, to reduce ΔVΔ𝑉\Delta Vroman_Δ italic_V use, our results show it is best to reduce umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

To better highlight how reducing umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT impacts agents with continuous actions, TableI and TableII show the performance of the final policies across all metrics for the inspection and docking environments respectively. TableI shows that as umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is decreased from 1.01.01.01.0N to 0.10.10.10.1N for the inspection environment, the total reward, inspected points, and success rate all increase while ΔVΔ𝑉\Delta Vroman_Δ italic_V decreases. TableII similarly shows that as umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is decreased in the docking environment, the total reward and success rate increase while ΔVΔ𝑉\Delta Vroman_Δ italic_V decreases. Both cases show that decreasing the action space magnitude enables better performance.

V-B \Pasteq_text:granularity

Answer: It depends on the task. For the inspection task, smaller action magnitude is more important. For the docking task, finer granularity is more important.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (6)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (7)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (8)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (9)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (10)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (11)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (12)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (13)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (14)

To answer this question, we first analyze what actions the agents take most often. Fig.3 shows the percentage that each action is taken for the different experiments. For the inspection environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N, close to 100% of all actions taken are zero or very close to zero. However when umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N, the agents often choose actions that are either zero or ±umaxplus-or-minussubscript𝑢max\pm u_{\rm max}± italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. For the docking environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N, while many actions taken are close to zero, there is a clear bell curve that appears centered on zero control. When umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N, this bell curve becomes more apparent, but the agents still tend to choose actions that are close to zero. These results suggest that action magnitude is more important in solving the inspection task, while granularity is more important in solving the docking task.

V-B1 Inspection

For the inspection task, the total reward for the final policies for each configuration is shown in Fig.4. This represents the agent’s ability to balance the objectives of inspecting all points and reducing ΔVΔ𝑉\Delta Vroman_Δ italic_V use. For the agents trained with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N, it can be seen that all agents trained with 41 or less choices result in similar final reward. Recall that in Fig.2(a), it can be seen that the lowest ΔVΔ𝑉\Delta Vroman_Δ italic_V for the inspection environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N occurs for the agent trained with 21 discrete actions. Notably, this is the configuration with the least number of choices where the agent can choose u=0.1𝑢0.1u=0.1italic_u = 0.1N.

For the agents trained with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N, it can be seen that reward tends to decrease slightly as the number of choices increases. The agent trained with three choices results in the highest reward, and Fig.2(b) shows that this configuration also results in the lowest ΔVΔ𝑉\Delta Vroman_Δ italic_V use. These results show that using a smaller action magnitude is much more important than increasing the granularity of choices for the inspection task. This result is intuitive as the agent does not need to make precise adjustments to its trajectory to complete the task, and can rely on following a general path to orbit the chief and inspect all points.

V-B2 Docking

For the docking task, the success rate for the final policies for each configuration is shown in Fig.5. For the agents trained with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N, the agents with more choices have the highest success rates. For the agents trained with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N, outside of the agents with 3 and 51 choices, all other configurations result in similar success rates. Along with Fig.2(c)&(d), this shows that more choice and finer granularity leads to higher ΔVΔ𝑉\Delta Vroman_Δ italic_V use, but it is necessary to successfully complete the task.

However, there is one notable exception to this trend: the agents trained with continuous actions (highest granularity). It can be seen that this configuration uses by far the least ΔVΔ𝑉\Delta Vroman_Δ italic_V out of all agents that achieved at least a 50% success rate. In particular, the agent trained with continuous actions and umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N achieves the best balance of high success with low ΔVΔ𝑉\Delta Vroman_Δ italic_V. These results show that increasing the granularity of choices is much more important than using a smaller action magnitude for the docking task. This result is intuitive because the agent must be able to make precise adjustments to its trajectory as it approaches and docks with the chief.

V-C \Pasteq_text:num_choices

Answer: No, for these tasks it is better to choose either discrete or continuous actions.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (15)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (16)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (17)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (18)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (19)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (20)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (21)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (22)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (23)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (24)

To answer this question, we consider the behavior of the trained agents. Fig.6 shows example trajectories of trained agents in the inspection task. Ideally, these agents will circumnavigate the chief along a smooth trajectory. In Fig.6(a), it can be seen that this is easily accomplished when the agent uses continuous actions, as it can constantly make small refinements to the trajectory to keep it smooth. On the other hand, when the agent can only use three discrete actions with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N as shown in Fig.6(b), the trajectory becomes far less smooth. It can be seen that the agent jerks back and forth as it attempts to adjust its trajectory.

To balance continuous and discrete actions, the number of discrete choices can be increased to allow the agent to make smaller adjustments to its trajectory. As seen in Fig.6(c), having 101 choices helps the agent’s trajectory become much smoother. However, as answered by question Q2, this comes at a cost of performance. The optimal performance for the inspection task came with three discrete actions with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N. In Fig.6(d), it can be seen that this configuration also results in a much smoother trajectory, as the agent can make smaller adjustments to its trajectory due to the smaller umaxsubscript𝑢maxu_{\rm max}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. Therefore, the optimal behavior can also be achieved using discrete actions for the inspection task. This also follows Fig.3(b), where the actions most commonly used are zero and ±umaxplus-or-minussubscript𝑢max\pm u_{\rm max}± italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

For the docking task, ideally the agent will slow down and approach the chief along a smooth trajectory. Fig.7 shows example trajectories of trained agents in the docking task, where similar results to the inspection task can be seen. Continuous actions allow for the smoothest trajectory, three discrete actions with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N is the least smooth trajectory, and 101 discrete actions or three discrete actions with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N allow for smoother trajectories while using discrete actions. However, there is still a considerable amount of “chattering” for the control, where the agent frequently switches between multiple control values as it attempts to refine its trajectory.

To best balance continuous and discrete actions, we analyze the behavior shown in Fig.3(d), and attempt to provide action choices for the agent that mimic a bell curve. These experiments are the 1.0/0.11.00.11.0/0.11.0 / 0.1 configuration and the 1.0/../0.0011.0/../0.0011.0 / . . / 0.001 configuration. These configurations allow the agent to use actions with high magnitudes when it is far from the chief, but small magnitudes when it gets closer to the chief. From Fig.2(d) and Fig.5, it can be seen that the 1.0/../0.0011.0/../0.0011.0 / . . / 0.001 configuration achieves much lower ΔVΔ𝑉\Delta Vroman_Δ italic_V than the 1.0/0.11.00.11.0/0.11.0 / 0.1 configuration, with both having 100% success. These configurations also achieve lower ΔVΔ𝑉\Delta Vroman_Δ italic_V than most discrete action experiments, but still do not perform as well as the agent trained with continuous actions. As shown in Fig.7(e) and Fig.7(f), these configurations also do not have as smooth of trajectories as the agent with continuous actions, and there is still frequent chattering in the control. Therefore, despite attempting to balance discrete and continuous actions, the optimal behavior for the docking task is still achieved using continuous actions.

VI Conclusions and Future Work

In this paper, we trained 480 unique agents to investigate how choice impacts the learning process for space control systems.In conclusion, the results show that (Q1) increasing the likelihood of selecting “no thrust” as an action greatly reduces ΔVΔ𝑉\Delta Vroman_Δ italic_V use.By either making zero thrust a more likely choice, or reducing the action magnitude to make choices closer to zero, this significantly reduces the ΔVΔ𝑉\Delta Vroman_Δ italic_V used by the agent. Next, our results indicate that (Q2) increasing granularity of choices or adjusting action magnitude for optimal performance is highly dependent on the task. For the inspection task, selecting an appropriate action magnitude is more important than increasing granularity of choices. It was found the optimal configuration for completing the inspection task was three discrete actions with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N. For the docking task, the opposite is true, where the optimal configuration was continuous actions with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N.This makes sense considering the operating range of the tasks, where the agent can complete the inspection task by orbiting the chief at a further relative distance, while the agent must complete the docking task by making small adjustments to its trajectory as it approaches the docking region. Finally, our results show that (Q3) there is not an optimal balance between discrete and continuous actions, and it is better to choose one or the other. When attempting to balance discrete and continuous actions for the docking environment by providing actions with decreasing magnitude, it was found that this configuration performed better than most discrete action configurations, but it still did not perform as well as agents with continuous actions.

In future work, we want to consider more complex six degree-of-freedom dynamics, where the agent can also control its orientation. We also want to explore more complex discrete action choices, including adding a time period for the thrust selection to better replicate a scheduled burn.

Acknowledgements

This research was sponsored by the Air Force Research Laboratory under the Safe Trusted Autonomy for Responsible Spacecraft (STARS) Seedlings for Disruptive Capabilities Program.The views expressed are those of the authors and do not reflect the official guidance or position of the United States Government, the Department of Defense, or of the United States Air Force. This work has been approved for public release: distribution unlimited. Case Number AFRL-2024-0298.

References

  • [1]D.Silver, A.Huang, C.Maddison, A.Guez, L.Sifre, G.Driessche, J.Schrittwieser, I.Antonoglou, V.Panneershelvam, M.Lanctot, S.Dieleman, D.Grewe, J.Nham, N.Kalchbrenner, I.Sutskever, T.Lillicrap, M.Leach, K.Kavukcuoglu, T.Graepel, and D.Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol.529, pp.484–489, Jan. 2016.
  • [2]O.Vinyals, I.Babuschkin, W.M. Czarnecki, M.Mathieu, A.Dudzik, J.Chung, D.H. Choi, R.Powell, T.Ewalds, P.Georgiev, J.Oh, D.Horgan, M.Kroiss, I.Danihelka, A.Huang, L.Sifre, T.Cai, J.P. Agapiou, M.Jaderberg, A.S. Vezhnevets, R.Leblond, T.Pohlen, V.Dalibard, D.Budden, Y.Sulsky, J.Molloy, T.L. Paine, C.Gulcehre, Z.Wang, T.Pfaff, Y.Wu, R.Ring, D.Yogatama, D.Wünsch, K.McKinney, O.Smith, T.Schaul, T.Lillicrap, K.Kavukcuoglu, D.Hassabis, C.Apps, and D.Silver, “Grandmaster Level in StarCraft II using Multi-Agent Reinforcement Learning,” Nature, vol.575, pp.350–354, Oct. 2019.
  • [3]U.J. Ravaioli, J.Cunningham, J.McCarroll, V.Gangal, K.Dunlap, and K.L. Hobbs, “Safe reinforcement learning benchmark environments for aerospace control systems,” in 2022 IEEE Aerospace Conference (50100), pp.1–18, IEEE, 2022.
  • [4]N.Hamilton, P.Musau, D.M. Lopez, and T.T. Johnson, “Zero-shot policy transfer in autonomous racing: reinforcement learning vs imitation learning,” in Proceedings of the 1st IEEE International Conference on Assured Autonomy, pp.11–20, 2022.
  • [5]H.H. Lei, M.Shubert, N.Damron, K.Lang, and S.Phillips, “Deep reinforcement learning for multi-agent autonomous satellite inspection,” AAS Guidance Navigation and Control Conference, 2022.
  • [6]J.Aurand, H.Lei, S.Cutlip, K.Lang, and S.Phillips, “Exposure-based multi-agent inspection of a tumbling target using deep reinforcement learning,” AAS Guidance Navigation and Control Conference, 2023.
  • [7]A.Brandonisio, M.Lavagna, and D.Guzzetti, “Reinforcement learning for uncooperative space objects smart imaging path-planning,” The Journal of the Astronautical Sciences, vol.68, pp.1145–1169, Dec 2021.
  • [8]C.E. Oestreich, R.Linares, and R.Gondhalekar, “Autonomous six-degree-of-freedom spacecraft docking with rotating targets via reinforcement learning,” Journal of Aerospace Information Systems, vol.18, no.7, pp.417–428, 2021.
  • [9]J.Broida and R.Linares, “Spacecraft rendezvous guidance in cluttered environments via reinforcement learning,” in 29th AAS/AIAA Space Flight Mechanics Meeting, pp.1–15, 2019.
  • [10]K.Hovell and S.Ulrich, “Deep reinforcement learning for spacecraft proximity operations guidance,” Journal of spacecraft and rockets, vol.58, no.2, pp.254–264, 2021.
  • [11]R.E. Kopp, “Pontryagin maximum principle,” in Mathematics in Science and Engineering, vol.5, pp.255–279, Elsevier, 1962.
  • [12]K.Doya, “Reinforcement learning in continuous time and space,” Neural computation, vol.12, no.1, pp.219–245, 2000.
  • [13]D.van Wijk, K.Dunlap, M.Majji, and K.Hobbs, “Deep reinforcement learning for autonomous spacecraft inspection using illumination,” AAS/AIAA Astrodynamics Specialist Conference, Big Sky, Montana, 2023.
  • [14]K.Dunlap, M.Mote, K.Delsing, and K.L. Hobbs, “Run time assured reinforcement learning for safe satellite docking,” in 2022 AIAA SciTech Forum, pp.1–20, 2022.
  • [15]N.Hamilton, K.Dunlap, T.T. Johnson, and K.L. Hobbs, “Ablation study of how run time assurance impacts the training and performance of reinforcement learning agents,” in 2023 IEEE 9th International Conference on Space Mission Challenges for Information Technology (SMC-IT), pp.45–55, IEEE, 2023.
  • [16]K.Dunlap and K.Cohen, “Hybrid fuzzy-lqr control for time optimal spacecraft docking,” in North American Fuzzy Information Processing Society Annual Conference, pp.52–62, Springer, 2022.
  • [17]N.Hamilton, L.Schlemmer, C.Menart, C.Waddington, T.Jenkins, and T.T. Johnson, “Sonic to knuckles: evaluations on transfer reinforcement learning,” in Unmanned Systems Technology XXII, vol.11425, p.114250J, International Society for Optics and Photonics, 2020.
  • [18]M.Alshiekh, R.Bloem, R.Ehlers, B.Könighofer, S.Niekum, and U.Topcu, “Safe reinforcement learning via shielding,” in Thirty-Second AAAI Conference on Artificial Intelligence, p.2669–2678, 2018.
  • [19]G.Brockman, V.Cheung, L.Pettersson, J.Schneider, J.Schulman, J.Tang, and W.Zaremba, “Openai gym,” 2016.
  • [20]J.F. Fisac, A.K. Akametalu, M.N. Zeilinger, S.Kaynama, J.Gillula, and C.J. Tomlin, “A general safety framework for learning-based control in uncertain robotic systems,” IEEE Transactions on Automatic Control, vol.64, no.7, pp.2737–2752, 2018.
  • [21]P.Henderson, R.Islam, P.Bachman, J.Pineau, D.Precup, and D.Meger, “Deep reinforcement learning that matters,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol.32, p.3207–3214, 2018.
  • [22]H.Mania, A.Guy, and B.Recht, “Simple random search of static linear policies is competitive for reinforcement learning,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp.1805–1814, 2018.
  • [23]K.Jang, E.Vinitsky, B.Chalaki, B.Remer, L.Beaver, A.A. Malikopoulos, and A.Bayen, “Simulation to scaled city: zero-shot policy transfer for traffic control via autonomous vehicles,” in Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, pp.291–300, 2019.
  • [24]N.Bernini, M.Bessa, R.Delmas, A.Gold, E.Goubault, R.Pennec, S.Putot, and F.Sillion, “A few lessons learned in reinforcement learning for quadcopter attitude control,” in Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control, (New York, NY, USA), pp.1–11, Association for Computing Machinery, 2021.
  • [25]D.Silver, “Lectures on reinforcement learning.” url:https://www.davidsilver.uk/teaching/, 2015.
  • [26]J.Schulman, F.Wolski, P.Dhariwal, A.Radford, and O.Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [27]W.Clohessy and R.Wiltshire, “Terminal guidance system for satellite rendezvous,” Journal of the Aerospace Sciences, vol.27, no.9, pp.653–658, 1960.
  • [28]G.W. Hill, “Researches in the lunar theory,” American journal of Mathematics, vol.1, no.1, pp.5–26, 1878.
  • [29]K.Dunlap, M.Mote, K.Delsing, and K.L. Hobbs, “Run time assured reinforcement learning for safe satellite docking,” Journal of Aerospace Information Systems, vol.20, no.1, pp.25–36, 2023.
  • [30]R.Agarwal, M.Schwarzer, P.S. Castro, A.C. Courville, and M.Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,” Advances in Neural Information Processing Systems, vol.34, 2021.

Appendix A Sample Complexity Figures

Sample complexity is a metric that indicates how “fast” an agent trained by periodically checking performance throughout training. If an agent has better sample complexity (i.e. trains “faster”) then its sample complexity curve will measure better values after fewer timesteps. For a metric like reward, better sample complexity will be closer to the top left of the plot, while a metric like ΔVΔ𝑉\Delta Vroman_Δ italic_V use will show better sample complexity by being closer to the bottom left corner of the plot.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (25)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (26)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (27)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (28)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (29)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (30)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (31)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (32)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (33)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (34)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (35)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (36)

In Fig.8 and Fig.9 we show the sample complexity for agents trained in the inspection environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N and umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N respectively.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (37)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (38)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (39)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (40)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (41)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (42)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (43)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (44)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (45)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (46)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (47)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (48)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (49)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (50)

In Fig.10 and Fig.11 we show the sample complexity for agents trained in the docking environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N and umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N respectively.

Appendix B Final Policy Comparison Tables

ExperimentTotal RewardInspected PointsSuccess RateΔVΔ𝑉\Delta Vroman_Δ italic_V (m/s)Episode Length (steps)
Continuous7.8198 ±plus-or-minus\pm± 0.529295.81 ±plus-or-minus\pm± 4.68080.448 ±plus-or-minus\pm± 0.497313.0222 ±plus-or-minus\pm± 2.0929323.98 ±plus-or-minus\pm± 36.636
Discrete - 1017.0945 ±plus-or-minus\pm± 0.448390.176 ±plus-or-minus\pm± 7.24190.176 ±plus-or-minus\pm± 0.380815.2489 ±plus-or-minus\pm± 2.8737271.516 ±plus-or-minus\pm± 43.964
Discrete - 517.7759 ±plus-or-minus\pm± 0.52893.38 ±plus-or-minus\pm± 7.50010.466 ±plus-or-minus\pm± 0.498812.0699 ±plus-or-minus\pm± 2.7292292.3 ±plus-or-minus\pm± 42.679
Discrete - 418.7482 ±plus-or-minus\pm± 0.572796.102 ±plus-or-minus\pm± 4.20470.42 ±plus-or-minus\pm± 0.49365.662 ±plus-or-minus\pm± 1.1497325.44 ±plus-or-minus\pm± 42.3286
Discrete - 318.4412 ±plus-or-minus\pm± 0.731594.64 ±plus-or-minus\pm± 5.90310.434 ±plus-or-minus\pm± 0.49566.1217 ±plus-or-minus\pm± 1.5649301.928 ±plus-or-minus\pm± 42.8778
Discrete - 218.8244 ±plus-or-minus\pm± 0.565593.968 ±plus-or-minus\pm± 6.25520.382 ±plus-or-minus\pm± 0.48594.7953 ±plus-or-minus\pm± 0.6288294.198 ±plus-or-minus\pm± 42.7977
Discrete - 118.8792 ±plus-or-minus\pm± 0.494494.936 ±plus-or-minus\pm± 5.57710.436 ±plus-or-minus\pm± 0.49595.1757 ±plus-or-minus\pm± 0.6842300.084 ±plus-or-minus\pm± 36.9724
Discrete - 98.6939 ±plus-or-minus\pm± 0.607792.804 ±plus-or-minus\pm± 6.66290.28 ±plus-or-minus\pm± 0.4495.015 ±plus-or-minus\pm± 0.6076285.528 ±plus-or-minus\pm± 42.9038
Discrete - 78.7802 ±plus-or-minus\pm± 0.492394.894 ±plus-or-minus\pm± 5.85690.498 ±plus-or-minus\pm± 0.55.6317 ±plus-or-minus\pm± 0.8868295.616 ±plus-or-minus\pm± 37.309
Discrete - 59.0643 ±plus-or-minus\pm± 0.196298.306 ±plus-or-minus\pm± 1.17660.63 ±plus-or-minus\pm± 0.48286.2467 ±plus-or-minus\pm± 0.872330.052 ±plus-or-minus\pm± 25.9433
Discrete - 38.7449 ±plus-or-minus\pm± 0.372796.388 ±plus-or-minus\pm± 4.08530.466 ±plus-or-minus\pm± 0.49887.1767 ±plus-or-minus\pm± 0.9963309.28 ±plus-or-minus\pm± 32.9718

In TableIII we show the final policy results for agents trained with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N in the inspection environment. The table shows the InterQuartile Mean (IQM) and standard deviation for each metric.

ExperimentTotal RewardInspected PointsSuccess RateΔVΔ𝑉\Delta Vroman_Δ italic_V (m/s)Episode Length (steps)
Continuous8.7324 ±plus-or-minus\pm± 0.19999.0 ±plus-or-minus\pm± 0.01.0 ±plus-or-minus\pm± 0.010.8143 ±plus-or-minus\pm± 1.7896333.496 ±plus-or-minus\pm± 13.6857
Discrete - 1018.3228 ±plus-or-minus\pm± 0.327689.098 ±plus-or-minus\pm± 3.82160.0 ±plus-or-minus\pm± 0.05.7558 ±plus-or-minus\pm± 0.6629312.282 ±plus-or-minus\pm± 30.5927
Discrete - 518.6179 ±plus-or-minus\pm± 0.320292.516 ±plus-or-minus\pm± 4.00420.0 ±plus-or-minus\pm± 0.05.8669 ±plus-or-minus\pm± 0.6587322.662 ±plus-or-minus\pm± 31.1315
Discrete - 418.4393 ±plus-or-minus\pm± 0.333990.492 ±plus-or-minus\pm± 4.18590.0 ±plus-or-minus\pm± 0.05.8385 ±plus-or-minus\pm± 0.7675319.386 ±plus-or-minus\pm± 32.385
Discrete - 318.6998 ±plus-or-minus\pm± 0.279894.136 ±plus-or-minus\pm± 3.47410.072 ±plus-or-minus\pm± 0.25856.5631 ±plus-or-minus\pm± 0.637343.786 ±plus-or-minus\pm± 30.6773
Discrete - 218.4793 ±plus-or-minus\pm± 0.42490.526 ±plus-or-minus\pm± 5.070.0 ±plus-or-minus\pm± 0.05.3385 ±plus-or-minus\pm± 0.6996324.068 ±plus-or-minus\pm± 39.9499
Discrete - 118.7238 ±plus-or-minus\pm± 0.304994.034 ±plus-or-minus\pm± 3.87770.092 ±plus-or-minus\pm± 0.2896.1293 ±plus-or-minus\pm± 0.7255335.342 ±plus-or-minus\pm± 26.6045
Discrete - 98.5916 ±plus-or-minus\pm± 0.396792.73 ±plus-or-minus\pm± 4.88070.07 ±plus-or-minus\pm± 0.25516.0518 ±plus-or-minus\pm± 0.7598334.318 ±plus-or-minus\pm± 34.7523
Discrete - 78.4511 ±plus-or-minus\pm± 0.431891.23 ±plus-or-minus\pm± 5.60370.04 ±plus-or-minus\pm± 0.1966.0729 ±plus-or-minus\pm± 0.9819321.638 ±plus-or-minus\pm± 40.5397
Discrete - 58.6838 ±plus-or-minus\pm± 0.297994.988 ±plus-or-minus\pm± 3.85950.226 ±plus-or-minus\pm± 0.41827.2322 ±plus-or-minus\pm± 0.741344.036 ±plus-or-minus\pm± 35.2377
Discrete - 38.9796 ±plus-or-minus\pm± 0.332596.698 ±plus-or-minus\pm± 3.38980.458 ±plus-or-minus\pm± 0.49825.1738 ±plus-or-minus\pm± 0.946334.572 ±plus-or-minus\pm± 42.9696

In TableIV we show the final policy results for agents trained with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N in the inspection environment. The table shows the IQM and standard deviation for each metric.

ExperimentTotal RewardSuccess RateΔVΔ𝑉\Delta Vroman_Δ italic_V (m/s)Violation (%)Final Speed (m/s)Episode Length (steps)
Continuous1.4105 ±plus-or-minus\pm± 0.52790.57 ±plus-or-minus\pm± 0.495113.2319 ±plus-or-minus\pm± 1.71530.0 ±plus-or-minus\pm± 0.00.0141 ±plus-or-minus\pm± 0.00431780.944 ±plus-or-minus\pm± 237.6564
Discrete 1.0/../0.0011.9307 ±plus-or-minus\pm± 0.56961.0 ±plus-or-minus\pm± 0.043.01 ±plus-or-minus\pm± 16.92031.5736 ±plus-or-minus\pm± 2.51450.0886 ±plus-or-minus\pm± 0.0159876.132 ±plus-or-minus\pm± 184.738
Discrete 1.0/0.12.0274 ±plus-or-minus\pm± 0.55831.0 ±plus-or-minus\pm± 0.090.128 ±plus-or-minus\pm± 27.50160.8247 ±plus-or-minus\pm± 1.48350.1027 ±plus-or-minus\pm± 0.0177747.38 ±plus-or-minus\pm± 117.5253
Discrete - 1012.1488 ±plus-or-minus\pm± 0.23121.0 ±plus-or-minus\pm± 0.0293.3862 ±plus-or-minus\pm± 29.61041.6089 ±plus-or-minus\pm± 2.45670.0861 ±plus-or-minus\pm± 0.0164671.596 ±plus-or-minus\pm± 54.1947
Discrete - 511.7891 ±plus-or-minus\pm± 0.55050.978 ±plus-or-minus\pm± 0.1467391.2823 ±plus-or-minus\pm± 75.86590.3771 ±plus-or-minus\pm± 0.67830.0613 ±plus-or-minus\pm± 0.0152927.566 ±plus-or-minus\pm± 278.6582
Discrete - 412.107 ±plus-or-minus\pm± 0.23741.0 ±plus-or-minus\pm± 0.0308.5834 ±plus-or-minus\pm± 49.67440.4602 ±plus-or-minus\pm± 0.7510.0753 ±plus-or-minus\pm± 0.014728.69 ±plus-or-minus\pm± 104.8544
Discrete - 312.1907 ±plus-or-minus\pm± 0.23921.0 ±plus-or-minus\pm± 0.0251.3246 ±plus-or-minus\pm± 44.18490.1856 ±plus-or-minus\pm± 0.40520.0733 ±plus-or-minus\pm± 0.0164774.518 ±plus-or-minus\pm± 154.2326
Discrete - 211.5185 ±plus-or-minus\pm± 0.66430.802 ±plus-or-minus\pm± 0.3985269.6558 ±plus-or-minus\pm± 45.69410.7576 ±plus-or-minus\pm± 1.30760.0527 ±plus-or-minus\pm± 0.01491220.52 ±plus-or-minus\pm± 432.1194
Discrete - 111.2417 ±plus-or-minus\pm± 0.59210.52 ±plus-or-minus\pm± 0.499650.0763 ±plus-or-minus\pm± 66.82490.1531 ±plus-or-minus\pm± 0.41430.0531 ±plus-or-minus\pm± 0.01251610.08 ±plus-or-minus\pm± 408.8315
Discrete - 90.884 ±plus-or-minus\pm± 0.31650.324 ±plus-or-minus\pm± 0.46824.6879 ±plus-or-minus\pm± 34.56830.4601 ±plus-or-minus\pm± 1.01150.0535 ±plus-or-minus\pm± 0.01321717.72 ±plus-or-minus\pm± 401.7791
Discrete - 70.7117 ±plus-or-minus\pm± 0.16930.0 ±plus-or-minus\pm± 0.010.5039 ±plus-or-minus\pm± 1.42240.632 ±plus-or-minus\pm± 1.12580.0429 ±plus-or-minus\pm± 0.00842000.0 ±plus-or-minus\pm± 0.0
Discrete - 50.6349 ±plus-or-minus\pm± 0.12750.0 ±plus-or-minus\pm± 0.09.5683 ±plus-or-minus\pm± 1.31410.0419 ±plus-or-minus\pm± 0.17650.0426 ±plus-or-minus\pm± 0.00812000.0 ±plus-or-minus\pm± 0.0
Discrete - 30.6934 ±plus-or-minus\pm± 0.13220.0 ±plus-or-minus\pm± 0.09.5583 ±plus-or-minus\pm± 1.32530.0787 ±plus-or-minus\pm± 0.25880.0574 ±plus-or-minus\pm± 0.00992000.0 ±plus-or-minus\pm± 0.0

In TableVI we show the final policy results for agents trained with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N in the docking environment. The table shows the IQM and standard deviation for each metric.

ExperimentTotal RewardSuccess RateΔVΔ𝑉\Delta Vroman_Δ italic_V (m/s)Violation (%)Final Speed (m/s)Episode Length (steps)
Continuous1.8289 ±plus-or-minus\pm± 0.51930.842 ±plus-or-minus\pm± 0.364711.6234 ±plus-or-minus\pm± 1.36190.0 ±plus-or-minus\pm± 0.00.0131 ±plus-or-minus\pm± 0.00741497.154 ±plus-or-minus\pm± 343.938
Discrete - 1011.6612 ±plus-or-minus\pm± 0.6890.72 ±plus-or-minus\pm± 0.44998.9843 ±plus-or-minus\pm± 23.43980.6285 ±plus-or-minus\pm± 1.32010.023 ±plus-or-minus\pm± 0.01071212.808 ±plus-or-minus\pm± 475.4788
Discrete - 510.7218 ±plus-or-minus\pm± 0.21680.044 ±plus-or-minus\pm± 0.2051111.8182 ±plus-or-minus\pm± 21.18451.3326 ±plus-or-minus\pm± 1.69150.0097 ±plus-or-minus\pm± 0.00661948.042 ±plus-or-minus\pm± 157.2884
Discrete - 411.8428 ±plus-or-minus\pm± 0.71460.776 ±plus-or-minus\pm± 0.416986.7219 ±plus-or-minus\pm± 7.71690.1063 ±plus-or-minus\pm± 0.36160.0316 ±plus-or-minus\pm± 0.01571124.558 ±plus-or-minus\pm± 464.4369
Discrete - 311.4687 ±plus-or-minus\pm± 0.66690.604 ±plus-or-minus\pm± 0.489191.2584 ±plus-or-minus\pm± 10.87260.4691 ±plus-or-minus\pm± 1.07460.0306 ±plus-or-minus\pm± 0.01611321.878 ±plus-or-minus\pm± 559.5975
Discrete - 211.7366 ±plus-or-minus\pm± 0.63670.838 ±plus-or-minus\pm± 0.3685104.1883 ±plus-or-minus\pm± 12.13580.3216 ±plus-or-minus\pm± 0.80180.0226 ±plus-or-minus\pm± 0.00951268.57 ±plus-or-minus\pm± 397.6955
Discrete - 111.7699 ±plus-or-minus\pm± 0.65520.928 ±plus-or-minus\pm± 0.258586.4941 ±plus-or-minus\pm± 12.51111.1875 ±plus-or-minus\pm± 2.36870.0628 ±plus-or-minus\pm± 0.0254912.868 ±plus-or-minus\pm± 214.4565
Discrete - 91.4149 ±plus-or-minus\pm± 0.83780.648 ±plus-or-minus\pm± 0.4776106.863 ±plus-or-minus\pm± 25.63720.2353 ±plus-or-minus\pm± 0.65630.0468 ±plus-or-minus\pm± 0.0281242.01 ±plus-or-minus\pm± 490.6625
Discrete - 71.8318 ±plus-or-minus\pm± 0.75630.782 ±plus-or-minus\pm± 0.412995.567 ±plus-or-minus\pm± 18.18730.0 ±plus-or-minus\pm± 0.00.0663 ±plus-or-minus\pm± 0.0371128.162 ±plus-or-minus\pm± 489.9392
Discrete - 52.0283 ±plus-or-minus\pm± 0.60711.0 ±plus-or-minus\pm± 0.077.0818 ±plus-or-minus\pm± 14.20230.0508 ±plus-or-minus\pm± 0.2470.0923 ±plus-or-minus\pm± 0.0247886.076 ±plus-or-minus\pm± 215.382
Discrete - 30.6064 ±plus-or-minus\pm± 0.1670.0 ±plus-or-minus\pm± 0.015.0623 ±plus-or-minus\pm± 7.54360.0 ±plus-or-minus\pm± 0.00.0279 ±plus-or-minus\pm± 0.01082000.0 ±plus-or-minus\pm± 0.0

In TableVI we show the final policy results for agents trained with umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N in the docking environment. The table shows the IQM and standard deviation for each metric.

Appendix C Additional Final Policy Comparison Figures

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (51)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (52)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (53)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (54)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (55)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (56)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (57)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (58)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (59)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (60)

In Fig.12 and Fig.13 we show the final policy performance with respect to the number of inspected points, success rate, and episode length for agents trained in the inspection environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N and umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N respectively. Comparisons of the ΔVΔ𝑉\Delta Vroman_Δ italic_V use and reward are shown in Fig.2 and Fig.4 respectively.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (61)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (62)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (63)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (64)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (65)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (66)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (67)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (68)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (69)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (70)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (71)
Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (72)

In Fig.14 and Fig.15 we show the final policy performance with respect to the total reward, constraint violation percentage, final speed, and episode length for agents trained in the docking environment with umax=1.0subscript𝑢max1.0u_{\rm max}=1.0italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0N and umax=0.1subscript𝑢max0.1u_{\rm max}=0.1italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1N respectively. Comparisons of the ΔVΔ𝑉\Delta Vroman_Δ italic_V use and success rate are shown in Fig.2 and Fig.5 respectively.

Investigating the Impact of Choice on Deep Reinforcement Learning for Space Controls (2024)
Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 5792

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.