We've discovered the surprising power of the Linear Quadratic Regulator (LQR), but it also has its limits. LQR has a basin of attraction around the point we've linearized the dynamics around. If the cart-pole is in a state outside of this basin, the system becomes unstable and the controller goes mad. To save our pendulum from these horrible nonlinear dynamics, we need to swing it back to our stable basin using some sort of swing-up algorithm. That's where trajectory optimization comes into the picture.

I came across trajectory optimization after reading Daniel Piedrahita's article about cart-pole control. To learn more about this topic and more I've been watching Dr. Russ Tedrake's lectures from the course MIT 6.832 - Underactuated Robotics. I've also watched Dr. Matthew Kelly's introductory video to trajectory optimization and read his paper on how to implement direct collocation. All of these sources are fantastic and I highly recommend you take a look at them!

LQR is a closed-loop control algorithm that spans the entirety of state space. Put simply, anywhere you are, the controller can manage to get you back to a stable fixed point. However, this only works for linear dynamics which require linearization. Trajectory optimization is an open-loop solution to control which can be useful for specific tasks, such as creating a swing-up trajectory. Instead of having a solution for the entire state space, trajectory optimization only creates a trajectory that we need to follow. After we're back in our basin of attraction, we can switch on LQR again!

Trajectory optimization, as the name suggests, uses optimization to minimize the cost of control for a specific trajectory. An optimization problem works by trying to minimize the value of a function called the objective function. This function is dependent on decision variables that need to fulfill some constraints. The decision variables for these types of problems are often the state and control signals, *x* and *u*, and how long the trajectory takes to finish. Using notation from Matthew Kelly, trajectory optimization can be described in what's known as the Lagrange form:

The objective function (1.1) tries to minimize *w* which is the integrand of the trajectory. This problem is subject to the system dynamics (1.2) but also constraints along the path (1.3). This could for example be that a car avoids pedestrians on its way to the goal. There are also boundary constraints (1.4) that put constraints on the initial and final states. The state and control signal also have bounds (1.5)(1.6), for instance, max speeds, voltages, etc. Finally, the initial and final states and time have some limits (1.7)(1.8)(1.9).

One trajectory optimization algorithm is called direct collocation. The direct part means that the problem is discretized, often into what's called a nonlinear program, or NLP, which is what we will do. Converting a trajectory optimization problem into an NLP problem is known as transcription, and direct collocation falls under the umbrella of so-called direct transcription methods.

Nonlinear programs can be generalized as (1.10) an objective function *J* of some decision variables *z* with some equality constraints *f*, inequality constraints *g* and bounds on *z*. The dynamics constraint (1.2) is an equality constraint while the others are inequality constraints. An easy objective function to implement is the integral of the control signal squared (1.11). We can do this if we regard the control signal as a scalar function. This makes sense for the cart-pole where the control is the motor voltage. This objective function, therefore, tries to create a smooth trajectory with a minimized control input.

We still haven't discretized our NLP. To do this, there are again several options to choose from. An easy method to implement is called trapezoidal collocation. We begin by discretizing the decision variables into values at discrete points in time. These are called collocation points and there can be an arbitrary *N* of them which are separated in time by the time step *h*. Here *f* denotes the dynamics.

I won't delve into the proof that trapezoidal collocation accurately approximates the continuous dynamics. Instead, the discretized objective function has been provided below (1.12) as well as the discretized equality constraint on the dynamics (1.13).

If we add in the constraints on the initial and final state as well as their bounds and put this into a solver, this should produce a solution! All that's missing is an initial guess of the trajectory for the solver to optimize. An easy guess to make is just to interpolate from the initial state to the final state with the control signal being zero, and this dumb guess is often enough for the solver to come up with an actual solution!

After your solver has come up with a solution for the optimization problem, it spits out a sequence of length *N* of the optimal state and control signal. These *N* collocation points need to be converted back into a continuous function. The control signal can be easily converted using piecewise linear interpolation. Mathematically, the functions between the points can be constructed using the equation below (1.14).

The state is a bit harder as it can be converted using quadratic interpolation. Instead of connecting the collocation points with lines, points are connected by quadratic functions. Interpolation using low-degree polynomial functions is called spline interpolation. These functions can be constructed using the equation below (1.15).

Now we have all the tools we need to start coding! We will create a class that takes in the dynamics, constraints and the number of collocation points `N`

. The libraries used are `numpy`

and parts of `scipy`

for the solver `minimize`

and the cubic interpolator `CubicSpline`

.

```
import numpy as np
from scipy.optimize import minimize
from scipy.interpolate import CubicSpline
class DirectCollocation():
def __init__(self, N: int, dynamics, N_states: int, N_controls: int, state_lower_bound, state_upper_bound, control_lower_bound, control_upper_bound, tolerance=1e-6):
self.N = N
self.dynamics = dynamics
self.N_states = N_states
self.N_controls = N_controls
self.state_lower_bound = state_lower_bound.T[0]
self.state_upper_bound = state_upper_bound.T[0]
self.control_lower_bound = control_lower_bound.T[0]
self.control_upper_bound = control_upper_bound.T[0]
self.tolerance = tolerance
```

The constraints consist of the lower and upper bounds of the state and control variables. These are flipped to row vectors for easy manipulation. Besides those parameters, there's also the number of state variables `N_states`

, the number of control variables `N_controls`

and the `tolerance`

. `N_states`

and `N_controls`

are useful for calculations. The `tolerance`

specifies how accurate the solver should be; A lower `tolerance`

means a high accuracy.

Next up is the method `make_guess`

which makes the initial guess. This takes in the `initial_state`

and the desired `final_state`

and makes a linear interpolation between those two. The initial guess of the control is zero through the entire trajectory.

```
def make_guess(self, initial_state, final_state):
self.initial_state = initial_state.T[0]
self.final_state = final_state.T[0]
self.initial_variables = np.hstack([
np.hstack([np.linspace(initial, final, self.N) for initial, final in zip(self.initial_state, self.final_state)]),
np.zeros(self.N_controls*self.N),
])
return self.initial_variables
def variables_to_state_control(self, variables):
state = variables[:self.N_states*self.N]
control = variables[self.N_states*self.N:(self.N_states+self.N_controls)*self.N]
return state, control
```

I also threw in the method `variables_to_state_control`

since this is important when it comes to using a solver. The solver can only use data in the form of a row vector, so all of the decision variables (state and control) need to be packed into this single vector. The initial guess, `self.initial_variables`

, is created in this way and `variables_to_state_controls`

takes in the decision variables from the solver and unpacks them into the state and control.

Up next is the `objective_function`

. As the math (1.11) implies, this spits out the quadratic cost of the control signal which the solver tries to minimize.

```
def objective_function(self, variables):
state, control = self.variables_to_state_control(variables)
result = 0
for i in range(self.N_controls):
for k in range(self.N-1):
result += (self.h/2)*(control[self.N*i+k]**2+control[self.N*i+k+1]**2)
return result
```

The constraints can be categorized as either equality constraints or inequality constraints. In code, these are packed up similarly to the decision variables as row vectors. For the equality constraints, the solver ensures that the elements of this vector are always zero. The equality constraints are created in the method `eq_constraints`

. These consist of the constraints on the initial and final state and the dynamic constraints along the trajectory.

```
def eq_constraints(self, variables):
state, control = self.variables_to_state_control(variables)
constraints = []
for k in range(self.N_states):
constraints.append(state[self.N*k]-self.initial_state[k])
constraints.append(state[self.N*k+self.N-1]-self.final_state[k])
state_current = state[0::self.N]
control_current = control[0::self.N]
dynamics_current = self.dynamics(np.vstack(state_current), np.vstack(control_current)).T[0]
for k in range(self.N-1):
state_next = state[k+1::self.N]
control_next = control[k+1::self.N]
dynamics_next = self.dynamics(np.vstack(state_next), np.vstack(control_next)).T[0]
dynamic_constraint = state_next-state_current-self.h/2*(dynamics_next+dynamics_current)
constraints.extend(list(dynamic_constraint))
state_current = state_next
control_current = control_next
dynamics_current = dynamics_next
return constraints
```

The inequality constraints are created in the method `ineq_constraints`

. The solver ensures that the elements of this vector are never zero. For now, these are the upper and lower bounds of the variables.

```
def ineq_constraints(self, variables):
state, control = self.variables_to_state_control(variables)
constraints = []
for k in range(self.N):
current_state = state[k::self.N]
current_control = control[k::self.N]
constraints.extend(list(np.array(current_state)-self.state_lower_bound))
constraints.extend(list(self.state_upper_bound-np.array(current_state)))
constraints.extend(list(np.array(current_control)-self.control_lower_bound))
constraints.extend(list(self.control_upper_bound-np.array(current_control)))
return constraints
```

Finally, the optimal control is solved using `make_controller`

. This method takes in the duration of the trajectory, the number of points to interpolate `N_spline`

, the `initial_state`

and the `desired final_state`

. It first creates an initial guess of the trajectory. The constraints are packed into a dictionary that the solver can process. The solver itself uses Sequential Least Squares Programming (SLSQP) and takes in the objective function, initial guess, constraints and tolerance and spits out a solution of both the state and control signals. If `N_spline`

is defined, then the state and control are splined and interpolated respectively.

```
def make_controller(self, time, initial_state, final_state, N_spline=None):
self.time = time
self.N_spline = N_spline
self.h = (self.time)/self.N
self.make_guess(initial_state, final_state)
constraints = [
{"type": "eq", "fun": self.eq_constraints},
{"type": "ineq", "fun": self.ineq_constraints},
]
sol = minimize(
fun=self.objective_function,
x0=self.initial_variables,
method="SLSQP",
constraints=constraints,
tol=self.tolerance
)
state, control = self.variables_to_state_control(sol.x)
N_time = np.linspace(0, time, self.N)
if N_spline:
N_spline_time = np.linspace(0, time, N_spline)
state_spline = np.vstack([
CubicSpline(N_time, s_row)(N_spline_time) for s_row in state.reshape((self.N_states, self.N))
])
control_spline = np.vstack([
np.interp(N_spline_time, N_time, c_row) for c_row in control.reshape((self.N_controls, self.N))
])
else:
state_spline, control_spline = np.vstack(state), np.vstack(control)
return state_spline, control_spline
```

Using our code for direct collocation, we can now plan a swing-up trajectory for a single pendulum. We then use this control signal and when the pendulum is close to the upper fixed point, we switch on our LQR controller. This was simulated using 41 collocation points and a trajectory duration of 1 second. And... go

Let's try it with two pendulums! To make our LQR way more stable, we'll make the outermost pendulum way lighter than the other one. Let's set the duration to 2 seconds and use 61 collocation points.

So, as you can see above, the trajectory planner doesn't always work out, but why is that? Well, the devil's in the details.

The integrator of the direct collocation algorithm is slightly different from the Rk4. The small differences in the dynamics, and that there isn't a collocation point for every time step, paired with the chaotic nature of the double pendulum make the trajectory unstable.

Another issue is the calculation time. The single pendulum took 17 seconds to calculate. The double pendulum however took a whopping 7.5 minutes!

There are ways of fixing this. For the stability issues, we can use a time-varying LQR controller that stabilizes the dynamics along the trajectory. We can also use Model-Predictive Control, or MPC, which calculates a new trajectory for every time step to compensate for the inaccuracies. This however requires a fast calculation time which can be accomplished by using a faster solver and/or programming language. But for now, this is promising!

I'll try to look into Kalman filters again. It would also be interesting to find a better solver and make a faster trajectory planner so that MPC can be implemented. Making a time-varying LQR controller would also be cool!

Regarding the physical cart-pole, the motor driver I have wasn't powerful enough so I've ordered a new one from AliExpress which should arrive shortly.

By the way, if you want to have a deeper look into the code, visit the GitHub repository! https://github.com/sackariaslunman/inverted-pendulum

Controlling the cart-pole will not be easy; the system has non-linear dynamics. Worse yet, when you add more poles to the cart, the dynamics also become chaotic. Most basic controllers (that I have a sliver of knowledge in designing) operate with linear dynamics. To use a linear controller on a non-linear system you need to approximate dynamics in a way that makes them linear.

To do this we can make some assumptions about our system, starting with that when we start our controller, we let go of the pole in the upright position, i.e. the angle is 0. We also assume then assume that our controller will be able to control this well so that the cart's velocity *dx* and the angular velocity *d* are 0. We also assume that the cart starts at some arbitrary position *x* which we also set to 0.

Using our new initial conditions it is now possible to approximate the linear dynamics at this specific point, where *x*, *dx*, and *d* are all 0, using a method called linearization. Since we used *x* to denote the cart's position I will denote the full state as a bold ** x**. In its non-linear form

Using linearization, the approximated linear dynamics can be described using a matrix *A* that describes how each variable changes based on itself and the other variables in the full state. Each variable also changes based on ** u** which is described in the

As you can see, it's quite a handful to do by hand, especially when you have many states; the number of elements has complexity O(*n*^2) where *n* is the number of states. Basically, the *A* and *B* matrices together describe the "tangent" of how the state variables change at a specific state ** x0** and control state

In our case, ** x0** should be chosen to be where the pole is balanced, i.e. the upright position where all variables are 0. Likewise, when the pole is balanced the control signal shouldn't need to fire so

And this is just the case with one pole! Luckily, I've already calculated these matrices for an arbitrary number of poles. I'm not going to show the final equations since they are disgustingly huge.

So, how do we control a system that is described using a linear state space model? That's where the Full-State Feedback, or FSFB, regulator comes in. It is ridiculously simple; way simpler than PID controllers if you've played around with those. I watched Christoper Lum's lecture on FSFB control and I can happily recommend the entire series of lectures.

To make the most basic type of FSFB regulator, we need to assume that every variable in our state is observable. The observable variables can be described as a subset of ** x** called

In our case, C becomes the identity matrix of size *n* by *n* where *n* is the number of variables in our state. For our system, the observable state does not depend on the control so *D* becomes 0, or more specifically, a column vector of size *n* by 1 where all elements are 0.

Finally, using this knowledge we can design our regulator. Since we can observe every variable at each time step, if we can set a proportional gain for each variable, and the system is controllable, then we can control the entire state. Yes, that's it. Just a proportional gain that we will call *K*. We will design our control ** u** to be equal to -

Optimal control relies heavily on the concepts of system controllability and observability. It's not the goal of this article to explain those concepts but I've provided some links. You can also watch Christopher Lum's lectures.

After modeling our FSFB regulator we are left with one question: how do we optimize the gain *K*? This is where LQR, or the Linear-quadratic regulator, comes in. LQR tries to optimize *K* based on how *"costly"* the different variables in our state are. For our cart-pole, a costly variable might be . We might want to make this 0 as fast as possible and worry about the other variables later. Using LQR, we can specify that is more costly than the others and then optimize our controller based on these criteria.

*K* is optimized using the above equation. The costs of the different state variables are specified on the diagonal of the diagonal matrix *Q*. The costs of the control signals are likewise specified in the *R* matrix in the same manner.

As you can see, *K* is not included in the equation. To solve for *K* you use something called the continuous-time algebraic Ricatti equation or CARE. You can look it up if you want but in Python, it can be solved by using SciPy and NumPy.

```
import numpy as np
from scipy.linalg import solve_continuous_are
S = solve_continuous_are(A, B, Q, R) # S is the solution for CARE
K = np.linalg.inv(R) @ B.T @ S # how to calculate K using S
```

And that's it, two rows of code! Pretty simple if you ask me.

To implement controllers in Python there's already a library that you can use called python-control which I highly recommend. However, since I want to learn all of this from scratch, I've opted for making my own Full-state Feedback regulator class called `FSFB`

. You can have a look at the code below.

```
from scipy.signal import cont2discrete
import numpy as np
class FSFB:
def __init__(self, A, B, C, D, dt: float):
self.A = A
self.B = B
self.C = C
self.D = D
self.dt = dt
self.discretize()
def discretize(self):
dlti = cont2discrete((self.A,self.B,self.C,self.D),self.dt)
self.A_d = np.array(dlti[0])
self.B_d = np.array(dlti[1])
```

What I haven't yet explained is discretization and the `discretize`

method. Since our computers work with zeros and ones, we have to adapt our linear model to work in discrete time rather than continuous time. The function `cont2discrete`

takes in `A`

and `B`

(as well as `C`

and `D`

, but they stay the same after discretization) and calculates their discrete counterparts `A_d`

and `B_d`

based on the time step `dt`

.

I've also made a subclass of `FSFB`

called `LQR`

that implements a method that calculates the gain *K* called `K_lqr`

using *CARE* and its discrete counterpart using *DARE*. There's also the method `calculate_K_r`

which calculates the gain `K_r`

which lets the controller follow a reference signal `r`

, which I won't go into detail about in this article. Finally, the feedback methods take in the current state and reference signal and calculate the next control signal.

```
from scipy.linalg import solve_continuous_are, solve_discrete_are
class LQR(FSFB):
def calculate_K_lqr(self, Q, R):
self.Q = Q
self.R = R
S = solve_continuous_are(self.A, self.B, Q, R)
S_d = solve_discrete_are(self.A_d, self.B_d, Q, R)
self.K_lqr = np.linalg.inv(R) @ self.B.T @ S
self.K_lqr_d = np.linalg.inv(R + self.B_d.T @ S_d @ self.B_d) @ (self.B_d.T @ S_d @ self.A_d)
def calculate_K_r(self):
K_r = np.true_divide(1, self.D + self.C @ np.linalg.inv(-self.A + self.B @ self.K_lqr) @ self.B)
K_r[K_r == np.inf] = 0
K_r = np.nan_to_num(K_r)
self.K_r = K_r.T
K_r_d = np.true_divide(1, self.D + self.C @ np.linalg.inv(np.eye(self.A_d.shape[0]) - self.A_d + self.B_d @ self.K_lqr_d) @ self.B_d)
K_r_d[K_r_d == np.inf] = 0
K_r_d = np.nan_to_num(K_r_d)
self.K_r_d = K_r_d.T
def feedback(self, state, r):
u = self.K_r @ r - self.K_lqr @ state
return u
def feedback_d(self, state, r):
u_d = self.K_r_d @ r - self.K_lqr_d @ state
return u_d
```

Finally! Using the environment that we set up in the last article and our new `LQR`

class we can now regulate the cart-pole. I'm using a *Q* with ones on the diagonal and *R* containing a single element 0.1 for *Va.* The initial conditions are *x* = -0.4, *dx* = 0, = 45 and *d* = 0. The reference signal, or goal, is *x* = 0.1, *dx* = 0, = 0 and *d* = 0. And... go!

Woohoo! Let's add another pole. Adding another pole makes the system even more unstable so the starting angles for both poles will be 30 to ensure that our controller can regulate the system. And... go!

LQR works great if you can measure all state variables accurately. In real life, this is almost certainly impossible. There's always some noise in the system dynamics and measurements, and often you can't measure the entire state. That's where the Kalman filter and LQG control come in which we will have a look at next time.

We are also regulating an unrealistic system. Right now, the motor draws upwards of 2400 Watts: not ideal. To simulate a more realistic system, a good idea could be to do some system identification on the physical model when that has been built. But for now, we'll have to stick with the current system parameters.

I'll look into implementing a Kalman filter and combining it with LQR control to create an LQG controller. I've also read Daniel Piedrahita's article where he talks about regulating a cart-pole using exciting stuff like LQR-Trees and trajectory planning which I will have a look at later on!

Also, the motor driver for the physical model has arrived, so I'll get busy building that too!

So, the old model with a `Cart`

, `Pole`

and `Motor`

class worked well with the forward Euler method where integrating is easy. However, for higher-quality integration schemes, it helps to have the full state of your system and just one function that differentiates it, otherwise, it gets messy quickly.

The new model is just a single class called `CartPoleSystem`

which represents the entire system: cart, poles and motor. The constructor takes in a cart and motor, represented as tuples, and a list of poles, represented as a list of tuples. It also takes in the gravitational acceleration `g`

and the integration scheme represented as a string. We will return to the different kinds of integrators and how we implement them.

```
from __future__ import annotations
import numpy as np
from numpy import sin, cos, pi
color = tuple[int,int,int]
class CartPoleSystem:
def __init__(
self,
cart: tuple[float, float, float, float, float, color],
motor: tuple[float, float, float, float, float, float, float, color],
poles: list[tuple[float, float, float, float, color]],
g: float,
integrator: str = "rk4"
):
self.cart = np.array(cart[:-1], dtype=np.float32)
x0, m, u_c, min_x, max_x, cart_color = cart
self.m = m
self.u_c = u_c
self.min_x = min_x
self.max_x = max_x
self.cart_color = cart_color
self.motor = np.array(motor[:-1], dtype=np.float32)
Ra, Jm, Bm, K, r, min_Va, max_Va, motor_color = motor
self.Ra = Ra
self.Jm = Jm
self.Bm = Bm
self.K = K
self.r = r
self.min_Va = min_Va
self.max_Va = max_Va
self.motor_color = motor_color
self.num_poles = len(poles)
self.poles = np.array([pole[:-1] for pole in poles], dtype=np.float32)
self.pole_colors = [pole[-1] for pole in poles]
self.M = self.m + sum(mp for (_, mp, _, _) in self.poles)
self.g = g
self.integrator = integrator
self.reset(self.get_initial_state())
def reset(self, initial_state):
self.state = np.array([
initial_state
])
self.d_state = np.array([
np.zeros(len(initial_state))
])
def get_initial_state(self):
return np.hstack([np.array([self.cart[0], 0, 0]), np.hstack([[angle0, 0] for angle0 in self.poles.T[0]])])
```

The constructor saves all of the named parameters and the cart, motor and poles as NumPy arrays. The `get_initial_state`

method gets the initial state based on the passed parameters to the constructor. This state is then passed to the `reset`

method which adds it to the first row of the `state`

matrix. We also have the `d_state`

matrix which stores the calculated derivative of the state at each time step. The full state with *n* poles is as follows:

To get the next state, we need to differentiate the current state. The `differentiate`

method does pretty much what the `update`

methods of the previous codebase did, except for the full state in just one function.

```
def differentiate(self, state, u):
Va = u[0]
sum1 = sum([m*sin(state[3+k*2])*cos(state[3+k*2]) for k,(_,m,_,_) in enumerate(self.poles)])
sum2 = sum([m*(l/2)*state[3+k*2+1]**2*sin(state[3+k*2]) for k,(_,m,l,_) in enumerate(self.poles)])
sum3 = sum([(u_p*state[3+k*2+1]*cos(state[3+k*2]))/(l/2) for k,(_,_,l,u_p) in enumerate(self.poles)])
sum4 = sum([m*cos(state[3+k*2])**2 for k,(m,_,_,_) in enumerate(self.poles)])
d_x = state[1]
dd_x = (self.g*sum1-(7/3)*((1/self.r**2)*((self.K/self.Ra)*(Va*self.r-self.K*d_x)-self.Bm*d_x)+sum2-self.u_c*d_x)-sum3)/(sum4-(7/3)*(self.M+self.Jm/(self.r**2)))
d_theta_m = d_x / self.r
dd_thetas = np.hstack([[state[3+k*2+1],(3/(7*l/2)*(self.g*sin(state[3+k*2])-dd_x*cos(state[3+k*2])-u_p*state[3+k*2+1]/(m*l/2)))] for k,(_,m,l,u_p) in enumerate(self.poles)])
return np.hstack([d_x, dd_x, d_theta_m, dd_thetas])
```

The code is messy and could be refactored, but for now, it works. The differentiated state is returned in the following format:

The system is brought forward in time with the `step`

method. This method gets the current state and then differentiates and integrates it either using the forward Euler method or RK4 (I promise I will explain what this is later) which spits out the next state. It then constrains/clamps the next state to within its maximum/minimum boundaries and appends it and the current state's derivative to the state matrices.

```
def get_state(self, t=-1):
return self.state[t]
def step(self, dt: float, Va: float):
state = self.get_state()
next_state, d_state = None, None
if self.integrator == "fe":
next_state, d_state = fe_step(dt, self.differentiate, state, [Va])
else:
next_state, d_state = rk4_step(dt, self.differentiate, state, [Va])
next_state = self.clamp(next_state)
self.update(next_state, d_state)
return next_state
def clamp(self, state):
x = state[0]
if x > self.max_x:
x = self.max_x
elif x < self.min_x:
x = self.min_x
state[0] = x
for k in range(self.num_poles):
state[3+k*2] %= 2*pi
return state
def update(self, next_state, d_state):
self.state = np.vstack([self.state, next_state])
self.d_state = np.vstack([self.d_state, d_state])
```

Besides these important methods, we also have two helper methods that will be used later in the environment class.

```
def max_height(self) -> float:
return sum(l for _,_,l,_ in self.poles)
def end_height(self) -> float:
state = self.get_state()
return sum(l*cos(state[3+k*2]) for k, (_,_,l,_) in enumerate(self.poles))
```

What exactly is numerical integration? Without going too heavy into math lingo, a numerical integrator is a function that receives a state and how to calculate its derivative and then tries to predict what the next state will be. We already touched on this with the forward Euler method which is a numerical integrator, just not a very good one. To brush up on the topic I can recommend watching a series of lectures by professor Steve Brunton on Differential Equations and Dynamical Systems that delves into numerical integration.

The problem with the forward Euler method, or FE, is that it tends to overshoot, and the overshooting only increases with a bigger time step. This can even lead to the simulated system becoming unstable when the time step increases over a certain threshold. This threshold depends on the system and how "dynamic" it is.

The Runge-Kutta methods are methods for numerical integration that try to be more accurate than FE using the same time step. Brunton has a great explanation of the 2nd-order and 4th-order Runge-Kutta methods which you can find here. In my simulation I'm using the 4th-order Runge-Kutta method or RK4. I won't go into details about the algorithm and why it works; it has to do with trying to eliminate the errors of the derivative's Taylor series expansion.

My implementation of FE and RK4 both receive a time step, the differentiation function, the state and the control state. They then calculate the next state using the derivative of the current state.

```
def fe_step(dt: float, differentiate, x0, u):
d_x = differentiate(x0, u)
x = x0 + dt * d_x
return x, d_x
def rk4_step(dt: float, differentiate, x0, u):
f1 = differentiate(x0, u)
f2 = differentiate(x0 + (dt/2) * f1, u)
f3 = differentiate(x0 + (dt/2) * f2, u)
f4 = differentiate(x0 + dt * f3, u)
d_x = f1 + 2*f2 + 2*f3 + f4
x = x0 + (dt/6) * d_x
return x, d_x
```

RK4 is great, but it's not enough for chaotic systems like our cart-pole system. It let me increase the time step from 0.02 to 0.025 without the system going unstable, but that was it. Chaotic systems typically need more powerful and tailor-made integration schemes. I might come back and find a better integrator than RK4.

OpenAI has an open-source package called Gym that is used for testing one's machine learning models on some predefined dynamic models. Gym already has a cart-pole environment, but this is an idealized version with only a single pole and no friction or motor. However, Gym also lets you define your own environment which is what we will do.

The environment is defined as a class called `CartPolesEnv`

and is a subclass of Gym's `Env`

class. The constructor takes in a `CartPoleSystem`

, the time step `dt`

and `g`

. It initializes the screen, its size and the font that will be used for the rendering. Then, more importantly, it initializes the action space or control space. This object defines the minimum and maximum values, as well as the data type and shape, of our control signals, which in our case is the input voltage `Va`

. We do the same for our observation space which is our full state. The class `Box`

is used for creating spaces of continuous signals. If discrete signals are preferred then there's also the `Discrete`

class.

```
from numpy import radians, pi, sin, cos
from random import uniform
import gym
from gym import spaces
import numpy as np
from cartpolesystem import CartPoleSystem
import pygame
from time import perf_counter
class CartPolesEnv(gym.Env):
def __init__(
self,
system: CartPoleSystem,
dt: float,
g: float,
):
super(CartPolesEnv, self).__init__()
self.system = system
self.max_height = system.max_height()
self.dt = dt
self.g = g
self.screen: pygame.surface.Surface | None = None
self.font: pygame.font.Font
self.i : int
self.width = 1200
self.height = 700
self.size = self.width, self.height
self.action_space = spaces.Box(
low=system.min_Va,
high=system.max_Va,
dtype=np.float32
)
num_of_poles = system.num_poles
obs_lower_bound = np.hstack([np.array([system.min_x, np.finfo(np.float32).min]), np.tile(np.array([0.0, np.finfo(np.float32).min]), num_of_poles)])
obs_upper_bound = np.hstack([np.array([system.max_x, np.finfo(np.float32).max]), np.tile(np.array([2*pi, np.finfo(np.float32).max]), num_of_poles)])
self.observation_space = spaces.Box(
low=obs_lower_bound,
high=obs_upper_bound,
shape=(2+2*num_of_poles,),
dtype=np.float32
)
```

An environment needs a `reset`

method that starts a new episode. To train general control models it helps if the new initial state is kind of random. Therefore, the cart will start in a random position on the track and every pole can assume an angle between -15 and 15. It also closes the current rendered window and returns the initial state. The observed state excludes the third element the motor angle since it's not important for balancing the poles.

```
def get_state(self):
state = np.delete(self.system.get_state(), 2, axis=0)
return state
def reset(self):
self.close()
initial_state = [uniform(self.system.min_x, self.system.max_x)*0.8, 0, 0]
for _ in range(self.system.num_poles):
initial_state.extend([radians(uniform(-15, 15)), 0])
self.system.reset(initial_state)
return self.get_state(), {"Msg": "Reset env"}
```

An environment also needs a `step`

method. This method should reward the model if it's doing good and penalize the model if it's doing badly. It should also terminate the episode if the model is doing something unwanted. Since I haven't started doing reinforcement learning or RL on the model, I haven't fleshed out the rewarding and termination conditions. Right now, the episode ends if the cart goes beyond the track. There are no rewards, only a penalization if the episode is terminated. How to effectively reward an RL model is really tricky; we will definitely return to this method and do some improvements.

```
def step(self, action: float):
reward = 0
Va = action
state = self.system.step(self.dt, Va)
x = state[0]
y = self.system.end_height()
terminated = bool(
x >= self.system.max_x or
x <= self.system.min_x
)
if terminated:
reward -= 10
return self.get_state(), reward, terminated, {}, False
```

Last up we have the `render`

method. This is the same PyGame code from last time plus some optional text lines that can be rendered. There's also a `close`

method which closes the PyGame window.

```
def si_to_pixels(self, x: float):
return int(x * 500)
def render(self, optional_text: list[str] = []):
if not self.screen:
pygame.init()
self.screen = pygame.display.set_mode(self.size)
self.font = pygame.font.Font('freesansbold.ttf', 20)
self.start_time = perf_counter()
self.i = 0
self.screen.fill(Colors.gray)
state = self.system.get_state()
x0 = self.si_to_pixels(state[0]) + self.width//2
y0 = self.height//2
pygame.draw.rect(self.screen, self.system.cart_color, (x0, y0, 20, 10))
max_x = self.width//2 + self.si_to_pixels(self.system.max_x)
min_x = self.width//2 + self.si_to_pixels(self.system.min_x)
pygame.draw.rect(self.screen, self.system.cart_color, (min_x-10, y0, 10, 10))
pygame.draw.rect(self.screen, self.system.cart_color, (max_x+20, y0, 10, 10))
motor_x0 = min_x-100
theta_m = state[2]
motor_sin = self.si_to_pixels(sin(-theta_m)*0.05)
motor_cos = self.si_to_pixels(cos(-theta_m)*0.05)
pygame.draw.polygon(self.screen, self.system.motor_color, [
(motor_x0+motor_sin, y0+motor_cos),
(motor_x0+motor_cos, y0-motor_sin),
(motor_x0-motor_sin, y0-motor_cos),
(motor_x0-motor_cos, y0+motor_sin),
])
x0 += 10
for k, ((_,_,l,_),color) in enumerate(zip(self.system.poles,self.system.pole_colors)):
x1 = x0 + self.si_to_pixels(l * sin(state[3+k*2]))
y1 = y0 + self.si_to_pixels(-l * cos(state[3+k*2]))
pygame.draw.line(self.screen, color, (x0, y0), (x1, y1), 10)
x0 = x1
y0 = y1
texts = [
f"Time: {round(self.i*self.dt,2)} s",
"",
"Cart:",
f"Position: {round(state[0],2)} m",
f"Velocity: {round(state[1],2)} m/s",
"",
"Motor:",
f"Angle: {round(theta_m,2)} rad",
]
texts.extend(optional_text)
for k, text_k in enumerate(texts):
text = self.font.render(text_k, True, Colors.black, Colors.gray)
text_rect = text.get_rect()
self.screen.blit(text,(0,20*k,text_rect.width,text_rect.height))
pygame.display.flip()
self.i += 1
def close(self):
pygame.quit()
self.screen = None
color = tuple[int,int,int]
class Colors(Dict):
red = (255,0,0)
green = (0,255,0)
blue = (0,0,255)
purple = (255,0,255)
cyan = (0,255,255)
yellow = (255,255,0)
black = (0,0,0)
white = (255,255,255)
gray = (100,100,100)
```

Since RL isn't my strong suit I will begin balancing the pole using classic control theory. Specifically, I will try out full-state feedback control or FSFB using LQR. I have already linearized the nonlinear model for *n* poles so the rest should be easy (hopefully).

Also, most of the physical parts have arrived. The track is assembled so right now I'm waiting on the driver for the DC motor. I also have to decide what sensors to use. I was thinking about only using a camera, but it might be too inconvenient. Encoders on the motor and pendulum joints could be the way forward. We'll have to see...

I'm currently trying to build a physical model of this system and balance it using different control methods. To do this the *proper* way, I'll first need to model the system to optimize my controller and simulate the controlled system to verify that it works as intended, so let's do that.

Forget about the coding; first, we need to model the system. The Cart-Pole system is what's known as a nonlinear dynamic system. To begin visualizing the system in our heads, let's imagine a cart and just one pole.

The cart can move along the horizontal axis or the x-axis. It can not move along the vertical axis or the y-axis. The pendulum is attached to the cart along the middle and can rotate freely in the xy-plane. The angle between the pendulum and the y-axis is a function of time and will be denoted as . Likewise, the cart's position on the x-axis is *x*. The pole has a length of *l* and a mass of *m* that exerts a gravitational force. There's also an input force *f* which our controller uses to balance the pole.

All forces aren't written out. Two of the missing forces are the friction between the cart and the track which is proportional to the cart's velocity and the friction in the pole's rotation joint.

The system can be mathematically described by the use of several different methods such as Newton's laws of motion or the Lagrangian. Since this is a popular and well-documented model in mechanics I will save myself some work by using Colin D. Green's [1] solution for the equations which can be found here. The equations describe a generalized cart-pole system with multiple poles indexed by *i*.

Now we can add as many poles as we please, but the model is not complete yet. The physical cart will be fastened to a timing belt that will be driven by a DC motor. The DC motor sits on the x-axis and can be described by two equations.

Since the inductance of small DC motors is very small it has been excluded from our equations.

Now, our DC motor's output shaft is coupled with a timing belt on which our cart sits. Therefore, the velocity and acceleration of the cart are proportional to the angular velocity and angular acceleration at the output shaft respectively by the radius of the output shaft, *r*. Likewise, the output torque *TL* is proportional to the input force *f* at the cart by *r*. Using this information we can reformulate equations (3) and (4) in terms of derivations of *x*, *f* and *r*.

Now we can get the final equation for the cart's acceleration by inserting equations (5) and (6) into (1).

With the power of these two equations, we can now begin coding the system in Python. Heads up, the coding was quick and dirty and needs to be refactored; I just wanted to get a simulator out quickly.

Let's begin with the pole. Let's make it a class whose constructor takes in the mass `m`

, initial angle `angle0`

, length `l`

, coefficient of friction `u_p`

and an optional pendulum `pole`

which is attached to the end of this one. It also takes in a color that it will display in the visual part of the simulator. Then we initialize the states of the pole as a dictionary `state`

where the keys are the named variables and the values are lists of the variable values through time. We also have some imports for typing and not having to type `math.sin`

and `math.cos`

.

```
from __future__ import annotations
from math import sin, cos
class Pole:
def __init__(
self,
mass: float,
angle0: float,
length: float,
pole_friction: float,
color: tuple[int,int,int],
child: Pole | None
):
self.m = mass
self.l = length
self.u_p = pole_friction
self.color = color
self.state = {
"theta": [angle0],
"d_theta": [0],
}
self.child = child
```

Next up is an update method that updates the states of the system. The arguments are the time step `dt`

, the acceleration of the cart `dd_x`

and the gravitational acceleration `g`

. The algorithm calculates the angular acceleration of the pole `dd_theta`

by use of equation (2) and uses the forward Euler method to calculate the angular velocity `d_theta`

and finally the angle `theta`

. We then save the new states and append them to the state lists.

```
def update(self, dt: float, dd_x: float, g: float):
theta = self.angle()
d_theta = self.angular_velocity()
dd_theta = 3/(7*self.lh())*(g*sin(theta)-dd_x*cos(theta)-self.u_p*d_theta/(self.m*self.lh()))
d_theta = d_theta + dd_theta * dt
theta = theta + d_theta * dt
self.state["d_theta"].append(d_theta)
self.state["theta"].append(theta)
```

We also make use of some helper methods. One of these is the `__iter__()`

method which puts the pole and all of its attached poles into a list that can be iterated through. We also have two methods that return the angle or angular acceleration at a specific time step, with the default being the last value. Finally the `lh()`

method just returns the length of the pole.

```
def lh(self):
return self.l/2
def __iter__(self):
poles = [self]
pole = self.child
while pole:
poles.append(pole)
pole = pole.child
return iter(poles)
def angle(self, t=None):
if not t:
t = len(self.state["theta"])-1
return self.state["theta"][t]
def angular_velocity(self, t=None):
if not t:
t = len(self.state["d_theta"])-1
return self.state["d_theta"][t]
def __str__(self):
return f"Pole:\n Mass: {self.m}\n Length: {self.l}\n Friction: {self.u_p}\n Child: {self.child is not None}"
```

Moving on we have the cart. Similarly, it has a constructor that takes in all parameters and initializes the state. Two of these parameters are a minimum position `min_x`

and a maximum position `max_x`

which are the boundaries of the track. It also takes in a pole, or rather a chain of poles, and a DC motor which we haven't written yet.

```
class Cart:
def __init__(
self,
mass: float,
cart_friction: float,
x0: float, min_x: float,
max_x: float,
color: tuple[int,int,int],
motor: DCMotor,
pole: Pole
):
self.m = mass
self.u_c = cart_friction
self.min_x = min_x
self.max_x = max_x
self.color = color
self.motor = motor
self.state = {
"x": [x0],
"d_x": [0],
"dd_x": [0]
}
self.pole = pole
def __iter__(self):
return iter(self.pole)
def total_mass(self):
M = self.m
for pole in self:
M += pole.m
return M
def x(self, t=None):
if not t:
t = len(self.state["x"])-1
return self.state["x"][t]
def velocity(self, t=None):
if not t:
t = len(self.state["d_x"])-1
return self.state["d_x"][t]
def acceleration(self, t=None):
if not t:
t = len(self.state["dd_x"])-1
return self.state["dd_x"][t]
```

The update method of the cart calculates `dd_x`

by use of equation (7) and has `dt`

, input voltage `Va`

and `g`

as arguments. With the forward Euler method, we calculate the velocity `d_x`

and position `x`

, but if the position is beyond the boundaries `min_x`

or `max_x`

, we clamp it to the nearest boundary. It then iterates through the chain of poles and updates every one of them. Finally, we also update the state of the motor.

```
def update(self, dt: float, Va: float, g: float):
x = self.x()
d_x = self.velocity()
M = self.total_mass()
sum1 = sum([pole.m*sin(pole.angle())*cos(pole.angle())
for pole in self])
sum2 = sum([pole.m*pole.lh()*pole.angular_velocity()**2
*sin(pole.angle()) for pole in self])
sum3 = sum([pole.u_p*pole.angular_velocity()
*cos(pole.angle())/pole.lh() for pole in self])
sum4 = sum([pole.m*cos(pole.angle())**2 for pole in self])
dd_x = (g*sum1-7/3*((1/self.motor.r**2)*(self.motor.K/self.motor.Ra
*(Va*self.motor.r-self.motor.K*d_x)-self.motor.Bm*d_x)+sum2
-self.u_c*d_x)-sum3)/(sum4-7/3*(M+self.motor.Jm/self.motor.r**2))
d_x = d_x + dd_x * dt
x = self.clamp(x + d_x * dt)
self.state["dd_x"].append(dd_x)
self.state["d_x"].append(d_x)
self.state["x"].append(x)
for pole in self:
pole.update(dt, dd_x, g)
self.motor.update(dt, d_x)
def clamp(self, x: float):
if x > self.max_x:
return self.max_x
elif x < self.min_x:
return self.min_x
return x
```

Lastly, the DC motor. I won't go into detail more than that it has the simplest code. Since we are only interested in its angular velocity `omega`

and angular acceleration `d_omega`

at the moment, most of the state updating happens in the cart and therefore the update method for the motor is very concise.

```
class DCMotor:
def __init__(
self,
resistance: float,
K: float,
inertia: float,
friction: float,
radius: float,
color: tuple[int,int,int]
):
self.Ra = resistance
self.K = K
self.Jm = inertia
self.Bm = friction
self.color = color
self.r = radius
self.state = {
"omega": [0],
"d_omega": [0]
}
def angle(self, t=None):
if not t:
t = len(self.state["omega"])-1
return self.state["omega"][t]
def angular_velocity(self, t=None):
if not t:
t = len(self.state["d_omega"])-1
return self.state["d_omega"][t]
def update(self, dt, d_x):
d_omega = d_x / self.r
omega = self.angle() + d_omega * dt
self.state["d_omega"].append(d_omega)
self.state["omega"].append(omega)
```

All we have to do now is initialize the parameters of our system, i.e. the cart, motor and poles, and give them some initial states.

```
cart = Cart(0.5, 0.05, 0, -0.8, 0.8, colors.red,
DCMotor(0.05, 0.5, 0.05, 0.01, 0.05, colors.black),
Pole(0.2, 10/180*pi, 0.2, 0.005, colors.green,
Pole(0.2, 5/180*pi, 0.15, 0.005, colors.blue,
Pole(0.2, 15/180*pi, 0.15, 0.005, colors.purple, None
)
)
)
)
```

Here we've created a triple-pole system. Each pole has a mass of 0.2 kg and a coefficient of friction at the joint of 0.005. The poles decrease in length from 0.2 m to 0.15 m to 0.1 m and have colors green, blue and purple. They have initial angles of 10, 5 and 15, written in radians.

The cart is red and has a mass of 0.5 kg and a coefficient of friction between it and the track of 0.05. It starts at x=0 on the track and can travel from x=-0.8 m to x=0.8 m. The track, therefore, has a length of 1.6 m.

To simulate the system, we simply have to run a loop where we update the cart. To visualize this, I've made a bad stupid ugly simulator in PyGame. I'm not going to go over this code but you are free to try it out.

```
import sys, pygame
from time import perf_counter
from math import pi, sin ,cos
from cartpole import Cart, Pole, DCMotor
pygame.init()
size = width, height = 1200, 600
screen = pygame.display.set_mode(size)
class Colors:
red = (255,0,0)
green = (0,255,0)
blue = (0,0,255)
purple = (255,0,255)
cyan = (0,255,255)
yellow = (255,255,0)
black = (0,0,0)
white = (255,255,255)
gray = (100,100,100)
colors = Colors
dt = 0.001
g = 9.81
cart = Cart(0.5, 0.05, 0, -0.8, 0.8, colors.red,
DCMotor(0.05, 0.5, 0.05, 0.01, 0.05, colors.black),
Pole(0.2, 10/180*pi, 0.2, 0.005, colors.green,
Pole(0.2, 5/180*pi, 0.15, 0.005, colors.blue,
Pole(0.2, 15/180*pi, 0.10, 0.005, colors.purple, None
)
)
)
)
def si_to_pixels(x: float):
return int(x * 500)
last_update = 0
start_time = perf_counter()
i = 0
font = pygame.font.Font('freesansbold.ttf', 20)
while True:
current_time = perf_counter()-start_time
if current_time > dt + last_update:
last_update = current_time
else:
continue
for event in pygame.event.get():
if event.type == pygame.QUIT:
sys.exit()
screen.fill(colors.gray)
cart.update(dt, 12*cos(i*dt*2), g)
x0 = si_to_pixels(cart.x()) + width//2
y0 = height//2
pygame.draw.rect(screen, cart.color, (x0, y0, 20, 10))
max_x = width//2 + si_to_pixels(cart.max_x)
min_x = width//2 + si_to_pixels(cart.min_x)
pygame.draw.rect(screen, cart.color, (min_x-10, y0, 10, 10))
pygame.draw.rect(screen, cart.color, (max_x+20, y0, 10, 10))
motor_x0 = min_x-100
motor_sin = si_to_pixels(sin(-cart.motor.angle())*0.05)
motor_cos = si_to_pixels(cos(-cart.motor.angle())*0.05)
pygame.draw.polygon(screen, cart.motor.color, [
(motor_x0+motor_sin, y0+motor_cos),
(motor_x0+motor_cos, y0-motor_sin),
(motor_x0-motor_sin, y0-motor_cos),
(motor_x0-motor_cos, y0+motor_sin),
])
x0 += 10
for pole in cart:
x1 = x0 + si_to_pixels(pole.l * sin(pole.angle()))
y1 = y0 + si_to_pixels(-pole.l * cos(pole.angle()))
pygame.draw.line(screen, pole.color, (x0, y0), (x1, y1), 10)
x0 = x1
y0 = y1
texts = [
f"Time: {round(i*dt,2)} s",
f"",
f"Cart:",
f"Position: {round(cart.x(),2)} m",
f"Velocity: {round(cart.velocity(),2)} m/s",
f"Acceleration: {round(cart.acceleration(),2)} m/s^2",
f"",
f"Motor:",
f"Angle: {round(cart.motor.angle(),2)} rad",
f"Angular velocity: {round(cart.motor.angular_velocity(),2)} rad/s",
]
for k, text_k in enumerate(texts):
text = font.render(text_k, True, colors.black, colors.gray)
text_rect = text.get_rect()
screen.blit(text,(0,(text_rect.height)*k,text_rect.width,text_rect.height))
pygame.display.flip()
i += 1
```

Concisely, we update the state of the system every 1 ms and then draw out the different subsystems. The boundaries are marked by red squares and the motor is a square that spins in place. We also write out some data in the top-left corner. The input voltage `12*cos(i*dt*2)`

is a cosine function of time that makes the cart move backward and forwards repeatedly.

And there we go, all done!

A big flaw in the system is that we are using the forward Euler method for numerical integration. This isn't as accurate as other methods, especially with a larger time step. In the future, I'll probably use some Runge-Kutta method instead, but I need to do some research first. Using a time step that is too big also makes the system go unstable. For my simulator, the system goes unstable with a time step greater than around 0.03.

I've ordered all of the parts for the real physical model that I'm building. Until they arrive, I will continue developing the model and simulator. The model is incomplete as it does not include any gear ratio between the motor and timing belt.

The simulator is still very ugly and badly written. I plan to remake the simulator and convert it into an OpenAI gym environment so that it is easy to use for training a reinforcement learning model. Next up is doing just that and perhaps writing a simple controller for balancing a single pole.

[1] Green, Colin D. (2020) Equations of Motion for the Cart and Pole Control Task.

[link1]