Enhancing deep reinforcement learning with integral action to control tokamak safety factor

Recent advances in the use of Artificial Intelligence to control complex systems make it suitable for profile plasma control. In this work, we propose an algorithm based on Deep Reinforcement Learning to control the safety factor profile with a feedback design. For this purpose, we first derive a device-specific control-oriented model with fast simulation time. Then, in order to enhance robustness with respect to external disturbances and model errors, we include an error time integrator into the controller. A cascade of the kinetic and magnetic models with the error time integrator is used in the learning procedure of the feedback controller. Finally, to illustrate the e ffi ciency of the proposed design procedure, the obtained controller is tested in a reference plasma simulator, the Raptor simulator.


Introduction
Because of the high uncertainties in the measurements and estimations of plasma profiles, as well as in the modelling of kinetic and magnetic dynamics, robust feedback control is crucial to obtain high-performance operations of tokamak reactors.In tokamak reactors, the safety factor has been found to be strictly related to Magnetohydrodynamic (MHD) activities [1]: therefore, controlling the safety factor to the desired profile becomes an essential step towards obtaining long-time discharges [2].In this article, we will consider the safety factor profile control problem during the so-called flat-top phase.
In the plasma control literature, it is common to interchangeably speak about the current profile, safety factor q (and its inverse ι-profile), magnetic flux gradient profile, and magnetic flux profile.From an operative point of view, a lot of contributions have been made by the plasma physics and control communities working together.For instance, an overview of the plasma control in the Tore Supra tokamak can be found in [3].More specifically, in [4] are shown experimental results using proportional feedback for the control of the internal inductance on Tore Supra, while in [5,6] the authors show the results on the DIII-D and JET tokamaks obtained by using optimal control applied on data-driven models.Subsequently, different strategies using first-principles-driven models have been developed from the control community.
The main challenge in model-based safety factor profile control for advanced mode operations is the derivation of dynamical models that are complex enough to retain the main physical properties and simple enough to be used for feedback design.A first attempt to model and numerically simulate the plasma profile evolution has been made in [7].Then, a control-oriented model describing the magnetic flux gradient and temperature evolution has been proposed in [8].Control strategies have been developed using a linear finite-dimensional approximation of the original PDE describing the magnetic flux dynamics [9][10][11].An optimal controller designed on a nonlinear model obtained by Galerkin approximation has been proposed in [12], while a backstepping controller has been designed on the nonlinear model obtained by finite differences in [13].Model predictive control strategies for the safety factor profile control have been presented in [14].Nonlinear robust safety factor profile control is developed in [15].Control algorithms for simultaneous control of magnetic and kinetic parameters of tokamak plasmas using finite-dimensional approximation are presented in [16][17][18][19].
Recently, much effort has been spent in designing controllers directly on the PDE model of magnetic and temperature diffusion.In [20], a sum-of-square polynomial technique has been used to construct a Lyapunov function to stabilize the closed-loop system, whereas in [21] the same technique has been used to optimize the bootstrap current.Furthermore, a Lyapunov-based controller has been designed in [22,23].A Lyapunov-based control technique for the kinetic and magnetic profiles designed on the linearization of the original equations has been proposed in [24].
The seminal work [25] provided a powerful control-oriented fast plasma transport simulator (Raptor) that helps the feedback design study and implementation for kinetic and magnetic profiles control for the TCV tokamak.This enabled the design, simulation and implementation of multiple controllers [26,27].For a broader overview of emerging and current challenges in tokamak control, we refer the reader to [28,29].
With the rise of Deep Neural Networks (DNNs) as tools for function approximation, the machine learning community took a step towards control problems of physical systems.Recently, Deep Reinforcement Learning (DRL) methods proved to be effective in solving complex nonlinear control problems [30,31].Guided by an optimization objective, DRL algorithms train a DNN to produce a sequence of almost-optimal inputs (or actions).This sequence of inputs is called policy.These algorithms are data-driven, as training evolves according to the interactions with the environment.One major advantage of DRL algorithms is their direct applicability to a large family of complex systems, especially in the case of model-free approaches, e.g.[32][33][34].The environment to be controlled is typically considered a black box.In order to estimate the future performances of the policy without knowing the environment, many DRL algorithms exploit an actor-critic structure.This family of methods exploits two or more DNNs (see e.g.[35]).The former is used for approximating the policy, while the latter predicts its performance by estimating the sum of future rewards.This model-agnostic approach enabled DRL algorithms to be applied on a wide variety of complex tasks and, most recently, also in the field of nuclear fusion [36].
In this paper, we propose to design a dynamic DNN controller complemented by a time integral of the error.Such an addition is valuable because, according to systems and control theory, the addition of an integrator in the feedback loop is known to solve the problems of constant reference regulation and constant disturbance rejection [37].This control strategy has indeed been shown to be effective for the control of some classes of linear and nonlinear Ordinary Differential Equations (ODE) [38], as well as some classes of linear and nonlinear PDEs [39].In [40], the authors presented a method to solve the regulation problem of linear systems in closed-loop with a neural network controller and an integrator.More recently, control design by RL has been combined with Model Predictive Control to solve the tracking problem of surface vessels [41].
However, few works have been so far dedicated to controlling PDEs using DRL.Because of the spatial differential operators present in PDEs, the state inherits some spatial regularity properties, such as local smoothness.This intrinsic property can be exploited to design specific RL algorithms that are able to deal with very large state spaces [42], e.g. the regularized fitted q-iteration (RFQI) algorithm.Furthermore, this method has been successfully applied to the control design of a multidimensional nonlinear problem such as a heating, ventilating and air conditioning system [42,43], but only with a finite amount of possible actions.The case in which the action space is infinite-dimensional has been investigated in [44].More recently, a Proximal Policy Optimization (PPO) algorithm has been used to design the controller for congested freeway traffic [45].DRL algorithms have already been used in the context of nuclear fusion control.In [46], the authors proposed a DRL technique to control the safety factor during the ramp-up phase, while in [47] the authors developed a DRL algorithm to train a feed-forward controller for the kinetic profiles.Recently, a DRL controller has been proposed in [48] for plasma shape control.A recent publication that aligns with our proposed work is [49], where the authors present an RL-based algorithm for simultaneous control of the 2 Reward at j R j Accumulated reward at j safety factor and normalized beta in the JT-60SA tokamak.In contrast to their approach, our contribution focuses on emphasizing the utilization of the time integral of the error as a parameter to be fed into the trained neural network.Furthermore, after presenting our control architecture and a novel training procedure, we claim that our proposed control strategy compensates for the disparities between the training model and real-world conditions.The contribution of this paper hinges on an original model of the kinetic and magnetic plasma dynamics: the resulting simulator is fast enough to be used for learning purposes by a DRL algorithm.In order to make the latter robust to constant model uncertainties and disturbances, we embed the controller with an error time integrator.To illustrate the robustness of the proposed controller, we test it on a different model than the one used for training, i.e. the Raptor simulator.Several authors proposed the inclusion of an error time integrator in the control loop, which is fundamental in practical operations to compensate for discrepancies between the model employed for control design and the actual plant.Integral action-based strategies appeared for instance in [50], with a modification of an LQR (referred to as LQI), and in [51], where a Lyapunov control strategy is designed based on the linearized partial differential equation (PDE).The main limitation of these works is that they require the knowledge of a linearized model around the desired equilibrium point.Furthermore, such approaches typically limit the applicability of the resulting controller to a small region of initial conditions around the equilibrium.In contrast, our work is based on a feedback design strategy that uses neural networks (NNs) trained on an arbitrarily large region of the state space, thereby yielding a controller applicable from any plant's initialization.Furthermore, we leverage on modelfree reinforcement learning algorithms, thereby circumventing the necessity for explicit knowledge of the system's dynamics in proximity to the desired operational point.
The paper is organized as follows: In Section 2 we propose some preliminaries on Reinforcement Learning control and its use in combination with integral action.In Section 3 the plasma model is written as a Markov decision process and the training algorithm is presented.The simulation results of the obtained controller on the Raptor simulator are shown in Section 4. Finally, some concluding remarks and comments on future works are given in Section 5.In the appendix section, additional context and clarification are provided for the presented material.Specifically, Appendix A introduces the model of the tokamak's kinetic and magnetic dynamics, Appendix B outlines the simulation algorithm, and Appendix C offers a simple example of implementing reinforcement learning control with integral action.

Background on Reinforcement Learning control
In this section, we briefly introduce the essential Reinforcement Learning concepts.A thorough discussion can be found in [52].Justified by Bellman's principle of optimality, Reinforcement Learning aims at optimally solving a problem by learning a sequence of maximum-reward actions (called policy).The policy optimization is driven by value functions.The state-value function V π (s j ) corresponds to the expected total discounted reward R j starting from state s j at the j time instance and then following the policy.We remark that the value function profoundly depends on the policy π.If the agent uses a given policy π to select an action from the state s j , the value function is given by where E [•] stands for the expectation.The optimal policy is the policy that corresponds to the maximum value V ⋆ (s j ) of the value function Dynamic Programming methods search for the optimal policy using the former equation, but they require knowledge of the model.To enable the concept of model-free Reinforcement Learning, it is necessary to introduce the stateaction value function or Q-function.The Q-function corresponds to the expected total discounted reward when the action a j is taken in state s j , and then the policy π is followed henceforth.Therefore, the Q-function is given by The optimal Q-function is given by and stands for the expected total discounted reward when the agent picks possible non-optimal action a j in s j , and then behaves optimally henceforth.The relation between the optimal Q and V function is expressed by If the optimal Q-function is known, then the optimal action a ⋆ j can be extracted by choosing the action a j that maximizes Q ⋆ (s j , a j ) a ⋆ j = arg max This is the reason why the knowledge of the Q function enables model-free RL.Finally, the advantage function measures how advantageous a certain action a j is with respect to the one drawn from the policy In RL, as well as in dynamic programming, the action is chosen through a policy that has the objective of maximizing the expected total discounted reward.In the present application, we have a system model with continuous state and action spaces, which inherently comprise an infinite number of elements.Due to this infinite nature, it becomes infeasible to fully explore the whole state and action spaces [52].Consequently, tabular dynamic programming methods cannot be employed to apply a reinforcement learning algorithm to our fusion control problem.Instead, the typical solution to tackle the continuous time nature of state and action spaces is to employ deep function approximations and estimate the value function and the policy.In our application, we chose to use the PPO algorithm, which is based on Trust Region Policy Optimization (TRPO).These algorithms use an actor-critic approach, where the actor is in charge of improving the policy based on the value function that is estimated by the critic.The actor and the critic correspond to function approximators parametrized by ϕ and θ, respectively.In particular, in our application case, these function approximators are selected to be DNNs and ϕ, θ are vectors collecting weights and biases.
The Critic's role is to evaluate the current policy prescribed by the actor.At each iteration of an episode p, the tuple (s p, j , a p, j , r p, j+1 , s p, j+1 ) is stored in a buffer B p .Each episode is of length J and each episode-related buffer B p is stored in a general buffer B. After a certain number of episodes, the parametrized value function V ϕ is updated to minimize the following loss function where Ê represents the estimated expectation and can be implemented as the mean where R p, j is the total discounted reward at iteration j of the p − th episode and N sample is the total number of samples.The critic parameters ϕ are updated numerically via gradient descent.
The Actor's role is to use the information from the Critic to update the current policy.To understand the PPO algorithm we first need to understand the optimization objective of Policy Gradient (PG) methods, defined as follows that can be implemented as (log π θ (a p, j | s p, j ) Â(a p, j , s p, j )) (11) where π θ (a p, j | s p, j ) is computed using the current neural network and Â(a j , s j ) using the Generalised Advantage Estimator (GEA) using the critic value estimator.It can be proven that (10) drives the policy in a gradient ascent fashion, with respect to the objective function [53].The policy π θ is the current neural network that gives the probability of picking a j when the environment gives a certain state observation s j .If for a certain couple (a j , s j ) the advantage is positive, the policy gradient updates θ to augment the probability of taking the action a j when the environment is in s j .Unfortunately, with the previously defined loss function, the parameter θ will often be updated far from the previous The minimum of the clipped and unclipped objective ensures the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.With this scheme, we only ignore the change in probability ratio when it would improve the objective, and we include it when it deteriorates the objective.

Control design
In this section, we propose the design of a dynamic controller enhanced with knowledge about the integral of the error.Inspired by classical control-theoretic solutions (PIs and PIDs) and recent results on total stability [56,57], we embed the stabilization of a discrete-time integrator in the control objective.As such, we consider the problem of stabilizing the extended cascade system composed of the plant and the integrator, as in Fig. 1.The idea is to embed "memory" in the agent by providing some time-related information.Then, if the integrator's state is stabilized, we guarantee a zero-tracking error with respect to the (constant) reference.Note that such a property is due to the specific structure of the integrator dynamics.Actually, an alternative for embedding memory into the agent is to use Recursive Neural Networks (RNNs), e.g.[58,59].Yet, aside from the increased complexity in their training due to back-propagation through time, it would not be possible to guarantee perfect asymptotic tracking and (constant) disturbance rejection.Indeed, it is not possible to predict the dynamics information that will be stored in the RNN's latent space and its evolution.As stated above, the proposition of adding an integral state to the controller stems from regulation theory.To explain the need for an integral state, a toy example is considered in Appendix C. Fig. 1: General overview of a control scheme including a time integrator.In the context of this work, the controller is a DNN trained using a DRL algorithm.For more information regarding the control structure, we refer to Fig. 4 and Fig. 6.

Integral state for magnetic flux control
Consider ψ(R, Z) the poloidal flux of the magnetic field B(R, Z) passing through a disc centred at the toroidal axis at height Z and with surface S = πR 2 where R is the large plasma radius.Let the magnetic flux be defined as and its dynamics can be expressed by the following reaction-diffusion equation [8] where x = ρ/a ρ identifies the normalized spatial variable, D(x, t) and G(x, t) are diffusion parameters, while S (x, t) is the source term and are defined in (A.2).In this study, we consider that the magnetic flux is controlled by two Electron Cyclotron Current Drive (ECCD) systems, each characterized by its input power P eccd,i , where i ∈ 1, 2. Our case of study resembles the scenario presented in [51], where the two inputs are applied to the same spatial point.Specifically, the first antenna P eccd,1 has a positive effect on z j , while the second antenna P eccd,2 has a negative effect on it.Thus, from a theoretical point of view, the two inputs can be treated as a single input.The term ρ is the toroidal flux coefficient indexing the magnetic surfaces, defined as ρ = (2ϕ/B ϕ0 ) 1 2 , where ϕ is the toroidal magnetic flux and B ϕ0 is the value of the toroidal magnetic flux at the plasma center.The spatial index belongs to the interval ρ ∈ [0, a ρ ] where a ρ is the minor plasma radius corresponding to the Last Closed Flux Surface (LCFS).An important quantity of plasma control in tokamak devices is the safety factor.This distributed variable measures the toroidal over poloidal turns of a field line passing through a point (R, Z) in a toroidal plane.Since the magnetic field lines are assumed to be equal in the same magnetic surface, we can define the safety factor for each magnetic surface indexed by x.In particular, the safety factor is defined as the quotient between the toroidal and poloidal gradient, that using the previous definition of the toroidal magnetic flux, can be defined as where ϕ(x, t) is the toroidal flux defined in A.11. Another important quantity in plasma analysis is the ι-profile, which is also referred to as "rotational transform" The ι-profile is a more natural control variable since it proportionally depends on the poloidal flux gradient.Additional details about the model are discussed in Appendix A. The following equation gives the plasma thermal energy dynamics τ th = e −5.7466 P 0.0214 oh (1 + P eccd,1 ) 0.0426 (1 where, while P OH identifies the ohmic power and is defined in (A.12).Although temperature regulation is not the primary objective of this work, we still consider thermal dynamics in the model.The reason for including these dynamics is to account for the delay introduced by temperature diffusion.Rather than using a pure delay term, we chose to represent the temperature diffusion by an appropriate dynamic equation for more accurate modelling.At the same time, we made the decision not to incorporate the full distributed thermal diffusion model in order to minimize simulation time and achieve a sufficiently short training time for the RL algorithm.For more details regarding the temperature reconstruction using thermal energy and Ohmic power, we refer to Appendix B.
By utilizing an implicit-explicit time discretization and a fixed-step spatial discretization for equation (17), and an implicit-explicit time discretization for equation (20), we obtain the difference equation in the state variable Here, N is the number of elements used in the spatial discretization of equation (17).Hence, at iteration j, the vector ψ j ∈ R N denotes the discretized magnetic flux in N spatial points, while W th, j represents the thermal energy.
The relation between a j and P eccd,i and further information on the simulation algorithm employed in this study can be found in Appendix B.
In this work, the objective is to regulate the ι-profile ι(x, t) defined in (19) to a desired ι ⋆ (x).As ι depends on the magnetic flux gradient ∂ψ ∂x (x, t), the control objective can be reformulated as the regulation to a desired magnetic flux gradient profile ∂ψ ⋆ ∂x (x).Therefore, we define the discretized version of the magnetic flux gradient where Then, we select one element of the magnetic flux gradient to define the output y s = S ∂ψ ∂x , where S ∈ R 1×N is the selection matrix with a single element equal to one and zero elsewhere.From now on, we assume that the system's parameters are known to belong to a certain range of values.This means that we do not know the exact parameter's value, but we only have an estimation.Consider k n a generic system's parameter.According to the previous assumption, we know that k n ∈ [k n , kn ].We define the vector K as the collection of all the system's parameters , where N k is the total number of parameters.We denote by f K (z j , a j ) the magnetic flux dynamics with parameters K By simulating this system with constant input a ⋆ , it is possible to extract the magnetic flux gradient at witch the system stabilizes ∂ψ ∂x ⋆ .In particular, the equilibrium magnetic flux gradient is related to the equation's set of parameters K and the applied constant input a ⋆ .It is worth to remark that equilibrium magnetic flux's gradients obtained with the same input a ⋆ and different set of parameters K 1 and K 2 may be different, as shown in Fig. 2. In real applications, an integral action is vital since we have to compensate for the uncertainties coming from the modelling procedure.At the same time, the integral action allows compensation for the mismatch between Raptor and the training model.The use of the integral action allows the regulation of the integrated quantity to zero.Having at our disposal one input (the two antennas act in opposite directions on the magnetic flux dynamics, therefore they can be treated as a single input), in case of unreachable equilibrium, using the integral action we can regulate to zero the error y s, j − y ⋆ s .We define the discrete-time integrator state dynamics as We define the extended state as and therefore the extended dynamics can be defined as Since we do not know a priori the equilibrium of the integral state (since it depends on the unknown parameters K), we fix the integral state reference to zero ε ⋆ = 0.

Magnetic flux and Temperature dynamics plus integral state as a Markov Decision Process
In this section, we express the tokamak magnetic flux and temperature dynamical model together with the integral state in Markov Decision Process settings.At each time step j, the environment conditions are described by the state vector s j ∈ S, while the action corresponding to the power of the ECCD antennas can be picked from the action space a j ∈ A. An action a j is applied to the environment in state s j at a time j, which evolves to the state s j+1 according to the state transition probability P(s j+1 | s j , a j ).In other words, P(s j+1 | s j , a j ) represents the probability of ending in state s j+1 when we apply an action a j in state s j .In our case, since we have the model of the system, the state transition probability boils down to deterministic dynamics where δ(•) is the Dirac delta function.When applying the action a j , the agent receives the new state s j+1 together with a scalar reward r j+1 = r(s j+1 , a j ), as shown in Fig. 3.We select the reward such that Fig. 3: General representation of the interaction between the environment (system to be controlled) and the agent (controller) in the RL context. where where the notation I α×α stands for an identity matrix of dimensions α × α.The Q matrix is built to have a different cost at the diagonal entry corresponding to the error integrated by the integral state.It possible to notice that the third term in ( 29) can be rewritten as Therefore, the presence of this term in the reward allows to penalize the integral state variations.The policy π defines the mapping from the state space to the action space, which the agent modifies during the learning phase, and will be used once the learning procedure is completed.The policy can be deterministic or stochastic.A stochastic policy draws actions from a random distribution, whose state-dependent momenta are learned by the agent.A common choice in DRL algorithms is to draw from Gaussian distributions.Hence, the policy DNN π(s j ) = [µ(s j ), σ(s j )] T is a function of the state s j that returns the mean µ and the variance σ.Then, the probability of selecting a j when the system is in the state s j is In the case of deterministic policy, we have that Hence, in the case of deterministic policies, the policy DNN is usually trained to provide directly the next action.As in the standard DRL framework, we optimize over a discounted objective.Hence, the total discounted reward from time j onward is defined as where γ ∈ [0, 1] is the discount factor.The tokamak's environment, together with the integral action, is sketched in Fig. 4.

Training algorithm
In this section, we describe the strategy used in the training algorithm in order to make the agent learn how to use the integral state to regulate the desired error to zero.The implementation of the learning algorithm is summarized in Algorithm 1. Firstly, the parameters K ∈ K are selected according to the tokamak configuration that we want The agent draws an action a j using the current stochastic policy π θ k Update the state s j+1 = g K (s j , a j ) Compute the reward r j+1 = r(s j+1 , a j ) using ( ∂ψ P ∂x ⋆ , a ⋆ ) end Compute the total discounted reward R j Compute the Advantage estimate A j from current critic V ϕ k using the "Generalized Advantage Estimate" algorithm Update actor θ k+1 minimizing (14) Update critic ϕ k+1 minimizing (9).end to control.In the following, we refer to N E and N S as the number of episodes and episode steps in the training, respectively.We define the steady-state output set O as At the beginning of every episode, a couple of steady-state output and input ( ∂ψ ∂x ⋆ , a ⋆ ) ∈ O is selected.After analysing numerous equilibrium positions using different plasma parameters, we observed that the variation of the gradient's equilibria was small with respect to plasma parameter variations.In particular, we noticed that by changing the plasma parameters, we could shift the ∂ψ ∂x ⋆ maximum value and either flatten or sharpen the shape of the equilibrium function around this point.To avoid the time-consuming task of identifying all possible steady-state outputs with different parameters selection, we observed that a similar outcome could be achieved by adding a Gaussian function perturbation to the calculated equilibrium position computed with parameters where ) are continuous uniform random variable selected in different intervals.We chose to use the Gaussian function since it is a function that approaches zero at x = 1, 0, as we aimed to preserve the value of ∂ψ ∂x ⋆ at the plasma's center and the LCFS.This decision was influenced by the fact that the magnetic flux gradient remains fixed at zero at the center, while the flux gradient at the LCFS is primarily dependent on the magnetic central location, which experiences comparatively minimal variations compared to other plasma parameters.
After the modification of the equilibrium's shape, we ask the agent to stabilize the system to this unreachable equilibrium, and therefore we expect the agent to give priority to the regulation of the integrated error at one point of the spatial domain.Then, the learning procedure is carried out following the PPO algorithm described in the previous section.It is worth emphasizing that during the learning procedure, a stochastic policy is employed to improve exploration, whereas in Section 4, a deterministic policy will be utilized at test time.The selection matrix S , for the definition of the point to be regulated, is selected to be 0 in all entries except for the 5 th position, which is set equal to 1.This means that we seek to regulate to zero the error at the x = 0.2 position.We select a position near the plasma center since it is in the interval of the spatial domain where the measures are more reliable.Moreover, we are sufficiently near the deposit of the ECCD current that is in x = 0. To design the training-based controller, we carry out four different controller training in order to give some hints on the definition of the α i parameters depending on the desired closed-loop performances.It is worth mentioning that only the ratio between the free parameters matters in the learning of the final controller.Therefore, we arbitrarily fix the value of the α 3 , α 4 and R parameters, letting varying only the α 1 and α 2 parameters.In particular, the common cost parameters for the four trainings are while the α 1 and α 2 parameters (α 1 , α 2 ) are selected in the the set of pairs {(0.1, 0.05), (1, 0.5), (10,5), (100, 50)}.In Fig. 5a, we show the the episode reward mean (as defined in ( 29)) of four trainings with different α 1 and α 2 .In Fig. 5b we show the time-varying profiles at two points of the spatial domain, namely x = 0 and x = 0.2, of the closed-loop simulations using the four obtained controllers.The simulations are performed on the same model with which the controllers have been trained, while we set an unreachable equilibrium position, in the form of (36), as desired set-point.A proper functioning of the integral action becomes necessary to obtain the convergence toward the equilibrium at the integrator location, i.e. at x = 0.2.Throughout the different simulations, we employ the same unreachable equilibrium to ensure comparability between the simulations.After analysing  It follows from the comparisons of all these cases that reducing the values of α 1 and α 2 with respect to α 4 provides a quick output regulation at the desired point.However, this comes at the expense of a larger magnitude of overshoot.Furthermore, due to the difference in N steps across the various cases, we can formulate some preliminary hypotheses (which necessitate more extensive investigation for validation) concerning the learning behaviour of the RL algorithm.Specifically, in Cases 1 and 4, the α i parameters "guide" the controller's learning towards distinct objectives: achieving rapid integral action and minimizing overshoot, respectively.In Cases 2 and 3, the α i parameters steer the controller towards a balance between these two attributes.We infer that this difference is accountable for the larger N steps in Case 2 and 3, where the controller is "asked" to learn two distinct attributes rather than just one.
In the next section, the controller trained with the parameters of Case 1 is tested in closed loop with the Raptor simulator.The reason behind the selection of the controller of Case 1 lies in its sufficiently fast integral action.However, if the priority is to have reduced overshoot at the expense of a slower integral action, an alternative would be to select the controller of Case 3. We remark that the output of the training procedure is a trained neural network that takes as input the state s j and the reference point s ⋆ K and returns the values of the input a j to be applied.

Control results
In this section, we show the closed-loop simulations of the RL feedback control law applied to the Raptor simulator.In the following simulations, the parameters are selected equal to the ones in (B.7)-(B.8).Target ι-profiles, corresponding to steady-state plasma configurations, are generated by applying a constant input for a sufficiently long time in the Raptor simulator.If these profiles are used in the feedback control law without changing the Raptor configurations, we can achieve perfect tracking of the target profiles.Nevertheless, it is difficult to get perfect tracking in practical applications because of the high system uncertainties and the limited degrees of freedom in the available actuators.During the simulations, the ramp-up phase lasts until t = 0.02 s bringing the central current from I p = 80 kA to I p = 120 kA, while the feedback controller is activated at t = 0.1 s.During the flat-top phase, four different target profiles are given to the controller: the first at t = 0.1 s, the second at t = 2.5 s, the third at t = 5 s, and the last at t = 7.5 s.Therefore, to test the robustness of the RL control feedback, we set up three different control scenarios: • 1 st scenario: RL feedback applied on the Raptor simulator (with and without anti-windup).Fig. 6: Graphical representation of the proposed control scheme applied to the Raptor simulator.At each time step the trained neural network ( f NN ) returns a number (u j ) between 0 and 1.This number is mapped to the ECCD powers to be applied to Raptor.At the same time as the input is applied to Raptor, the integral state is updated.
• 2 nd scenario: RL feedback applied on the Raptor simulator with input disturbance.
• 3 rd scenario: RL feedback applied on the Raptor simulator with input disturbance and different integration point (no retraining).
The training procedure is done using the linear integrator dynamic described by (25).In the first simulation, the test is done using the same integrator dynamics as in the training.A strong overshoot can be observed for the target equilibrium corresponding to a constant feed-forward near the input's limits.This overshot problem is due to the windup problem caused by the simultaneous presence of the integrator and the input saturation.To solve this problem, we implement an anti-windup scheme modifying the integrator linear dynamics (25) into We refer to [60] for a survey on anti-windup techniques for linear, nonlinear, discrete, and continuous time systems.In Fig. 6, we show the controller design for the Raptor simulator.Fig. 7 shows the ι-profile evolution at four locations of the spatial domain x ∈ {0, 0.1, 0.2, 0.35} during a simulation time of length T sim = 10 s.In Fig. 7, we compare the simulation results in case the controller is equipped or not with the anti-windup algorithm.We remark that for every ι-profile stabilization, the controller without the anti-windup presents a larger overshoot when the feed-forward input (connected to the required reference) is closer to the saturation.Using the linear integrator dynamics, the integrator state can continue to integrate (possibly in the wrong direction) when the input is saturated.This means that the integrator will need some additional time to come back to a value in the interval where the input is not saturated.Using the nonlinear integrator dynamics in (38), we prevent the integrator state to vary in the wrong direction in case of input saturation.Notice that the trajectories with the anti-windup implementation present a smaller overshoot for some profiles and no overshot in others.Fig. 8 shows the ι-profiles at different time instants t ∈ {2.5, 2.7, 2.9, 3.1, 3.3} and the feedback control input a j = P eccd,1 P eccd,2 during the simulation interval.Comparing Figures 8a and 8c, we can remark that the ι-profile in the case of anti-windup implementation is much closer to the desired profile than in the case Fig. 7: Rotational transform profiles (ι(x, t)) at different points of the spatial domain {0, 0.1, 0.2, 0.35}.The orange curves represent the temporal evolution of the profiles without the implementation of the anti-wind-up algorithm, while the blue curves depict the profiles with the utilization of the anti-wind-up algorithm.
of linear integrator dynamics.Nevertheless, in both cases, the perfect tracking of the desired ι-profile is eventually achieved for all given references.
The same controller including the anti-windup integrator as in the last section is evaluated in the second scenario.In this scenario, we introduce an input disturbance at time t = 5.3 s, implemented with a Neutral-Beam Injector with ρ dep = 0.4 and w cd = 0.4.Both parameters ρ dep , w cd are set differently from the ones of the ECCD antenna in (B.8).Therefore, when the input disturbance is activated, the magnetic flux profile is modified and it is not possible to obtain the perfect tracking of the ι-profile reference as in the previous scenario.Fig. 9 represents the ι trajectories in four points of the spatial domain.Starting from the disturbance application, we can remark that the error at the integration point x = 0.2 is regulated to zero, while the trajectories in the other locations are not regulated to zero.Fig. 10a shows that the ι-profile is stabilized to the desired profile at t = 5.3 s, when the disturbance is introduced.Then, the profile is stabilized into a shape that coincides with the desired shape at x = 0.2 and it is different at the other locations of the spatial domain.

rd scenario.
In this third scenario, we use the same anti-windup controller used in the previous scenarios.In this scenario, the integration position is changed from p i = 5 to p i = 3.This means that the position where we want to regulate the error to zero is x = 0.1 instead of x = 0.2.We remark that the neural network has not been retrained in a different integration position, therefore with this simulation, we want to test the robustness of our control algorithm with respect to the integration point.Fig. 11 shows that before the disturbance application, the controller is able to perfectly track the desired profiles.While after the disturbance application, the error is regulated to zero at x = 0.1 and not in the other points.In Fig. 12a we remark that after the disturbance application, the ι-profile is stabilized to a shape where the error is zero at x = 0.1 position.

Conclusions
In this article, we have developed a dynamic Deep Neural Network (DNN) controller enhanced with an integral action using a Deep Reinforcement Learning (DRL) algorithm to regulate the plasma safety factor in a tokamak.(c) Rotational transform profiles (ι(x, t)) at different time instants without the implementation of the anti-windup algorithm.
(d) Applied ECCD power control input in case the anti-windup algorithm is not implemented.
Fig. 8: ι-profile and applied input comparison between controller with and without anti-windup.
Firstly, we provided an overview of the current state-of-the-art in reinforcement learning control.Subsequently, we derived a simplified, finite-dimensional nonlinear discrete system from the original distributed magnetic and kinetic plasma equations.Next, a DRL training algorithm has been proposed to design the controller for the plasma dynamics in cascade with the temporal integrator of the error to be regulated.The proposed simulator has been implemented in Python to be able to train the DNN controller using state-of-the-art DRL algorithms.Finally, we evaluated the performance of the obtained controller on the Raptor simulator under standard conditions, in the presence of external disturbances and when the time integrator integrates an error that differs from the one used during training.
A possible extension to this work is the realization of a second loop of optimization that learns how to select the best α i parameters of the reward function to optimize a cost function as the one proposed in Section 3.3 of [61].
Another future research is to use the same RL algorithm to achieve simultaneous stabilization of the temperature and the safety factor profile.Additionally, an important extension would be to obtain at least local stability guarantees for the closed-loop system with the proposed dynamic DNN controller.

Appendix A. Safety factor and thermal energy control model
Consider the reaction-diffusion equation describing the magnetic flux dynamics The coefficients D(x, t), G(x, t), and S (x, t) can be computed by following [62, eq.III-34] as where η ∥ (ρ, t) is the resistivity, µ 0 is the permeability of the free space, F is the diamagnetic function, V(ρ, t) is the plasma volume, V ′ = ∂V ∂ρ is the volume spatial derivative while C 2 and C 3 are space varying parameters depending on the considered plasma geometry configuration.j ni (x, t) is the non-inductive current source and includes the bootstrap current j bs as well as the ECCD density currents j eccd j ni = j bs + j eccd . (A.3) In this work, the bootstrap currents are computed according to [25] where T e is the electronic temperature, T i (x, t) ≈ α T i (t)T e (x, t) is the ions temperature, n e is the electron density, α is a constant parameter while k bs , L 31 , L 32 , L 34 , R pe are space varying parameters depending on the electronic and ion temperatures and on the plasma geometric configuration.The ion-to-electron temperature ratio can be fixed to α T i = 0.7.The electron density can be approximated by (a) Rotational transform profiles (ι(x, t)) at different time instants after the disturbance application.
(b) Applied ECCD power control input before and after the disturbance application.
Fig. 10: ι-profiles and applied ECCD power (P ECCD,i ) for the two antennas in case of external disturbance application.
where ne is the electron line average density, that in our case has been considered to be constant ne = 1 × 10 −19 .An appropriate and effective choice used in control-oriented plasma-dynamics simulators is to approximate the current density by a weighted Gaussian [8].According to [25], the ECCD efficiency can be modelled heuristically as j eccd,i (ρ, t) = c cd,i e ρ 2 /0.5 2 T e n e e −4(ρ−ρ dep,i ) 2 /w 2 cd,i P eccd,i (t) (A.6) where w dep is the deposition width and ρ dep is the location of the peak of the deposition, while P eccd,i is the power associated with the i-th antenna.The parameter c cd is a machine-dependent parameter that can be chosen to scale the expression to the experimentally obtained current drive values.The total ECCD current is obtained as the sum of the different antennas, that in this work are considered to be two j eccd (ρ, t) = j eccd,1 (ρ, t) + j eccd,2 (ρ, t). (A.7) According to [63], the conductivity can be computed as The Spitzer conductivity σ sptz depends on the electron temperature and on the effective value of the plasma charge Z e f f .This last parameter may in general vary spatially, but it is chosen here to be a fixed quantity for the whole plasma Z e f f = 3.5.The neoclassical correction c neo depends on the electron and ion collisionality parameters as well as on Z e f f .Both σ sptz and c neo are space and time-varying.Specific boundary conditions have to be considered both at the center and on the LCFS of the plasma.At the plasma center, the spatial variation of the flux is zero ∂ψ ∂x (0, t) = 0, (A.9) while at the LCFS, we consider a Neumann boundary condition where I p is the total plasma current.The toroidal flux ϕ(x, t) is defined as the magnetic flux passing through a poloidal surface centered at R 0 and with normalized radius x.Assuming that the toroidal magnetic field remains constant, it is possible to obtain an explicit formula for the toroidal flux [8] ϕ(x, t) = .11)Fig. 11: Rotational transform profiles (ι(x, t)) at different points of the spatial domain {0, 0.1, 0.2, 0.35} for a time period of 10 seconds.A disturbance, implemented as a Neutral-Beam Injector placed in a different place from the ECCD antenna control input, is applied at time t = 5.3 (s).In this simulation the position at which we want to regulate the error to zero is x = 0.1, while the controller was trained to regulate the error at x = 0.2.
An instrumental quantity for the temperature dynamics is the ohmic power U pl j tor dx (A.12) In the previous equation, U pl identifies the toroidal loop voltage while j tor corresponds to the toroidal density and they can be computed as The temperature diffusion equation writes Because of the high uncertainty of the proposed temperature model and in order to effectively diminish the simulation time, we choose to use an empirical reduced-order model that approximates the actual temperature dynamics, similar to the one proposed in [64].This model is composed of an ordinary differential equation representing the evolution of the thermal energy W th and a Neural network that takes as input the total power and the thermal energy and returns the distributed temperature profile.The plasma thermal energy is defined as )n e T e dV (A.17)(a) Rotational transform profiles (ι(x, t)) at different time instants after the disturbance application.
(b) Applied ECCD power control input before and after the disturbance application.
Fig. 12: ι-profiles and applied ECCD power (P ECCD,i ) for the two antennas in case the point to be regulated is different from the one the controller has been trained with and external disturbance application.
where n i ≈ α ni n e (x, t) is the ions density, e is the electron charge and W e , W i are the electrons and ions energy, respectively.The density ratio can be approximately computed as α ni ≈ (7 − Z e f f )/6.The exponents in the τ th expression in (20) have been obtained by applying linear regression on data obtained from Raptor simulations.In particular, the collected data together with the applied open-loop input, are organised in the vectors X and Y as following Linear regression is then applied to the couple (X, Y) to obtain the power constant values k 0 , k 1 , k 2 , k 3 such that τ th = e k 0 P k 1 tot (1 + P eccd,1 ) k 2 (1 The temperature profile is obtained as the output of an artificial Neural Network T e (x, t) = f NN (P tot , W th ). (A.20) The neural network has been trained using a set of temperature profiles associated with the total power and the thermal energy obtained by some Raptor simulations.For both the τ th linear regression and the NN training, the Raptor simulations have been obtained by applying different constant open-loop inputs to the system and extracting the total power, the thermal energy, τ th and the temperature profiles.

Appendix B. Simulation Algorithm
Employing a combination of implicit-explicit time discretization and fixed-step spatial discretization, as outlined in [8, Appendix A], system (17) can be approximated by the difference equation where ψ j , S j ∈ R N are N-dimensional vectors of the magnetic flux and the source term at N different point of the spatial domain at the j time iteration.The matrices A j ∈ R N×N and B j ∈ R N×N depend on the plasma physical parameters and change at every iteration j.The time discretization step is fixed at δt = 0.01, with an implicit-explicit ratio of h = 0.45, and the total simulation time is referred to as T sim .The space domain is divided into N = 21 discretization elements, with a fixed spatial discretization step of δx i = 0.05.Similarly, the thermal energy dynamics can be approximated by the difference equation W th, j+1 = d j W th, j + s j P tot, j (B.2) 20 where W th, j , P tot, j ∈ R are the thermal energy and the total power at the j time iteration.d j and s j are coefficients depending on τ th, j .It is worth remarking that our study case is similar to the one considered in [51], where the two available inputs act on the same spatial point: the first antenna P eccd,1 acts positively on z j while the second P eccd,2 acts negatively.The two input powers have limited maximum power P eccd,i ∈ [P eccd,i , Peccd,i ].To control the magnetic flux gradient, the control action corresponds to the difference between the two antennas' power.Inversely, the control action for the temperature profile corresponds to the sum of the two antennas' power.Since in this work we are interested in magnetic control, in the following we define a function mapping from the desired power difference to the value of each antenna power.Firstly, we define the control input a j ∈ [0, 1] that is mapped to the desired difference α j ∈ [α, ᾱ] between the two ECCD powers applied at the j time iteration where α = P eccd,1 − Peccd,2 and ᾱ = Peccd,1 − P eccd,2 .Given a desired power difference α j , the control input powers are mapped to minimize their sum P eccd,1, j + P eccd,2, j .The mapping can be expressed by P eccd,1, j = P eccd,1 P eccd,2, j = P eccd,1 − α j if α j < P eccd,1 − P eccd,2 P eccd,1, j = α j + P eccd,2 P eccd,2, j = P eccd,2 if α j ≥ P eccd,1 − P eccd,2 .

(B.4)
After the spatial discretization of the magnetic flux dynamics and the temporal discretization of both the magnetic flux and thermal energy dynamics, we obtain the difference equation in the state variable The steps for the plasma magnetic flux and temperature simulation are listed in Algorithm 2. The constant parameters are fixed as follows In the current experiment, we assume that the two ECCD actuators, described by the injected current density in (A.6), have the following parameters It is worth noticing that the Ohmic power at time instant j + 1 is computed with the variables η ∥, j , T e, j , T i, j , u j , belonging to the time instant j, as well as ψ j+1 , belonging to the time instant j + 1.With this simulation procedure, it is not possible to only use variables belonging to the time instant j + 1 for the P OH, j+1 calculation because P OH, j+1 itself is needed to compute T e, j+1 and T i, j+1 .This is an intrinsic property of this simulation procedure, introduced in [8], that avoids the implementation of a fixed point iteration research to find all the states at the time step j + 1.The nonlinear components of the model are delayed by one sample while an implicit-explicit scheme is used for the linear components, thus avoiding the fixed-point iteration algorithm to obtain a faster simulation.
To test the simulator's precision with respect to a certain tokamak configuration, we compare the trajectories obtained with the application of some constant open-loop controls with the ones obtained through the application of the same controls with the Raptor simulator.The Raptor simulator is a real-time predictor of the Ψ and T e profiles used as an observer in the TCV control environment [25].The kinetic and magnetic profiles are obtained by simulating two coupled nonlinear reaction-diffusion PDEs.Therefore, the Raptor simulator provides fairly precise simulation results 21  where δt is the discretization time step.The control objective is to find a control law capable of stabilizing the system to a desired position x ⋆ .Hence, we aim at steering the system to z ⋆ = [x ⋆ 0] ⊤ .Moreover, we want such a stabilization property to be robust, namely, we want the controller to stabilize the system even in presence of (constant) disturbances.An example of such disturbances may be the imperfect knowledge of the system's parameters, e.g., the spring's constant k = k 0 + δk with δk ∈ [−∆, ∆].Hence, we model the true plant to be controlled as z j+1 = (A D + ÃD )z j + B D a j , (C.4) where z j = [x j ẋ j ] T and ÃD embeds the constant unknown spring variation δk.Instead of using a more conventional control approach (e.g.Lyapunov-based, forwarding, etc.) we now want to use an RL algorithm to learn the policy (controller) for the former system.To do that, let us define the optimal problem via the reward where a ⋆ = k 0 x ⋆ is the steady state input for obtaining the desired equilibrium with a spring constant k = k 0 .The training is performed on a system with unitary parameters m = c = 1 and spring constants k 0 = 1 and ∆ = 0.2.The cost matrices are defined as Q = 0.001I and R = 0.01.In this example, we use a 2 layers neural network with 32 nodes for both the actor and the critic.The learning rate is set equal to γ = 0.001.Training is performed with a 2 × 10 6 total number of steps, while each episode has 500 steps.The time step is set equal to δt = 0.1.The training is performed using 8 environments in parallel using the PPO algorithm.At each episode, the spring parameter variation is selected randomly in the interval [−∆, ∆].As such, the second equation implies x e = x ⋆ .Hence, x ⋆ is reached even in presence of constant unknown variation of the spring constant.We rewrite the open-loop system with state s j = [x j ẋ j ε j ] T as We define and the reward as r j = (s j − s ⋆ ) T Q e (s j − s ⋆ ) + R(a j − a ⋆ ) 2 (C.10)where Q e is the new extended state cost matrix and a ⋆ = k 0 x ⋆ as before.We perform the same training as the one done for the system without the integrator state.In this case we fix .17we show the extended system in closed-loop with the trained control law in case k = 1.We can appreciate that in this case, the position converges to the desired equilibrium.Therefore, we have trained a robust controller that makes use of an integrator state to be robust with respect to constant parameter variation.It is possible to show that the same controller is robust also with respect to constant disturbances.

Fig. 2 :
Fig.2: Examples of different poloidal magnetic flux equilibriums of the reaction-diffusion equation describing the magnetic flux dynamics.In black we show that with a certain set of parameters K 1 , changing the applied constant input (ECCD power), the poloidal magnetic flux equilibrium changes.The red profile represent a possible poloidal magnetic flux equilibrium with a different set of parameters K 2 .

Fig. 4 :••
Fig.4: Graphical representation of the tokamak environment together with the integral state in the RL settings.The action allows to compute the future state.The future state together with the actual input allow to compute the future reward.
Fig 5a and Fig 5b, we class the four different cases depending on the speed of the integral action, the overshoot magnitude and the number of steps N steps after which no substantial controller changes were observed: • Case 1: α 1 = 0.1, α 2 = 0.05.Big overshoot, fast regulation.N steps = 3.5 × 10 6 ; • Case 2: α 1 = 1, α 2 = 0.5.Medium overshoot, medium speed regulation.N steps = 4.5 × 10 6 ; (a) Reward mean of four different training of the proposed RL algorithm.(b) Magnetic flux trajectories at two points of the spatial domain (x = 0 and x = 0.2) of the system in closed-loop with the four controllers obtained with different α i parameters.

( a )
Rotational transform profiles (ι(x, t)) at different time instants with the implementation of the anti-windup algorithm.(b)Applied ECCD power control input in case of implementation of the antiwindup algorithm.

( a )
Trajectories of poloidal flux gradient (∂ψ/∂x) over time at various points of the normalized spatial index {0.05,0.35, 0.5, 0.75}.The dashed lines represent trajectories obtained from an open-loop Raptor simulation, while the solid lines depict trajectories resulting from the simulator proposed in this study.(b) Poloidal flux gradient (∂ψ/∂x) profiles at different time instants {0.1, 0.2, 1.5}.The dashed lines represent trajectories obtained from an open-loop Raptor simulation, while the solid lines depict trajectories resulting from the simulator proposed in this study.
Fig. C.14 shows the evolution of the episode reward mean over time.In Fig. C.15, we show the mass-spring-damper system in a closed loop with the trained control law in case k = 1 with

Fig. C. 14 :
Fig. C.14: Reward mean over steps of the mass-spring-damper controller training without the integral action.

Fig. C. 15 :
Fig. C.15: Control action and position of the mass-spring damper system in closed loop with a controller without integral state that has been trained on a model with a different spring coefficient.

Fig. C. 16 :
Fig. C.16: Reward mean over steps of the mass-spring-damper controller training with integral action.
C.16 is depicted the evolution of the episode reward mean.In Fig.C

Fig. C. 17 :
Fig. C.17: Control action, integral state and position of the mass-spring damper system in closed loop with a controller with integral state that has been trained on a model with a different spring coefficient.

Table 1 :
Table of symbols.

Table A .
2: Table of symbols and corresponding units.Rotational transform profiles (ι(x, t)) at different points of the spatial domain {0, 0.1, 0.2, 0.35} for a time period of 10 seconds.A disturbance, implemented as a Neutral-Beam Injector placed in a different place from the ECCD antenna control input, is applied at time t = 5.3 (s).
(ρ, t) is the electron thermal diffusivity, τ d is the time-varying damping modelling the losses and S T (ρ, t) is the source term.In our specific application, where we consider two ECCD inputs, we have S T (ρ, t) = S T,eccd,1 (ρ, t) + S T,eccd,2 (ρ, t).