增强型深度强化学习方法应用于化工过程控制

doi:10.16085/j.issn.1000-6613.2024-1289

摘要/Abstract

摘要：

深度强化学习（DRL）算法因其无须依赖历史数据和先验知识，仅通过环境与智能体的互动即可实现策略优化和自主学习，在工业过程控制领域表现出良好的应用前景。其中，基于双延迟深度确定性策略梯度（TD3）算法的控制策略可有效克服深度确定性策略梯度（DDPG）模型中Q值易被高估，导致次优策略和鲁棒性不佳的缺陷，成为目前最领先的基于深度强化学习的控制模型。然而，原始TD3方法在应用于具有较显著策略波动的工业过程控制时仍显示出局限性，特别是其Q值低估问题会导致模型控制性能不佳。为了解决这些限制，本文提出了一种适用于工业过程控制的增强型TD3控制模型（ETD3），该模型首先建立评估指标来判断行动者（Actor）网络参数的高估或低估情况，并根据评估结果调整输入到批评家（Critic）网络的损失函数。然后，通过替换原始TD3中的固定学习率为三角衰减周期学习率，以提升模型的训练收敛性和控制性能。本文最后通过将增强型TD3算法应用于工业天然气脱水过程的控制过程验证了其有效性。

关键词: 过程控制, 深度强化学习, 双延时深度确定性策略梯度, 三角衰减周期

Abstract:

Deep reinforcement learning (DRL) algorithms have recently attracted considerable attentions in the field of industrial process control due to their strong ability to achieve optimal control policies through environment-agent interactions without relying on historical data or prior knowledge. Among a variety of DRL models, the twin delayed deep deterministic policy gradient (TD3) model can effectively address the problem of "Q-values overestimation" endured by the deep deterministic policy gradient (DDPG) model, establishing itself as a leading DRL model for industrial process control. However, the original TD3-based controller shows limitations in the industrial process control with considerable policy fluctuations, especially, the Q-values underestimation may result in suboptimal control policies. Accordingly, this study introduced an enhanced TD3 (ETD3) model to improve the performance of TD3 in practical industrial process control. In the ETD3 model, an evaluation criterion was firstly presented to assess the overestimation or underestimation of actor network parameters, and then the loss function that was input to the critic network was adjusted according to the assessment results. Subsequently, the fixed learning rate in the original TD3 model was replaced by a triangular decay cycle learning rate, which can enhance the model's training convergence and control performance. Finally, the effectiveness of the ETD3 model was verified by the performance of the ETD3 controller in the natural gas dehydration process under different disturbances.

Key words: process control, deep reinforcement learning, twin delayed deep deterministic policy gradient (TD3) model, triangular decay cycle

中图分类号:

TP3-05

张佳鑫, 董立春. 增强型深度强化学习方法应用于化工过程控制[J]. 化工进展, 2025, 44(10): 5563-5569.

ZHANG Jiaxin, DONG Lichun. An enhanced deep reinforcement learning algorithm for industrial process control[J]. Chemical Industry and Engineering Progress, 2025, 44(10): 5563-5569.

图/表 10

图1 AC网络示意图

表1 ETD3方法的伪代码

方法1：ETD3

初始化两个Critic网络的参数 $θ 1 Q$ 和 $θ 2 Q$ 以及Actor网络的参数 $θ μ$ 。

初始化最大训练集数为M，每集最大步长为N，轮次终止条件为Finish。

初始化经验池大小 $B$ = $50000$ ，软目标更新系数 $τ$ = $0.005$ ，学习率为三角衰减周期函数。

初始化两个Critic目标网络的参数 $θ 1 Q'$ 和 $θ 2 Q'$ 以及Actor网络的参数 $θ μ'$ 。

for $t$ = 1 to T do：

选择Actor网络中存在噪声的动作

$a ~ μ (s | θ μ) + ε, ε ~ Ν (0, σ)$

计算奖励函数r并获得下一个状态 $s t + 1$

将转换元组(s, a, r, s′)存储到经验池B中。

if $t$ > M then：

N次转换的小批量样本(s, a, r, s′)

$a' ← μ' (s') + ε$

$ε ~ c l i p Ν (0, σ), - c, c$

计算目标值：

$y t ← r (s, a) + γ m i n i = 1,2 Q θ 1' (s', a')$

分别计算两个Critic网络的损失函数：

$L (θ i Q) = Ε s, a ~ (s, a, r, s') Q θ i (s, a | θ Q) - r - Q θ i' φ (s, a | θ Q) 2$

使用梯度下降更新两个Critic网络的参数：

$θ 1 Q ← θ 1 Q - α ∇ θ 1 Q L (θ 1 Q)$

$θ 2 Q ← θ 2 Q - α ∇ θ 2 Q L (θ 2 Q)$

if $t$ mod $d$ then：

计算目标Actor网络的更新：

$m a x Ε s, a ~ (s, a, r, s') Q (s, a) | a = θ μ (s)$

利用软更新策略更新目标网络参数：

$θ i = 1,2 Q' ← τ θ i = 1,2 Q + (1 - τ) θ i = 1,2 Q'$

$θ μ' ← τ θ μ + (1 - τ) θ μ'$

end if

end for