数据与分析

使用强化学习控制裂纹传播

本文介绍了基于深度确定性策略梯度的强化学习控制框架的应用。在 Abaqus 中模拟裂纹扩展过程,Abaqus 与强化学习环境集成来控制脆性材料中的裂纹扩展。还讨论了所提出的控制框架的实际部署。

地面裂缝中的神奇光芒,绿色发光的纹理
地面裂缝中的魔法发光,裂缝中的绿色发光纹理或毁坏的土地表面。破坏、分裂中的闪亮熔岩、灾难后的损坏裂缝效应、逼真的 3D 矢量隔离集
资料来源:klyaksun/Getty Images

裂纹决定传输、机械和化学性能。控制材料中裂纹扩展的能力对于操纵材料性能的时空变化至关重要。在地下工程领域,地下裂缝的控制改善了天然气和石油的生产、工程地热系统的开发以及CO 2的长期储存Pyrak-Nolte 2015)。

强化学习(RL)是一种专门为学习顺序决策和控制策略而开发的机器学习技术。与监督学习和无监督学习不同,强化学习通过采取行动、观察环境状态并根据行动和状态接收奖励,从与环境动态交互获得的经验中学习。通过不断与环境交互,强化学习代理学习控制策略,最大化预期累积贴现未来奖励。深度确定性策略梯度(DDPG)算法是Lillicrap 等人提出的一种离策略强化学习算法。(2015)该算法适用于具有高维、连续状态和动作空间的控制问题。

DDPG 是一种无模型、离策略的算法,结合了确定性策略梯度方法、深度 Q 学习和 Actor-Critic 技术的思想。它的目的是最大化评估控制策略性能的动作值(Q值)函数。Q 值是针对特定状态采取特定行动并随后遵循该政策后的预期回报(也称为预期累积折扣奖励)。 通过在行动者模型和批评者模型中实现深度神经网络作为函数逼近器,DDPG算法能够处理高维、连续状态和动作空间的控制问题,因此适用于本研究研究的裂纹扩展控制问题。

基于 DDPG 的 RL 控制框架已被作者成功应用于控制简单和复杂合成环境中的裂纹扩展(Jin 和Misra 2022,2023 在本文中,基于 DDPG 的 RL 控制框架与模拟裂纹扩展的 Abaqus 模拟器集成,演示了 RL 技术在控制脆性材料裂纹扩展中的应用。

Abaqus 模拟环境描述
Abaqus 用于模拟宽度为 13 mm、高度为 25 mm 的薄矩形材料样品在表面牵引下的裂纹扩展,如图1 所示。该材料样品具有 Young’模量为 50.0 GPa,泊松比为 0.3。材料最大主应力为30.0 MPa,最大位移为0.05,内聚系数为1×10鈭5材料的左边缘通过 Encastre 边界条件固定,并在顶部和底部边缘施加表面牵引力。长度为 3.0 mm 的初始裂纹位于左边缘中部。

裂纹_Fig1.JPG
图 1 ’在 Abaqus 模拟器中创建的训练环境,用于学习控制表面牵引下裂纹扩展的控制策略。薄的矩形材料样品宽度为 13 毫米,高度为 25 毫米。材料的顶部和底部边缘受到表面牵引力。裂纹从左边缘上的初始裂纹传播到右边缘上随机分配的目标点。强化学习代理学习控制表面牵引力的大小和方向,以了解所需的控制策略。
资料来源:金宇腾和 Sid Misra

RL 代理必须控制顶部和底部边缘上的表面牵引力的方向和大小,以便裂纹从左侧边缘上的初始裂纹传播到右侧边缘上的预定目标点(图 1)。为了训练 RL 代理,目标点被随机放置在每个训练片段的右边缘,使学习代理有更多机会探索动作和解决方案空间,并制定稳健的控制策略来完成所需的控制任务。

环境状态是根据左侧边缘上的裂纹尖端 ( xtip , ytip )的位置和右侧边缘上的目标点 (xgoal, ygoal) 的位置定义状态空间受到 2D 材料面积的约束。强化学习代理通过控制顶部和底部边缘表面牵引力的方向θtopθbottom以及大小topbottom来学习到达单个目标点。顶部底部的作用空间限制在[±45°, 45°]之间,顶部底部的作用空间限制在[200.0, 1000.0] MPa之间

基于 DDPG 的 RL 框架与基于 Abaqus 的环境的交互图 2显示了基于 DDPG 的 RL 控制框架与 Abaqus 环境的集成。强化学习代理必须学会通过修改表面牵引力来控制从初始裂纹到目标点的裂纹扩展。RL 代理通过与基于 Abaqus 的环境交互来学习控制策略。Abaqus 脚本文件决定环境的状态。Abaqus 脚本文件根据 RL 代理生成的操作(加载)进行修改。该脚本文件用于根据最新状态和新生成的边界条件(载荷)构建 Abaqus 模型,并将仿真作业提交给 Abaqus。工作完成后,脚本文件从 Abaqus 输出数据库中提取裂纹信息(PHILSM 数据)以创建结果文件,该文件根据 RL 代理的操作和使用定义的环境的先前状态捕获裂纹传播。阿巴克斯。基于结果文件,Python 脚本计算分配给 RL 代理的操作的奖励和环境的最新状态,用于构建新的 Abaqus 模型以进一步更新控制策略。

裂纹_Fig2.JPG
图 2’基于 DDPG 的 RL 控制框架与 Abaqus 环境的集成。该工作流程显示了 RL 代理和环境之间的交互,以根据先前状态生成到新状态的转换。根据新状态对所需控制任务的有用性来分配奖励。
资料来源:金宇腾和 Sid Misra

为了成功发展强化学习,奖励函数根据最新采取的行动和当前/后续环境状态计算适当的奖励大小和行为。最佳奖励函数为 RL 方案中实现的神经网络提供必要的信息,以提高学习速度,同时保持学习稳定性(即平衡利用和探索权衡)。给出负奖励的奖励函数会鼓励智能体快速达到目标,但它可能会导致提前终止,以避免受到高额累积惩罚。给出正奖励的奖励函数会鼓励智能体继续积累奖励,但可能会导致智能体缓慢地朝着目标移动以积累奖励。

RL 代理
RL 代理由神经网络表示。强化学习框架中使用四个深度神经网络作为学习代理:行动者网络、目标行动者网络使、Q值函数的函数逼近器(称为批评家网络Q)和目标批评家网络Q使。在本文中,神经网络是在 Keras 平台上构建的。目标评论家/演员网络具有与具有相同初始权重的评论家/演员网络相同的架构。行动者网络将状态作为输入,然后确定性地计算特定动作,而批评者网络将状态和动作作为输入,然后计算标量 Q 值(Jin 和 Misra 2022,2023

参与者网络通过将状态确定性地映射到特定动作来表示当前的确定性策略。在本研究中,评论家和演员网络的权重更新是在 Tensorflow 中使用 Adam 优化器完成的。目标评论家/演员网络的初始权重与评论家/演员网络相同。对于每个时间步长,两个目标网络都根据速率蟿缓慢更新(也称为“软”更新),其中≤1。软更新的使用通过使移动目标y保持学习的稳定性来确保学习的稳定性。慢慢改变。

表 1总结了训练阶段使用的调整参数。适当调整演员/评论家网络学习率和目标网络更新率,以实现快速稳定的训练。折扣因子代表了未来奖励的重要性。它被设置为0.0,因为只需要一步就可以到达最终状态。重放缓冲区的容量定义了存储的最大交互数量。它被设置为等于总训练次数,以便利用之前的所有训练经验。小批量大小定义了用于在每个模拟时间步更新网络的交互数量。控制过程可以轻松地扩展到多个步骤。使用具有 6 核和 32 GB RAM 的 Intel Xeon CPU E5-1650 v3,整个训练过程大约需要 5 小时。 

范围 价值
演员网络学习率 0.0005
批评者网络学习率 0.01
目标网络更新率 0.005
贴现系数γ 0.0
重放缓冲区R的容量 500
小批量大小N 64
总训练次数 500

表1-训练阶段使用的调整参数。适当调整参数以实现最佳训练性能。

图 3显示了训练阶段裂纹扩展的 Abaqus 模拟示例,其中显示了薄矩形材料中从初始裂纹开始的模拟裂纹扩展路径。基于Abaqus参数PHILSM(描述裂纹表面的符号距离函数),开发了Python代码来对Abaqus模拟结果进行后处理。后处理结果如图4所示。再现的裂纹路径与实际的Abaqus仿真结果非常吻合。图5显示了训练阶段的奖励历史。当裂纹到达材料的右边缘时,奖励是根据目标点与实际裂纹尖端之间的距离来衡量的。小的负奖励意味着距离小,这表明控制良好,而大的负奖励意味着距离大,这表明控制差。RL 智能体可以获得的最大可能奖励是 0。最初,RL 智能体没有学习适当的控制策略。经过 400 次训练后,强化学习智能体能够制定出良好且稳定的控制策略。该图展示了基于 DDPG 的 RL 框架在 Abaqus 模拟器中控制裂纹扩展的能力。

裂纹_Fig3.JPG
图 3’训练阶段裂纹扩展的 Abaqus 模拟示例。裂纹扩展是材料特性和表面牵引力的函数。显示了一种从薄矩形材料中的初始裂纹开始的模拟裂纹扩展路径。对于每个训练集,强化学习方案将训练学习代理控制裂纹从左边缘生长到右边缘随机选择的目标点。长度l 0的初始裂纹位于左边缘的中心。RL 代理需要学习控制表面牵引力的方向和大小,以控制裂纹扩展,直到右边缘的目标点。
资料来源:金宇腾和 Sid Misra
裂纹_Fig4.JPG
图 4’后处理后的 Abaqus 仿真结果。左图显示了再现的裂纹路径,与实际的 Abaqus 模拟结果非常匹配。红线代表裂缝路径,右边缘的蓝点代表目标点。右图显示了 RL 代理为所需控制生成的动作。通过控制顶面和底面牵引力的方向和大小,裂纹到达目标点。
资料来源:金宇腾和 Sid Misra
裂纹_Fig5.JPG
图5’训练阶段的奖励历史。当裂纹到达材料的右边缘时,奖励是根据目标点与实际裂纹尖端之间的距离来衡量的。最初,强化学习智能体没有学习到适当的控制策略;因此,奖励是很大的负值,代表负反馈。经过 400 次训练后,强化学习智能体能够制定出良好且稳定的控制策略,以接近于零的小负值奖励为代表。
资料来源:金宇腾和 Sid Misra

强化学习代理在实际部署中面临的挑战在实际部署中,可以通过对材料执行连续计算机断层扫描来实现对状态的连续监控。也可以使用声波推断裂纹位置。图 6展示了从数值环境(模拟器)学习,然后在真实材料上进行评估和部署的流程图。

强化学习代理需要通过调整某些工程参数来与现实世界的材料进行交互,从而最终控制裂纹的扩展/增长。在这种情况下,需要正确定义从环境返回到代理的信号/反馈。此外,需要正确定义 RL 代理调整的工程参数,并将其与 RL 代理生成的操作集成。为了进行真实场景部署的训练,需要在训练阶段添加噪声以模拟环境中的随机性。强化学习代理可以处理随机性,因为它们接收来自环境的实时反馈,并可以立即做出反应以实现稳定的控制。

裂纹_Fig6.JPG
图 6 — 从数值环境(模拟器)学习,然后在现实世界材料上进行评估和部署的流程图。
资料来源:金宇腾和 Sid Misra

部署 RL 代理期间遇到的现实挑战如下:

  • 模拟器的可靠性可能是训练性能的瓶颈。需要一个真实的模拟器来减少模拟器和现实世界环境之间的差距,以便在将代理部署到现实世界的过程中保持稳健的控制。
  • 模拟器的效率可能是训练性能的另一个瓶颈。考虑到智能体需要数千个训练集来学习最优控制策略,每个训练集的模拟时间应该很短。对于复杂的环境,所需的训练次数可能会更高。因此,计算成本和时间可能是一个技术挑战。

还可以从现实世界的环境中学习(使用实验设备而不是图 6 中的模拟器),然后在现实世界的材料上进行评估和部署。这比以前的方法要昂贵得多。从现实世界的材料中学习将需要多达 20,000 个材料样本,这将在速度、成本和基础设施方面提出技术挑战。尽管如此,在部署之前需要对现实世界的材料进行充分的评估。在这种情况下,将遇到的现实挑战如下:

  • 由于根据问题的复杂程度,训练过程可能需要数千次,因此在现实环境中训练强化学习智能体可能成本高昂,并且需要仔细开发实验室基础设施。在大多数情况下,不可能根据现实环境来训练代理。
  • 传感器和控制器应该足够准确和敏捷,以获得鲁棒的监测和控制。

结论
本文介绍了基于 DDPG 的强化学习控制框架在现实控制问题上的应用。裂纹扩展过程在集成到强化学习环境中的 Abaqus 中进行模拟。所提出的控制框架展示了强大的控制能力,有可能应用于现实世界的领域,包括地质力学、土木工程、生产碳氢化合物和地热能的水力压裂、陶瓷、结构健康监测和材料科学等。一些。还讨论了所提出的控制框架的实际部署以及潜在的挑战和解决方案。


致谢
这项研究工作得到了美国能源部科学办公室、基础能源科学办公室、化学科学、地球科学和生物科学部门的支持,奖励编号为 DE-SC0020675。


参考文献
Pyrak-Nolte, LJ、DePaolo, DJ 和 Pietra脽, T. 2015。控制地下裂缝和流体流动:基础研究议程美国能源部科学办公室。(美国)。

Lillicrap, TP、Hunt, JJ、Pritzel, A.、Heess, N.、Erez, T.、Tassa, Y.、Silver, D. 和 Wierstra, D. 2015。深度强化学习的连续控制arXiv.

Jin, Y. 和 Misra, S. 2023。使用深度强化学习控制断裂传播人工智能的工程应用六月(122)106075。

Jin, Y. 和 Misra, S. 2022。使用深度强化学习控制混合模式疲劳裂纹扩展应用软计算九月(127)109382。

原文链接/jpt
Data & Analytics

Controlling Crack Propagation Using Reinforcement Learning

This article presents the application of a reinforcement learning control framework based on the Deep Deterministic Policy Gradient. The crack propagation process is simulated in Abaqus, which is integrated with a reinforcement learning environment to control crack propagation in brittle material. The real-world deployment of the proposed control framework is also discussed.

Magic glow in ground cracks, green glowing texture
Magic glow in ground cracks, green glowing texture in cracking holes or ruined land surface. Destruction, shining lava in split, damage fissure effect after disaster, Realistic 3d vector isolated set
Source: klyaksun/Getty Images

Cracks determine transport, mechanical, and chemical properties. The ability to control crack propagation in a material is crucial for manipulating the spatiotemporal variations of material properties. In the domain of subsurface engineering, the control of the subsurface fractures improves the production of natural gas and petroleum, the development of engineered geothermal systems, and the long-term storage of CO2 (Pyrak-Nolte 2015).

Reinforcement learning (RL) is a machine-learning technique developed specifically for learning sequential decision making and control strategies. Unlike the supervised and unsupervised learning, RL learns from the experience obtained through dynamically interacting with the environment by taking actions, observing the state of the environment, and receiving rewards based on the action and state. By continuously interacting with an environment, the RL agent learns a control policy that maximizes the expected cumulative discounted future reward. The Deep Deterministic Policy Gradient (DDPG) algorithm is an off-policy reinforcement learning algorithm introduced by Lillicrap et al. (2015). This algorithm is suitable for control problems that have high-dimensional, continuous state and action spaces.

DDPG is a model-free, off-policy algorithm that combines the ideas from deterministic policy gradient method, deep Q-learning, and actor-critic techniques. It aims to maximize the action-value (Q-value) function that evaluates the performance of the control policy. Q-value is the expected return (also known as the expected cumulative discounted rewards) after taking a specific action for a specific state and, thereafter, following that policy. By implementing deep neural networks as function approximator in both actor and critic models, the DDPG algorithm is able to handle control problems with high-dimensional, continuous state and action space, thus suitable for the crack growth control problem investigated in this study.

The DDPG-based RL control framework has been successfully applied by the authors for controlling the crack propagation in simple and complex synthetic environments (Jin and Misra 2022, 2023). In this article, the DDPG-based RL control framework is integrated with Abaqus simulator, which simulates crack propagation, to demonstrate the application of the RL technique for controlling crack propagation in brittle material.

Description of the Environment Simulated in Abaqus
Abaqus is used to simulate crack propagation in a thin, rectangular material sample having a width of 13 mm and height of 25 mm under surface traction, as shown in Fig. 1. The material sample has a Young’s modulus of 50.0 GPa and a Poisson’s ratio of 0.3. The material is under a maximum principal stress of 30.0 MPa with a maximum displacement of 0.05 and cohesive coefficient of 1×10−5. The left edge of the material is fixed with encastre boundary condition, and surface tractions are applied at the top and bottom edges. An initial crack with length of 3.0 mm is located at the middle of the left edge.

Crack_Fig1.JPG
Fig. 1—The training environment created in the Abaqus simulator to learn the control policy for controlling crack propagation under surface traction. The thin, rectangular material sample has a width of 13 mm and height of 25 mm. The top and bottom edges of the material are under surface tractions. The crack propagates from initial crack on the left edge to a randomly assigned goal point on the right edge. The RL agent learns to control the magnitudes and directions of surface tractions to learn the desired control policy.
Source: Yuteng Jin and Sid Misra

An RL agent has to control the directions and magnitudes of the surface tractions on the top and bottom edges such that the crack propagates from the initial crack on the left edge to a predetermined goal point on the right edge (Fig. 1). For purposes of training the RL agent, the goal point is randomly placed on the right edge for each training episode, giving the learning agent more opportunity to explore the action and solution spaces and develop a robust control strategy to accomplish the desired control task.

The state of the environment is defined in terms of the location of crack tip (xtip, ytip) on the left edge and the location of the goal point (xgoal, ygoal) on the right edge. The state space is constrained by the area of the 2D material. The RL agent learns to reach the single goal point by controlling the directions, θtop and θbottom, and the magnitudes, σtop and σbottom, of the surface tractions on the top and bottom edges. The action space of θtop and θbottom is constrained between [−45°, 45°], and that of σtop and σbottom is constrained between [200.0, 1000.0] MPa.

Interactions of the DDPG-Based RL Framework With the Abaqus-Based Environment Fig. 2 shows the integration of the DDPG-based RL control framework with the Abaqus environment. The RL agent has to learn to control the crack propagation from the initial crack to the goal point by modifying the surface tractions. The RL agent learns the control policy by interacting the with Abaqus-based environment. Abaqus script file determines the state of the environment. The Abaqus script file is modified according to the action (load) generated by the RL agent. The script file is used to build an Abaqus model based on the latest state and the newly generated boundary condition (load) and submit the simulation job to Abaqus. Once the job is done, the script file extracts the crack information (PHILSM data) from the Abaqus output database to create the result file, which captures the crack propagation based on the action of the RL agent and the prior state of the environment defined using Abaqus. Based on the result file, a Python script calculates the reward to be assigned to the action of the RL agent and the latest state of the environment, which is used to build a new Abaqus model for further update of the control policy.

Crack_Fig2.JPG
Fig. 2—Integration of the DDPG-based RL control framework with the Abaqus environment. The work flow shows the interactions between the RL agent and environment to generate transitions into new state based on the prior state. Reward is assigned based on the usefulness of the new state toward the desired control task.
Source: Yuteng Jin and Sid Misra

For successful development of reinforcement learning, the reward function computes a suitable magnitude and behavior of the reward based on the latest action taken and the current/subsequent state of the environment. An optimum reward function provides necessary information to the neural networks implemented in the RL scheme to boost the speed of learning while maintaining stability in the learning (i.e., balancing the exploitation and exploration tradeoff). A reward function that gives negative rewards encourages the agent to reach the goal quickly, but it may lead to early termination to avoid receiving high accumulated penalties. A reward function that gives positive rewards encourages the agent to keep going to accumulate the rewards, but it may cause the agent to move slowly toward the target to accumulate the reward.

RL Agent
The RL agent is represented by a neural network. Four deep neural networks are used as the learning agents in the reinforcement learning framework: actor network μ, target actor network μʹ, function approximator for the Q-value function, referred as the critic network Q, and target critic network Qʹ. In this paper, the neural networks were built in the Keras platform. The target critic/actor networks have the same architecture as the critic/actor networks with the same initial weights. The actor network takes state as input and then deterministically computes the specific action, while the critic network takes both the state and action as inputs and then computes a scalar Q-value (Jin and Misra 2022, 2023).

The actor network represents the current deterministic policy by mapping the state to the specific action deterministically. In this study, the weight updates for critic and actor networks are accomplished in the Tensorflow with Adam optimizer. The initial weights for the target critic/actor networks are the same as the critic/actor networks. For each time step, both the target networks are updated slowly (also referred to as “soft” update) according to rate τ, where τ ≪ 1. The use of soft update ensures stability in learning by making the moving target yi change slowly.

Table 1 summarizes the tuning parameters used during the training stage. The actor/critic network learning rate and target networks update rate are properly tuned to achieve a fast and stable training. The discount factor represents the importance of future rewards. It is set to be 0.0 because it takes only one step to reach the final state. The capacity of replay buffer defines the maximum number of interactions stored. It is set to be equal to the total training episodes so that all the previous training experience is utilized. The minibatch size defines the number of interactions used to update the networks at each simulation time step. The control process can be easily scaled to multiple steps. The whole training process takes about 5 hours using Intel Xeon CPU E5-1650 v3 with 6 cores and 32 GB RAM. 

ParameterValue
Actor network learning rate0.0005
Critic network learning rate0.01
Target networks update rate τ0.005
Discount factor γ0.0
Capacity of replay buffer R500
Minibatch size N64
Total training episodes500

Table 1—Tuning parameters used during the training stage. The parameters are properly tuned to achieve optimal training performance.

Fig. 3 shows an example of Abaqus simulation of crack propagation during the training stage, where a simulated crack growth path from an initial crack in a thin, rectangular material is shown. Based on the Abaqus parameter PHILSM (signed distance function to describe the crack surface), a Python code is developed to post process the Abaqus simulation result. The post-processed result is shown in Fig. 4. The reproduced crack path closely matches the actual Abaqus simulation result. Fig. 5 shows the reward history in the training stage. The reward is measured based on the distance between the goal point and the actual crack tip when crack reaches the right edge of the material. A small negative reward means the distance is small, which indicates good control, whereas a large negative reward means the distance is large, which represents poor control. The maximum possible reward the RL agent may obtain is 0. Initially, the RL agent did not learn a proper control strategy. After 400 training episodes, the RL agent was able to develop a good and stable control strategy. This figure demonstrates the capability of the DDPG-based RL framework of controlling the crack propagation in the Abaqus simulator.

Crack_Fig3.JPG
Fig. 3—Example of an Abaqus simulation of crack propagation during the training stage. The crack propagation is a function of material properties and the surface tractions. One simulated crack growth path from an initial crack in a thin, rectangular material is shown. For each training episode, the reinforcement learning scheme will train the learning agent to control the crack growth from the left edge to reach the randomly selected goal point on the right edge. The initial crack of length l0 is at the center of the left edge. The RL agent needs to learn to control the directions and magnitudes of the surface tractions to control the crack propagation until the goal point on the right edge.
Source: Yuteng Jin and Sid Misra
Crack_Fig4.JPG
Fig. 4—Post-processed Abaqus simulation result. The left plot shows the reproduced crack path, which closely matches the actual Abaqus simulation result. The red line represents the crack path, and the blue dot on the right edge represents the goal point. The right plot shows the action generated by the RL agent for the desired control. The crack reaches the goal point by controlling the directions and magnitudes of the top and bottom surface tractions.
Source: Yuteng Jin and Sid Misra
Crack_Fig5.JPG
Fig. 5—Reward history in the training stage. The reward is measured based on the distance between the goal point and the actual crack tip when crack reaches the right edge of the material. Initially, the RL agent did not learn a proper control strategy; consequently, the rewards are large negative values, which represent negative feedback. After 400 training episodes, the RL agent was able to develop a good and stable control strategy, represented by the small negative-valued reward close to zero.
Source: Yuteng Jin and Sid Misra

Challenges in Real-World Deployment of RL Agents In the real-world deployment, a continuous monitoring of the state can be achieved by performing a continuous computer tomography scan on the material. The crack location can also be inferred using sonic wave. A flowchart of learning from a numerical environment (simulator) followed by evaluation and deployment on real-world materials is presented in Fig. 6.

The RL agent needs to interact with real-world material by tuning certain engineering parameters that ultimately controls the crack propagation/growth. In such cases, the signal/feedback that is returned to the agent from the environment need to be properly defined. In addition, the engineering parameters to be tuned by the RL agent need to be properly defined and integrated with the actions generated by the RL agent. To be trained for the deployment in real-world scenario, noise needs to be added in the training stage to simulate the randomness in the environment. The RL agent can handle the randomness because they receive real-time feedback from the environment and can react instantly to achieve a stable control.

Crack_Fig6.JPG
Fig. 6—A flowchart of learning from a numerical environment (simulator) followed by evaluation and deployment on a real-world material.
Source: Yuteng Jin and Sid Misra

The real-world challenges to be encountered during the deployment of an RL agent are the following:

  • The reliability of the simulator can be a bottleneck on the training performance. A realistic simulator is required to decrease the gap between simulator and real-world environment in order to maintain a robust control during the deployment of the agent to the real world.
  • The efficiency of the simulator can be another bottleneck on the training performance. The simulation time for each training episode should be short considering thousands of episodes are required for the agent to learn the optimal control policy. For a complex environment, the required training episodes can be even higher. Consequently, the computational cost and time can be a technical challenge.

It’s also possible to learn from a real-world environment (using experimental device instead of a simulator in Fig. 6) followed by evaluation and deployment on a real-world material. This is much more expensive than the previous approach. Learning from real-world materials will require up to 20,000 material samples, which will pose technical challenges in terms of speed, cost, and infrastructure. Nonetheless, sufficient evaluation of real-world materials will be needed before the deployment. In such case, the real-world challenges to be encountered are the following:

  • Since the training process can takes thousands of episodes depending on the complexity of the problem, the training of the RL agents in the real-world environment can be costly and require careful development of laboratory infrastructure. In most of the cases, it’s not possible to train the agent based on the real-world environment.
  • The sensor and controller should be accurate and agile enough to obtain the robust monitoring and control.

Conclusions
This article presents the application of DDPG-based reinforcement learning control framework on real-world control problems. The crack propagation process is simulated in Abaqus, which is integrated in the reinforcement learning environment. The proposed control framework demonstrates a powerful control ability, which has a potential of being applied to the real-world domains including geomechanics, civil engineering, hydraulic fracturing for producing hydrocarbon and geothermal energy, ceramics, structural health monitoring, and material science, to name a few. The real-world deployment of the proposed control framework along with potential challenges and solutions is also discussed.


Acknowledgements
This research work is supported by the US Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division under the Award Number DE-SC0020675.


References
Pyrak-Nolte, L.J., DePaolo, D.J., and Pietraß, T. 2015. Controlling Subsurface Fractures and Fluid Flow: A Basic Research Agenda. USDOE Office of Science. (United States).

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. 2015. Continuous Control With Deep Reinforcement Learning. arXiv.

Jin, Y., and Misra, S. 2023. Controlling Fracture Propagation Using Deep Reinforcement Learning. Engineering Applications of Artificial Intelligence. June (122) 106075.

Jin, Y., and Misra, S. 2022. Controlling Mixed-Mode Fatigue Crack Growth Using Deep Reinforcement Learning. Applied Soft Computing. September (127) 109382.