基于长短期记忆近端策略优化强化学习的等效并行机在线调度方法

doi:10.3969/j.issn.1004-132X.2022.03.009

摘要/Abstract

摘要： 针对等效并行机在线调度问题，以加权完工时间和为目标，提出了一种基于长短期记忆近端策略优化（LSTM-PPO）强化学习的在线调度方法。通过设计融合LSTM的智能体记录车间的历史状态变化和调度策略，进而根据状态信息进行在线调度。设计了车间状态矩阵对问题约束和优化目标进行描述，在调度决策中引入额外的设备等待指令来扩大解空间，并设计奖励函数将优化目标分解为分步奖励值实现调度决策评价。最后基于PPO算法进行模型更新和参数全局优化。实验结果表明所提方法优于现有的几种启发式规则，并将所提算法应用于实际车间的生产调度，有效减小了加权完工时间和。

关键词: 等效并行机, 在线调度, 强化学习, 长短期记忆近端策略优化

Abstract: To solve the related parallel machine online scheduling problems, the total weighted completion time was taken into account, and an online scheduling method was proposed based on LSTM-PPO reinforcement learning. A LSTM-integrated agent was designed to record the historical variations of workshop states and the corresponding scheduling policy adjustment, and then online scheduling decision was made according to the state information. Meanwhile, the workshop state matrix was designed to describe the problem constraints and optimization goals, additional machine waiting was introduced in scheduling action space to expand solution space, and the reward function was designed to decompose the optimization goal into step-by-step rewards to achieve scheduling decision evaluation. Finally, the model updating and global optimization of parameters was achieved by PPO algorithm. Experimental results show that the proposed method has competitive solutions than the existing heuristic rules, and the proposed algorithm is applied to the production scheduling of the actual workshops, which effectively reduces the total weighted completion time.

中图分类号:

TH166

贺俊杰, 张洁, 张朋, 汪俊亮, 郑鹏, 王明. 基于长短期记忆近端策略优化强化学习的等效并行机在线调度方法[J]. 中国机械工程, 2022, 33(03): 329-338.

HE Junjie, ZHANG Jie, ZHANG Peng, WANG Junliang, ZHENG Peng, WANG Ming. Related Parallel Machine Online Scheduling Method Based on LSTM-PPO Reinforcement Learning[J]. China Mechanical Engineering, 2022, 33(03): 329-338.

导出引用管理器 EndNote|Ris|BibTeX

链接本文: http://www.cmemo.org.cn/CN/10.3969/j.issn.1004-132X.2022.03.009

http://www.cmemo.org.cn/CN/Y2022/V33/I03/329

参考文献

［1］吴继浩. 面向航天产品的多目标动态生产调度方法研究及应用［D］. 绵阳:西南科技大学, 2019.
WU Jihao. Research and Application of Multi-objective Dynamic Production Scheduling Method for Aerospace Products［D］. Mianyang:Southwest University of Science and Technology, 2019.
［2］BANSAL N. Algorithms for Flow Time Scheduling［D］. Pennsylvania:Carnegie Mellon University, 2003.
［3］LEONARDI S, RAZ D. Approximating Total Flow Time on Parallel Machines［J］. Journal of Computer and System Sciences, 2007, 73(6):875-891.
［4］SITTERS R. Efficient Algorithms for Average Completion Time Scheduling［C］∥Integer Programming and Combinatorial Optimization. Lausanne, 2010:411-423.
［5］HALL L A, SHMOYS D B, WEIN J. Scheduling to Minimize Average Completion Time:Off-line and On-line Algorithms［J］. Mathematics of Operations Research, 1996, 22(3):513-544.
［6］MAO H, ALIZADEH M, MENACHE I, et al. Resource Management with Deep Reinforcement Learning［C］∥Proceedings of the 15th ACM Workshop on Hot Topics in Networks. Atlanta, 2016:50-56.
［7］柳丹丹, 龚祝平, 邱磊. 改进遗传算法求解同类并行机优化调度问题［J］. 机械设计与制造, 2020(4):262-265.
LIU Dandan, GONG Zhuping, QIU Lei. Improved Genetic Algorithm for the Optimal Scheduling Problem of Uniform Parallel Machine［J］. Machinery Design &Manufacture, 2020(4):262-265.
［8］许显杨, 陈璐. 考虑设备可靠性与能耗的平行机调度［J］. 上海交通大学学报, 2020, 54(3):247-255.
XU Xianyang, CHEN Lu. Parallel Machine Scheduling Problem Considering Machine Reliability and Enegy Consumption［J］. Journal of Shanghai Jiao Tong University, 2020, 54(3):247-255.
［9］GUPTA D, MARAVELIAS C T, WASSICK J M. From Rescheduling to Online Scheduling［J］. Chemical Engineering Research and Design, 2016, 116:83-97.
［10］ZHANG R, CHANG P, SONG S, et al. A Multi-objective Artificial Bee Colony Algorithm for Parallel Batch-processing Machine Scheduling in Fabric Dyeing Processes［J］. Knowledge-based Systems, 2017, 116:114-129.
［11］MICHAEL L P. Scheduling:Theory, Algorithms, and Systems［M］. New York:Springer, 2018.
［12］TAO J, LIU T. WSPT’s Competitive Performance for Minimizing the Total Weighted Flow Time:from Single to Parallel Machines［J］. Mathematical Problems in Engineering, 2013, 2013:343287.
［13］ANDERSON E J, POTTS C N. Online Scheduling of a Single Machine to Minimize Total Weighted Completion Time［J］. Mathematics of Operations Research, 2004, 29(3):686-697.
［14］TAO J. A Better Online Algorithm for the Parallel Machine Scheduling to Minimize the Total Weighted Completion Time［J］. Computers and Operations Research, 2014, 43(1):215-224.
［15］ABBEEL P, COATES A, QUIGLEY M, et al. An Application of Reinforcement Learning to Aerobatic Helicopter Flight［M］∥SCHLKOPF B, PLATT J, HOFMANN T.Advances in Neural Information Processing Systems 19:Proceedings of the 2006 Conference. Cambridge:MIT Press, 2007:1-8.
［16］吴晓光, 刘绍维, 杨磊, 等. 基于深度强化学习的双足机器人斜坡步态控制方法［J］. 自动化学报, 2020, 46:1-12.
WU Xiaoguang, LIU Shaowei, YANG Lei, et al. A Gait Control Method for Biped Robot on Slope Based on Deep Reinforcement Learning［J］. Acta Automatica Sinica, 2020, 46:1-12.
［17］王云鹏, 郭戈. 基于深度强化学习的有轨电车信号优先控制［J］. 自动化学报, 2019, 45(12):2366-2377.
WANG Yunpeng, GUO Ge. Signal Priority Control for Trams Using Deep Reinforcement Learning［J］. Acta Automatica Sinica, 2019, 45(12):2366-2377.
［18］袁兆麟, 何润姿, 姚超, 等. 基于强化学习的浓密机底流浓度在线控制算法［J］. 自动化学报, 2021, 47(7):1558-1571.
YUAN Zhaolin, HE Runzi, YAO Chao, et al. Online Reinforcement Learning Control Algorithm for Concentration of Thickener Underflow［J］. Acta Automatica Sinica, 2021, 47(7):1558-1571.
［19］CUNHA B, MADUREIRA A M, FONSECA B, et al. Deep Reinforcement Learning as a Job Shop Scheduling Solver:a Literature Review［C］∥International Conference on Hybrid Intelligent Systems. Porto, 2018:350-359.
［20］SUTTON R S, BARTO A G. Reinforcement Learning:an Introduction［M］. Cambridge:MIT Press, 2018.
［21］LIU C L, CHANG C C, TSENG C J. Actor-critic Deep Reinforcement Learning for Solving Job Shop Scheduling Problems［J］. IEEE Access, IEEE, 2020, 8:71752-71762.
［22］GABEL T, RIEDMILLER M. Distributed Policy Search Reinforcement Learning for Job-shop Scheduling Tasks［J］. International Journal of Production Research, 2012, 50(1):41-61.
［23］王世进, 孙晟, 周炳海, 等. 基于Q-学习的动态单机调度［J］. 上海交通大学学报, 2007(8):1227-1243.
WANG Shijin, SUN Sheng, ZHOU Binghai, et al. Q-Learning Based Dynamic Single Machine Scheduling［J］. Journal of Shanghai Jiao Tong University, 2007(8):1227-1243.
［24］WANG J, HE J, ZHANG J. A Reinforcement Learning Method to Optimize the Priority of Product for Scheduling the Large-scale Complex Manufacturing Systems［C］∥ 48th International Conference on Computers & Industrial Engineering (CIE48). Auckland, 2018:2-5.
［25］ZHANG Z, ZHENG L, LI N, et al. Minimizing Mean Weighted Tardiness in Unrelated Parallel Machine Scheduling with Reinforcement Learning［J］. Computers & Operations Research, 2012, 39(7):1315-1324.
［26］GUAN Y, REN Y, LI S E, et al. Centralized Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization［J］. IEEE Transactions on Vehicular Technology, 2020, 69(11):12597-12608.
［27］WEI H, LIU X, MASHAYEKHY L, et al. Mixed-autonomy Traffic Control with Proximal Policy Optimization［C］∥ IEEE Vehicular Networking Conference(VNC). Los Angeles, 2019:19529967.
［28］GANGAPURWALA S, MITCHELL A, HAVOUTIS I. Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot Locomotion［J］. IEEE Robotics and Automation Letters, 2020, 5(2):3642-3649.
［29］CHEN Y, MA L. Rocket Powered Landing Guidance Using Proximal Policy Optimization［C］∥4th International Conference on Automation, Control and Robotics Engineering. Shenzhen,2019:1-6.
［30］ZHU J, WANG H, ZHANG T. A Deep Reinforcement Learning Approach to the Flexible Flowshop Scheduling Problem with Makespan Minimization［C］∥2020 IEEE 9th Data Driven Control and Learning Systems Conference. Liuzhou, 2020:20256682.
［31］RUMMUKAINEN H, NURMINEN J K. Practical Reinforcement Learning - Experiences in Lot Scheduling Application［J］. IFAC-PapersOnLine, 2019, 52(13):1415-1420.
［32］SUTTON R S, MCALLESTER D A, SINGH S P, et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation［C］∥ Proceedings of the 12th International Conference on Neural Information Processing Systems. Denver, 1999:1057-1063.
［33］SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust Region Policy Optimization［C］∥32nd International Conference on Machine Learning. Lille, 2015:1889-1897.
［34］MNIH V, BADIA A P, MIRZA M, et al. Asynchronous Methods for Deep Reinforcement Learning［C］∥International Conference on Machine Learning. New York, 2016:1928-1937.
［35］KINGMA D P, BA J. Adam:a Method for Stochastic Optimization［C］∥3rd International Conference for Learning Representations. San Diego,2015:1412.6980.

[1]	郭具涛, 吕佑龙, 戴铮, 张洁, 郭宇. 基于复合规则和强化学习的混流装配线调度方法[J]. 中国机械工程, 2023, 34(21): 2600-2606,2614.
[2]	石晴晴, 张润锋, 张连洪, 兰世泉, . 基于强化学习算法的水下滑翔机路径跟踪研究[J]. 中国机械工程, 2023, 34(09): 1100-1110.
[3]	武星, 杨俊杰, 汤凯, 翟晶晶, 楼佩煌. 面向复合地图的移动机器人分层路径规划[J]. 中国机械工程, 2023, 34(05): 563-575.
[4]	张凯, 毕利, 焦小刚. 集成强化学习算法的柔性作业车间调度问题研究[J]. 中国机械工程, 2023, 34(02): 201-207.
[5]	李文超, 严洪森. 一类流水型知识化制造单元调度的自进化算法 [J]. 中国机械工程, 2011, 22(7): 830-835.
[6]	王文玺;肖世德;孟祥印;张卫华;;. 模糊神经网络下基于强化学习的自主式地面车辆路径规划研究[J]. J4, 2009, 20(21): 0-2525.