China Mechanical Engineering ›› 2022, Vol. 33 ›› Issue (01): 88-96.DOI: 10.3969/j.issn.1004-132X.2022.01.010

Previous Articles     Next Articles

Quality Prediction of Automotive Parts for Imbalanced Datasets

LI Minbo1,2;DONG Weiwei1   

  1. 1.Software School,Fudan University,Shanghai,200433
    2.Shanghai Key Laboratory of Data Science,Fudan University,Shanghai,200433
  • Online:2022-01-10 Published:2022-01-19

面向不平衡数据集的汽车零部件质量预测

李敏波1,2;董伟伟1   

  1. 1.复旦大学软件学院, 上海,200433
    2.复旦大学上海市数据科学重点实验室, 上海,200433
  • 作者简介:李敏波,男,1970年生,副教授,博士。研究方向为工业大数据分析、物联网智能信息处理。发表论文40余篇。Email:limb@fudan.edu.cn。
  • 基金资助:
    国家重点研发计划(2018YFB1703104);
    国家自然科学基金(61671157)

Abstract: In response to the problems of imbalance between quantity of qualified and unqualified automotive parts for the product inspection datasets, a MCDC-MF-SMOTE oversampling method of quality inspection data was proposed based on density clustering and multi-process manufacturing features. Firstly, density clustering was carried out for the minority(unqualified) and the majority(qualified) samples respectively, and then oversampling weight was calculated by multi-process manufacturing data and cluster sample distribution. Data were generated in a few clusters according to oversampling ratio and cluster weights. The MCDC-MF-SMOTE method was used to generate a balanced quality dataset of automotive parts. Random Forest was used to rank feature importance to reduce feature dimension. LightGBM, XGBoost, SVM, and MNB models for Stacking were integration to predict unqualified products. After experiments, this method has higher stability and quality prediction performance. Compared with random sampling, the detection rate of unqualified products is increased by approximately 63%.

Key words: data imbalance, quality prediction, density clustering, integrated learning

摘要: 针对汽车零部件质检数据存在合格品与不合格品数量不平衡的问题,提出了基于密度聚类与多工序制造特征的MCDC-MF-SMOTE质检数据过采样方法。先对少数类(不合格)与多数类(合格)样本分别进行密度聚类,再对多工序制造数据和类簇样本分布进行过采样权重计算;根据设定的过采样比和类簇权重,在少数类簇中进行过采样数据生成。使用MCDC-MF-SMOTE过采样方法生成汽车零部件质检的平衡数据集,并采用随机森林排序制造特征的重要性,对分类模型LightGBM、XGBoost、SVM和MNB进行Stacking集成来预测不合格品。与随机抽检相比,该方法对不合格产品的检出率提高了约63%。

关键词: 数据不平衡, 质量预测, 密度聚类, 集成学习

CLC Number: