Health Prognosis for Equipment Based on ACO-K-Means and MCS-SVM under Small Sample Noise Unbalanced Data

: For the problem of manufacturing system residual life prognosis with insufficient small sample data and unbalanced distribution, this paper proposes a model for equipment health status analysis and life prognosis based on improved ant colony optimization K-Means (ACO-K-Means) and multi-classification Self-Adding SVM (MCS-SVM). First, based on the fuzzy data set, the data is classified for the first time according to the traditional SVM, and the initial classification results are obtained. Second, the improved K-Means algorithm based on the ant colony algorithm is used to cluster the data set after the initial classification, to obtain more health status labels in different states.The noise scale coefficient is established, and the data set distribution is optimized by introducing the unbalanced scale standard and the adaptive addition rule, to enrich the sample capacity of the scarce label under the influence of noise. On this basis, the SVMset is introduced according to the number of clusters to achieve multi-classification of the data set. Finally, by using the state data of the hydraulic pump of Caterpillar, the simulation results show that the two improved algorithms can accurately analyze the health state and lifetime prognosis of equipment under small noise samples and unbalanced data.


INTRODUCTION
With the continuous development and progress of modern technology, the accuracy of equipment life assessment in industrial production is getting higher and higher.Timely and effective assessment of equipment health status and prediction of equipment remaining life level are directly related to the work efficiency of enterprises.Once people's assessment of the equipment deviates, it is likely to cause huge economic losses and casualties [1].Thus, the effective evaluation and health management of machinery and equipment has become the focus of more and more scholars.
Compared with the traditional equipment life prediction scheme, the equipment health assessment based on automatic machine learning algorithm has gradually become the mainstream of modern enterprise research.The machine learning algorithm for equipment data with existing status tags is called supervised learning [2].The machine learning algorithm for unfamiliar data without status tags is called unsupervised learning [3].For supervised learning, a series of scholars have proposed various improvement schemes based on traditional machine learning.Among them, Garrido et al. [4] proposed an improved SVM based on PSO.By optimizing the dynamic weight, SVM can process nonlinear data, and then use the efficient optimization method of PSO to determine parameters to maximize the classification effect of the improved SVM.Guha et al. [5] proposed an optimization algorithm based on a variety of Bayesian combined neural networks, and improved the prediction accuracy of the algorithm through the evaluation of parameters.
In the field of unsupervised learning, Huang et al. [6] established an optimization algorithm for feature knowledge transfer and achieved good results.Huang et al. [7] established a machine learning model based on independent forests to label data sets and achieved good results.Kwon et al. [8] proposed an unsupervised fault diagnosis method that can realize fault detection and location based on the normal sample training classification model.McLeay et al. [9] adopted the unsupervised learning method for equipment fault detection and proved the effectiveness.
For unbalanced data, scholars have also carried out some research in this field.Among them, Li et al. [10] proposed using the global information addition method to add a few valuable sample points and achieved good results.Liu et al. [11] combine under sampling and oversampling when facing unbalanced data, and use voting random forest to enhance the classification effect.Liu et al. [12] proposed a rolling bearing life stage recognition method based on multi-classifier integration and weighted balanced distribution adaptation.Liu et al. [13] solved the problem that the single body could not be tested due to unbalanced data in the field of box fault diagnosis and life prediction.Lv et al. [14] adopted the SMOTE algorithm to process the unbalanced data and successfully predicted the health of the optical cable.Li et al. [15] proposed a unified framework and model for fusion prediction to generate denoising self-coder and deep coral network.
Under the noise data, Peng et al. [16] established an improved combination model based on support vector machine for the noise data to predict the residual life of bearings.Wu et al. [17] established a support vector machine prediction model based on noise reduction data.Yan et al. [18] proposed a life prediction method of hydraulic cylinder based on deep learning.The noise data is reconstructed using the DAE algorithm, avoiding the problems that may be caused by the noise data, and the life of hydraulic cylinder is successfully predicted.Bhouri et al. [19] and Maddu et al. [20] proposed a processing method for this kind of data, but it did not take into account the problems of sample strangeness and imbalance.
Scholars extensively research supervised and unsupervised learning for optimizing noise and unbalanced data.However, in practical industries, equipment life data obtained from testing often suffers from incompleteness, such as insufficient samples, ambiguous states, and uneven distribution.Existing studies focus on limited aspects of machine learning algorithm optimization, particularly in equipment life prediction.Few studies address challenges related to imbalanced equipment health data samples, small sample sizes, and fuzzy labels.To tackle these issues, this paper investigates incomplete data in industrial production and proposes a model for small sample imbalance, fuzzy data, and noise.The simulation stage involves calculating and analyzing sample health status and predicting equipment residual life based on root mean value evaluation of equipment vibration.The aim is to simulate actual industrial data characteristics, enhance K-Means algorithm efficiency using ACO, introduce new rules and noise ratio rules for building an improved SVM set for classification, and ultimately achieve equipment status recognition and health prognosis.

HEALTH PROGNOSIS MODEL BASED ON IMPROVED ACO-K-MEANS AND MCS-SVM 2.1 Poblem Description
In the context of intelligent industrial production systems, equipment performance has improved, reducing failure probability.However, interconnections among components can cause system breakdowns if a single component fails.Efficient fault detection and prediction in low sample quantity and imbalanced samples are crucial in fault diagnosis and life prediction research.To tackle these issues, this paper uses various algorithms and models, enhancing classification performance of the samples.Fig. 1 illustrates the technical road map.

ACO-K-Means
In this paper, the core idea of improving K-Means is to limit the search range of the traditional algorithm using ACO, reducing complexity and enhancing clustering effectiveness.The improved algorithm employs fuzzy clustering, where the ant colony algorithm's search scope is determined based on the results of the initial fuzzy clustering.This approach reduces the time required for the ant colony to find the optimal solution.Additionally, the clustering effect is enhanced by optimizing the ant colony algorithm within the sample distribution area identified in the first fuzzy clustering.
The steps to improve the ant colony algorithm are as follows: Step 1: Set the maximum number of iterations to 0, initialize ij  and Δ ij  , m ants will be placed in the number of n vertices.
Step 2: Set the starting point of ants in the current solution set k according to probability k ij  move to next vertex  and put it in the current solution set.Among them, k = 1, 2, ..., m.
Step 3: Calculate the objective function value of each ant z k and find the optimal solution.k = 1, 2, ..., m.The objective function formula is as follows in Eq. ( 1) where, d k is the distance from the current cluster center to all the current sample points of the same kind, and the distance data is calculated using Euclidean distance.
Step 4: Modify the track strength based on the above steps, and follow the following Eq.( 2). where, only ants (i, j) pheromone concentration on the path, Δ ij  is the pheromone concentration increment,  is the persistence of the current track.
Step 5: Update the number of iterations.If the current number of iterations does not reach the maximum number of iterations and all solutions found are different, return to Step 2. If the maximum number of iterations is reached or the same solution is found, the optimal solution under Eq.( 1) is output, namely the cluster center.Due to the noise in the actual industrial data, it is necessary to further process the labeled samples.Noise refers to other kinds of samples that appear in certain kinds of samples.The existence of noise can easily lead to fuzzy boundary of sample types and poor effect of subsequent classifiers.In view of this situation, this paper chooses to set the noise ratio α Analyze and screen each minority.The expression of noise ratio α is as follows.
The K-nearest neighbor idea is introduced into the noise ratio discrimination process.N M is the number of target samples in K neighborhood value, N N is the number of samples of the non-target category of K neighboring value.Set the noise standard n, x is a sample set.Technical Gazette 31, 1(2024), 24-31 (5)

MCS-SVM
Traditional SVM is a widely used binary classification algorithm.Its basic model is the linear classifier with the largest gap in the feature space.Its core idea is to find the training set that can be divided according to the requirements, while maximizing the distance from the sample point to the hyperplane.The equation of hyperplane is as follows: For unknown samples with unknown tags, SVM cannot provide accurate classification results.However, when the sample distribution is unbalanced, the algorithm struggles to meet the requirements.Hence, this paper proposes an improved algorithm that combines K-Means with SVM.By leveraging the clustering effect of K-Means and the excellent classification effect of SVM, the issue of unfamiliar samples can be avoided.To address the problem of uneven sample distribution, new sample rules are introduced to enhance the distribution of the sample sets, thereby mitigating the impact of sample imbalance on calculation results.Additionally, to handle multiclassification samples more effectively, SVM is introduced multiple times in this paper.Iterative rules are set to expand the classifier, which initially deals with two-classification problems, to handle multi-classification problems.
When unbalanced samples are obtained, the first fuzzy classification of the samples is performed using SVM.It is assumed in this paper that the sample set contains at least two labels at this stage.If the obtained samples are completely unfamiliar, a K-Means clustering can be applied before the first fuzzy classification to obtain at least two labels.Based on the traditional SVM principle, the classification decision function of SVM is as follows: In the model, we will first follow the above process to conduct a fuzzy classification of sample points.Because the first classification is relatively fuzzy, and because the noise proportion coefficient is introduced in the subsequent calculation process to reduce the impact of noise, although the process may be affected by noise samples, the error caused by noise can be reduced by introducing relaxation variables.
In the aspect of error analysis, the accuracy of the final sample point based on the traditional algorithm under the original data and the accuracy of the improved algorithm based on the original data are calculated, and the advantages and disadvantages of the model in this paper are compared and analyzed.
According to the ACO-K-Means algorithm, more equipment status labels are obtained, which avoids the disadvantage that the traditional machine learning algorithm cannot quickly and effectively judge the equipment status in the face of fuzzy samples.At the same time, in order to have a better classification effect when the test set is input subsequently, it is necessary to carry out unbalance analysis on the clustered data.In order to ensure that the sample points of each state are relatively rich, this paper establishes a balanced proportion standard t at the same time, calculate the unbalance ratio of the current sample β.
where, X max is the existing sample capacity under the label of the current target sample, X m is the sample size of the current target sample.There β can set it freely according to actual needs.If α > N and β ≥ t if it is satisfied at the same time, the following equation can be described as follows.
  new rand 0 1 where, x new is a new sample point,  x is the same sample point with the farthest Euclidean distance from the current cluster center, i x is the cluster center of the sample point of the current category.This method can avoid the error caused by unbalanced data and noise data in the subsequent state classification, so that the model can deal with the fuzzy and unbalanced state label data in the actual industry.
Following the above method, after labeling the sample set, it is also necessary to introduce the SVM group to judge the equipment health status of the test set.At the same time, due to the actual industrial production, the health data of equipment often follows the time series and presents a certain growth trend, rather than disorder.Assuming that the improved K-Means algorithm obtains n health status, it needs to import n − 1 SVMs from a SVM group.This step is to overcome the disadvantage that traditional SVM cannot handle multi-classification problems.

Health Prognosis Process Based on ACO-K-Means and MCS-SVM
The health prognosis process based on ACO-K-Means and MCS-SVM algorithm is as follows.
Step 1: Input equipment data.There are at least two kinds of equipment health status.The experimental data is divided into training set and test set according to 2:1.
Step 2: SVM is used for the first fuzzy classification of the dataset, and the first classification result is obtained.
Step 3: Following the principle of K-Means algorithm, the first fuzzy clustering algorithm is used to compress the search area of the optimization algorithm, and then the ant colony algorithm and K-Means algorithm are used to cluster the classified sample points to obtain the qualified equipment status label.
Step 4: Introduce noise proportion coefficient n and balance proportion standard t at the same time, calculate the noise proportion coefficient of the current target sample α and unbalance ratio β.Add the clustered sample points according to the established rules.
Step 5: Use the data set of known tags to classify the sample points, and introduce the SVM set.If the number of equipment status tags is n, the number of SVM is 1 n  .Complete the classification of sample points.
Step 6: Output the results, judge the equipment health status, fit the equipment health development trend and predict the future life of the equipment.
The model can avoid the impact of sample strangeness, sample imbalance, sample noise, etc. while maintaining high computing speed with small sample size, overcome the shortcomings of traditional algorithms and the disadvantages of the aforementioned data defects in the foreword scholars' research content, and provide theoretical basis for the actual prediction of enterprises.

CASE STUDY 3.1 Data Source
In this paper, the hydraulic pump of Caterpillar Company of America is used for simulation experiment.During the collection process, the health status of the hydraulic pump is mainly reflected by the bearing vibration data.The strain degree of the equipment can be calculated by observing the vibration frequency of the hydraulic pump bearing.The hydraulic pump was added with 20 -80 mg of experimental materials, and vibration data were collected every 10 minutes.According to the characteristics of the data, the collected data were divided into four stages: bad, poor, medium and good.The sample point distribution each is determined by the improved K-Means algorithm, and the data index for evaluating each state is also determined by the corresponding SVM.There is no failure risk in the medium and good states, and there is failure risk in the bad and poor states.

Health Status Identification
Since there is no failure risk in the medium and good states, there is failure risk in the bad and poor phases.Tab. 1 lists the number of samples of hydraulic pump with and without failure risk.Due to the need of training SVM, about 2/3 of the data is used for training SVM, and about 1/3 of the data is used for testing.Firstly, the collected data sets with and without fault risk are visualized.Due to much vibration data collected, this paper randomly selects vibration data in two directions for demonstration.The visualized data is shown in Fig. 2.
Blue is the sample distribution without failure risk, and red is the sample distribution with failure risk.
In actual industry, the data obtained often contains noise.In Fig. 2, it is obvious that there are several groups of noises.The influence of noise is ignored when SVM is introduced for the first time.Use SVM to solve the sample set for the first time, and the result is shown in Fig. 3. Where, the blue line is the calculated classifier, and the sample points circled in red circle are the support vector points.The distribution of sample points with and without fault risk can be obtained by using the first SVM classification.At the same time, there are noise samples in the figure due to the error of manual measurement or detection machine.The noise scale coefficient will be introduced in the subsequent experiments to process the noise sample points.
The improved K-Means algorithm is introduced based on the results of the first rough classification, and the clustering effect of the original algorithm is improved through the joint ant colony algorithm.According to the information, the health status of the equipment in this example shows four stages: bad, poor, medium and good.Therefore, we can know that the number of types of final output clustering results is 4. On this basis, the ant colony algorithm and K-Means algorithm are introduced to cluster the sample points after the first rough classification.
In order to reflect the advantages of the ant colony algorithm combined with K-Means algorithm, we first compare it with the traditional K-Means algorithm.Fig. 4 shows the clustering effect of traditional K-Means algorithm: Red "×" is the cluster center found by traditional K-Means.
Then use the K-Means algorithm optimized by the ant colony algorithm to cluster the data set.Firstly, the sample set is first fuzzy clustering to determine the search range of the ant colony.In order to ensure the optimization effect and shorten the search time of the ant colony, this paper compresses the search range of the ant colony to the maximum distribution area of each kind of sample points after the first fuzzy clustering, that is, the area divided between the extreme values of the coordinates of the sample points in the same cluster.
The extreme values of sample point distribution of health states can be obtained by calculation in Tab. 2. The ant colony algorithm is introduced into the search range according to the specified rules.In the K-Means algorithm, the maximum number of iterations is set to 100, and its initial value is set to 0. At the same time, initialization ij  and Δ ij  , 100 ants can be placed in each healthy state after the first fuzzy classification 4 extreme vertices.The cluster centers of the four health states of the searched devices are shown in Tab. 3. To compare the clustering effect of the traditional K-Means algorithm and the K-Means algorithm optimized by the ant colony algorithm, this paper selects Dunn index as the evaluation index.Dunn equation is as follows: The larger the Dunn index (DI), the better the clustering effect of the corresponding algorithm.
Through calculation, the Dunn index of the traditional K-Means algorithm and the ant colony algorithm proposed in this paper can be optimized and the average value of the Dunn index of the K-Means algorithm in each health state can be obtained.The numerical value and comparison results are shown in Tab. 4.
It can be seen in the table that the improved K-Means algorithm with the ant colony algorithm has better clustering effect while having high speed.
After observation, the sample point data in the bad state after ignoring the noise caused by human or detection machine shows an unbalanced state.By evaluation, we can get that K = 4, and set n = 0,1.Follow the Eq. ( 3) to Eq. ( 5) to calculate the noise ratio coefficient of each sample point and compare it with 0,1.
Finally, three noise sample points were successfully selected, and the coordinate distribution of the sample points will be ignored in the subsequent calculation.
Calculate the unbalance ratio of the target sample at this time β = 0,545.Set this time t = 0,8; add the sample set according to Eq. ( 9) until β ≥ t .
Then introduce the SVM set.At this time, the equipment health status presents four stages, and three SVMs are introduced.The final classification results are shown in Fig. 5 to Fig. 7.   Clustering algorithms are employed to label them before utilizing traditional machine learning algorithms for calculations.K-Means and improved K-Means are used here for clustering to control variables and facilitate better comparisons.Results indicate that the recognition accuracy of traditional machine learning algorithm combinations is lower compared to that of ACO-K-Means combined with corresponding classification algorithms.The comparative analysis demonstrates that the proposed combination of ACO-K-Means and MCS-SVM exhibits superior recognition accuracy.The addition of new sample points helps mitigate errors resulting from sample imbalance and overfitting due to limited samples.
Regarding cost and algorithm complexity analysis, the proposed model in this paper demonstrates comparable calculation speeds to other algorithms listed in Tab. 5, while achieving higher accuracy and operating on relatively small datasets.Thus, the proposed model outperforms in terms of both cost analysis and algorithm complexity analysis.

Health Prognosis Results
The health status of the hydraulic pump is mainly reflected by the calculated RMS value (root mean square value of vibration) under its bearing vibration data.The RMS value can be calculated and analyzed to quickly determine the health status of the equipment and predict the remaining life of the equipment, providing a reference basis for enterprises to continue to use the hydraulic pump.The calculation formula of RMS is as follows.The abscissa represents the test time of the equipment, and the ordinate represents the RMS value.When the equipment's RMS value approaches 5, the corresponding time point for the equipment to enter the health decline period is 25.At this time, the equipment is at risk of failure.However, at time point 10, an outlier is observed in the RMS distribution, which does not align with the actual industry's equipment operation trend.Therefore, the data at this point should be disregarded when using the RMS distribution chart to predict the service life of the hydraulic pump.The subsequent analysis focuses on data points with RMS values greater than 5 to predict the remaining equipment life at this stage.Data points with RMS values less than 5 will not be considered.The relevant equipment data for prediction has been filtered in the following table.Technical Gazette 31, 1(2024), 24-31 Use the data in the table to calculate the remaining life of the equipment.The maximum service time of the equipment is the 440th minute of the test time.The remaining life of the equipment can be calculated by calculating the difference between the maximum service time of the test and the current service time of the test.Analyze the calculated residual life and corresponding data points and fit the residual life curve of the equipment.First, use ACO-K-Means-MCS-SVM to fit the change trend of the RMS of the device.
The comparison results between the fitting curve and the real value are shown in Fig. 9.In actual industry, it is impossible to successfully eliminate all noise, so in order to fit the authenticity of the curve, some abnormal values are selected to be taken into account in the simulation process.
Second, the future trend of the equipment is predicted by considering RMS and detection time, shown in Fig. 10.
The comparison results show that the predicted RUL after data point 10 completely coincides with the true value.Due to the rule of adding new sample points in ACO-K-Means-MCS-SVM, the RUL prediction value before data point 10 has certain error.However, with the progress of detection, the error will gradually decrease to disappear.

Figure 10 Comparison between predicted RUL and true value of equipment
According to the above prediction results of RMS and RUL, the model proposed in this paper is still highly effective in life prediction.The model proposed in this paper has high practicability and can meet the requirements of enterprises for the accuracy of equipment condition analysis and life prediction in the case of small samples, unbalanced samples and fuzzy sample labels that are likely to occur in the actual industry.At the same time, the remaining service life of the equipment can provide the basis for the enterprise to replace the equipment in advance and avoid the economic losses caused by the equipment life judgment error in actual production.

CONCLUSION
In this paper, samples are classified into 'with failure risk' and 'without failure risk' using SVM fuzzy classification.An improved K-Means algorithm is then employed to label the fuzzy dataset for determining equipment health status.The ant colony algorithm is introduced to enhance the K-Means algorithm by reducing the search range through fuzzy clustering.It efficiently finds cluster centers meeting the conditions.To achieve multi-classification and mitigate errors caused by sample imbalance and noise, this paper introduces noise proportion coefficient, noise neglect rule, sample imbalance proportion coefficient, and a new rule.These rules prevent errors from noise points and unbalanced samples.Subsequently, the SVM set is utilized to judge the health status of future equipment health data.ACO-K-Means demonstrates superior clustering while MCS-SVM exhibits better classification accuracy.Moreover, the combined ACO-K-Means and MCS-SVM model effectively predicts the future development trend of equipment lifetime and provides a reference for equipment replacement.The proposed model performs well in datasets with small sample imbalance and fuzzy labels, while mitigating interference from noisy data.
This paper employs multiple models and algorithms to establish suitable models for handling noise and imbalanced samples, prioritizing efficiency and minimal resource consumption, resulting in high experimental accuracy.However, the performance of the proposed model is only validated through examples, lacking comprehensive evaluation across other datasets and metrics.Future research will focus on refining parameter selection and tuning methods, conducting extensive experimental validation, and testing the model's robustness with additional datasets.These efforts aim to further enhance the model's performance and interpretability.

Figure 2
Figure 2 Technology roadmap based on ACO-K-Means and MCS-SVM algorithm

Figure 2 Figure 3
Figure 2 Vibration visualization of vibration data

Figure 4
Figure 4 Cluster effect of traditional K-Means algorithm

Figure 5 Figure 6
Figure 5 Classification diagram of good to medium status

Figure 7
Figure 7 Classification diagram of poor to bad status The comparison results between the ACO-K-Means combined MCS-SVM algorithm and the traditional machine learning algorithm proposed in this paper are shown in Tab. 5.Because the given original data set lacks the status label, the traditional K-Means is used to first label the sample set when comparing with some machine learning algorithms.When facing the two-dimensional classification problem, in addition to the SVM mentioned in this paper, KNN algorithm is one of the best algorithms in the field of machine learning when dealing with simple classification problems.Therefore, this paper focuses on the combination of traditional K-Means, ant colony optimization K-Means and SVM, KNN, multi-class selfadding SVM, and comparison of classification accuracy to prove the superiority of the proposed algorithm.

Fig. 8
Fig. 8 shows the fitted RMS change trend of the hydraulic pump.

Figure 8
Figure 8 Change trend of hydraulic pump RMS

Figure 9
Figure 9 Comparison between real RMS and fitting line of equipmentBy comparing the real RMS of the equipment with the trend line fitted, it can be concluded that the model proposed in this paper can accurately predict the future RMS of the equipment.There are some abnormal values in the figure due to the error of manual or detection machine.In actual industry, it is impossible to successfully eliminate all noise, so in order to fit the authenticity of the curve, some abnormal values are selected to be taken into account in the simulation process.Second, the future trend of the equipment is predicted by considering RMS and detection time, shown in Fig.10.The comparison results show that the predicted RUL after data point 10 completely coincides with the true value.Due to the rule of adding new sample points in ACO-K-Means-MCS-SVM, the RUL prediction value before data point 10 has certain error.However, with the progress of detection, the error will gradually decrease to disappear.

Table 1
Hydraulic pump data set distribution table

Table 2
Extreme values of activity areas of equipment health status samples

Table 3
Display table of K-Means clustering center optimized by ant colony algorithm

Table 4
Comparison of clustering effects between the traditional K-Means algorithm and the ant colony optimization K-Means algorithm in this paper

Table 5
Comparison of classification effects of various models

Table 6
RMS and time distribution of equipment under failure risk