Radar-based Hail-producing Storm Detection Using Positive Unlabeled Classification

: Machine learning methods have been widely used in many fields of weather forecasting. However, some severe weather, such as hailstorm, is difficult to be completely and accurately recorded. These inaccurate data sets will affect the performance of machine - learning - based forecasting models. In this paper, a weather - radar based hail - producing storm detection method is proposed. This method utilizes the bagging class - weighted support vector machine to learn from partly labeled hail case data and the other unlabeled data, with features extracted from radar and sounding data. The real case data from three radars of North China are used for evaluation. Results suggest that the proposed method could improve both the forecast accuracy and the forecast lead time comparing with the commonly used radar parameter methods. Besides, the proposed method works better than the method with the supervised learning model in any situation, especially when the number of positive samples contaminated in the unlabeled set is large.


INTRODUCTION
Hailstorms can bring severe damages to buildings, crops, vehicles, and other personal properties. As hailstorm has a short duration and small spatial scale, its detection and now forecasting are always challenging subjects. Currently, the most accurate hail forecasting methods rely on weather radars due to the fact that they could generate high-resolution volume data by scanning at multiple elevations.
Machine learning methods, which could learn from data, then make decisions without being explicitly programmed, have been proven to be effective in severe weather forecasting [14][15][16][17][18][19]. Some nonlinear models such as support vector machine (SVM) and random forest (RF) can find hyperplanes which can solve linear indivision problems. Therefore, in theory, machine learning models could be used to distinguish hail storms from no-hail storms by using radar parameters as features.
However, in practice, the performance of machine learning models is deeply influenced by data quality. Some weather data sets are difficult or expensive to be acquired, while some are not very accurate. Hail case data is such a kind of data. In many countries, the most reliable hail case data are from hail reports manually recorded by meteorological observation stations. Some areas make use of hailpads to automate or semi-automate the recording process [20][21][22][23]. Recently, some novel weather case collection methods are proposed. For example, the NOAA National Severe Storms Laboratory is using a mobile phone application named "mPing" to collect crowdsourcing weather reports [24], and someone uses data mining techniques to crawl severe weather records from social networking sites like Twitter [25]. Either way, the time and place of hail occurrence have a massive impact on whether a case is recorded correctly. Another problem is that many hail reports are recorded in various text formats. They are not easy to be converted to structural data, which is needed for machine learning models.
Even if the hail case data are correct, labeling the radar data to the corresponding weather case is still very costly. Especially for the supervised learning, both the positive sample set and the negative sample set should be well labeled. However, in hail detection, if treating all the samples without hail record as negative samples in hail classification, due to hail cases being easy to miss, a large number of false-negative samples will contaminate the negative sample set. This may substantially affect the performance of classification.
Based on the above background, we find that supervised classifiers may have limited performance in hail storm detection. Training a classification model for a specific geographical area needs a lot of historical data collection and labeling work. Besides, the classification model is generally not universal due to the differences in climate and topography in different geographical areas. Applying the same model to other regions still requires much work. Therefore, reducing the cost of data labeling and processing is critical to applying machine-learningbased models to operational weather forecasting.
Weakly supervised learning refers to a class of models that attempt to learn from weakly supervised data [26]. Weak supervision can be divided into three categories: incomplete, inexact, and inaccurate supervision. Incomplete supervision means that only a subset of the training set is labeled. For inexact supervision, only coarsegrained labels are given. When the given labels are not always ground-truth, it is inaccurate supervision. Besides, Incomplete supervision includes semi-supervised learning [27][28][29][30][31], active learning [32][33][34], and transfer learning [35,36].
In the task of hail storm classification, one can easily obtain a part of accurate case data from the hail reports, but accurately labeling the whole data is costly. A feasible solution is to train a weakly supervised classifier with the data set that only a subset of positive samples is labeled, which is a typical positive unlabeled learning (PU learning) problem, one of the incomplete supervision methods. Unlike supervised learning using a totally labeled positive training set P and a negative training set N, PU learning requires only a positive training set P, which includes the partly labeled hail case data and an unlabeled set U, which includes all of the other unlabeled data.
There are various categories of approaches to solving PU learning problems [37]: (i) approaches that identify possible negative data in the unlabeled set using heuristic methods then perform supervised learning [38][39][40][41][42], (ii) approaches that regard the unlabeled set as negative set, but introduce a biased weight to classification models to penalize more misclassification of positive instances than misclassification of unlabeled instances [39,[43][44][45], (iii) approaches that treat the PU learning problem as one-class learning problems, which learn from positive samples only [46][47][48][49], and (iv) approaches that make use of bootstrap methods to build aggregate classifiers based on positive and unlabeled samples [50,51].
In this paper, a machine-learning-based hail-producing storm detection method used for hail forecasting is presented. The machine learning model is designed based on the characteristics of the samples and the problem. The features are extracted from radar and sounding parameters based on operational forecasting experience and convective physical processes. One of the state-of-the-art PU classification models, the bagging class-weighted SVM, is used as the classification model in order to alleviate the problem that hail cases cannot be fully recorded. Then the method is compared with the classic radar parameters method and the method using supervised classification with real historical data for validation.
The paper is structured as follows: Section 2 gives a detailed description of the data used in this paper and the proposed method. The results of validation and discussion are provided in Section 3. The last Section 4 shortly draws the most important conclusions.

DATA AND METHODOLOGY 2.1 Data Sources
The data used in this paper include Doppler weather radar data, radiosonde sounding data, and severe weather observational data. Due to the type of severe weather varying with region, topography, and season, in order to avoid these effects on the model parameters, we focus the study on the convective seasons of Beijing-Tianjin-Hebei region, in North China. The radar data used in this paper are generated from three single-polarization S-Band radars, which are deployed in Tianjin, Beijing, and Shijiazhuang, respectively. The radars perform volume scans once every six minutes, and each volume scan includes nine elevations. The resolution of the generated plan position indicator (PPI) image is 1 × 1 km. The sounding data are from the nearby radiosonde stations, and are acquired twice a day at 0000 UTC and 1200 UTC. Each case uses the latest data before, and the data of each grid point are obtained by bilinear interpolation.
The hail case data are from the hail reports of the manual observation stations provided by the China Meteorological Administration. The manual observation stations record hail cases based on human eye-observations of hailstones of any size. Among the hail reports from 2011 to 2015, we extracted 146 hail cases that generate hailstones larger than 10 mm and are under the coverage of the radars. All of these cases have clear records of time and locations. If a hail case is detected by two or more radars at the same time, only the nearest radar is used. The geographical locations of radars and manual observation stations are shown in Fig. 1.

Figure 1
The geographical locations of the radar and observation stations. The yellow circle represents the scan range of the radars.

Data Handling
All the algorithms in this study are conducted on convective cells. To identify convective cells, we use a modified SCIT method [52], which utilizes a border following algorithm [53] to extract 2D components from PPI images instead of using radial images. After identifying the convective cells, a convective cell is labeled as a hail-producing cell if it is located above a manual observation station that reports hail during the recording period. Since the record time often lags behind the actual time of hail fall and the forecast lead time should be considered, we tracked backward until the time step when the severe convective cells appear and labeled the cells in the same hail process as positive samples. Finally, 1521 convective cells are labeled as hail-producing samples.
Although one can train the PU classifier with the positive and unlabeled set only, a refined negative set is still needed to evaluate the performance of the classifier. It is not feasible to regard all the convective cells that are not recorded in the hail reports as negative samples because the number of no-hail storms is too large, and there are missing cases on the hail reports. So, we prepared the negative set as follows. First, identify all the convective cells of the basedata between 0000 UTC and 1000 UTC. We chose these basedata because during this period, in the North China region is the daytime, so the hail cases are not easy to miss. Second, we only kept the convective cells whose maximum reflectivity is larger than 45 dBZ. According to the local historical cases, convective cells that do not meet this condition will hardly produce hail. Third, we removed all the convective cells that were related to any hail reports or were far away from any observation station.
After the above processing, we can obtain a negative sample set as clean as possible. However, this data set is still too large. Also, we should test the model trained by an unlabeled sample set with a high ratio of positive samples to see the performance under extreme conditions. So, we randomly extracted 13689 samples, nine times more than the number of hail samples, from it. When training the PU classifier, we randomly incorporate positive samples into the negative sample set to obtain an artificially generated unlabeled sample set. As the actual label of each sample is clear, we could verify whether each sample is correctly classified.

Features
Features are crucial for machine learning models. In this study, we divide the features used in the classification model into two groups: main features and auxiliary features. The main features are the classic radar parameters that can be used independently for hail detection, including maximum radar reflectivity in a vertical column (Zmax), Waldvogel parameter, VIL density, and SHI. As these parameters have been proven to be effective [7,[54][55][56][57][58], we do not need to verify their importance as features for classification. Therefore, the values of the main features are directly input into the classification model after standardization. The following is a brief introduction of the main features: Zmax is the most straightforward criterion which predicts the presence of hail if the maximum reflectivity in a vertical column exceeds a certain threshold.
Waldvogel parameter is proposed by [2]. It predicts hail if the vertical distance between R W dBZ echo top and the melting layer is greater than or equal to a threshold H T : Initially, the reflectivity threshold R W is 45 dBZ and the height threshold H T is 1.4 km.
VIL density is proposed by [4] to improve the warning of severe hail on the basis of VIL. It is defined as the VIL is divided by the radar echo top H ET : One form of VIL is given by: where Z i and Z i+1 are radar reflectivity values at the lower and upper portions of the sampled layer, and ∆h is the vertical thickness of the layer.
where H 0 is the height of the melting layer, H T is the height of the storm top, W T (H) is the temperature-based weighting function, and E  is the kinetic energy flux of the hailstones.
Note that Z max , Waldvogel parameter, and VIL density are grid-based parameters, which calculate a value at a point or in its neighborhood, but our algorithm is celloriented. So, we should convert them into cell-based parameters. [52] has defined cell-based VIL by vertically integrating a three-gate-averaged maximum reflectivity at each level through the depth of the storm.
Then based on it, [59] defines cell-based VILd by a ratio of the cell-based VIL to the storm top. The Z max of a 3D convective cell is the 27-grid-averaged maximum reflectivity inside the convective cell, and the Waldvogel parameter of a convective cell is the vertical distance between the storm top and the melting layer.
The auxiliary features refer to a set of radar and sounding products that may have potential relationships with hail storms. These values cannot be used for hail detection independently but may improve the performances of machine learning models. The design and selection of auxiliary features are based on operational forecasting experience and convective physical processes, and also benefit from some previous studies [10,11,15,[60][61][62][63][64][65]. The radar products and sounding products introduced as auxiliary features are listed in Tab. 1.
However, more features do not mean that the classification results will get better. As the scale of the training sample set is not large, using a complicated model with too many features may make the model have a high variance. A model with high variance is overfitting to noisy or unrepresentative training data, resulting in a decline of performance [67,68]. Since we do not have negative sample sets in the actual situation, it is hard to select useful features carefully. A feasible solution is employing an unsupervised dimensionality reduction method like principal component analysis (PCA), and using the principal components as features. We performed PCA on the dataset, and the contribution rates of the top 10 principal components are shown in Fig. 2. From it, we can see that the cumulative contribution rates of PC1 to PC7 have reached 89.85%. So, we choose the first seven principal components as features for the bagging CWSVM classifier. The overview of features is shown in Fig. 3.

Model
The task of hail-producing storm detection can be transformed into a binary classification problem, with hailproducing cells as positive samples and no-hail cells as negative samples. However, considering hail cases cannot be completely recorded, constructing a PU classifier trained from partially labeled hail-producing cells and the other unlabeled cells is more suitable. In this study, we picked one of the state-of-the-art PU learning models, the bagging class-weighted SVM (CWSVM) [50], for this task, using the features described above. As mentioned before, four categories of approaches can be used to solve the PU learning problems. The bagging CWSVM combines two of them: the one is biased weight, and the other is bootstrap. In short, bagging CWSVM uses classweighted SVM as base classifier then applies bootstrap aggregating (bagging) to further reduce the variances caused by the randomness in the negative samples.
Compared with the classic SVM, the CWSVM penalizes the misclassification of each class using an independent weight [40,68]. In the context of PU learning, the penalty weight of misclassified positive samples P is larger than the penalty weight of misclassified unlabeled samples U, because the unlabeled set that is assumed to be negative also contains positive data. Then the optimization problem is:  (5) with N α ∈  the support values, y ∊ {−1, +1} N the label vector, K(·,·) the kernel function, b the bias term and N ξ ∈  the slack variables.
Bagging is an ensemble meta-algorithm to improve stability and accuracy, which is often applied to highvariance models [69]. Bagging constructs a sub-classifier using a subset drawn from the training set uniformly and with replacement, then combines their predictions. In PU learning, the bagging CWSVM draws a subset from U, and combines it with the whole P, as a training set to train the CWSVM. As the U is "contaminated" by positive samples, each subsampling will construct a subset with different portions of "contamination", which eventually will induce a large variability in the sub-classifiers. For this reason, bagging could improve the overall performance of PU learning.
The parameters needed to be tuned in bagging CWSVM include the number of samples drawn each time from the unlabeled set K, the number of classifiers for bagging T, and the penalty weights CP and C U . Commonly, in PU learning, the penalty weights C P and C U are set to make the total penalty equal for the two classes [70,71]: where n P is the size of P. Since the ratio n P /K is fixed, only needs to tune C P .

EXPERIMENTS AND RESULTS
We conducted a series of experiments to answer the following questions: (i) How much does the PU learning method improve comparing with the traditional radar parameter method? (ii) What are the performances when the ratio of hail samples contaminated in the unlabeled set is different? (iii) What if using the supervised classifier directly for positive and unlabeled classification? In other words, do we really need PU classification? (iv) What is the forecast lead time of the PU learning method?

Experiment Setup
The construction of datasets in this study is a little complicated compared with ones used for evaluating supervised learning, which is summarized in Fig. 4. As mentioned before, we prepared a refined no-hail sample set and should use a certain proportion of positive samples for contamination to construct the simulated unlabeled set. So the labeled hail sample set is divided into three: one for training, one for testing, and the third for contaminating. In this step, we split the hail samples in the unit of cases, considering that the features of hail samples in the same case may be similar, which can make the model easy to generalize. By random selection, 100 out of 146 hail cases are used for training and contaminating, and the rest is for testing. Accordingly, 68% of no-hail cases are added to the training set.

Figure 4 The schematic representation of dataset construction. CV refers to cross-validation
Part of the samples from the 100 cases was used to contaminate the unlabeled set, and this selection is in the unit of cells. Since we also want to test the performance with different contamination ratio, the number of cells for contaminating is different in each experiment. However, to ensure the comparability of the results in all experiments, samples in the positive set should remain the same. There are a total of 1027 convective cells in the 100 hail cases used for training, and 540 of them are taken as positive samples.
The hyperparameters of the bagging CWSVM model comprise the number of classifiers for bagging T, the number of resamples from the unlabeled set K, the positive class penalty weight CP, and the other hyperparameters inherited from SVM. In theory, the performance is monotonically non-decreasing in T. Although the training time will increase with T, we set it to a large valve, 200, since we only focus on the performance. In addition, we found in our preliminary study that the gamma and the kernel types of SVM have little effects on the final results. Therefore, we assign gamma to the reciprocal of the number of features, which is a conventional treatment, and use the radial basis function as the kernel function. Consequently, there remain two hyperparameters that need to be determined. Due to the small number of samples, we also make use of the training set to tune them, instead of using an independent validation set. While tuning, the training set is fixed to the contamination ratio of 10%, and is split into four-folds. Then we conduct a grid search using 4-fold cross-validation to find the optimal parameter combination. Results of the grid search show that the optimal choice is CP = 100 and K = 2000. After obtaining the datasets and hyper parameters, a series of experiments with different contaminated rates are conducted. First of all, the proposed method is compared with three radar parameter methods, Waldvogel parameter, VIL density, and SHI. The warning thresholds used in the radar parameter methods are obtained by statistics on the same training set, as shown in Fig. 5. The threshold which could acquire the highest CSI is selected. Then, in order to demonstrate whether the PU learning is necessary, we compare it with the classic two-class SVM. At last, the forecast lead time is compared to see if the proposed method could forecast earlier than the traditional methods.
The metrics used for evaluation include area under the ROC curve (AUC), probability of detection (POD), false alarm rate (FAR), and critic success index (CSI). AUC measures the entire two-dimensional area underneath the entire receiver operating characteristic (ROC) curve, which provides an aggregate measure of performance across all possible classification thresholds. It is a commonly-used metric in evaluating classifiers. The POD, FAR and CSI are respectively defined as: and TP CSI TP FN FP = + + (9) where TP represents the number of true positives, that is the detected events, FN represents the number of false negatives, that is the miss-detected events, and FP represents the number of false positives, that is the false alarmed nonevents. These three metrics are commonlyused in evaluating weather forecasting methods.

Experiment Results
The performance diagram in Fig. 6 shows the POD and precision of traditional radar parameter methods and the proposed PU learning method trained by positive sets and unlabeled sets with different contamination ratios. As can be seen, among the three traditional radar parameters, the performances of the Waldvogel parameter and the SHI are similar, and they are better than the VIL density. The proposed bagging CWSVM methods trained by the unlabeled sets with up to 10% contamination rate significantly outperform the radar parameter methods. When the contamination rate reduces, the performance will improve. The ROC curve in Fig. 7 also demonstrates the same results.
The bar plots in Fig. 8 show the POD, FAR, CSI and AUC of different models. The warning thresholds of the three traditional radar parameter methods are selected using the same training set without contamination. From this figure, it is clear that the proposed model can both detect more positive samples and reduce false alarms compared with traditional methods. Unlike the contamination rate that has less influence on the FAR, contaminating the unlabeled set with more positive samples would lower the POD. When the contamination rate increases to 10%, the POD of the PU learning model drops to the same level as in the Waldvogel parameter method and the VIL density method. The changes of POD, FAR, CSI and AUC of bagging CWSVM and SVM over different contamination rates are shown in Fig. 9. We can see from the figure that all the metrics except FAR of Bagging CWSVM are always equal to or better than the ones of SVM. The higher the contamination rate, the more significant the difference between the two methods. When the contamination rate is 0%, the bagging CWSVM can be treated as a binary supervised learning model, and its performance is at the same level as SVM because its meta-classifier is also SVM and bagging will not reduce the performance. Therefore, the bagging CWSVM can be used in any situation without worrying about whether the negative set is contaminated. Moreover, PU learning is necessary when the negative sample sets are not guaranteed to be clean. The forecast lead time of SHI and Bagging CWSVM with different contamination rates for the 46 test cases is shown in Fig. 10. Since the forecast lead time of the three traditional radar parameters method has little differences, only SHI is used as a comparison. It can be seen that the proposed method can improve the forecast lead time to some extent, although a high contamination rate may influence the earlier forecast. When there are no contaminated samples in the unlabeled set, the bagging CWSVM method forecasts each case earlier by 6 to 12 minutes. When the contamination rate is 10%, the forecast lead time of the proposed method is at the same level as SHI.

CONCLUSION
In this paper, a radar-based hail-producing storm detection method based on positive unlabeled learning is proposed. Features used in the model are based on weather radar parameters and sounding parameters. Four radar parameters are directly input into the classifier, and the others are used after dimensionality reduction by PCA. The PU classifier model used in this study is bagging CWSVM, which iteratively trains many binary classifiers to discriminate the known positive examples from random subsamples of the unlabeled set, and averages their predictions.
Real weather radar data from three radars deployed in North China were used to evaluate the proposed method. Results show that the proposed method performs better forecast than any radar parameter method, and could improve the forecast lead time when the contamination rate in the unlabeled set is less than 10%. The comparison with SVM demonstrates that the proposed method is not inferior to supervised learning models at any time, and improvement of performance becomes more substantial when the contamination rate increases. Therefore, the proposed method is very suitable for hail-producing storm detection or other severe weather forecasting. It can significantly reduce the amount of work required for modeling and makes it possible to apply a unique model to each region.
The model can be further improved. On the one hand, in this work, we only used the radar parameters and radiosonde parameters as features. More values, such as the production from numerical weather prediction, can also be made use of in the model. On the other hand, we did much work in data clean to make sure the positive samples are correct in this work, but hail reports are not always correct in practice. On that condition, using more robust PU learning models like the work of [51] is a better choice.