Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

: There is a common notion that traditional unsupervised feature extraction algorithms follow the assumption that the distribution of the different clusters in a dataset is balanced. However, feature selection is guided by the calculation of similarities among features when topic keywords are extracted from a large number of unmarked, unbalanced text datasets. As a result, the selected features cannot truly reflect the information of the original data set, which thus affects the subsequent performance of classifiers. To solve this problem, a new method of extracting unsupervised text topic-related genes is proposed in this paper. Firstly, a sample cluster group is obtained by factor analysis and a density peak algorithm, based on which the dataset is marked. Then, considering the influence of the unbalanced distribution of sample clusters on feature selection, the CHI statistical matrix feature selection method, which combines average local density and information entropy together, is used to strengthen the features of low-density small-sample clusters. Finally, a related gene extraction method based on the exploration of high-order relevance in multidimensional statistical data is described, which uses independent component analysis to enhance the generalisability of the selected features. In this way, unsupervised text topic-related genes can be extracted from large unbalanced datasets. The results of experiments suggest that the proposed method of extracting unsupervised text topic-related genes is better than existing methods in extracting text subject terms from low-density small-sample clusters, and has higher prematurity and feature dimension-reduction ability.


INTRODUCTION
As society gradually enters the age of "big data" [1], increasing amounts of information are available from webpages, microblogs, forums, and multimedia files, etc. [2]. Meanwhile, the time available to read and process information is decreasing, so efficient and accurate information analysis is becoming an effective means of understanding large datasets and discovering value. Such analysis is applicable to public opinion monitoring and early warning on the internet, such as filtration of harmful information from networks, emotion analysis, personalized recommendations for products [3], etc. Moreover, during data processing, it is generally necessary to process a lot of data that is redundant or has uncorrelated features with causing the efficiency of learning algorithms significantly. This can be a fatal link in machine learning and data mining, as feature extraction has direct impacts on model building, analysis efficiency and accuracy.
At present, feature extraction may be classified as supervised and unsupervised [4]. In text content analysis processes, regardless of the class, a vector space model [5] is required to express the text in a vector space consisting of a certain quantity of feature words. This causes two inevitable issues in practical applications: (1) the distribution of sample categories (clusters) in the dataset is not balanced. Various measures have been used for feature subset quality evaluation, including independent correlation analysis [6], similarity analysis [7], distancebased Euclidean distance, Mahalanobis distance [8], and the most widely-used technique: information entropybased mutual information and information gain [9]. All of these techniques identify similarities between the sample categories (clusters) in the dataset. Assuming that most of the identified features come from the "big class", which contains most of the categories (clusters), and none or very few features come from the "small class", the most distinguishing features can be selected. Subsets cannot accurately reflect the information of the entire sample space, which reduces the ability of subsequent learning methods to solve practical problems. (2) The subject to be processed becomes more complicated and the data dimensions increase rapidly due to the very large size of some datasets. Analyses of ultrahigh-dimension datasets have high memory and computational requirements [10]. In spaces with high-dimensional features, various feature points have strong dependency, which causes high redundancy and even noise. Hence, the ability to generalise the features of traditional methods deteriorates sharply, and "empty space" is caused in highly-dimensional data space, making it difficult to solve multi-element density estimation problems. It is increasingly important to extract the substantive characteristics of things from complicated information; i.e., to determine out mutual independence and potentially hidden information, remove high-order redundancy, extract the genetic data of complete and independent subjects, and improve feature generalisability.
In order to overcome the defect that the traditional feature extraction method cannot truly reflect the information of the original data set under the imbalanced data set, this paper proposes an unsupervised text topicrelated gene extraction method (UTTGE). The contributions of this paper are listed as follows: • The UTTGE method combines factor analysis method with density peak algorithm. The factor analysis method is used to find the optimal low-dimensional base describing the original high-dimensional vector space, which makes it possible for the density peak algorithm [11] to quickly find the sample clusters of large-scale data sets. • The UTTGE method introduces average local density and information entropy [12] into the definition of feature item weighting, so as to construct the feature item's discrimination matrix for sample categories (clusters), which can eliminate the defects existing in the feature selection for uneven sample sets by the traditional method. • The UTTGE method uses Independent Component Analysis (ICA) [13] for the topic mining tasks of unsupervised texts. By analyzing the high-order dependence between multi-dimensional statistics, it finds hidden information components which are mutually independent, and accurately selects the optimal feature subset comprehensively and truly reflecting texts' topic information in imbalance largescale data sets. In this way, the classification and recognition of texts are improved.
The remainder of this paper is organised as follows. Section 2 reviews relevant work by Chinese and other scholars. Section 3 proposes a compatible, new, unsupervised, text topic-related gene extraction method based on an unsupervised clustering method for unbalanced datasets and text topic-related genes. Section 4 provides the experimental results and compares the new method's performance with other similar methods. Finally, we conclude our paper in Section 5.

RELATED WORK
At present, Chinese and other scholars have performed some research on the analysis of unbalanced datasets. Such issues are mainly solved in two ways: 1) optimisation of the existing feature descending dimension method and 2) improvement of sample class distribution rebalance and sorting algorithms. The core idea of class distribution rebalance is data resampling. The more common resampling techniques include oversampling and undersampling. Chawla et al. [14] provided the synthetic minority over-sampling technique (SMOTE) and improved the generalisability of the oversampling method by artificial synthesis of small classes. However, this method requires a high sample training time and increases the possibility of sample redundancy. Chen et al. [15] provided a step-by-step optimisation-based anti-random undersampling algorithm. This algorithm can remove noise and repetition information from training samples and make the classifier more suitable for small samples. In addition, improvement of the classification algorithm is not based on changing the class distribution of the original unbalanced dataset but on identifying small class samples by making the classifier more sensitive to them. Fang et al. [16] provided a method for detecting internet spam using the SMOTE oversampling method together with the random forest classification algorithm. Li et al. [17] provided an improved kernel density estimation-based data classification algorithm, and the space information of the method is still defined as the distance between the detection point and the class-centre, which inevitably reduces this method's robustness. Except for these methods, many studies have provided improvements to the classification algorithm, e.g. boosting [18], FCM-KFDA [19], AdaBoost-SVM [20], etc. The feature subsets selected in these methods are more optimised, but these methods generally have low efficiency for large, highlydimensional datasets.
For treatment of the unbalanced issue, there is little research on the first aspect of feature dimension reduction. However, it is an effective method of solving unbalanced issues and provides powerful support for solving a series of issues arising from highly-dimensional data. In the traditional feature selection method [21] was classified into unilateral and bilateral methods. The positive class features (sample of feature words belonging to a certain class), and combined positive and negative features (sample of feature words not belonging to a certain class) are selected using unilateral and bilateral methods, and frame combination is established according to feature selection effectiveness to obtain an optimised feature subset. However, this method still relies on traditional feature selection methods and is unsatisfactory for the selection of features in unbalanced datasets. Khoshgoftaar et al. [22] provided an iterative feature selection model to select optimal feature subsets, in which the data features are ranked by the clustering results obtained by an iterative process. However, in the model, selection of iterative functions and the number of iterations have large impacts on problem solving, and the performance of the model is limited to a certain extent. Through an unsupervised feature dimension-reduction model, this paper intends to minimise the information loss that occurs during dimension reduction and present a data subset that is closer to the original data. Current mainstream methods include PCA (principal Component Analysis), mutual information-based methods, MDS, ISOMAP, and manifold-based methods (LLE, LE, LPP, NPE, etc.). Lin et al. [23] provided a direct, unsupervised, orthometric locality-preserving algorithm. This algorithm resolves the matrix using a Laplacian matrix and may directly extract a projection matrix from the original space of the high-dimensional sample to solve the issue of small samples in the unsupervised identification analysis algorithm. Xu et al. [24] provided a mutual, informationbased, unsupervised, feature selection method (UFS-MI). In this method, standard UmRMR is selected after comprehensive consideration of relevancy and redundancy features to evaluate the feature importance. Zhu et al. [25] provided an unsupervised feature selection model for regularised self-representation (RSR), in which each feature may be represented as a linear combination of relevant features in a low-dimension space, and the l2 norm is regularised to select the representative feature and ensure its robustness. Li et al. [26] provided a strongly-robust unsupervised feature selection algorithm (RUFS), which uses the l2,1 norm minimization method to deal with redundancy and noise in tag learning and feature selection. This method provides an unsupervised, unbalanced, dataset feature selection method. In unsupervised environments, according to changes in the cluster size and using the same features of different clusters, this method assigns weights according to a feature importance function to adjust the unbalanced nature of the data distribution. Alibeigi et al. [27] provided an unsupervised feature selection method for unbalanced datasets. In unsupervised environments, the probability density of different feature spaces is used to analyse the distribution of each data feature. The data distribution relationship is used for feature selection. However, this method does not take into account the characteristics of the data distribution, which have a great impact on classification performance.

THE PROPOSED METHOD 3.1 Basic Framework
In unsupervised environments, to extract the feature information from unbalanced datasets, it is necessary to determine how to: 1) build models for unlabelled highdimensional data; 2) effectively measure feature similarity; 3) reduce feature dimensions and effectively reduce redundancy and 4) ensure rapid acquisition of the optimal feature subset.
In this paper, consideration is first given to the solution to the problems of valid dimension, and dimension when density peak clustering is performed for unmarked highdimensional data. Dimension reduction is performed for high-dimensional vectors in the factor analysis method. The clustering algorithm is indicated for density peak by the neighbourhood similarity of the sample point to achieve the clustering and automatic marking of the unmarked text set. Then, a weight is introduced to improve the calculation of the existing χ 2 statistical magnitude, and a CHI statistical matrix is constructed for the feature and sample classes (clusters), and a low-dimensional embedded space is built on the basis of maintaining the amount of original feature information. Finally, the topic gene is extracted in the dependent component analysis method. Fig. 1 shows the framework of the unsupervised text topic-related gene extraction method (UTTGE). Step One Step Two Step Three The following provides details on the three main steps of the UTTGE method.

Density Peak-Based Text Clustering Method
Generally, much informational content does not provide effective labelling due to poor processing and arrangement. However, if one needs to perform exploratory classification and marking for such information in an unsupervised sample, clustering-an unguided learning method-should be used. When the sample size is very high in an actual clustering process, the computational load is likely to exceed the capacity of the computer. Therefore, before clustering, it is necessary to perform dimension reduction for a certain class of variables in the sample. In this paper, we first analyze the characteristic variables of the sample using factor analysis, and then use the fast search and discovery density peak algorithm to cluster the samples according to the obtained factors.

Factor Analysis of Sample Features
Suppose the sample set X includes n samples, x1, x2, …, xn. Each sample xi consists of m feature indexes, and is recorded as X = (xij)n×m = (X1, X2, …, Xm).
(1) Before the factor analysis, the degree of correlation of X1, X2, …, Xm is judged by the Kaiser Meyer Olkin method (KMO) [28] to determine whether factor analysis is necessary. The value of KMO ranges within (0,1). The closer the KMO value is to 0, the weaker the correlation of X1, X2, …, Xm; the closer it is to 1, the stronger the correlation. Generally, it is considered that when the value of KMO is > 0.5, the factor analysis is of actual significance. (2) The covariance matrix Σ = (hij)m×m is calculated for X1, X2,…, Xm. The characteristic root λ1 ≥ λ2 ≥ ··· ≥ λp ≥ 0 may be obtained for the covariance matrix by the characteristic equation |Σ − λI| = 0 of Σ, and corresponding unit feature vector is T1, T2, …, Tp. (3) According to the solution principle for the actual problem, the first u characteristic roots and feature vectors are taken. The sum of their characteristic roots is made to be > 85% of the sum of all characteristic roots to determine the quantity of public factors. (4) The factor loading matrix 11 1 is calculated by the characteristic root and feature vector of Σ. If the load of each factor has no significant difference in the different feature indexes, the factor loading matrix must be rotated. The factor loading matrix is generally rotated by varimax rotation to u} is operated by the line vector of the rotated factor loading matrix A', and the maximum load bip of the feature index Xi of the matrix A' in u factors to obtain matrix The sample set X is simplified into the finite sample set X Δ comprising n samples, where each sample xi consists of u feature index factors, and the feature index matrix constructed for n samples from this, where ij x * is the j th feature index factor of the i th sample. Its formula is as follows:

Density Peak Search Discovery-Based Text Clustering Algorithm
Rodriguez et al. [29] provided a rapid search and discovery density peak-based clustering algorithm that can automatically discover cluster centres and achieve efficient clustering of arbitrarily-shaped data. The algorithm assumes that the local density of data point xi is ρi, that the distance from xi to the local density is larger than this, and that the closest data point xj in the cluster is δi. The clustering decision graph is built by calculating the ρi and δi of the arbitrary data point xj. The relative high data points of ρi and δi are marked as the central point of the cluster, and the remaining data points are distributed in the cluster of data points closest to it with higher density. In the algorithm, ρi and δi are defined as follows: data points, and dc is the cut-off distance (hyperparameter).
To better remove interference from noisy samples and provide true and reasonable clustering results, we use the feature index matrix in Section 3.2.1 as the algorithm input. We then calculate the sample similarity by adjusting the cosine similarity to redefine the variable dij in the clustering algorithm, and select the cut-off distance dc in the selection method provided in literature [29]. The value of dc is obtained by defining the data point xi as the circle centre, the radius as dc, and the accumulative number of ρi as |X| × 2%. The similarity between vectors i x * and j x * is: where, Thus, we divide the data set X into C clusters. The algorithm is described below: Input: feature index matrix X * of n samples Output: C sample clusters Step 1: calculate the distance Sim(i, j) between any two data points i x * and j x * in Eq. (4) Step 2: calculate the local density i ρ * of any data point i x * in X * and the distance i δ * between this point and the point with higher local density.
Step 3: use i ρ * as the horizontal axis and i δ * as the vertical axis to draw a decision-making diagram.
Step 4: according to the decision graph, mark the points with higher i ρ * and i δ * values as the cluster centers, and mark points where i ρ * is relatively low but i δ * is relatively high as noise points.
Step 5: distribute the remaining points to obtain C cluster partitions of n samples.

Feature Selection Method for Unbalanced Datasets 3.3.1 χ 2 -Value Based on Text Feature Distribution Matrix
Different feature words have large differences in their ability to express text topics and importance. The CHI method considers feature importance according to χ 2 values and, generally, features below a certain χ 2 value contain no or little sample class (cluster) distinction information. However, this understanding is established through the balanced or quasi balanced sample class (cluster) differentiation in the data set. In the case of unbalanced class (cluster) differentiation, the influence of class (cluster) differentiation and feature word frequency on classification are not considered. For unbalanced datasets, the traditional CHI method has obvious defects. To avoid the deficiencies using χ 2 -values, after comprehensive consideration of the specific distribution of the features in each sample class (cluster), it is necessary to solve the problem of sample class (cluster) imbalance and feature selection. In this paper, existing χ 2 -values are blended using information entropy and average local density to establish a new, weighted, χ 2 -value matrix, which can better solve the problem of feature selection in unbalanced datasets. Correction of the distribution of the features in the sample class (cluster) to a certain extent not only clearly shows the actual feature distribution, but also significantly improves the performance of the CHI statistical selection method.
To solve the difference between different sample classes (clusters), the feature t and the sample class (cluster) ci are simultaneously weighted, and the weighted χ 2 -value may be defined as Wχ 2 (t, ci) in this study. Let W = 1 in the traditional feature selection method. If a larger weight is distributed to the small class (cluster), the χ 2value of the small class (cluster) will be increased, and the opportunity to select these features will be increased so as to improve the classification accuracy of the small class (cluster). However, oversize weighted values are distributed to the χ 2 -values of the small class (cluster), so it is possible to influence the selection of the feature in the large class (cluster). Therefore, the weight setting is especially important, and the weight is defined as the information entropy of feature t and the sample class (cluster) ci in this study; that is to say, Wχ 2 (t, ci) is expressed as follows: where p(t|ci) is the probability of feature t occurring in sample class (cluster) ci, p(ci) is the probability of occurrence of the sample class (cluster) ci, p(t, ci) is the occurrence probability of feature t and sample class (cluster) ci, ( ) The statistical matrix K is established by the weighted χ 2 -values. The rows and columns in K, respectively, are the weighted probability distributions of the feature in different classes (clusters) and the same class (cluster). On this basis, the feature selection can avoid defects resulting from further consideration of the feature or the sample class (cluster).

Algorithm description
Input: weighted text χ 2 -value matrix K. Output: text feature subset T Algorithm steps: The time complexity of the algorithm is decided mainly by Step (2). The time complexity of the algorithm is O(n×m) (where n is the feature number and m is the number of the class (cluster). Additionally, according to the specific algorithm step, the space complexity of the algorithm is O(n).

Genetic Extraction Model for Text Topics
The purpose of ICA algorithm is to calculate a separation matrix and obtain a group of mutuallyindependent random variables. In this paper, the negentropy-based fast fixed-point algorithm (FastICA) [30] is used to find out the mutual independent implicit topic information components by analysing the high-order statistical correlations in the multidimensional data, and extract independent genetic features while removing the high-order redundancy of the components.

Negentropy-Based Fast Fixed Point Algorithm
Definition 2: if the density function of the random variable is py(x), its differential entropy is defined as follows: Definition 3: the negentropy J is defined below: where y * is a Gaussian random vector with the same correlation (covariance) matrix as y.
It is very difficult to directly calculate the negentropy, so it must be calculated approximately. The typical method for negentropy approximation is to use high-order accumulation and a density polynomial. Its corresponding approximation is given below: where kurt(y) is the kurtosis of y. However, this estimation method is not robust. Therefore, in practice, the expected form of the non-quadratic function G and its corresponding approximate form is as follows: where the function may be selected from 1 ( ) log cosh G y ay a = or 2 ( ) exp( 0.5 ), G y y = − − and a ranges within 1 ≤ a ≤ 2 and is generally 1.
Fast ICA algorithm finds out one unit (length) of vector w to maximise the non-Gaussianity of the corresponding projection w T z. The non-Gaussianity is measured by the negentropy approximation J(w T z), as defined in Eq. (9).
Description of the basic algorithmic form: ① Centralize the data to obtain average 0; ② Whiten the data to obtain z; ③ Select estimated number m of independent components, and make i = 1; ④ Select one initialized vector wi (randomly) with unit norm;

Description of Genetic Extraction Algorithm for Text Topics
Obtain the text feature subset T = t1, t2, …, tp of dataset X by the algorithm provided in Section 3.3.2.
(1) Centralisation of the feature subset Calculate the average vector of the text feature subset T = t1, t2, …, tp. { } t TT ' where E is the feature vector matrix of Ct, E is an orthogonal matrix, D is the feature value matrix of Ct, and D is a diagonal matrix. Linearly whiten V into: The data obtained after whitening is: (3) Calculate the independent components in the algorithm provided in Section 3.4.1.

EXPERIMENTAL RESULTS AND ANALYSIS
In this section, the unsupervised text topic-related gene extraction method is verified by experiment. All codes were written in MATLAB R2015a software, and the parameters of the PC used for the compilation runs were: HP Pavilion 15, Intel i7-6500U CPU, 8 GB RAM and Windows 10 64-bit operating system. To validate the proposed method, comparisons were performed between it and several publicly-available datasets: the regularized self-representation-based unsupervised feature selection algorithm (RSR) [25], the feature clustering-based feature selection method (FSFC) [31], the mutual informationbased unsupervised feature selection method (UFS-MI) [24], the strong robustness unsupervised feature selection algorithm (RUFS) [26], and the model-induced termweighted features method (tp-bnb) [32].

Corpus Set
To verify the differences in the performance of multiple methods in different data environments, three datasets from different sources are selected for evaluation testing in this paper. Dataset C: Sohu news data (SogouCS) 20151022 corpus, including 12 classes of 10,902 files, where the greatest class includes 2254 files and the smallest class includes 130 files. To verify the actual treatment effects of all methods, the corpus is supplemented and optimised to a certain degree. For example, some classes of text are supplemented, some incomplete texts are removed, and six classes are selected from the processed corpus: computer games, entertainment, sports and leisure, medicine, natural science, and art: a total of 5493 texts. See Tab. 1 for the specific data structure.  Dataset D: it contains microblogs related to 12 hot topics on Sina Weibo collected from January 1 to January 10, 2017. Considering the large differences in popularity of the topics, an equal proportion sampling method is adopted, and after the sampling is completed, artificial marking is performed, which includes 2996 pieces of relevant messages and 500 pieces of noisy data for 12 topics. See Tab. 2 for the specific data structure.
In the text pre-processing stage, the Chinese corpus is processed by the ICTCLAS Chinese word segmentation tool of the Chinese Academy of Sciences. The English corpus is processed by the porter algorithm. K-nearest neighbour (KNN), naive Bayesian methods and Support Vector Machine (SVM) are used as the classification algorithm, the k-means clustering method is selected as the clustering method. The neighbour parameter used in the contrast algorithm is set to five and cosine similarity is selected as the vector similarity.

Evaluation Test Index
Evaluation of the algorithm's classification results was performed using the macro average recall ratio, macro average accuracy rate, macro average F1 value and other indexes. The algorithm's clustering results were evaluated according to the normalized mutual information.
1) The macro average recall ratio is given in Eq. (11).
where ri is the recall ratio of class i, and |C| is the class number.
2) The macro average accuracy rate is given in Eq. (12).
where pi is the accuracy rate of class i.
where F1i is the F1 value of class i. 4) The level of similarity between different partitions of the same dataset can be measured by the normalized mutual information, as shown in Eq. (14).
where U and V are two different partitions of the same dataset, and I(U, V) is the mutual information of U and V.

Experimental Testing and Analysis
To obtain experimental results with high statistical significance, this paper used the five-fold cross validation method for evaluation. Figs. 2 and 3, respectively, show the classification results for RSR, FSFC, UFS-MI, RUFS, tpbnb and UTTGE obtained by the KNN and naïve Bayes classifiers on the 20 newsgroup corpuses. According to Figs. 2 and 3, the effect is best for UTTGE and worst for FSFC. For RSR, UFS-MI, tp-bnb and RUFS, UFS-MI was dominant. When fewer feature numbers are selected, the UTTGE method has higher classification accuracy than the other five methods and UFS-MI and tp-bnb had similar results to UTTGE. However, it can be seen that the classification performance of these methods is reduced to a certain degree at low dimensions. This is mainly because of the impact on classification performance of the many empty files that appeared in the feature dimension reduction process. Therefore, it cannot be said that selecting fewer features gives a better result. Figs. 2 and 3 show that when the feature number is increased to a critical point, the performance of the classifier declines to a certain extent mainly because of the impact of the many invalid classification features that are introduced. Hence, feature dimension reduction must be carefully chosen within a rational range. 3, UTTGE has a slightly lower macro average accuracy rate than UFS-MI but only when the feature number is 200. When the feature number is 200 or 500, UTTGE has a slightly lower macro average F1 value than tp-bnb. When the feature number is 50, the macro average accuracy rates of RSR, FSFC, UFS-MI, tp-bnb and RUFS are less than 75% while UKGE-MS achieved 76.01%. The macro average F1 values of RSR, FSFC, UFS-MI, tp-bnb and RUFS are less than 70%, while UKGE-MS achieves 70.28%. For UTTGE, with increases in the feature number, the macro-accuracy rate tends to become stable after reaching the optimal value of 87.2%, and both exceeded 85%. The other five methods had macro average accuracy rates of less than 85%. Tab. 4 shows the experimental results of the RSR, FSFC, UFS-MI, RUFS, tp-bnb and UTTGE methods are obtained by the naïve Bayes classifier on the Sohu News dataset (SogouCS) 20151022. According to Tab. 4, UTTGE has a higher macro average accuracy rate than the other four methods except with 100 features, when it has a slightly lower macro average F1 value than tp-bnb. With 100 features, RSR, FSFC, UFS-MI and RUFS have macro average accuracy rates less than 60%, while UKGE-MS has a 73.12% macro average accuracy rate. Only tp-bnb has a result close to that of UTTGE, which shows that when fewer features are selected, UTTGE performs better than the other five methods. With increases in feature number, for UTTGE, the classification accuracy is stable after reaching the optimal value and has a macro average accuracy rate more than 80%, while the other four methods are less than 80% and also have lower macro average F1 values.  Tab. 5 shows the macro average recall ratio peak of the different features selected from the three datasets from different sources by the KNN classifier. According to Tab. 5, with Datasets A and C, UTTGE has a significantly higher macro average recall ratio than the other five methods. With dataset B, UTTGE has a slightly lower macro average recall ratio than tp-bnb. However, it can be seen that when the Chinese unbalanced dataset is processed, the performance of UTTGE declines slightly compared with that of the English dataset, mainly because there are more factors impacting Chinese text than English text during processing, and the semantic conductive influences of the words are more significant. Tab. 6 shows that the experimental results of the RSR, FSFC, UFS-MI, RUFS, tp-bnb and UTTGE are obtained on Datasets D by using LIBSVM classifier (RBF kernel function, the dimension of the feature space vector is set to 300). From Tab. 6, it can be seen that UTTGE method significantly improves the macro average accuracy and macro average F1 value obtained by LIBSVM classifier compared with other methods. Based on the analysis of the experimental results in Tabs. 3-6, it can be known that the UTTGE method can accurately select the optimal feature subset that comprehensively and truly reflects texts' topic information, which can effectively improve the classification and recognition of texts.
Referring to the idea of using a clustering algorithm to verify the validity of the classification algorithm proposed in literature [33], and the k-means clustering algorithm is used to analyze the clustering and original categories of the text datasets considered in this paper. The normalized mutual information value of the dataset is used to measure the effectiveness of the algorithm.
As it is necessary to clearly define a cluster number during k-means clustering to reduce the impact of the kvalue section on the method, and the cluster number k used for the proposed and comparison methods are set to the class number include in the data labels, i.e. 20, 10 and 12. Fig. 4 shows the value corresponding to the normalized mutual information under different conditions. It also can be seen in Fig. 4 that the proposed UTTGE method has obvious advantages compared with the other four algorithms. As shown in Figs. 4(b) and (c), when the feature number is lower, UTTGE can rapidly achieve better results. Therefore, compared with the common unsupervised feature selection algorithm, the proposed algorithm performs better during unsupervised feature selection.

Parameter Analysis
Selection of the k value has a major impact on the results of the KNN algorithm. If a smaller k value is selected, only training samples that are close to the input sample affect the forecast result, which can cause overfitting. If a larger k value is selected, its advantage is that it can reduce the learning estimation error, and it may also increase the learning approximation error. In this case, the difference between the training sample and the input sample will affect the prediction and cause the prediction error. Therefore, in practical application, the smaller k value is generally selected, and the best k value is selected by cross validation method. Fig. 5 shows the classification results of the KNN algorithm with three datasets, with the UTTGE method using various k values. The plots of the experimental results from parabolic shapes. With the 20 Newsgroup corpus dataset, the KNN algorithm has the best classification effect when k = 21 (Fig. 5a). With the Reuter-21578 dataset, the optimal value is k = 29 (Fig. 5b). Meanwhile, with the Sohu News dataset (SogouCS), k = 9 is optimal (Fig. 5c).

CONCLUSIONS AND FUTURE WORK
This paper has explored the traditional featuredimension reduction method with unbalance datasets, and proposes an unsupervised text topic-related gene extraction method based on the density peak, χ 2 distribution matrix, and an independent component analysis approach. This method does not need large-scale training of marked samples or valid pre-definition of class relationships and relevant features, and overcomes disadvantages of poor generalization of models resulting from unbalanced distributions. On the basis of the rapid search and discovery peak text clustering method, the text feature distribution features of weighted χ 2 -values were determined by information entropy, which avoids changes in the class distribution of unbalanced datasets caused by oversampling and undersampling methods. The performance of the CHI statistical selection method is significantly improved by correcting the feature class distribution. Finally, the independent implicit information component of multi-dimensional data is extracted by the negentropy-based fast fixed-point algorithm (FastICA), and its feature subset has better generalisation performance than the RSR, FSFC, UFS-MI, RUFS and tp-bnb methods.
Feature dimension reduction is achieved under the condition of maintaining the identifiability of the dataset. Feature dimension reduction is a key step in the preprocessing of large industrial and social datasets [34,35].
The genetic extraction approach proposed in this paper will play an important role in the field of "big data" processing. So future work will explore how to better meet the data processing requirements in this field.

Compliance with Ethical Standards
This study was funded by Anhui province philosophy and social science planning project (No. AHSKY2018D 09). The authors would like to thank the anonymous revie wers for their helpful comments and suggestions.