Cross-Media Semantic Matching based on Sparse Representation

: With the rapid growth of multi - modal data, cross - media retrieval has aroused many research interests. In this paper, the cross - media retrieval includes two tasks: query image retrieves relevant text and query text retrieves relevant images. With the development of sparse representation, two independent sparse representation classifiers are used to map the heterogeneous features of images and texts into their common semantic space before implementing similarity comparison. The proposed method makes full use of semantic information, and it i s effective in the retrieving task. T he performance of this method was evaluated on Wiki dataset, NUS - WIDE dataset, Wiki dataset with CNN features and Pascal dataset with CNN features. The experimental results validate its effectiveness compared with several state - of -the- art algorithms on the Mean Average Precision and other performance indexes.


INTRODUCTION
With the rapid development of multi-modal data, it is very useful for people to understand and mine information contained in data using the relevant information of multimodal data [1]. Firstly, through the analysis of pictures and textual comments on the Internet social network, it is easy to effectively understand the people's opinion of the current hot topic or predicting social problems affecting public safety. Secondly, with the development of e-commerce, some online shopping websites such as Taobao and Jingdong, have become an inseparable part of people's lives. Through the analysis of products' style, function and user review information, the e-commerce websites can be adjusted to the marketing strategy. At the same time, the development of the Internet has also changed the way of people's work, learning and entertainment. People begin to use the image to retrieve the similar images or texts, or use the keywords and textual document to retrieve the related images and videos. Through the correlation analysis of the multi-modal data, better service can be provided for Internet users, and improve the efficiency of people's study and work. Therefore, using the semantic correlation of multi-modal media data, analyzing the semantic content of them has become an important research topic in the fields of cross-media retrieval and pattern recognition.
Currently, the correlation modeling among multimodal data still faces some challenges [2]. On the one hand, the low-level features of different modality data (e.g. an image and a section of text) are heterogeneous. However, the heterogeneous media data can be unified at the semantic level, i.e. semantic consistency of heterogeneous media data. Traditional media technology ignores it, so it is difficult to deal with the heterogeneous data. On the other hand, the correlation modeling of multimodal media data also needs the semantic information of isomorphic media data (e.g. several images are isomorphic to each other). Although this kind of data is often consistent in feature representation, how to mine correlation information of isomorphic media data using the semantic information is another important problem for cross-media correlation modeling.
In this paper, two independent sparse representation classifiers were used to map the heterogeneous features of images and texts into their common semantic space before implementing similarity comparison. And with their outputs, the common semantic space of images and texts can be obtained further applying cross-media retrieval. This method is named Sparse Representation-Semantic Matching (SRSM) in this paper. Compared with other cross-media retrieval methods, this method considers the semantic information of isomorphic media data as well as semantic consistency of heterogeneous media data. What is more, this method makes full use of semantic information, and it is effective.
The rest of the paper is shown as follows. Related works are introduced in Section 2. The details of SRSM are described in Section 3. The experimental results are shown in Section 4, and the conclusion is made in Section 5.

RELATED WORK 2.1 Cross-Media Retrieval based on Subspace Learning
Currently, a significant number of cross-media retrieval works focus on subspace learning method. This kind of method aims to learn a latent subspace of different modalities of media data (shown as Fig. 1). And it can be divided into four parts which are shown as follows: Figure 1 The framework of subspace learning method Subspace learning based on projection: This kind of subspace learning method uses the feature mapping to extract the latent subspaces shared by different modalities of media data. It can be divided into linear projection methods (e.g. Canonical Correlation Analysis, CCA [3] and Partial Least Squares, PLS [4]) and nonlinear projection methods (Kernel Canonical Correlation Analysis, KCCA [5] and Deep Canonical Correlation Analysis, DCCA [6]).
Subspace learning based on matrix factorization: This kind of subspace learning method uses the matrix factorization to extract the basis vectors of latent subspaces shared by different modalities of media data. It can be divided into nonnegative factorization methods (e.g. Joint Shared Nonnegative Matrix Factorization, JSNMF [7]) and eigen decomposition-based methods (e.g. Multi-Output Regularized Feature Projection, MORFP [8]). Subspace learning based on task: This kind of subspace learning method learns multiple related tasks at the same time so that it can improve the overall generalization performance of each task. It can be divided into Multi-task learning methods (e.g. Alternating Structure Optimization, ASO [9] and Convex Multi-Task Feature Learning, CMTFL [10]), Multi-label learning methods (e.g. Shared-Subspace Learning for Multi-Label Classification, SSLMC [11]) and Multi-class learning methods (e.g. Shared Structures in Multi-Class Classification, SSMCC [12]).
Subspace learning based on measurement: This kind of subspace learning method aims to learn the great measurement of different modalities of media data so that it can achieve the measurement difference among the data. It can be divided into Euclidean distance measurement methods (e.g. Multi-Modal Distance Metric Learning, MMDML [13]) and Mahalanobis distance measurement methods (e.g. Shared Subspace for Multiple Metric Learning, SSMML [14]).

Sparse Representation
Researches of neurophysiology show that sparse coding exists in primary visual cortex of humans. In 2000, Vinje and Gallant published a paper in Science [15]. By recording the response characteristics of the macaque's neurons under conditions of open natural scenes and simulated natural scenes, they discovered that the response of neurons in visual cortex meets sparse distribution. Then in 2001, a paper published by Nirengerg et al. in Nature showed similar results [16].
Sparse model is widely applied to domains of signal and image processing. Each signal can be represented by a linear combination of small number of elements in a dictionary. The development of sparse representation on image is roughly as follows: In 1993, Mallat proposed sparse representation for overcomplete dictionary. He used an overcomplete Gabor dictionary to represent an image and proposed Matching Pursuit (MP) algorithm [17]. In 1996, Olshausen et al. revealed the directionality of Human Vision [18]. Furthermore, many other methods have also been proposed [19][20][21][22]. For example, in [19], a signal was sparsely coded over a set of redundant bases and classified based on its coding vector. In [20], Wright et al. introduce sparse representation to robust face recognition. This boosts the research of sparse representation classification. And Gao et al. [21] proposed kernel sparse representation in face recognition.

SPARSE REPRESENTATION-SEMANTIC MATCHING CROSS-MEDIA RETRIEVAL
In this section, the details of SRSM are introduced. The framework of the model is shown in Fig. 2. Two independent sparse representation classifiers will be used to map the heterogeneous features of images and texts into their common semantic space before implementing similarity comparison. And with the output of the two independent sparse representation classifiers, the common semantic space of images and texts can be obtained and then be applied to cross-media retrieval. Two independent sparse representation classifiers unify the isomorphic features of images and texts to the common semantic level respectively. And then, these models unify the heterogeneous media data to the semantic level.

Low-level features of texts
The common semantic subspace

Sparse Representation Classifier
With the developing of compressed sensing, sparse representation represents a sample (a test sample) e.g. an image or a text using an overcomplete dictionary (the training samples), and the representation is linear and naturally sparse [23][24][25][26]. The total training set is defined as the overcomplete dictionary A of k classes: where the i th class is represented as: where, m is the samples dimension (p for image and q for text), and n i is the number of training samples of i th class. Then, for a test sample y, it can be represented as a linear representation of the total training samples as: where T ,1 ,2 0,..., 0, , ,..., 0,..., 0  is a coefficient matrix whose elements are close to zero except those related to i th class (as shown in Fig. 3). So a test sample y is effectively represented only using the training samples of the same class. Recent development in compressed sensing and sparse representation shows that the linear representation y = Ax can be solved by the following l 2 -minimization problem: 2 2 arg min s.t. x x Ax y = = where ||·|| 2 represents l 2 -norm.
where ||·|| 0 represents l 0 -norm. However, this problem is NP-hard. And the researches have validated that the solution of l 0 -minimization problem is equal to the l 1minimization problem if x is sparse enough, which is shown as follows: 1 1 arg min s.t. x x Ax y = = Furthermore, in the real application, the data is noisy. The testing sample cannot be represented exactly as a sparse linear representation of the training samples. Therefore, the linear representation y = Ax can be rewritten as follows with the possible noise: where z is a noise term with boundary ||z|| 2 < ε. And the l 1minimization problem can be changed to: Now when given a test sample y, its sparse representation 1 x is computed firstly. And the testing sample y can be effectively represented only using the training samples of the same class. However, there may be a few small nonzero elements of multiple classes in 1 x because of modeling error and noise. Consequently, for the th class, let function δ i : R n → R n select coefficients related with it, and for x ∈ R n , δ i (x) ∈ R n is a vector whose elements are zero except those related with th i class. So the linear representation can be approximately represented as . At last, the test sample y can be classified based on the minimization of residual between i y  and : And the algorithm of Sparse Representation Classifier is shown as Algorithm 1.

Algorithm 1: Sparse Representation-based Classification (SRC) (1) Input: a matrix of training samples
∈ R m×n , a test sample y ∈ R m , (and an optional error tolerance ε > 0) (2) Normalize the columns of A to have unit l 2 (3) Solve the l 1 -minimization optimization problem:

Sparse Representation-Semantic Matching
In SRSM, two independent sparse representation classifiers are used to map the heterogeneous features of images and texts into their common semantic space before implementing similarity comparison. Firstly, all training images or texts are used to reconstruct each image or text based on algorithm 1. And then, after obtaining the residuals vectors of testing images and texts, a little change is made to them: transforming them to probability representations and setting the maximum value of each residuals vector to be 1 while other being 0. And the algorithm of SRSM is shown as Algorithm 2. Now the low-level features of images and texts are mapped into their common semantic subspace in which the feature dimension of images and texts is the same, which is shown as follows:

EXPERIMENTS
In this section, the experimental results of SRSM are shown and compared with some other cross-media retrieval methods on four datasets: Wikipedia dataset [3], NUS-WIDE dataset [27], Wikipedia dataset with CNN features, and Pascal dataset with CNN features [28][29]. And the experimental results validate the effectiveness of this method.

Dataset
Wikipedia dataset [3]: It contains 2866 image-text pairs from Wikipedia's articles and the related images. All of them are classified to 10 categories. In this dataset, the low-level features of texts are 10-dimensional Latent Dirichlet Allocation (LDA) features [30] while images are 128-dimensional Scale Invariant Feature Transformation (SIFT) [31].
NUS-WIDE dataset [27]: It contains 26,9648 imagetext pairs. There are 81 semantic categories in all. In the experiments, 10 categories with maximum number of samples are selected (i.e. sky, lake, grass, plants, window, water, animal, buildings, clouds and person) to construct the dataset. In this dataset, the low-level features of texts are 1000-dimensional tag feature vectors while images are 500-dimensional SIFT features [31].
Wikipedia-CNN dataset [28][29]: It extracts CNN features from original images and textual features from original texts respectively. The low-level features of images are represented as 4096-dimensional CNN features while texts are 100-dimensional LDA features.
Pascal-CNN dataset [28]: It contains 1000 image-text pairs. There are 20 categories totally. In the experiment, 600 pairs are selected for training and 400 for testing. In this dataset, the low-level features of images are represented as 4096-dimensional CNN features while texts are 100-dimensional LDA features.

Evaluation Metric and Distance Functions
In experiment, Mean Average Precision (MAP) and Precision-Recall (PR) [3,28,29] are used to evaluate the performance of this method and compared ones. MAP and PR are widely used in performance evaluation of crossmedia retrieval algorithms. In this paper, the cross-media retrieval includes two tasks: query image retrieves relevant text and query text retrieves relevant images.
The last step of general cross-media retrieval methods is computing the distances of each image and text samples. The distance functions includes L1 distance, Normalized Correlation (NC), L 2 distance, Kullback-Leibler Divergence (KL), and Centered Correlation (CC), which are shown as follows: For convenience, i k×1 are used to represent a sample of Ir k×n and t k×1 to represent a sample of Tr k×n .
CC distance: Then the performance of SRSM is evaluated using the 5 different distance functions on Wikipedia-CNN dataset in order to find which one is the most suitable. The experimental results are shown in Tab. 1. It can be found that CC distance obtains the best performance. So CC distance is used in the whole experiments.

Experimental Results
In these experiments, the performance of SRSM is compared with some other cross-media retrieval methods on Wikipedia dataset which is designed for cross-media retrieval and NUS-WIDE dataset which is much larger than Wikipedia dataset. The MAP scores on both datasets are shown in Tab. 2. For the MAP scores obtained by compared methods, the results in [30] are cited. The results verify the effectiveness of the proposed SRSM method. Recently, it has been proved in many domains that CNN features enjoy more powerful performance for image representation. Consequently, the methods of cross-media retrieval with CNN features have a better performance. Therefore, the SRSM method is also compared with compared methods on Wikipedia-CNN dataset and Pascal-CNN dataset which extract CNN visual features from original images. For compared methods, some classical algorithms are selected, which include MDCR, GMMFA [31], GMMLDA [32], CCA-3V [33], SCM, and CCA. The MAP scores are shown on these two datasets in Tab. 3. And the PR curves for image query and text query on Wikipedia-CNN dataset and Pascal-CNN dataset. Furthermore, the MAP scores per class for image query, text query and the average performance on Wikipedia-CNN dataset and Pascal-CNN dataset are also shown. The results verify the superior performance of SRSM compared with other methods on the two datasets with CNN features.
And lastly, the left column is the query image (text), and the top 5 results are listed on the right columns. The upper part is using image to retrieve texts. The lower part is using text to retrieve images. For both parts, the first row is the success case while the second row is the failure case. For convenience, the images corresponding to the texts are used to represent both the query texts and the retrieved texts.

CONCLUSION
In this paper, two independent sparse representation classifiers are used to map the heterogeneous features of images and texts into their common semantic space before implementing similarity comparison. And with the output of the two independent sparse representation classifiers, the common semantic space of images and texts is obtained and further applied into cross-media retrieval. This method is named as Sparse Representation-Semantic Matching (SRSM) in this paper. The cross-media retrieval in this paper includes two tasks: query image retrieves relevant text and query text retrieves relevant images. In the experiments, the semantic information is made full of use. Through the analysis of the results, this method is effective obviously. The performance of this method on Wiki dataset, NUS-WIDE dataset, Wiki dataset with CNN features and Pascal dataset with CNN features is shown. The experimental results validate its effectiveness compared with several state-of-the-art algorithms. With this method, the images or texts can be retrieved effectively.