Keyword Search in Large-Scale Databases with Topic Cluster Units

: To solve the inefficiency of the existing keyword search methods in large databases, this paper proposes TCU-based query, an offline query method based on topic cluster units. First, topic cluster units (TCUs) are constructed through vertical grouping and horizontal grouping on tables and tuples. In contrast to traditional keyword query methods, this offline method cannot only reduce the query response time, but also return results comprising richer and more complete semantic information. In order to further improve the efficiency of data preprocessing, an optimized solution for table join ordering based on the genetic algorithm is presented. Second, we select index terms using the association rule, and then we build an index on every topic cluster; by doing so we can improve the query speed significantly. Finally, we conduct extensive experiments to demonstrate that our approach greatly improves the performance of keyword search.


INTRODUCTION
Recently, keyword search has been widely used in information retrieval [1]. Because it is simple and easy to use, more and more people have accepted and applied the technology. Since documents and large amounts of data are normally stored in relational databases, it is necessary to provide a simple and effective query method targeting relational databases [2,3]. For non-expert users, it is difficult to exploit related information by means of traditional structured query methods such as SQL, because these query methods require users to have good knowledge of the underlying schema and the skill of using complex query languages. Obviously, these requirements are complicated and unfriendly for ordinary users. Thus keyword search over relational databases has been paid widespread attention.
Existing studies on keyword query over relational databases can be broadly classified into two types of methods: online query and offline query [4]. The main idea of online query is that database is modeled as a schema graph or data graph. At search time, the system traverses the graph to return one or more sub-graphs (Candidate Networks or Steinertrees) as query results [5][6][7][8][9][10][11]. Due to the frequent table joins in the process of queries, these methods can be extremely costly. In contrast, offline query methods have well solved the above problems by using virtual documents or tuple units [12,13]. These offline query methods connect tables using width-first traversal method before a keyword query is issued. Still, the offline query methods do not consider the query efficiency and thus are not practical in largescale databases. The data preprocessing is expensive and time-consuming using above methods if the database contains hundreds of tables. In addition, since the scale of obtained tables at the preprocessing phase is large, it costs a lot of time to get the query results, even if indexes are constructed.
In order to address the above problems, we emphasize the keyword search over large-scale databases, and the contributions of this paper are summarized as follows: (1) We propose a TCU-based query method for keyword search over large-scale databases.
(2) We construct a new data structure -topic cluster units. Based on an improved spectral clustering strategy and an association graph of topic cluster tuples, tables and tuples in a database are divided into several clusters. Generating a set of TCUs as search results offline cannot only significantly reduce the query response time, but also return results with richer and more complete semantics. (3) We design an optimized solution for table join ordering which can reduce the cost of data preprocessing. (4) We select terms according to the association rule algorithm, then we construct a subject index for each topic cluster, which improves the query speed remarkably. (5) We implement our approach over the real dataset Freebase [14]. The results demonstrate that the TCUbased query method achieves high efficiency and effectiveness, and outperforms state-of-the-art approaches significantly.

RELATED WORK 2.1 Online Query
DBXplorer, DISCOVER, BANKS, and BLINKS are online query methods in a relational database. These methods share the similar idea: queries are processed by a graph traversal that searches for connected tuples containing the query keywords.
DBXplorer and DISCOVER view the database as a schema graph with tables as nodes, and relationships as edges [5,6]. These methods construct and evaluate a set of CNs (Candidate Networks) for a given query, then identify potential answers that are composed of relevant tuples based on the CNs and schema graph of the database.
BANKS and BLINKS first covert the whole database into a data graph, where each node denotes a tuple, and edges connect tuples which can be joined together. The methods identify the Steiner trees from the whole graph, which is inefficient and proved to be an NP-Complete problem [8,10,11]. BANKS identifies the Steiner trees by employing a backward search strategy. A Steiner tree is a connected tree where every leaf node represents a tuple containing at least one query keyword, and internal nodes correspond to tuples that connect the leaf records. But when there are some nodes with big in-degree, the performance of BANKS will be decreased remarkably. So BLINKS proposed new techniques -a bidirectional expansion and a BLINKS index that can improve search efficiency significantly.

Offline Query
A drawback of the online query methods is the high cost of the online table joins. In general, the above problems can be solved through the pretreatment of tables and tuples in databases [12,13,15,16]. In recent years, the problem of off-line query has been discussed in-depth, and some solutions are put forward primarily. Su and Widom first propose the concept of text objects and virtual documents [12]. Multi-joining of tables is completed offline and the query efficiency is improved significantly. Improving and expanding the text objects, Feng et al. group the tuples with same attribute values to construct a more complete data structure -tuple units [13]. It integrates multiple tuple units to response to a keyword query and achieves a high performance. Because above methods only consider the simple table joins and use SQL-based methods to create the virtual documents or tuple units, they are not suitable for large-scale databases with complex schemas. Based on the offline query methods mentioned above, we define a more reasonable data structure -topic cluster units and put forward aTCU-based query method. We consider the table join order at the same time, so this method can obtain good performance in terms of retrieval efficiency and effectiveness.

OVERVIEW OF TCU-BASED QUERY METHOD
The main idea of this paper is presented in Fig. 1.
(1) Query. After users submit queries to the query processor, the query processor returns a set of topic cluster units containing one or more keywords as results to the users by scanning the subject indexes. (2) Offline data preprocessing. The offline data preprocessing includes mainly four modules: Vertical grouping,

Vertical Grouping Based on Characteristics of Databases and Query Logs
This section proposes a vertical grouping strategy based on an improved spectral clustering algorithm. In order to make the results of vertical grouping more accurate, we design a novel construction method of similarity matrix between tables. In particular, we construct the similarity matrix from two aspects of table characteristics (i.e., topology compactness and content similarity) and query logs. Thus, this method is comprehensive and reflects the user preferences. Fig. 2 shows that the Vertical grouping is divided into 3 modules: Input module, Similarity matrix construction module and Output module. It takes the relational database and schema graph as input to describe the contents and structures of databases respectively. Another input is the query logs which reflect the distribution characteristics of information in databases from a different aspect. In the Similarity matrix construction module, the topology compactness and content similarities between tables are calculated by analyzing the schema graph and database in the Input module. We then construct a similarity matrix which is made up of topology compactness matrix and content similarity matrix. Finally, we conduct a statistical analysis on query logs to adjust the above results. A set of topic clusters is produced as output.  In the process of measuring the topology compactness, this paper introduces the concept of topology potential in data fields and proposes a novel calculation method of topology compactness. In a database schema graph, we use Gaussian function to describe the interaction between nodes. Gaussian function has good mathematical characters and it can describe the short-range field (the interaction between nodes has localization character, the influence of a node tends to attenuate significantly as the logical distance between nodes increases). We suppose each node can generate a force field along the direction of each edge. Nodes in the schema graph associate with each other and the interaction force is closely related to the size of nodes and the distance between them. Obviously, the topology potential of a node contains rich structural information, it can be used to measure the topology compactness between nodes.
For a database schema graph ( ) G V,E = , the topology compactness between i v and j v is defined as follows: . σ is an influence factor, which determines the influence scope of nodes in the schema graph. When σ is small, the interaction between two nodes is weak; conversely, when σ is big, the interaction between nodes becomes strong. So we need to choose a proper value for σ. This paper employs the topology potential entropy to select the optimal σ. The specific method is as follows: we suppose there are n nodes 1 2 , , ... , n v v v in the database schema graph, where the topology potential of each node is ( ) The topology potential entropy is defined as where n is the number of nodes in the schema graph, is the normalization factor. When H reaches the minimum, the value of σ is optimal. That is, the optimal value of σ varies with the underlying schema and size of databases. ij ld denotes the logical distance between v i and v j , namely, the path length between them in the schema graph. According to the mathematical properties of Gaussian function, for a given σ, if node v i is out of the influence scope ( 3 ) of node v j , the topology compactness between them attenuates to zero rapidly [17,18]. Note:the basic form of Gaussian function is According to the rule of thum b for Gaussian function, we may know that about 99.7% of the values are within 3 standard deviations of the mean. That is, if That is, when ij ld is greater than 3 Based on Eq. (1), we can get the topology compactness between two nodes, and then we use the normalized formula to transform the results and construct a matrix of topology compactness.  Fig. 3 shows a database schema graph, which contains six nodes. Here, we take the figure as an example to show the calculation of topology compactness and the construction of topology compactness matrix c. For this schema graph, the optimal influence factor σ is 1.29 based on the minimal potential entropy; thus, the influence scope and normalized formula 10 we can get the topology compactness between two tables. . .
In the same way, we can calculate the topology compactness among the remaining tables. Thus, the topology compactness matrix C is obtained.

Figure 3 Database schema graph
(2) Content similarity Furthermore, the information stored in tuples and attributes of tables can also affect the similarity between tables and play an important role in the process of Vertical grouping. Obviously, the more similar the contents of both tables, the higher probability that they will be assigned into the same cluster. Therefore, this section will discuss the content similarity between any two tables, which provides the theoretical basis for the graph partition.
The analysis of the content similarity between tables should be divided into two aspects of name similarity and value similarity.
Name similarity imposes an important influence on the content similarity between tables. Specifically, it includes two parts, table name similarity and attribute name similarity. This paper employs Vector Space Model to calculate the name similarity [19]. First, we extract the keywords from table names T and attribute names A. Then we construct the vector V i for every table T i . At last we calculate the name similarity using Eq. (2).
The other factor influencing the content similarity is the attribute value similarity between two tables. The key steps of calculating attribute value similarity are as follows: ① Calculate the content similarity between attributes using Jaccard function.
It is a statistic measuring the extent of variation. The smaller the variation coefficient, the less the richness of attributes, and vice versa. We map the attribute values to integers in ascending order. S m (S n ) and m A ( n A ) are standard deviation and average of the mapped attribute values, respectively.
Through the above analysis, we can get the content similarity matrix.
Finally, the similarity matrix obtained by synthesizing the structural compactness matrix and content similarity matrix where α and β are balancing factors which are used to adjust the impacts of influence factors on similarity between tables. In reality, the values of α and β are usually obtained according to a number of experiments. In this paper we simplify the problems by assigning the same weight (1/2) to them based on maximum entropy model. Example 4.2: Similarly, we can get the name similarity among the remaining tables with the method described above. Then, we will continue to illustrate the calculating procedure of attribute value similarity between tables.
First, we calculate the content similarity between any two attributes of two tables using Eq. (3) in step ①. For example, attribute columns Team name in table 1 and Football team in Tab. 2 are denoted as T 1 A 2 and T 2 A 3 , respectively. The content similarity between them is as follows:

J T A T A T A T A T A T A
Second, we can get the set Z of matching attribute pairs between tables 1 and 2 according to the greedymatching strategy in step ②.

Z T A T A T A T A =
Before calculating the attribute value similarity between tables, we need to calculate the variation coefficient of each attribute column in tables. Thus we can assign a proper weight to each matching attribute pair. Next, we will discuss the calculation method of the variation coefficient with attributes T 1 A 2 and T 2 A 3 as an example.
We map the attribute values to integers in ascending order. The mapped attributes T 1 A 2 and T 2 A 3 are illustrated in Fig. 4. We will calculate the coefficient of variation of two attributes to compare their richness.

Figure 4 Comparison of attribute columns in two tables
The coefficient of variation of attributes T 1 A 2 and T 2 A 3 can be calculated as follows. T A T A > shows that the degree of variation of attribute T 1 A 2 is bigger than that of attribute T 2 A 3 , then the richness of attribute T 1 A 2 is greater than that of attribute T 2 A 3 .
Finally, we use Eq. (4) to calculate the attribute value similarity between the tables. The specific calculation process is as follows:

Sim T T max T A T A J T A T A Z
In conclusion, the content similarity between Tabs. 1 and 2 is as follows:

(3) Adjustment of similarity matrix
Query logs record a large number of query results and contain three fields: user ID, user queries and tables which contain query results. All of the information can reflect user preferences. The basic idea of the vertical grouping approach with user feedback is to perform a statistical analysis on query logs, and to use Eq. (5) adjust the result.

DB T ,T A
can be calculated by Eq. (5) as follows:  We use the information contained in the query logs to strengthen the result and get the final similarity matrix shown below.  ③ Map all nodes in V to R k , assign the nodes in R k into clusters 1 2 , ,..., k C C C with the k-means algorithm.

Join Order Optimization
In order to avoid the complex table join during query processing, the data preprocessing must be done by connecting the tables 1 2 , ,...,  Fig. 5.
To the best of our knowledge, the key step of the join order optimization is encoding data tables. We use a join tree to store and express join orders. In order to preserve more detailed information of the join tree, we encode the join order by the preorder traverse of the join tree, as shown in Fig. 6.
After the above encoding operation, we randomly select poplength join trees as the initial population and execute a genetic operation on the initial population to Corresponding relationships between join tree and integer sequence Crossover operator: exchange the sub-trees with the same size randomly. Replace the duplicate tables in the join tree with the table which does not appear in the tree.
Mutation operator: exchange tables with nonzero values in the join tree.
Then we calculate the cost of each join tree based on the existing cost function, select poplength join trees with minimal cost as the next generations. Repeat the above processes of genetics and selection until it reaches the specified iteration times. The reasonable iteration times can be acquired through many experiments. The join order represented by the minimum cost tree in the last generation of the genetic algorithm is the optimal join order.

Horizontal Grouping Based on Association Graph of Topic Cluster Tuples
After the above sections, we can get a topic cluster set 1 2 ( , ,..., ), where each C i contains a big table T' i . In many cases, multiple topic cluster tuples should be integrated to answer keyword queries. Because of that, we need to further group the topic cluster tuples using a reasonable horizontal grouping method to improve the query speed. The existing methods use SQL-based methods like group by operator to group the tuples in the process of horizontal grouping. This approach is limited to equality of the key attributes, which causes the related tuples containing discrepant values be assigned to different groups. In this section, we design a horizontal grouping method based on the association graph of topic cluster tuples. It can improve the efficiency of queries while making the results of grouping meet the users' demand. The calculation method of comprehensive similarity is put forward to make the results more accurate.
(1) Calculation of similarities between topic cluster tuples One important data model in the horizontal grouping is the association graph G = (V, E) of topic cluster tuples. It is a weighted undirected graph that uses the similarity between topic cluster tuples as weight values of edges. This section proposes a novel calculation method of similarity. ① Define different distance functions, such as Euclidean distance, Edit distance and Hamming distance; ② Map the topic cluster tuples to an n-dimensional space and calculate the distance d k (t' i , t' j ) between two tuples using the distance functions; ③ Calculate the comprehensive similarity between two tuples according to the following formula:

Edit distance:
We need at least edit 3 times to convert t' i to t' j : insert "Beijing" before the word "2015" in t' i ; remove the word "2015" from t' i ; insert the "2015" between the words "Conference" and "Data". So the Edit distance between t' i and t' j is 2 ( ', ')=3 After calculating these three distances between t' i and t' j , we can get the similarity between them using Eq. (7). 1 1

Sim t' ,t' d t' ,t' / max d t' ,t'
All parameters involved in the algorithm include the similarity threshold , the number of topic cluster units r in each topic cluster and Minsize. The first two parameters are given by the users, Minsize is generally set to 1% ~ 3% of the number of tuples in each topic cluster.

Optimization Mechanism of Index Based on Association Rule Algorithm 4.2.1 Construction of Subject Index
In order to speed up queries, we need to construct indexes for the database. Traditional indexing methods create indexes for every single word. Their efficiency is greatly affected in multi-keyword query. To solve the above problem, this section presents an index mechanism for the topic cluster units that uses the association rule algorithm to select frequent item sets as index terms. Each frequent item corresponds to an index entry. The structure of inverted index is as follows: We use Lucene toolkit (a full-text search engine which provides a complete query engine and indexing engine) to generate the index over the topic cluster units, its construction algorithm is presented as follows: Algorithm 3: Index Construction Input:

Query Processing
Using above indexes over topic cluster units, the topic cluster units containing the search keywords can be returned to users as the query results. Given a set of query keywords 1 2 { , ,...., } n K k k k = , the system scans multiple indexes in parallel, and it then calculates the score of each relevant topic cluster units based on an existing ranking function. Finally, we rank the results in descending order and return top-k results to users.

Dataset and Environment Setup
We have designed and performed a comprehensive set of experiments on the real dataset Freebase. Freebase is a shared database with a large size and complex schema. It contains approximately 2000 tables and 300 million entities. Due to the limitation in our laboratory equipment, we simplify the dataset without affecting the experimental results. We extract a small portion of data (about 400 MB) from the Freebase database for our experiments, meanwhile maintaining the database schema and the relationships between tables unchanged. To validate the performance of theTCU-based query method proposed in the paper, the following three groups of experiments are conducted. Experiment 1 examines the effectiveness of the optimization scheme of table join order by evaluating the preprocessing time of the method TCU-based and Naive (that is, a straightforward solution without optimizing the table join order). In Experiment 2, we compare our approach against the baseline methods of DBXplorer, BLANKS and SAINT [5,11,13], the results indicate that the efficiency and effectiveness of the method TCU-based have been further improved. In Experiment 3, comparative experimental results on different datasets show that the TCU-based method has a good scalability.
Our experimental platform is a computer running the Windows 7, with Intel®Core (TM) 2.5 GHz CPU, a 4 GB of RAM and 500G Disk. All the algorithms are implemented in Java.

Experimental Evaluation 5.2.1 Optimization Scheme Evaluation
Experimental comparison between TCU-based query and Naive approach is conducted on the databases with the different number of the tables. Fig. 7 reports the preprocessing time for TCU-based query and Naive approach by varying the number of the tables. From the results, we observe that the TCU-based method beats the method Naive at the preprocessing time. The main reason is that TCU-based query method optimizes the join order among the tables. Additionally, the bigger the number of tables is, the more obvious the advantage of the scheme is.

Comparison with State-of-the-Art Approaches
This section focuses on the performance of the TCUbased method. We randomly select 100 keyword queries with different numbers of keywords from query logs, then we evaluate the efficiency and the effectiveness of our method by comparing it with the baseline methods of DBXplorer, BLANKS and SAINT.

(1) Query efficiency
We vary the different numbers of input keywords to identify all the answers and compare the corresponding average top-2 response time. Query response time is the duration from the instance that a user issues a query Q, until the time it produces the top-2 results, excluding the offline data preprocessing time. As expected, the query length affects the response time significantly. Longer queries result in larger response time. The figure also demonstrates that the offline query methods SAINT and TCU-based achieve much higher efficiency than the online query methods DBXplorer and BLANKS.
The main reason is that the latter two methods require to perform complex join operations of tables during the query phases. Furthermore, it is not surprising that it takes more time for above table joins to perform keyword search over the database with a complex schema than over that with a simple schema. In contrast, the tables and tuples in the database are preprocessed in offline query methods, thereby a lot of time can be saved for real search and the query speed is greatly improved. Moreover, compared to the method SAINT, the method TCU-based proposed in this paper has a great improvement in query performance. More specifically, the TCU-based method constructs a subject index for every topic cluster, and the  Fig. 8a plots that the performance of our method is improved slightly when the number of keywords is greater than 3. It is mainly because our method selects index terms using the association rule. It can directly scan the index items containing all query keywords, which does not need join index.
In order to better evaluate the performance of our query method, we identify the top-k answers with different values of k and compare the corresponding average response time. In Fig. 8b the different k is taken as x-axis and the corresponding average response time is taken as y-axis. Obviously, the method proposed in this paper significantly outperforms the baseline methods, suggesting that using topic cluster units is crucial to obtain good performance. For example, considering finding top-12 results of the queries, our method costs about 2014 ms while DBXplorer, BLANKS, SAINT consume 13500 ms, 7452 ms and 3420 ms respectively. As k grows, the response time of our method increases much more slowly than the other three methods. The reason is very similar to the one described in Fig. 8a, we will not repeat it here. This comparison further demonstrates the high efficiency of our method.

(2) Query effectiveness
In this section, we report the effectiveness of the TCU-based query method in comparison with the other three approaches. The query effectiveness of all methods, namely, the quality of returned answers, is measured by the average precision and average recall metrics. Given a set of keyword queries, we employ the corresponding SQL queries to generate our baseline query results which are accurate and complete, so that they can be used to compare with the results of keyword query methods. The average precision and average recall of different methods mentioned above are illustrated in Fig. 9.
An interesting observation is that the accuracy of the query methods is gradually reduced when we increase the number of keywords. Specifically, the accuracy of 1keyword and 2-keyword queries is generally higher than that of 3-keyword, 4-keyword and 5-keyword. This is because the relationships between keywords become more complex when the number of query keywords increases. The TCU-based method achieves roughly 90 percent precision, which leads to about 3~7 percent over the alternative method SAINT, and it is even higher when compared with the other methods such as DBXplorer and BLANKS. Fig. 9b shows that the TCU-based method outperforms the other three kinds of methods in terms of the recall, and the recall of our approach is about 15 percent higher than that of SAINT. The reason is that our method utilizes the additional pretreatment and then integrates multiple topic cluster units to answer a keyword query, which can improve both precision and recall. The information in database is preprocessed, which includes the Vertical grouping and Horizontal grouping. The Vertical grouping takes into account both the topology compactness and content similarity. Besides, combining the query logs, it can use the history queries to adjust the result of the Vertical grouping. This makes the data tables with a higher topic similarity be classified into the same topic cluster and the result of Vertical grouping is of higher classification accuracy. Horizontal grouping classifies the tuples with high similarity into the same topic cluster units by using a hybrid similarity measurement method. Through the above preprocessing operations, the subsequent query results can contain more comprehensive and accurate information. Comparing with other methods, the TCU- TCU-Based based method can discover answers that indirectly contain query keywords and consider the semantic correlation. Consequently, theTCU-based method is effective enough to provide high-quality query service.

Scalability of TCU-Based Query
This part of the experiment investigates the scalability of TCU-based query method. We study how the size of database affects the performance of the query method. The measurement of top-5 average query response time is described in Experiment 3. In Fig. 10, the x-axis represents the size of databases while the y-axis represents top-5 average query response time. In this experiment, datasets DB 1 , DB 2 , DB 3 , DB 4 and DB 5 are used, and their sizes vary from 100 MB to 500 MB. As the size of dataset grows, the response time of our method increases slowly. This is because the size of dataset only exerts great influence on the preprocessing time but not the online indexing. This experiment shows that our method is extensible on databases with different size.

CONCLUSION
This paper has studied the keyword query over relational databases. We put forward TCU-based query method -an offline query method based on the topic cluster units, which is suitable for large-scale databases with complex schemas and enormous data. Firstly, we define the concept of topic cluster units. In order to construct topic cluster units, we use an improved spectral clustering algorithm to make a vertical grouping on tables and use an association graph of topic cluster tuples to make a horizontal grouping on tuples. Tuples with high topology compactness, content similarities and cooccurrence in query logs are clustered into the same topic cluster through the above options. We use the topic cluster units as results which contain richer and more complete semantic information. Secondly, this paper designs an optimized solution for table join ordering based on genetic algorithm to reduce the cost of data preprocessing. Finally, we select index terms using the association rule to construct indexes for topic clusters, and thus the query speed is improved significantly.
In future research, we will continue to study the parallel query in distributed databases and the dynamic construction of indexes. They will also have positive effects on the query efficiency.