An Efficient Query Optimizer with Materialized Intermediate Views in Distributed and Cloud Environment

: In cloud computing environment hardware resources required for the execution of query using distributed relational database system are scaled up or scaled down according to the query workload performance. Complex queries require large scale of resources in order to complete their execution efficiently. The large scale of resource requirements can be reduced by minimizing query execution time that maximizes resource utilization and decreases payment overhead of customers. Complex queries or batch queries contain some common subexpressions. If these common subexpressions evaluated once and their results are cached, they can be used for execution of further queries. In this research, we have come up with an algorithm for query optimization, which aims at storing intermediate results of the queries and use these by - products for execution of future queries. Extensive experiments have been carried out with the help of simulation model to test the algorithm efficiency.


INTRODUCTION
In cloud environment, bulk of applications are based on data which are managed by Database Management Systems and that forms a crucial issue on cloud platform. So, service oriented computing in cloud, however is also extended to Database as a Service (DBaaS) [1]. Database management in cloud and efficient data access for cloud users become cumbersome task for cloud service providers. Cloud computing has limit of sharing processing time and storage space for various applications in databases. The proliferation found in diverse applications which impacted exceptional cloud stages bringing about massive growth inside the length of the information created just as devoured through such applications [2]. The most effective method to put together and deal with those large databases in order to get the required information for the clients is found to be the new research area in distributed and cloud environment. The data modelling in cloud platform is the basis of cloud applications and the key issue is the searching algorithms applied [3]. The most effective method to get the information convenient, precisely and dependably; assumes a significant function in the achievement of database model in cloud platform.
In cloud platform, users can hire huge resources for short period of time to execute complex queries more efficiently on large database using group of virtual machines [4]. The hire charges of resources for users can be reduced using better query optimization technique [5]. Thus there is a need of exploring efficient techniques for query execution that would decrease runtime and response time. It will likewise upgrade optimum use of resources in cloud data centers. This paper is in continuation with [6][7][8], where an extensive survey on various query optimization approaches as well the novel architecture of an intelligent query optimizer for distributed database has been presented.
Contributions -As such, the researchers have designed and developed a technique for query optimization in distributed and cloud environment. The contributions of this research paper are as follows.
• Devise an architecture of query optimizer for distributed databases which is integrated with materialized views that are reused for evaluation of further queries in the system.
• Formulate a query optimization technique that results in a better optimization of database queries.
• Introduce an optimization strategy that may reduce bandwidth requirement in cloud environment by satisfying SLA between customer and cloud service provider.
• Demonstrate the developed model with TPC-H benchmark dataset and series of benchmark queries to test performance of devised query optimization technique.
• Improve resource utilization in cloud by reducing query evaluation time and response time.
The remainder of this research paper is structured as follows. Section 2 presents an outline of related work. The proposed work with cost model is elaborated in section 3. Section 4 describes an experimental setup with workload of benchmark dataset and series of queries for testing performance of query optimizer as well as results for evaluation of the proposed system. Section 5 concludes the paper.

RELATED WORK
Optimization of database queries is performed through two phases -search space generation and optimal plan selection from the search space [5]. Researchers have discovered different approaches on query optimization those deals with reduction of communication cost, reduction of execution time and appropriate utilization of system resources.
Execution time of queries can be reduced by eliminating common sub-expressions used. Start-fetch wrapper using request window mechanism is used by Lee R et al. [9] to develop an IGNITE system that eliminates common subexpressions. However, the communication traffic generated in IGNITE is reduced by Chen G et al. [10] and Dokeroglu et al. [4] by using efficient sets of query execution plans in their research work. The response time of queries is reduced with the help of parallel query processing systems in research works proposed by Garofalakis et al. [11] and Dokeroglu et al. [12]. Giannikis et al. [13] proposed an architecture based on sharing of computation, storage and cache memory by creating batch of queries. Similar concept of resource sharing for query optimization is presented in [14].
Iterative processing method can also be used for query optimization in which actual runtime statistics is collected by continuously monitoring execution of queries [15]. However, based on intermediate results, it would be cumbersome to collect required statistics. Hence Cole and Grafe [16,17] addressed a multiple plans generation technique at compile time. POP is the progressing query optimization approach invented in [18,19], where cardinality approximation errors are detected in mid execution of queries. DB2L learning optimizer based on use of misestimates to learn and adjust the statistics to enhance better optimization of further queries [20]. Wang et al. [21] presented adaptive query optimization approach for cloud database system based on execution time as well as monetary costs.
The technique of caching intermediate results is one of the widely used query optimization technique [22], extended by Safaeei A et al. [23] based on multiple sliding windows to improve execution of overlapping queries with common subexpressions. Laptev et al. [24] presented EARL system and Agarwal et al. [25] proposed the BlinkDB, those iteratively works for collecting larger samples to reach at the desired accuracy. The Shark, presented in [26] caches inter query data with the help of shared memory concept. In distributed cloud environment, CHive [27] and NOVA [28] are the query optimization techniques based on incremental processing of continuous data. Logothetis et al. [29] invented CBP (Continuous Bulk Processing) system in which working state is preserved during query processing to reuse it for future queries. DBaaS can offer an assistance to cloud users that ready to get results with more fragile quality in return of lower cost [30].
Query processing time can be reduced by materializing views generated from select-project join operations [31]. The policies of protecting materialized views cache are introduced by WATCHMAN [32] and DynaMat [33] systems to increase the hit ratio. Ivanova et al. [34] devised an architecture that optimizes query processing by maintaining cache to avoid repetitions of physical operations. The results from MapReduce jobs can be reused by describing them in the form of analytical query languages [35][36][37]. Perez et al. [38] presented history aware optimizer that archives intermediate results of queries so as to reuse them during future query execution. Intermediate results or views that are to be cached are determined by Cache-on-Demand [39] as well as MQT technique [40] and relations that may be useful in execution of future queries are identified by Kossmann et al. [41].
MapReduce technique results in higher processing cost for queries with more join operations, which can be reduced using pipeline approach where results of a query can be referred by next queries to continue processing. Anyanwu K et al. [42] proposed a data model to minimize the number of MapReduce cycles using pipeline approach. Automatic Query Analyser (AQUA) developed by Wu et al. [43] works for MapReduce in two phases, first phase for minimizing number of MapReduce cycles and second phase for joining intermediate results. MapReduce online system avoids materialization by pipelining midway results from map jobs to reduce jobs [44].
After navigating various approaches on database query optimization, it has been perceived that there is a need of suitable technique to reduce the runtime for database queries that minimizes resource requirement. Conventional SQL cannot predict the future requirements, therefore it reduces the performance than MapReduce technique. MapReduce uses pipeline approach to reduce query processing time, however it require higher processing cost for queries with more join operations. Hence to bridge the gap between MapReduce and Conventional SQL, there is a demand of efficient query optimization technique in distributed and cloud environment.

PROPOSED WORK
The primary aim of this research work is to design and develop an intelligent query optimizer that improves resource utilization in distributed and cloud environment by reducing query evaluation time and response time. As such, the proposed query optimizer materializes intermediate views during query evaluation so as to reuse them for execution further queries. Hence, it will reduce I/O operations, communication cost as well as execution time.

An Architecture
An architecture of the proposed query optimizer presented in Fig. 1 comprises of various components. View Navigator searches for appropriate views during query evaluation by navigating View Catalogue with the help of view matching algorithm presented in Tab. 1. It navigates through each catalogue entry and tries to find the longest match view name stored in view store and returns the location of view store where match is found. Partial intermediate results are also used for evaluation of future queries. This algorithm applies commutative rule for natural join.
View Manager maintains generated views in View Store and also updates View Catalogue. It also keeps track of view reference count as well as read/write timestamp values for every view. These values are used for decision making of view deletions from View Store and corresponding updates in View Catalogue.
Query Evaluator performs the query evaluation task after substituting matching views.
View Store materializes views those are generated from previously evaluated queries.
View Catalogue maintains the catalogue of view names those are materialized in view store. View name matched by View Manager from View Catalogue mapped into View Store.

Cost Model
In order to measure the performance of proposed query optimization technique, a cost model has been developed based on various parameters presented in Tab. 2.
Let Q be the query from workload has been divided into n subqueries {Q1, Q 2 ,… , Q n } during evaluation and trying to match with k number of materialized views {V 1 , V 2 ,… , V k }in View Store.
Total execution cost of query Q includes execution time, memory requirement, IO cost as well as total view matching time from View Store during evaluation of query Q, as shown in Eq. (1). Total view matching time for query Q is calculated as in Eq. (2).
where maximum and minimum view matching time from View Store of any subquery Q i is shown in Eq.
Memory requirement M(Q i ) and IO cost IO(Q i ) for execution of query Q i are directly proportional to the size of relations those are involved in query evaluation, shown in Eq. (5) and (6). However processing and IO cost will get reduced when view is matched from View Store.
Total effective execution time of query Q using materialized view is calculated as in Eq. (7), where effective execution time of subquery Q i is as shown in Eq. (8).

EXPERIMENTAL SETUP AND RESULTS
It is crucial to evaluate the performance of proposed algorithm on real infrastructure of large-scale database system. Hence, to ensure the large scale database system infrastructure, the simulator model for query optimization has been developed in Java which run on the top of MySQL to compare the results of the proposed technique with existing approaches.

Simulation
The simulation model contains various modules. Frequent Query Join Holder module maintains a data structure for catalogue where it makes entries of views which are cached. Before executing any query this catalogue is to be searched. The view matching algorithm has been implemented on the top of Zql parser that parses SQL statements for matching views. If any view matches then resulting entry is searched in actual view store.
Frequent Query Join module is responsible for managing view store. It stores results of previously executed queries as views. If any view is not in use for a long period of time then it deletes it and also deletes its entry from catalogue. Hence view searching time will not dominate query execution time.
DBMS Query Runner module is responsible for execution of queries with the help of Zql Parser [45]. Zql parser parses SQL statements and generates Java structures signifying query statements and expressions. Zql parser takes SQL statements with insert, delete, update, select etc. as an input and produces data structure that represent the statements it parsed.

Example:
Select ps_supplycost, pname from part, partsupp, supplier, nation where p_partkey=ps_partkkey and ps_suppkey=s_suppkey and s_nationkey=n_nationkey and n_name='India'; Zql parser extract various parts of this query and generates ZqlQuery structure. The methods getSelect(), getFrom() and getWhere() returns Select, From and Where parts of the query in their respective data structures of Java.
Result Processor module analyses results using various parameters such as total execution time of queries, average runtime per query and number of iterations to be performed.

Workload
The proposed query optimizer has been tested by generating various workloads using the standard benchmark dataset of TPC-H [46]. The TPC-H database consist of eight individual relations as shown in Tab. 3, which represents data to exercise functionalities of complex analysis application system. The testing has been performed as small scale level, after distributing these benchmark relations randomly over two nodes after the horizontal fragmentation.
TPC-H benchmark comprised of set of 22 original queries to give realistic context that represent the activity of wholesale supplier. With the variant of these 22 base queries, the workload of 50 queries is generated to test the performance of proposed technique. In order to test the effectiveness of proposed system, four variations of workload of queries are generated as shown in Tab. 4. In each query, the selection predicates are generated at random with the help of proper range of probable values as per the benchmark.  Set of repeated queries having common sub-expressions By using the cost model presented in section 3.2, efficiency of proposed query optimizer has been measured in terms of various performance parameters such as total time for query execution, average runtime per query and number of iterations.

Results and Discussion
The performance of proposed query optimization technique is analysed and compared with conventional SQL system w.r.t. performance parameters viz. total execution time for workload of queries, average runtime per query and number of iterations required. Extensive experiments have been conducted with the help of all variations of query workloads specified in Tab. 4 and results are compared with conventional SQL system as presented in Tab. 5. As workload W 1 contains queries without any repetition and/or common subexpression, it cannot take benefit of materialization. Hence, proposed system incurs extra overhead of matching views from view store. However, we observed the decrease in running time for queries in workloads W 2 to W 4 that are matched and recycled intermediate views from view store. Fig. 2 shows the comparative chart on average execution time per query using conventional SQL system and proposed query optimization technique. Figure 2 Comparative study on average runtime per query

CONCLUSION AND FUTURE WORK
In this paper the novel query optimization technique with architecture of query optimizer has been presented, which aims at recycling intermediate results of previously executed queries for execution of future queries. The queries those contain common subexpressions or repeated, can use these materialized views for their execution so that it results into decrease in the average execution time per query and increase in performance of the system. Hence using this technique resource utilization in cloud environment will also get improved. The proposed system has been tested on small scale infrastructure and it is observed that query execution time is reduced to approximately 30% as compared to the conventional query processing system due to materialization of intermediate views. As a future work, the proposed system will be evaluated on a large scale infrastructure in cloud environment to test its efficiency.

Notice
This paper was presented at IC2ST-2021 -International Conference on Convergence of Smart Technologies. This conference was organized in Pune, India by Aspire Research Foundation, January 9-10, 2021. The paper will not be published anywhere else.