Technical gazette, Vol. 30 No. 5, 2023.
Original scientific paper
https://doi.org/10.17559/TV-20230402000493
A Method for Automatically Generating Join Queries Based on Relations-Attributes Distance Matrix over Data Lakes
Caicai Zhang
; Zhejiang Institute of Mechanical and Electrical Engineering, No. 528, Binwen Road, Hangzhou, Zhejiang 310053, China
Chenglang Lu
; Zhejiang Institute of Mechanical and Electrical Engineering, No. 528, Binwen Road, Hangzhou, Zhejiang 310053, China
Zhuolin Mei
; School of Computer and Big Data Science, Jiujiang University, No. 551, Qianjin East Road, Jiujiang, Jiangxi 332005, China
Bin Wu
; School of Computer and Big Data Science, Jiujiang University, No. 551, Qianjin East Road, Jiujiang, Jiangxi 332005, China
Jing Yu
; School of Computer and Big Data Science, Jiujiang University, No. 551, Qianjin East Road, Jiujiang, Jiangxi 332005, China
Abstract
Techniques for identifying joinable or unionable tables in data lakes can yield valuable information for data scientists. However, more than half of their working time is spent familiarizing themselves with the metadata and correlations of datasets. Simplifying the use of information in data lakes is crucial for enhancing their utilization. The existing solution of integrating correlated relations into a single large data table via full disjunction requires integration updating when either data or metadata changes, complicating data maintenance. This paper proposes a method for automatically generating join queries based on the distance matrix of relations and attributes in data lakes. The distance matrix only requires updating when metadata changes, simplifying data maintenance. Experimental results demonstrate that once the distance matrix is generated, the time required to generate the join queries is negligible. Compared to the existing solution, the time cost for executing join queries over correlated tables is nearly identical to that of selection queries over integrated tables. The results of these two queries are also the same, showcasing the effectiveness and efficiency of our method.
Keywords
data integration; data lakes; distance matrix; join queries
Hrčak ID:
307740
URI
Publication date:
31.8.2023.
Visits: 870 *