Skip to the main content

Original scientific paper

https://doi.org/10.17559/TV-20230402000493

A Method for Automatically Generating Join Queries Based on Relations-Attributes Distance Matrix over Data Lakes

Caicai Zhang ; Zhejiang Institute of Mechanical and Electrical Engineering, No. 528, Binwen Road, Hangzhou, Zhejiang 310053, China
Chenglang Lu ; Zhejiang Institute of Mechanical and Electrical Engineering, No. 528, Binwen Road, Hangzhou, Zhejiang 310053, China
Zhuolin Mei ; School of Computer and Big Data Science, Jiujiang University, No. 551, Qianjin East Road, Jiujiang, Jiangxi 332005, China
Bin Wu ; School of Computer and Big Data Science, Jiujiang University, No. 551, Qianjin East Road, Jiujiang, Jiangxi 332005, China
Jing Yu ; School of Computer and Big Data Science, Jiujiang University, No. 551, Qianjin East Road, Jiujiang, Jiangxi 332005, China


Full text: english pdf 894 Kb

versions

page 1539-1546

downloads: 423

cite


Abstract

Techniques for identifying joinable or unionable tables in data lakes can yield valuable information for data scientists. However, more than half of their working time is spent familiarizing themselves with the metadata and correlations of datasets. Simplifying the use of information in data lakes is crucial for enhancing their utilization. The existing solution of integrating correlated relations into a single large data table via full disjunction requires integration updating when either data or metadata changes, complicating data maintenance. This paper proposes a method for automatically generating join queries based on the distance matrix of relations and attributes in data lakes. The distance matrix only requires updating when metadata changes, simplifying data maintenance. Experimental results demonstrate that once the distance matrix is generated, the time required to generate the join queries is negligible. Compared to the existing solution, the time cost for executing join queries over correlated tables is nearly identical to that of selection queries over integrated tables. The results of these two queries are also the same, showcasing the effectiveness and efficiency of our method.

Keywords

data integration; data lakes; distance matrix; join queries

Hrčak ID:

307740

URI

https://hrcak.srce.hr/307740

Publication date:

31.8.2023.

Visits: 870 *