Tehnički vjesnik, Vol. 32 No. 2, 2025.
Izvorni znanstveni članak
https://doi.org/10.17559/TV-20240212001325
Integration of Python Crawler and URL De-duplication Algorithm for Metrological Data Analysis
Xiaogang Luo
; Geely University of China, Building 17, No. 175, Nanhu West Road, Huayang Sub-district, Tianfu New Area, Chengdu City, Sichuan Province, China, 641423
*
Liang Zhou
; Geely University of China, Building 17, No. 175, Nanhu West Road, Huayang Sub-district, Tianfu New Area, Chengdu City, Sichuan Province, China, 641423
* Dopisni autor.
Sažetak
In the field of network econometric data analysis, the analysis of massive URL data offers insights into the behavior of networks, the optimization of network structure, and the prevention of network attacks. Therefore, this study introduces Python web scraping technology to achieve data collection in the design of econometric data analysis software, and designs a hash split Bloom filter algorithm based on multiple eigenvalues to achieve de-duplication. The results demonstrated that the de-duplication time of the proposed algorithm was always lower than other algorithms. When the number of unified resource locators was 10,000, the de-duplication time was only 0.21 ms, which was 9.58 ms and 3.16 ms less than the other two algorithms, respectively. Meanwhile, the de-duplication accuracy of the research algorithm has always remained above 99.9%, reaching 99.95% when the number of unified resource locators was 35,000, which was 0.77% and 0.35% higher than the other two algorithms, respectively. In addition, the proposed algorithm had the lowest memory consumption. When the number of unified resource locators was 20,000, the memory consumption was only 0.28 MB, which was 1.18 MB lower than the Bloom filter algorithm. The research algorithms have shown outstanding performance in the field of network econometric data analysis, with high de-duplication efficiency, accuracy, and low memory consumption, providing reliable technical support for modern network econometric data analysis.
Ključne riječi
bloom filter; HSDBF-ME; Python crawler; quantitative data analysis; URL deduplication
Hrčak ID:
328649
URI
Datum izdavanja:
27.2.2025.
Posjeta: 863 *