Skip to the main content

Original scientific paper

https://doi.org/10.17559/TV-20240212001325

Integration of Python Crawler and URL De-duplication Algorithm for Metrological Data Analysis

Xiaogang Luo ; Geely University of China, Building 17, No. 175, Nanhu West Road, Huayang Sub-district, Tianfu New Area, Chengdu City, Sichuan Province, China, 641423 *
Liang Zhou ; Geely University of China, Building 17, No. 175, Nanhu West Road, Huayang Sub-district, Tianfu New Area, Chengdu City, Sichuan Province, China, 641423

* Corresponding author.


Full text: english pdf 1.026 Kb

page 739-747

downloads: 420

cite


Abstract

In the field of network econometric data analysis, the analysis of massive URL data offers insights into the behavior of networks, the optimization of network structure, and the prevention of network attacks. Therefore, this study introduces Python web scraping technology to achieve data collection in the design of econometric data analysis software, and designs a hash split Bloom filter algorithm based on multiple eigenvalues to achieve de-duplication. The results demonstrated that the de-duplication time of the proposed algorithm was always lower than other algorithms. When the number of unified resource locators was 10,000, the de-duplication time was only 0.21 ms, which was 9.58 ms and 3.16 ms less than the other two algorithms, respectively. Meanwhile, the de-duplication accuracy of the research algorithm has always remained above 99.9%, reaching 99.95% when the number of unified resource locators was 35,000, which was 0.77% and 0.35% higher than the other two algorithms, respectively. In addition, the proposed algorithm had the lowest memory consumption. When the number of unified resource locators was 20,000, the memory consumption was only 0.28 MB, which was 1.18 MB lower than the Bloom filter algorithm. The research algorithms have shown outstanding performance in the field of network econometric data analysis, with high de-duplication efficiency, accuracy, and low memory consumption, providing reliable technical support for modern network econometric data analysis.

Keywords

bloom filter; HSDBF-ME; Python crawler; quantitative data analysis; URL deduplication

Hrčak ID:

328649

URI

https://hrcak.srce.hr/328649

Publication date:

27.2.2025.

Visits: 863 *