Skoči na glavni sadržaj

Izvorni znanstveni članak

https://doi.org/10.17559/TV-20240212001325

Integration of Python Crawler and URL De-duplication Algorithm for Metrological Data Analysis

Xiaogang Luo ; Geely University of China, Building 17, No. 175, Nanhu West Road, Huayang Sub-district, Tianfu New Area, Chengdu City, Sichuan Province, China, 641423 *
Liang Zhou ; Geely University of China, Building 17, No. 175, Nanhu West Road, Huayang Sub-district, Tianfu New Area, Chengdu City, Sichuan Province, China, 641423

* Dopisni autor.


Puni tekst: engleski pdf 1.026 Kb

str. 739-747

preuzimanja: 420

citiraj


Sažetak

In the field of network econometric data analysis, the analysis of massive URL data offers insights into the behavior of networks, the optimization of network structure, and the prevention of network attacks. Therefore, this study introduces Python web scraping technology to achieve data collection in the design of econometric data analysis software, and designs a hash split Bloom filter algorithm based on multiple eigenvalues to achieve de-duplication. The results demonstrated that the de-duplication time of the proposed algorithm was always lower than other algorithms. When the number of unified resource locators was 10,000, the de-duplication time was only 0.21 ms, which was 9.58 ms and 3.16 ms less than the other two algorithms, respectively. Meanwhile, the de-duplication accuracy of the research algorithm has always remained above 99.9%, reaching 99.95% when the number of unified resource locators was 35,000, which was 0.77% and 0.35% higher than the other two algorithms, respectively. In addition, the proposed algorithm had the lowest memory consumption. When the number of unified resource locators was 20,000, the memory consumption was only 0.28 MB, which was 1.18 MB lower than the Bloom filter algorithm. The research algorithms have shown outstanding performance in the field of network econometric data analysis, with high de-duplication efficiency, accuracy, and low memory consumption, providing reliable technical support for modern network econometric data analysis.

Ključne riječi

bloom filter; HSDBF-ME; Python crawler; quantitative data analysis; URL deduplication

Hrčak ID:

328649

URI

https://hrcak.srce.hr/328649

Datum izdavanja:

27.2.2025.

Posjeta: 863 *