Training a Genre Classifier for Automatic Classification of Web Pages

Vidulin, Vedrana; Luštrek, Mitja; Gams, Matjaž

doi:10.2498/cit.1001137

Journal of computing and information technology, Vol. 15 No. 4, 2007.

Izvorni znanstveni članak

https://doi.org/10.2498/cit.1001137

Training a Genre Classifier for Automatic Classification of Web Pages

Vedrana Vidulin
Mitja Luštrek
Matjaž Gams

Puni tekst: engleski pdf 373 Kb

str. 305-311

preuzimanja: 1.363

citiraj

APA 6th Edition

Vidulin, V., Luštrek, M. i Gams, M. (2007). Training a Genre Classifier for Automatic Classification of Web Pages. Journal of computing and information technology, 15 (4), 305-311. https://doi.org/10.2498/cit.1001137

MLA 8th Edition

Vidulin, Vedrana, et al. "Training a Genre Classifier for Automatic Classification of Web Pages." Journal of computing and information technology, vol. 15, br. 4, 2007, str. 305-311. https://doi.org/10.2498/cit.1001137. Citirano 26.04.2024.

Chicago 17th Edition

Vidulin, Vedrana, Mitja Luštrek i Matjaž Gams. "Training a Genre Classifier for Automatic Classification of Web Pages." Journal of computing and information technology 15, br. 4 (2007): 305-311. https://doi.org/10.2498/cit.1001137

Harvard

Vidulin, V., Luštrek, M., i Gams, M. (2007). 'Training a Genre Classifier for Automatic Classification of Web Pages', Journal of computing and information technology, 15(4), str. 305-311. https://doi.org/10.2498/cit.1001137

Vancouver

Vidulin V, Luštrek M, Gams M. Training a Genre Classifier for Automatic Classification of Web Pages. Journal of computing and information technology [Internet]. 2007 [pristupljeno 26.04.2024.];15(4):305-311. https://doi.org/10.2498/cit.1001137

IEEE

V. Vidulin, M. Luštrek i M. Gams, "Training a Genre Classifier for Automatic Classification of Web Pages", Journal of computing and information technology, vol.15, br. 4, str. 305-311, 2007. [Online]. https://doi.org/10.2498/cit.1001137

Sažetak

This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.

Ključne riječi

Hrčak ID:

44610

URI

https://hrcak.srce.hr/44610

Datum izdavanja:

30.12.2007.

Posjeta: 1.860 *

Prijava i registracija

Journal of computing and information technology, Vol. 15 No. 4, 2007.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: