hrcak mascot   Srce   HID

Original scientific paper
https://doi.org/10.2498/cit.1001137

Training a Genre Classifier for Automatic Classification of Web Pages

Vedrana Vidulin
Mitja Luštrek
Matjaž Gams

Fulltext: english, pdf (373 KB) pages 305-311 downloads: 1.090* cite
APA 6th Edition
Vidulin, V., Luštrek, M. & Gams, M. (2007). Training a Genre Classifier for Automatic Classification of Web Pages. Journal of computing and information technology, 15 (4), 305-311. https://doi.org/10.2498/cit.1001137
MLA 8th Edition
Vidulin, Vedrana, et al. "Training a Genre Classifier for Automatic Classification of Web Pages." Journal of computing and information technology, vol. 15, no. 4, 2007, pp. 305-311. https://doi.org/10.2498/cit.1001137. Accessed 29 Mar. 2020.
Chicago 17th Edition
Vidulin, Vedrana, Mitja Luštrek and Matjaž Gams. "Training a Genre Classifier for Automatic Classification of Web Pages." Journal of computing and information technology 15, no. 4 (2007): 305-311. https://doi.org/10.2498/cit.1001137
Harvard
Vidulin, V., Luštrek, M., and Gams, M. (2007). 'Training a Genre Classifier for Automatic Classification of Web Pages', Journal of computing and information technology, 15(4), pp. 305-311. https://doi.org/10.2498/cit.1001137
Vancouver
Vidulin V, Luštrek M, Gams M. Training a Genre Classifier for Automatic Classification of Web Pages. Journal of computing and information technology [Internet]. 2007 [cited 2020 March 29];15(4):305-311. https://doi.org/10.2498/cit.1001137
IEEE
V. Vidulin, M. Luštrek and M. Gams, "Training a Genre Classifier for Automatic Classification of Web Pages", Journal of computing and information technology, vol.15, no. 4, pp. 305-311, 2007. [Online]. https://doi.org/10.2498/cit.1001137

Abstracts
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.

Hrčak ID: 44610

URI
https://hrcak.srce.hr/44610

Visits: 1.238 *