Albanian Text Classification: Bag of Words Model and Word Analogies

Kadriu, Arbana; Abazi, Lejla; Abazi, Hyrije

doi:10.2478/bsrj-2019-0006

Business Systems Research : International journal of the Society for Advancing Innovation and Research in Economy, Vol. 10 No. 1, 2019.

Izvorni znanstveni članak

https://doi.org/10.2478/bsrj-2019-0006

Albanian Text Classification: Bag of Words Model and Word Analogies

Arbana Kadriu orcid.org/0000-0003-4922-4753 ; SEE University, Tetovo, Macedonia
Lejla Abazi orcid.org/0000-0002-4354-146X ; SEE University, Tetovo, Macedonia
Hyrije Abazi orcid.org/0000-0002-6205-1431 ; SEE University, Tetovo, Macedonia

Puni tekst: engleski pdf 943 Kb

str. 74-87

preuzimanja: 766

citiraj

APA 6th Edition

Kadriu, A., Abazi, L. i Abazi, H. (2019). Albanian Text Classification: Bag of Words Model and Word Analogies. Business Systems Research, 10 (1), 74-87. https://doi.org/10.2478/bsrj-2019-0006

MLA 8th Edition

Kadriu, Arbana, et al. "Albanian Text Classification: Bag of Words Model and Word Analogies." Business Systems Research, vol. 10, br. 1, 2019, str. 74-87. https://doi.org/10.2478/bsrj-2019-0006. Citirano 24.11.2024.

Chicago 17th Edition

Kadriu, Arbana, Lejla Abazi i Hyrije Abazi. "Albanian Text Classification: Bag of Words Model and Word Analogies." Business Systems Research 10, br. 1 (2019): 74-87. https://doi.org/10.2478/bsrj-2019-0006

Harvard

Kadriu, A., Abazi, L., i Abazi, H. (2019). 'Albanian Text Classification: Bag of Words Model and Word Analogies', Business Systems Research, 10(1), str. 74-87. https://doi.org/10.2478/bsrj-2019-0006

Vancouver

Kadriu A, Abazi L, Abazi H. Albanian Text Classification: Bag of Words Model and Word Analogies. Business Systems Research [Internet]. 2019 [pristupljeno 24.11.2024.];10(1):74-87. https://doi.org/10.2478/bsrj-2019-0006

IEEE

A. Kadriu, L. Abazi i H. Abazi, "Albanian Text Classification: Bag of Words Model and Word Analogies", Business Systems Research, vol.10, br. 1, str. 74-87, 2019. [Online]. https://doi.org/10.2478/bsrj-2019-0006

Sažetak

Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

Ključne riječi

data mining; text classification; news articles; machine learning

Hrčak ID:

219005

URI

https://hrcak.srce.hr/219005

Datum izdavanja:

18.4.2019.

Posjeta: 1.659 *

Prijava i registracija

Business Systems Research : International journal of the Society for Advancing Innovation and Research in Economy, Vol. 10 No. 1, 2019.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: