A Comparison of Algorithms for Text Classification of Albanian News Articles

Autori

  • Arbana Kadriu SEE University, Macedonia
  • Lejla Abazi SEE University, Macedonia

Ključne reči:

data mining, text classification, news articles, machine learning

Apstrakt

Text classification is an essential work in text mining and information retrieval. There are a lot of algorithms developed aiming to classify computational data and most of them are extended to classify textual data. We have used some of these algorithms to train the classifiers with part of our crawled Albanian news articles and classify the other part with the already learned classifiers. The used categories are: latest news, economy, sport, showbiz, technology, culture, and world. First, we remove all stop words from the gained articles and the output of this step is a separate text file for each category. All these files are then split in sentences, and for each sentence the appropriate category is assigned. All these sentences are then projected to a single list of tuples sentence/category. This list is used to train (80% of the overall number) and to test (the remained 20%) different classifiers. This list is at the end shuffled aiming to randomize the sequence of different categories. We have trained and then test our articles measuring the accuracy for each classifier separately. We have also analysed the training and testing time.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Reference

Antonellis I., Bouras C., Poulopoulos V. (2006), “Personalized News Categorization Through Scalable Text Classification”, In: Zhou X., Li J., Shen H.T., Kitsuregawa M., Zhang Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, Vol. 3841, Springer, Berlin, Heidelberg.

Chaudhari, S. V., Lade, Sh. (2013), "Classification of News and Research Articles Using Text Pattern Mining", IOSR Journal of Computer Engineering (IOSR-JCE), Vol. 14 No. 5, pp. 120-126.

Cortes, C., Vapnik, V. (1995), "Support-vector networks", Machine Learning, Vol. 20 No. 3, pp. 273–297.

Gui Y., Gao Z., Li R., Yang X. (2012), “Hierarchical Text Classification for News Articles Based-on Named Entities”, In: Zhou S., Zhang S., Karypis G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science, Vol. 7713. Springer, Berlin, Heidelberg.

Jurka, T.P., Collingwood, L., Boydstun, A.E., Grossman, E. and van Atteveldt, W. (2013), “RTextTools: A supervised learning package for text classification”, The R journal, Vol. 5 No. 1, pp. 6-12.

Liparas D., HaCohen-Kerner Y., Moumtzidou A., Vrochidis S., Kompatsiaris I. (2014), “News Articles Classification Using Random Forests and Weighted Multimodal Features”, In: Lamas D., Buitelaar P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, Vol. 8849, Springer, Cham.

Manning, C.D., Raghavan, P., Schutze, H. (2008), Introduction to Information Retrieval, Cambridge University Press.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V. Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E. (2011), “Scikit-learn: Machine Learning in Python”, Journal of Machine Learning Research, Vol. 12, pp. 2825-2830.

Scannell, K.P. (2007), “The Crúbadán Project: Corpus building for under-resourced languages”, In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Vol. 4, pp. 5-15.

Swezey, R. M., Sano, H., Shiramatsu, S., Ozono, T., & Shintani, T., (2012), “Automatic detection of news articles of interest to regional communities”, International Journal of Computer Science and Network Security, Vol. 12 No. 6.

Zhou, D., Resnick, P., Mei, Q. (2011), “Classifying the Political Leaning of News Articles and Users from User Votes”, International AAAI Conference on Web and Social Media, North America.

##submission.downloads##

Objavljeno

2017-10-31

Broj časopisa

Sekcija

Mathematical and Quantitative Methods