Arabic Text Classification Framework Based on Latent Dirichlet Allocation

Zrigui, Mounir; Ayadi, Rami; Mars, Mourad; Maraoui, Mohsen

doi:10.2498/cit.1001770

Journal of computing and information technology, Vol. 20 No. 2, 2012.

Izvorni znanstveni članak

Arabic Text Classification Framework Based on Latent Dirichlet Allocation

Mounir Zrigui ; LaTICE Laboratory (Research Unit of Monastir ), University of Monastir, Tunisia
Rami Ayadi ; Faculty of Economics and Management, University of Sfax, Tunisia
Mourad Mars ; Stendhal University, Grenoble, France
Mohsen Maraoui orcid.org/0000-0001-6598-7465 ; University of Monastir, Tunisia

Puni tekst: engleski PDF 1.342 Kb

str. 125-140

preuzimanja: 2.367

citiraj

APA 6th Edition

Zrigui, M., Ayadi, R., Mars, M. i Maraoui, M. (2012). Arabic Text Classification Framework Based on Latent Dirichlet Allocation. Journal of computing and information technology, 20 (2), 125-140. https://doi.org/10.2498/cit.1001770

MLA 8th Edition

Zrigui, Mounir, et al. "Arabic Text Classification Framework Based on Latent Dirichlet Allocation." Journal of computing and information technology, vol. 20, br. 2, 2012, str. 125-140. https://doi.org/10.2498/cit.1001770. Citirano 11.07.2026.

Chicago 17th Edition

Zrigui, Mounir, Rami Ayadi, Mourad Mars i Mohsen Maraoui. "Arabic Text Classification Framework Based on Latent Dirichlet Allocation." Journal of computing and information technology 20, br. 2 (2012): 125-140. https://doi.org/10.2498/cit.1001770

Harvard

Zrigui, M., et al. (2012). 'Arabic Text Classification Framework Based on Latent Dirichlet Allocation', Journal of computing and information technology, 20(2), str. 125-140. https://doi.org/10.2498/cit.1001770

Vancouver

Zrigui M, Ayadi R, Mars M, Maraoui M. Arabic Text Classification Framework Based on Latent Dirichlet Allocation. Journal of computing and information technology [Internet]. 2012 [pristupljeno 11.07.2026.];20(2):125-140. https://doi.org/10.2498/cit.1001770

IEEE

M. Zrigui, R. Ayadi, M. Mars i M. Maraoui, "Arabic Text Classification Framework Based on Latent Dirichlet Allocation", Journal of computing and information technology, vol.20, br. 2, str. 125-140, 2012. [Online]. https://doi.org/10.2498/cit.1001770

Sažetak

In this paper, we present a new algorithm based on the LDA (Latent Dirichlet Allocation) and the Support Vector Machine (SVM) used in the classification of Arabic texts.

Current research usually adopts Vector Space Model to represent documents in Text Classification applications. In this way, document is coded as a vector of words; n-grams. These features cannot indicate semantic or textual content; it results in huge feature space and semantic loss. The proposed model in this work adopts a “topics” sampled by LDA model as text features. It effectively avoids the above problems. We extracted significant themes (topics) of all texts, each theme is described by a particular distribution of descriptors, then each text is represented on the vectors of these topics. Experiments are conducted using an in-house corpus of Arabic texts. Precision, recall and F-measure are used to quantify categorization effectiveness. The results show that the proposed LDA-SVM algorithm is able to achieve high effectiveness for Arabic text classification task (Macro-averaged F1 88.1% and Micro-averaged F1 91.4%).

Ključne riječi

LDA; Arabic; stemming algorithm; text classification; SVM

Hrčak ID:

85083

URI

https://hrcak.srce.hr/85083

Datum izdavanja:

30.6.2012.

Posjeta: 4.130 *

Prijava i registracija

Journal of computing and information technology, Vol. 20 No. 2, 2012.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: