Principal component analysis for authorship attribution

Jamak, Amir; Savatić, Alen; Can, Mehmet

doi:10.2478/v10305-012-0012-2

Business Systems Research : International journal of the Society for Advancing Innovation and Research in Economy, Vol. 3 No. 2, 2012.

Original scientific paper

https://doi.org/10.2478/v10305-012-0012-2

Principal component analysis for authorship attribution

Amir Jamak
Alen Savatić
Mehmet Can

Full text: english pdf 664 Kb

page 49-56

downloads: 2.507

cite

APA 6th Edition

Jamak, A., Savatić, A. & Can, M. (2012). Principal component analysis for authorship attribution. Business Systems Research, 3 (2), 49-56. https://doi.org/10.2478/v10305-012-0012-2

MLA 8th Edition

Jamak, Amir, et al. "Principal component analysis for authorship attribution." Business Systems Research, vol. 3, no. 2, 2012, pp. 49-56. https://doi.org/10.2478/v10305-012-0012-2. Accessed 23 Dec. 2024.

Chicago 17th Edition

Jamak, Amir, Alen Savatić and Mehmet Can. "Principal component analysis for authorship attribution." Business Systems Research 3, no. 2 (2012): 49-56. https://doi.org/10.2478/v10305-012-0012-2

Harvard

Jamak, A., Savatić, A., and Can, M. (2012). 'Principal component analysis for authorship attribution', Business Systems Research, 3(2), pp. 49-56. https://doi.org/10.2478/v10305-012-0012-2

Vancouver

Jamak A, Savatić A, Can M. Principal component analysis for authorship attribution. Business Systems Research [Internet]. 2012 [cited 2024 December 23];3(2):49-56. https://doi.org/10.2478/v10305-012-0012-2

IEEE

A. Jamak, A. Savatić and M. Can, "Principal component analysis for authorship attribution", Business Systems Research, vol.3, no. 2, pp. 49-56, 2012. [Online]. https://doi.org/10.2478/v10305-012-0012-2

Abstract

Background: To recognize the authors of the texts by the use of statistical tools, one first needs to decide about the features to be used as author characteristics, and then extract these features from texts. The features extracted from texts are mostly the counts of so called function words. Objectives: The data extracted are processed further to compress as a data with less number of features, such a way that the compressed data still has the power of effective discriminators. In this case feature space has less dimensionality then the text itself. Methods/Approach: In this paper, the data collected by counting words and characters in around a thousand paragraphs of each sample book, underwent a principal component analysis performed using neural networks. Once the analysis was complete, the first of the principal components is used to distinguish the books authored by a certain author. Results: The achieved results show that every author leaves a unique signature in written text that can be discovered by analyzing counts of short words per paragraph. Conclusions: In this article we have demonstrated that based on analyzing counts of short words per paragraph authorship could be traced using principal component analysis. Methodology could be used for other purposes, like fraud detection in auditing.

Keywords

principal components; authorship attribution; stylometry; text categorization; function words; classification task; stylistic features; syntactic characteristics

Hrčak ID:

86295

URI

https://hrcak.srce.hr/86295

Publication date:

1.9.2012.

Visits: 3.523 *

Login and registration

Business Systems Research : International journal of the Society for Advancing Innovation and Research in Economy, Vol. 3 No. 2, 2012.

Abstract

Keywords

Hrčak ID:

URI

Publication date: