Faculté des sciences

Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages

Savoy, Jacques

In: Journal of Quantitative Linguistics, 2012, vol. 19, no. 2, p. 132-161

The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French... Plus

Ajouter à la liste personnelle
    Summary
    The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French novels written by eleven authors, mostly from the 19th century. In the third we extract 59 German text excerpts from novels published mainly during the 19th and the beginning of the 20th century, written by 15 authors. The second objective is to analyse performance differences obtained when using word types or lemmas as text representations, and the third objective is to evaluate three authorship attribution schemes, the first of which uses principal component analysis (PCA), the second applies the Delta approach, and the third corresponds to a new authorship attribution method based on specific vocabulary. This concept is computed for a given text (or author profile) and then compared with the entire corpus. Based on this information, we show how a distance measure can be derived and by means of the nearest neighbor approach we suggest a simple and efficient authorship attribution scheme. Based on three test collections and using either word types or lemmas as features, we demonstrate that the suggested classification scheme performs better than the PCA method, and slightly better than the Delta approach.