Université de Neuchâtel

Information retrieval with Hindi, Bengali, and Marathi languages : evaluation and analysis

Savoy, Jacques ; Akasereh, Mitra ; Dolamic, Ljiljana

In: Multilingual Information Access in South Asian Languages, 2013, p. 334-352

Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during this second campaign. We have compared their...

Université de Neuchâtel

Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages

Savoy, Jacques

In: Journal of Quantitative Linguistics, 2012, vol. 19, no. 2, p. 132-161

The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French...

Université de Neuchâtel

Feature Weighting Strategies in Sentiment Analysis

Kummer, Olena ; Savoy, Jacques

In: SDAD 2012 : The First International Workshop on Sentiment Discovery from Affective Data, 2012, p. 48-55

In this paper we propose an adaptation of the Kullback- Leibler divergence score for the task of sentiment and opinion classification on a sentence level. We propose to use the obtained score with the SVM model using different thresholds for pruning the feature set. We argue that the pruning of the feature set for the task of sentiment analysis (SA) may be detrimental to classifiers performance...

Université de Neuchâtel

Simple and efficient classification scheme based on specific vocabulary

Savoy, Jacques ; Zubaryeva, Olena

In: Computational management science, 2012, vol. 9, no. 3, p. 401-415

Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values can be used to derive a simple and...

Université de Neuchâtel

Authorship Attribution Based on Specific Vocabulary

Savoy, Jacques

In: ACM Transactions on Information Systems (TOIS), 2012, vol. 30, no. 3, p. Art. 12

In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current study), representing the lexical specificity of...

Université de Neuchâtel

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Dolamic, Ljiljana ; Savoy, Jacques

In: ACM Transactions on Asian Language Information Processing (T.A.L.I.P.), 2010, vol. 9, no. 3, p. art. 11

The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the...

Université de Neuchâtel

Etude comparative de stratégies de sélection de prédicteurs pour l’attribution d’auteur

Savoy, Jacques

In: Actes 9ème Conférence en Recherche d’Information et Applications CORIA’12, 2012, p. 215-228

L’attribution d'auteur peut être vue comme une tâche en catégorisation de textes qui se subdivise en deux étapes. D’abord nous devons sélectionner les mots les plus discriminants puis appliquer un modèle de classification. Afin de bien choisir les meilleurs termes, nous avons évalué sept fonctions de sélection dont l’information mutuelle ponctuelle, le gain d’information, le...

Université de Neuchâtel

Quel est l'auteur de ce roman?

Savoy, Jacques

In: Actes 8ème Conférence en Recherche d’Information et Applications CORIA’11, 2011, p. 135-150

Dans cet article, nous présentons le problème de l'attribution d'auteur d'une oeuvre écrite. Comme représentation des textes, les études récentes s'appuient sur un ensemble restreint de mots fonctionnels ou très fréquents (50 ou 100). Sur cette base, les méthodes de l'analyse en composantes principales (ACP) ou des correspondances (AC) permettent de visualiser les affinités et...

Université de Neuchâtel

Who Wrote this Novel? Authorship Attribution across Three Languages

Savoy, Jacques

In: Revue Tranel (Travaux neuchâtelois de linguistique), 2011, vol. 55, p. 59-75

Based on different writing style definitions, various authorship attribution schemes have been proposed to identify the real author of a given text or text excerpt. In this article we analyze the relative performance of word types or lemmas assigned to re-present styles and texts. As a second objective we compare two authorship attribu-tion approaches, one based on principal component analysis...

Université de Neuchâtel

Variations autour de tf idf et du moteur Lucene

Savoy, Jacques ; Dolamic, Ljiljana

In: Actes 9e journées Analyse statistique des Données Textuelles JADT 2008, 2008, p. 1047-1058

A l'aide d’un corpus écrit en langue française et composé de 299 requêtes, cet article analyse et compare l’efficacité du dépistage de diverses stratégies d’indexation et de recherche basées sur le modèle classique « tf idf ». Cette dernière formulation demeure ambiguë et cache diverses variantes possédant des performances différentes, performance mesurée soit par la...