In: Computational Management Science, 2012, vol. 9, no. 3, p. 401-415
|
In: Information Retrieval, 2007, vol. 10, no. 6, p. 509-529
|
In: Le Document ? Actes du 20e Colloque International sur le Document numérique (CiDE.20), 2017, p. 1-14
La catégorisation de documents (attribution d'un texte à une ou plusieurs catégories prédéfinies) représente un problème possédant de multiples facettes. Ainsi, l'indexation automatique correspond à l'une d'entre elles qui se fonde sur la sémantique des documents. Cependant d'autres applications analysent les mots outils, ces formes qui ne portent que peu ou pas de sens. Or ces...
|
In: Digital Scholarship in the Humanities, 2015, vol. 30, no. 2, p. 246-261
|
In: Multilingual Information Access in South Asian Languages, 2013, p. 334-352
Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during this second campaign. We have compared their...
|
In: Multilingual Information Access in South Asian Languages, 2013, p. 23-37
Our goal in participating in FIRE 2011 evaluation campaign is to analyse and evaluate the retrieval effectiveness of our implemented retrieval system when using Marathi language. We have developed a light and an aggressive stemmer for this language as well as a stopword list. In our experiment seven different IR models (language model, DFR-PL2, DFR-PB2, DFR-GL2, DFR-I(n e)C2,...
|
In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, 2013, vol. 8138, p. 192-211
The Cultural Heritage in CLEF 2013 lab comprised three tasks: multilingual ad-hoc retrieval and semantic enrichment in 13 languages (Dutch, English, German, Greek, Finnish, French, Hungarian, Italian, Norwegian, Polish, Slovenian, Spanish, and Swedish), Polish ad-hoc retrieval and the interactive task, which studied user behavior via log analysis and questionnaires. For the multilingual and...
|
In: Actes 12e Journées internationales d’analyse statistique des données textuelles JADT 2014, 2014, p. 593-604
This paper describes a lexical study over the State of the Union addresses from 1934 until 2014. This corpus contains 81 governmental speeches uttered by thirteen presidents. This study shows that considering the most frequent lemmas does not provide useful and pertinent information. However when analyzing the part-of-speech (POS) distribution according to each president, we can see that some...
|
In: Actes 10e Journées Analyse statistique des Données Textuelles JADT 2010, 2010, p. 653-664
This paper describes the problem of classifying opinion from blogs. After retrieving relevant sentences, the search system must categorize them as opinionated or factual. To achieve this objective, different representations and automatic categorization models could be used. As baseline system, we have used the Naïve Bayes approach to classify the retrieved sentences as opinionated or not....
|
In: Journal of Quantitative Linguistics, 2010, vol. 17, no. 2, p. 123-141
This article describes a US political corpus comprising 245 speeches given by senators John McCain and Barack Obama during the years 2007–2008. We present the main characteristics of this collection and compare the common English words most frequently used by these political leaders with ordinary usage (Brown corpus). We then discuss and compare certain metrics capable of extracting terms best...
|