Faculté des sciences

Feature weighting approaches in sentiment analysis of short text

Kummer, Olena ; Savoy, Jacques (Dir.)

Thèse de doctorat : Université de Neuchâtel, 2012.

In this thesis, we propose a supervised classification scheme based on computation of the statistical scores for the textual features. More specifically, we consider binary classification (opinionated or factual, positive or negative) of the short text in the domains of movie reviews and newspaper articles. We analyze the performance of the proposed models on the corpora with the unequal sizes of... Plus

Ajouter à la liste personnelle
    Summary
    In this thesis, we propose a supervised classification scheme based on computation of the statistical scores for the textual features. More specifically, we consider binary classification (opinionated or factual, positive or negative) of the short text in the domains of movie reviews and newspaper articles. We analyze the performance of the proposed models on the corpora with the unequal sizes of the training categories.
    Based on our participation in different evaluation campaigns, we analyze advantages and disadvantages of the classification schemes that use Z scores for the purpose of classifying a sentence into more than two categories, e.g. positive, negative, neutral and factual. As a new feature weighting measure, we give an adaptation of the calculation of the Kullback-Leibler divergence score, called KL score. Considering the performance of different weighting measures on training corpora with unequal sizes, we chose two best performing scores, Z score and KL score. Thus, we propose a new classification model based on the calculation of normalized Z score and KL score for the features per each classification category. One of the advantages of this model is its flexibility to incorporate external scores, for example, from sentiment dictionaries.
    The experiments on datasets in Chinese and Japanese show a comparable level of performance of the proposed scheme with the results obtained on the English datasets without any use of natural language specific techniques. The advantage of the approaches analyzed in this thesis is that they can work as quick and easily interpretable baselines for short text classification.