-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This wiki aims to concisely explain my project on Fake Review Detection done during my tenure as a research intern at the Indian Institute of Technology, Bombay.
The project aimed at studying various techniques used for Fake Review detection. Due to the increased sophistication in techniques used by spammers to pass off their reviews as genuine, it is imperative to develop smarter methods to detect fake reviews. Due to which we do not treat the problem as a binary classification problem, but rather try and find the degree of spam present in a review. We believe that this is a step in helping consumers make a more informed decision about products as fake reviews can be weeded out.
Over the course of summer, we studied two techniques for review spam detection -
- Support Vector Regression
- KL Divergence score
We used Myle Ott's gold standard corpus of deceptive online spam. The dataset can be downloaded here. This corpus consists of truthful and deceptive hotel reviews of 20 Chicago hotels. The data is described in two papers according to the sentiment of the review. In particular, we discuss positive sentiment reviews in [1] and negative sentiment reviews in [2]. This corpus contains:
- 400 truthful positive reviews from TripAdvisor (described in [1])
- 400 deceptive positive reviews from Mechanical Turk (described in [1])
- 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp (described in [2])
- 400 deceptive negative reviews from Mechanical Turk (described in [2])
In SVM regression, the input is first mapped onto a m-dimensional feature space using some fixed (nonlinear) mapping, and then a linear model is constructed in this feature space.
One of the advantages of Support Vector Machine, and Support Vector Regression as the part of it, is that it can be used to avoid difficulties of using linear functions in the high dimensional feature space and optimization problem is transformed into dual convex quadratic programmes. Such loss functions usually lead to the sparse representation of the decision rule, giving significant algorithmic and representational advantages.
For more details regarding Support Vector Regression, refer to [3].
Each review was converted to a feature vector before being fed to the classifier. A total of 16 features were used which were subject to normalization. In addition to that, Principal Components Analysis was performed to gauge the variation in performance and various SVM kernels such as linear, rbf kernel were also used to measure the same. Features used are as follows:
- Lexical features
- Lexical Validity
- Text Orientation
- Sentiment Orientation
- Lexical diversity
- Content diversity
- Part-of-Speech (POS) n-gram diversity
- Lexical entropy
- Syntactical features
- Percentage of Nouns
- Percentage of Pronouns
- Percentage of Verbs
- Ratio of modals to verbs
- Stylistic features
- Capitalized Diversity
- Repetition diversity
- Emotiveness diversity
- Spelling check
- Self reference diversity
The primary purpose of this approach is to detect similar reviews based on semantic similarity. If the semantic contents of a review are mainly generated from another review, this suggests that both reviews may not sincerely reflect writers’ true opinions. However, this does not mean that all pairs of reviews with high semantic content overlap must be fake. Therefore, we developed a probabilistic ranking of suspicious reviews and thus, eliminate them.
The primary intuition behind using semantic similarity to detect similar reviews rather than the very popular cosine similarity is to cover certain cases where the review is the same save a few words. For example, just replacing the word “love” with “like” might mean the reviews are different according to cosine similarity, when they are in fact similar. Semantic similarity would recognize that and would rate the two reviews as similar.
Therefore, to address these tactics, we need to estimate the probability of an unseen term in a document. This can be done by taking into account the relationships between words. The goal is to assign a more reasonable probability to an unseen term when evaluating two reviews.
KL divergence [Kullback and Leibler 1951] is a well-known measure commonly used to estimate the distance between two probability distributions, and has been successfully applied to Web spam detection [Martinez-Romo and Araujo 2009; Mishne et al. 2005]. Accordingly, we apply a negative KL divergence to measure the similarity between pairs of language models, representing two reviews. If the negative KL divergence of the two language models is large, then the distance of the corresponding probability distributions is considered small. In other words, the semantic contents of the pair of reviews are quite similar, and they are likely to be spam.
[1] M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
[2] M. Ott, C. Cardie, and J.T. Hancock. 2013. Negative Deceptive Opinion Spam. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.