Skip to content

this repository contains two sentiment predictions models trained with the MELD dataset

Notifications You must be signed in to change notification settings

olgagasowska/sentiment-analysis-text-only-vs-text-video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DATASET MELD https://github.com/declare-lab/MELD The model bases only on the train part of the data due to time restriction of the project

Data preparation: The models were trained on a small sample (around 3560 elements) so the accuracy of the results may be biased.

The dataset was used on the basis of the preprocessed data (https://github.com/declare-lab/MELD/tree/master/data/MELD_Dyadic), and the raw files provided by MELD creators (mp4).

Steps undertaken:

Importing preprocessed data(“Utterance”, “Speaker”, “Sentiment”) and raw files.

Creating a “Filename” for each entrance in the preprocessed data that allows to align the video data with the rest of the data

Exporting head and posture features from the raw files with MediaPipe, with average for each “Utterance”

Aligning the two datasets. Dataset structure: “Filename”, “Avg Head Movement (x,y)”, “Avg Posture”, “Speaker”, “Sentiment”, “Utterance”

Two models were created:

Text - Only where the model establishes the sentiment on the basis of “utterance” only

Text + Video where the model combines the text and video layer for each utterance.

As the models base on a certain number of a series’ extracts, it was suspected that the multimodal model can have per speaker bias.

Consequently, a number of metrics were used for the thorough analysis and comparison of the models:

  Accuracy

  F1 Score

  Classification Report:
    Precision
    Recall
    F1 Score

  Per Speaker Accuracy

RESULTS

Metric Text-Only Model Text + Video Model
Accuracy 66.25% 65.13%
F1 Score 0.6571 0.6543

Although comparable, it seems that Text-Only model has better overall results.

TEXT ONLY

Class Precision Recall F1-Score
Negative 0.59 0.61 0.60
Neutral 0.70 0.79 0.74
Positive 0.68 0.49 0.57
Average 0.66 0.63 0.64

TEXT + VIDEO

Class Precision Recall F1-Score
Negative 0.54 0.69 0.61
Neutral 0.76 0.66 0.70
Positive 0.65 0.59 0.62
Average 0.65 0.65 0.64

For Neutral and Positive, the text + video model performs better with higher precision, but for Negative, the text-only model is better.

Speaker Text-Only Model Accuracy Text + Video Model Accuracy
Wayne 1.0000 1.0000
Dr. Miller 1.0000 1.0000
Mike 1.0000 1.0000
Marc 1.0000 1.0000
Kristin 1.0000 1.0000
Mrs. Geller 1.0000 1.0000
Mrs. Green 1.0000 1.0000
Drunken Gambler 1.0000 1.0000
Dr. Green 1.0000 1.0000
Mr. Tribbiani 1.0000 0.0000
Charlie 1.0000 1.0000
Raymond 1.0000 1.0000
Bernice 1.0000 0.0000
Receptionist 0.7500 0.7500
Dana 0.7143 0.7143
The Casting Director 0.7143 0.7143
Chandler 0.7108 0.6988
Phoebe 0.6882 0.6989
Joey 0.6818 0.5545
Barry 0.6667 0.5000
Ross 0.6667 0.6538
Pete 0.6667 1.0000
Julie 0.6667 0.6667
Rachel 0.6599 0.6327
Monica 0.6019 0.6481
The Fireman 0.6000 0.8000
Tag 0.5625 0.7500
Nurse 0.5000 0.0000
Bobby 0.5000 0.5000
Stage Director 0.3333 0.3333
Mona 0.1667 0.6667
Robert 0.0000 1.0000
Both 0.0000 0.0000
Ross and Joey 0.0000 1.0000
All 0.0000 0.5000

Per-Speaker Accuracy:

Both models achieve a mutual 1.0 accuracy for several Speakers. This can contradict the hypothesis that the multimodal model can base its predicitons on Spekaer identification rather than Sentiment prediciton.

Overall, both models ahcieve similar results. What is important is that the dataset is small (around 3560 elements) so the results may be different when enlarging the sample.

Video features, although improve slightly "neutral" and "negative" predictions, do not seem to be a valuable element for sentiment prediction.

About

this repository contains two sentiment predictions models trained with the MELD dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages