Skip to content

THAI SER

Latest
Compare
Choose a tag to compare
@jilamikaw jilamikaw released this 25 Jan 07:19
5fa3288

AI Research Institute of Thailand (AIResearch), with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), cooperating with Department of Computer Engineering - Faculty of Engineering and Department of Dramatic Arts - Faculty of Arts, Chulalongkorn University, publishes an open Thai speech emotion recognition dataset, with the sponsorship from Advanced Info Services Public Company Limited (AIS), namely THAI SER.

This dataset consists of 5 main emotions assigned to actors: Neutral, Anger, Happiness, Sadness, and Frustration. The recordings were 41 hours, 36 minutes long (27,854 utterances), and were performed by 200 professional actors (112 female, 88 male) and directed by students, former alumni, and professors from the Faculty of Arts, Chulalongkorn University.

The THAI SER contains 100 recordings and is separated into two main categories: Studio and Zoom. Studio recordings also consist of two studio environments: Studio A, a controlled studio room with soundproof walls, and Studio B, a normal room without soundproof or noise control. Thus the recording environment can be concluded as follows:

StudioA (noise controlled, soundproof wall)
└─ studio001
└─ studio002
...
└─ studio018

StudioB (Normal room without soundproof wall)
└─ studio019
└─ studio020
...
└─ studio080

Zoom (Recorded online via Zoom and Zencastr)
└─ zoom001
└─ zoom002
...
└─ zoom020

Each recording is separated into two sessions: Script Session and Improvisation Session.

To mapped each utterance to an emotion, we use majority voted of answer from 3-8 annotators which collected from crowdsourcing (wang.in.th).


Script session


In the script session, the actor was assigned three sentences:

sentence 1: พรุ่งนี้มันวันหยุดราชการนะรู้รึยัง หยุดยาวด้วย
            (Do you know tomorrow is a public holiday and it's the long one.)
sentence 2: อ่านหนังสือพิมพ์วันนี้รึยัง รู้ไหมเรื่องนั้นกลายเป็นข่าวใหญ่ไปแล้ว
            (Have you read today's newspaper, that story was the topliner.)
sentence 3: ก่อนหน้านี้ก็ยังเห็นทำตัวปกติดี ใครจะไปรู้หล่ะ ว่าเค้าคิดแบบนั้น
            (He/She was acting normal recently, who would thought that he/she would think like that.)

The actor was asked to speak each sentence two times for each emotion with two emotional intensity levels (normal, strong), with an additional neutral expression.


Improvisation session

For the Improvisation session, two actors were asked to improvised according to provided emotion and scenario.

Scenarios Actor A Actor B
1 (Neutral) A hotel receptionist trying to explain and service the customer (Angry) A angry customer who dissatisfy the hotel services
2 (Happy) A person excitingly talking with B about his/her marriage plan (Happy) A person happily talking with A and help him/her plan his ceremony
3 (Sad) A patient feeling depressed (Neutral) A doctor attempting to talk with A neutrally
4 (Angry) A furious boss talking with the employee (Frustrated) A frustrated person attempting to argue with his/her boss
5 (Frustrated) A person frustratingly talk about another person's action (Sad) A person feeling guilty and sad about his/her action
6 (Happy) A happy hotel staffs (Happy) Happy customer
7 (Sad) A sad person who felt unsecured about the incoming marriage (Frustrated) A person who frustrated about another person's insecureness
8 (Frustrated) A frustrated patience (Neutral) A Doctor talking with the patience
9 (Neutral) A worker who assigned to tell his/her co-worker about the company's bad situation (Sad) An employee feeling sad after listenning
10 (Angry) A person raging about another person's behavior (Angry) A person who feels like being blamed by another person
11 (Frustrated) A director who unsatisfied co-worker (Frustrated) A frustrated person who try their best on the job
12 (Happy) A person who gets a new job or promotion (Sad) A person who desperate in his/her job
13 (Neutral) A patient inquire information (Happy) A happy doctor telling his/her patience more information
14 (Angry) A person who upset with his/her work (Neutral) A calm friend who listened to another person's problem
15 (Sad) A person sadly tell another person about a relationship (Angry) A person who feels angry after listening to another person's bad relationship


File naming convention

Each of files has a unique filename, provided in .flac format with sample rate about 44.1 KHz. The filename consists of a 5 to 6-part identifier (e.g., s002_clip_actor003_impro1_1.flac, s002_clip_actor003_script1_1_1a.flac). These identifiers define the stimulus characteristics:

File Directory Management

studio (e.g., studio1-10)
└─ <studio-num> (studio1, studio2, ...)
    └─ <mic-type> (con, clip, middle)
        └─<audio-file> (.flac)

zoom (e.g., zoom1-10)
└─ <zoom-num> (zoomo1, zoom2, ...)
    └─ <mic-type> (mic)
        └─ <audio-file> (.flac)

Filename identifiers

  • Recording ID (s = studio recording, z = zoom recording)

    • Number of recording (e.g., s001, z001)
  • Microphone type (clip, con, middle, mic)

    Zoom recording session

    • mic = An actor's microphone-of-choice

    studio recording session

    • con = Condenser microphone (Cardioid polar patterns) which was placed 0.5m from the actor setting
    • clip = Lavalier microphone (Omni-directional patterns) attached to the actor’s shirt collar
    • middle = Condenser microphone (Figure-8 polar patterns) which was placed between actors
  • Actor ID (actor001 to actor200: Odd-numbered actors are Actor A, even-numbered actors are Actor B in improvisation session).

  • Session ID (impro = Improvisation Session, script = Script Session)

    • Script Session (e.g., _script1_1_1a)
      • Sentence ID (script1-script3)
      • Repetition (1 = 1st repetition, 2 = 2nd repetition)
      • Emotion (1 = Neutral, 2 = Angry, 3 = Happy, 4 = Sad, 5 = Frustrated)
      • Emotional intensity (a = Normal, b = Strong)
    • Improvisation Session (e.g., _impro1_1)
      • Scenario ID (impro1-15)
      • Utterance no. (e.g., _impro1_1 , _impro1_2)

Filename example: s002_clip_actor003_impro1_1.flac

  1. Studio recording number 2 (s002)
  2. Recording by Lavalier microphone (clip)
  3. 3rd Actor (actor003)
  4. Improvisation session, scenario 1 (impro1)
  5. 1st utterance of scenario recording (1)


Other Files

  1. emotion_label.json - a dictionary for recording id, assigned emotion (assigned_emo), majority emotion (emotion_emo), annotated emotions from crowdsourcing (annotated), and majority agreement score (agreement)
  2. actor_demography.json - a dictionary that contains information about the age and sex of actors.


Version

  • Version 1 (26 March 2021): Thai speech emotion recognition dataset THAI SER contains 100 recordings (80 studios and 20 zooms) which is 41 hours 36 minutes long which contain 27,854 utterances and be labeled 27,854 utterances.

Dataset statistics

Recording environment Session Number of utterances Duration(hrs)
Zoom (20) Script 2,398 4.0279
Improvisation 3,606 5.8860
Studio (80) Script 9,582 13.6903
Improvisation 12,268 18.0072
Total (100) 27,854 41.6114


Dataset sponsorship and license


Advanced Info Services Public Company Limited

This work is published under a Creative Commons BY-SA 4.0