Simple CFG development in JSGF
My understanding of JSGF and its <unk> tags are rudimentary. From what I've gathered from the documentation (linked in the pdf) and the explanations in the pdf, tokens without explicit mappings to a tag are tagged <unk>. I have assumed so in my tasks.
I organized the generatable sentences into <play_request> <music_entity>, where <play_request> is implicitly defined as (i want to listen to | [can you] (play [me] | put on)).
The extended grammar, in [english_extension.java](english_extension.java)
is able to generate:
- [can you play]<unk> [beatles]<artist>
- [can you put on]<unk> [paranoid android]<song>
- [i want to listen to]<unk> [jazz]<genre> [music]<unk>
- [play me]<unk> [ummagumma]<album> [by]<unk> [pink floyd]<artist>
As the model is scaled, it is likely that music entities also include phrases like "can you play", "want to listen", and other tokens that should be tagged <unk>, parsing them as <music_entity>. Then, it may be beneficial to explicitly define <play_request> and have the model identify both <play_request> and <music_entity>. Tokens like [music] and [by] may also need to be explicitly defined.
The localization, in [korean_localization.java](korean_localization.java)
explicitly defines <play_request> and introduces <want_listen>, which can be used as a request on its own but also can accompany <play_request>.
The localized grammar is able to generate:
- [비틀즈]<artist_ko> [재생해줘]<play_request>
- [파라노이드 안드로이드]<song_ko> [듣고싶어]<want_listen>
- [재즈]<genre_ko> [음악]<unk> [듣고싶은데]<want_listen> [틀어줘]<play_request>
- [핑크플로이드]<artist_ko> [의]<unk> [우마구마]<album_ko> [틀어줄 수 있니]<play_request>
Scaling a grammar is no easy task, and I think one significatn issue that may arise in scaling the above grammar may be the following:
Korean transliterations of songs, artists, and genres of non-Korean origin may not accurately resemble how they are pronounced. For example, the genre jazz
is most commonly written as 재즈
but is pronounced 째즈
. Varying levels of flunecy in English of other langauges may affect how a Korean speaker pronounces a song, artist, or genre. A corpus generated by a grammar that does not reflect such variance may not be effective in musical entity voice recognition, and may require a hybrid corpus that not only trains on 재즈
but also 째즈
and jazz
.
I believe working with conjugation offers a challenging problem to be solved when it comes to Korean-specific difficulties.
Korean sentences have many conjugations. For example, depending on what the user considers the voice recognition module--an older friend, an assistant, a younger friend, a subordinate, etc--they may use different types of conjugation. In such case, 재생해줘
from 2.2
[^1] may take other forms like 재생해주세요
, 재생해줘요
, 재생해
, 재생해주라
. In addition, sentence-ending conjugation may also vary depending on the user's mood, their native dialect, and other factors. This problem should be addressed from an end-to-end perspective as most pre-trained transformer models should know to focus on the token 재생
, but even if hardware conditions limit computational capacities to lightweight or rule-based models, I believe a feature engineering approach or data augmentation--both ideally human-in-the-loop--should be sufficient to address this issue.