This repository accompanies our paper, Evaluating the Language ENvironment Analysis System for Korean (McDonald et al., 2021), published in the Journal of Speech, Language, and Hearing Research (JSLHR).
Our data and coding manual are shared on https://osf.io/uztxr/.
The original LENA transcripts are CHAT (.cha
) files, having been exported from the LENA software.
The human transcripts are TextGrids (.TextGrid
).
The spreadsheet clip_data.xlsx
includes the relevant variables for each of the 60 clips, such as AWC and CVC.
evaluation.py
contains the methods to parse the CHAT and TextGrid transcripts into a common data structure, as well as methods to calculate classification accuracy, and also to extract features such as the word and turn count.
Since the output classes of LENA differ from the human transcripts, diarization / identification evaluation is done by mapping both to the a common set of classes.
The mappings are defined in mappings/
as JSON files, and enable convenient experimentation with different options.
The results.ipynb
Jupyter notebook contains some of the high-level code used to generate the results and figures used in the paper, such as confusion matrices.
The remaining errors, error rates, correlations, and graphs of comparisons of LENA and human codings are calculated in results.R
.
The recommended Python version is 3.8.
The recommended R version is 3.5.3.
Python dependencies can be installed with pip
(possibly within a virtual environment).
pip install -r requirements.txt
In order to calculate the morpheme count for Korean text, the Mecab
library must additionally be installed locally.
If not, an error will be thrown by the konlpy
library when a dependent method is called.
Installation instructions can be found in https://konlpy.org/en/latest/install/.
If you refer to our code or analysis in your work, please consider citing our paper:
@article{doi:10.1044/2020\_JSLHR-20-00489,
author = {Margarethe McDonald and Taeahn Kwon and Hyunji Kim and Youngki Lee and Eon-Suk Ko },
title = {Evaluating the Language ENvironment Analysis System for Korean},
journal = {Journal of Speech, Language, and Hearing Research},
volume = {64},
number = {3},
pages = {792-808},
year = {2021},
doi = {10.1044/2020\_JSLHR-20-00489}
URL = {https://pubs.asha.org/doi/abs/10.1044/2020_JSLHR-20-00489},
eprint = {https://pubs.asha.org/doi/pdf/10.1044/2020_JSLHR-20-00489}
}
Distributed under the MIT License. See LICENSE for more information.
This work was supported by Institute for Information & Communications Technology Planning \ Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-01367, Infant-Mimic Neurocognitive Developmental Machine Learning from Interaction Experience with Real World (BabyMind)).