literature_review.tex

\documentclass[a4paper]{article}
\usepackage{commath}
\usepackage[utf8x]{inputenc}
\usepackage[left=0.8in,right=.8in,top=1in,bottom=1in]{geometry}
\usepackage{verbatim} %comments
\usepackage{natbib}
\bibliographystyle{apalike}

\usepackage{titlesec}
\setcounter{secnumdepth}{4}

\titleformat{\paragraph}
{\normalfont\normalsize\bfseries}{\theparagraph}{1em}{}
\titlespacing*{\paragraph}
{0pt}{3.25ex plus 1ex minus .2ex}{1.5ex plus .2ex}


\usepackage{hyperref}
\title{Review of research conducted for Punctuation Retrieval.}
\author{Xing Yu Ng}
\date{}

\usepackage{multicol}
\setlength{\columnsep}{1cm}

\begin{document}

\maketitle


\begin{multicols}{2}
\section{Introduction}
\label{introduction}
Punctuation Retrieval is an important aspect in any Automatic Speech Recognition (ASR) pipeline for two reasons: to improve readability of auto-generated transcripts for videos or podcast subtitling or voice dictation applications, and to better capture the meaning of speech transcripts to improve the performance of downstream Natural Language Processing (NLP) tasks.

This review will look into the ideas taken into account by different authors and discuss potential areas of exploration to improve ASR output.

\section{Punctuation Features}
The majority of research into punctuation retrieval on English speech transcripts condense all punctuation into four classes --- (Period .), (Comma ,), (Question Mark ?) and (None), using a custom-defined mapping function to replace other punctuation with the four classes. This is done to combat the issue of an imbalance in punctuation occurrence in most datasets, with less frequent punctuation like semicolons or dashes occurring under 1\% in the entire \href{http://opus.nlpl.eu/OpenSubtitles-v2016.php}{OpenSubtitles v2016 english corpus}. An earlier paper by \cite{dynamiccrf} included the (Exclamation mark !) class, but did not comment on the performance of prediction on the less common classes. 
The paper by \cite{birnnattention} also used a different mapping scheme for the two languages evaluated --- Estonian and English, with the following differences:
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
 &\multicolumn{2}{|c|}{Mapped to} \\
 \hline
From & Estonian & English \\
\hline
; ! & Period . & Period . \\
: & Period . & Comma , \\
- --- & None & Comma ,\\
\hline
\end{tabular}
\end{center}
While these mappings are logical, with (?) and (!) marking sentence boundaries and the emdash (---) indicating a break in sentence structure, the mapping would result in some meaning and context being lost. Hence, expanding the set of punctuation classes to all punctuation which can be inferred from a speech, for instance \{! , - . : ; ? --- \textellipsis "\} for English, would improve the readibility of the ASR output and allow the text to better represent the meaning within the audio input.
\section{Data}
The data sources for training punctuation retrieval tasks can be categorised into purely textual, and aligned audio with text.
The most common source used for training punctuation retrieval is the IWSLT dataset - many versions of which features transcripts from TED-talks and OpenSubtitles. This textual datasource is used to train a model that only uses lexical features for punctuation retrieval. \cite{jointlearningcorrbirnn} also used transcripts from the Intelligence squared debate show. \cite{medicalasr} used an internal medical speech transcript dataset to demonstrate the ability of their model to transfer to a domain with large vocabulary and data scarcity.

\textbf{Use of non-lexical features} \cite{multimodalsemi} used the audio and text data from the Fisher corpus in training a model using both acoustic and lexical features. \cite{adversarial} also made use of the Penn Treebank-3 dataset to train a POS tagger to improve the performance of the punctuation retrieval class. \cite{birnnattention} made use of Estonian text annotated with pause duration (e.g. $\langle sil=0.030\rangle$ for 30 milliseconds of silence) 

\textbf{Data Augmentation} In performing ASR, the generated text from speech might contain word errors in the form of insertions, deletions, and substitutions, which would affect the accuracy of the downstream punctuation retrieval task. To make the model more robust to such word errors, \cite{noisy} simulated the three forms of word errors, inserting or substituting existing tokens with the unknown token and randomly deleting tokens. Their RoBERTa-large model obtained higher F1 scores on the test set --- ASR output generated by the Google Cloud Speech API to simulate word errors --- when trained on augmented text, demonstrating the effectiveness of introducing noise into the training data in improving the model's robustness to word errors. 

\cite{speechtranslationrobust} presented four strategies to generate noise for ASR training: Substitution with placeholder, substitution with random token, substitution with more frequent character, and substitution with homophone based character. Being trained and evaluated on a Chinese corpus, many homophonous words have different meaning. The proposed strategies proved effective in improving speech to text translation and can be adapted to the field of punctuation retrieval.

Other possible noise injection techniques include splitting of words into subwords or homophone or random substitution/insertion/deletion at the subword level.

\section{Models}
There are two categories of models that have been applied to the task of punctuation retrieval: Recurrent Neural Networks and feed-forward networks.
The current state of the art models as of 2020 features a BERT (variant) base layer with a variety of layer combinations above it. These models take a sequence of text represented as tokens or word embedding as input and output a sequence of corresponding punctuation labels, similar to other sequence tagging tasks like Part-of-speech (POS) or Named-Entity Recognition (NER) tagging.
\subsection{Tag decoders}
The final stage of the punctuation retrieval consists of a decoder which generates a sequence of punctuation labels from a sequence of representations.
\subsubsection{Conditional Random Fields (CRF)}
CRFs \citep{crf} are discriminative, undirected
Markov models which represents a conditional probability distribution of a structured output
variable y given an observation x. These are commonly used as an alternative to the Multi-layer perceptron (MLP) head for models. 

\cite{dynamiccrf} trained the model on punctuation prediction and sentence boundary labelling. They demonstrate the effectiveness of the dynamic CRF in improving performance of punctuation prediction, by learning a sentence boundary tag (e.g. start of question sentence or within declarative sentence) along with the punctuation tags. The introduction of sentence boundary information allows for long-range dependencies to be captured by the model, and helps to relax the strong dependency assumptions associated with linear-chain CRFs.

Likewise, \cite{crfhighorderdependencies} demonstrate that High order CRFs and semi-Markov CRFs outperforms Linear Chain CRFs and first-order semi-Markov CRFs on the punctuation-prediction task. They proposed introducing new tags denoting the start of sequences, similar to the second set of labels introduced by \citet{dynamiccrf}, which allows the model to better capture long-range dependencies.

\cite{hybridsemimarkovcrf} presents a similar model that also features performance gains over the linear chain and Gated Recursive
Semi-Markov CRFs (grSemi-CRF) \citep{grsemicrf} in the NER task. 


\subsection{\texorpdfstring{Recurrent Neural Network / \\ Multi-layered RNN}{Recurrent Neural Network / Multi-layered RNN}}
Prior to the release of the BERT models, BiLSTMs were among the best performing models for NLP, being able to utilise information from both sides to predict the tokens of the output sequence. The introduction of gates within the Long short-term memory (LSTM) or Gated recurrent unit (GRU) cells allows the model to preserve information across time-steps, giving the model the ability to capture long-term dependencies.  
\cite{birnnattention} demonstrates the effectiveness of bi-directionality in improving the performance of punctuation retrieval, as punctuation relies on cues from words in both directions. They also observed a slight improvement in restoration of question-marks when an attention mechanism is added to determine the relative importance of each BiLSTM output in determining each output label.

\cite{kim_2019} built upon this research, utilising a Deep-RNN model with a multi-head attention mechanism as proposed by \cite{attentionisallyouneed}. It relies on the stacked RNNs to extract sequential context before using the multi-head attention to focus on the context at each time-step. The use of multiple RNN layers would allow the model to learn various representations of the sequence. Based on the results obtained, there is a positive relationship between the number of RNN layers and the F1 score of the model. The paper does not provide any source code or details regarding the training process. Even so, the sequential nature of the model and its depth makes the model harder to tune and takes longer to train as compared to a single BiLSTM layer and possibly even a BERT model. Further research has to be conducted to evaluate the efficiency and performance of utilising the stacked RNN layers as compared to the BERT model.

\subsection{pretrained models}
\subsubsection{GloVe}
The use of word embeddings pretrained using a large generic corpus (e.g. wikipedia or common crawl) allows the model to perform better on unseen examples as words with similar meaning would have a closer representation, allowing the model to generalise on unseen examples.
Both \citet{birnnattention} and \citet{kim_2019} obtained visible F1 score improvements when using pretrained GloVe \citep{glove} embeddings as compared to randomly initialised word embeddings. 
\subsubsection{BERT}
The BERT model features is pretrained on two tasks: Masked Language modelling (MLM) and Next Sentence Prediction (NSP), which allows the model to learn bidirectional representations. Being pretrained on a large corpus allows the model to contain a preliminary understanding of the language, reducing training time and increasing performance on the fine-tuned task.
\paragraph{Self-attention}
The use of self-attention brings many benefits over RNNs. It reduces the computational complexity of each layer, increase parallelism of the model, and reduces the path lengths of long-range dependencies, allowing the model to better learn these long-range dependencies \citep{attentionisallyouneed}.
\paragraph{Positional Encodings}
The attention mechanism captures sequential information using positional encodings. The implementation by \cite{attentionisallyouneed} suggests a possible fixed representation of sine and cosine functions of different frequencies: 

$PE_{(2i)} = \sin{(pos/10000^{2i/d_{\text{model}}})}$

$PE_{(2i+1)} = \cos{(pos/10000^{2i/d_{\text{model}}})}$. 

The paper also claims that using learned positional embedding produces similar levels of performance. \cite{positionembedding} performs an analysis on the ability of transformer models to learn the meaning of positions, and discover that BERT and RoBERTa performs much worse than Generative pretraining (GPT-2) and sinusoid in learning positional information.

\cite{floater} presents an alternative approach --- FLOATER --- to encode positional information for non-recurrent models that can potentially improve on the performance of punctuation retrieval. The paper propose to use a dynamical system to model the position representations, characterised by the function: \[p(t)=p(s)+\int_s^t{ h(\tau,p(\tau);\theta_h)d\tau}, 0\leq s \leq t < \infty\]
The model appears to perform better on the task of machine translation, and claims to be inductive, performing well on long sentences as well. While the method adds a non-negligible time and memory overhead, the paper proposes several methods to streamline the process, allowing for minimal overhead when applying it to fine-tune BERT models.

\paragraph{BERT variants}
\cite{medicalasr} and \cite{efficientbertrobust} perform a comparison between various variants of BERT. In general, RoBERTa performs better than BERT, either due to its significantly larger pretraining corpus, or the removal of the NSP task. Distillation of the model is a possible alternative in the case where training or inference efficiency is more critical than accuracy, being 12\% smaller and with a 1.2x inference speed but a 9.1\% lower accuracy \citep{efficientbertrobust}.

\subsubsection{BERT-(BiLSTM)-CRF}
A common model used in many papers feature the BERT-BiLSTM-CRF. However, there is little evidence of the effectiveness of adding a BiLSTM layer to increase the performance of the tagger. \cite{bertcrf} performed a comparison between different combinations of BERT with BiLSTM and CRF, and the addition of the BiLSTM layer led to a drop in performance of the model. \cite{chinesebertbilstm} presented the BERT-BiLSTM and BERT-BiLSTM-CRF model but not the BERT-CRF or BERT model, providing no information about the effectiveness of the BiLSTM layer.

Likewise, \citet{rosvall2019comparison} observed a slight performance gain when using the CRF layer over BERT rather than a feedforward network on the task of NER.

\citet{tuneornottotune} demonstrates that the BERT-CRF model can outpeform the BERT-BiLSTM-CRF model when the base layer is unfrozen, and gradual unfreezing (unfreezing single layers and training for 1 epoch, from the last to the first layer) is necessary to match the performance of frozen BERT-BiLSTM-CRF. This paper demonstrates the ability of the BERT model's attention mechanism to act as a substitute for the BiLSTM layer if trained properly.

I hypothesise that the BERT layer with a high-order semi-Markov CRF tag decoder is able to outperform the BERT-BiLSTM-(Linear) CRF model. *** Correction, while a standalone high-order/semi-markov CRF model is able to outperform the linear-chain CRF, the presence of an attention mechanism within the BERT layer might already be able to represent high-level dependencies, reducing the effectiveness of the high-order/semi-markov CRF when compared to the linear-chain CRF. Considering the current and next input might be more effective than just the current input? since the label is also highly dependent on the subsequent word. Is this what 2nd-order CRF does or something else? CRF jointly predicts the sequence of punctuation labels rather than individually, making the sequence more common.

Semi-Markov vs Weak semi markov crf?

Would training on text with inserted punctuation with all labels 0 help? Would it result in the blank label being over represented? soft self-adjusting dice with weight?

\subsection{Multi-task learning}
\begin{enumerate}
\item Punctuation with POS-tagger, adversarial learning \citep{adversarial}

\item Capitalization \citep{jointlearningcorrbirnn}, \citep{chunkmerging}

\item Sentence Boundary \cite{dynamiccrf}

\item Use of acoustic features
\end{enumerate}

\section{Training process}
\subsection{Overlapping chunks}
\citet{chunkmerging} looks into the impact of overlapping successive chunks that are fed into the model for inference, and observed that labels at the ends of the output sequence tend to perform worse than at the middle. Thus, using a 50\% stride and discarding all overlapped heads or tails of sequences contributed to a substantial improvement in their punctuation retrieval model.
\subsection{Dealing with class imbalance}
There are various possible approaches to dealing with imbalance across punctuation classes. The bulk of research conducted deals with this class imbalance by absorbing the minority classes into a more generic class like period or comma. This is feasible as most punctuation can be generalised into two classes --- periods for sentence boundaries and commas for separating different parts of a sentence. However, this simply avoids the problem of class imbalance with punctuation like (?) being underrepresented. Dealing with class imbalance can improve the performance on weaker classes like (?) or (!) and can even allow for the training of a system that can retrieve other classes of punctuation like (``") or (---). 
\subsubsection{Weighted Cross-entropy (W-CEL)}
The multiplication of the component losses by a weight inversely proportional to the class frequency within the corpus will allow the model to converge faster on the weaker classes. A general formula for this loss function is given by: \[
J_{wcel}=-\frac{1}{M}\sum_{k=1}^{K}\sum_{m=1}^{M}w_k \times y_m^k \times log(f(x_m))
\] where $w_k$ represents the class weights \newline
$M$ represents the number of training examples \newline
$K$ represents the number of classes. \newline
This approach was utilised by \cite{adaptivenerunbalanceddata} in his binary classifier to identify weaker classes. \citet{efficientbertrobust} mentions that their implementation of focal loss or class weights did not outperform the generic cross-entropy loss. This might be due to the use of just three punctuation classes with only a slightly weaker (?) class, making any gains less pronounced. The lack of parameter tuning in their experiment may also reduce the performance of the models using focal loss or class weights.

\subsubsection{Focal Loss}
\citet{focallosspunct} demonstrated the effectiveness of Focal Loss over Cross-entropy loss, obtaining a clear improvement both in their controlled experiments and when compared to the Bert Punct (Base) model \citep{pandababa}.

\subsubsection{Sørensen–Dice coefficient / F1 Score}
%On the Bayes-Optimality of F-Measure Maximizers


%Optimization for Medical Image Segmentation: Theory and Practice When Evaluating With Dice Score or Jaccard Index 


\citet{li2020dice} proposed the use of a self-adjusting soft dice loss as an alternative to cross-entropy loss. The Sørensen–Dice coefficient (DSC): $DSC(A,B)=\frac{2 \times\abs{A \cap B}}{\abs{A} + \abs{B}}$ is used to measure the similarity of the 2 sequences. The proposed loss function introduces a decaying factor to the DSC function to increase the importance of less confident classes and allowing the training process to focus on those classes. Their proposed loss function is as follows: \[
DSC(x_i)=\frac{2{(1-p_{i1})}^{\alpha}p_{i1}\cdot y_{i1}+\lambda}{{(1-p_{i1})}^{\alpha}p_{i1}+ y_{i1}+\lambda} ,\] with $p$ being the probability of a token being assigned to a label $i$, and $y$ representing the actual label of the token matching label $i$.
The model trained with the DSC loss achieved a higher F1 score across all evaluated tasks as compared to Focal Loss and Dice Loss. The results obtained when trained on datasets of varying levels of imbalance further supports the hypothesis that the DSC loss is effective in combating the issue of class imbalance. 

% \citet{dicescorejaccardindex}

\subsection{Transfer learning for unseen domain}
\citet{medicalasr} demonstrated the effectiveness of fine-tuning the pretrained BERT model on medical-domain data in improving the performance of medical transcript punctuation retrieval. This has proven effective, but also highlighted the inability of the BERT model trained on non-medical data to generalise to the medical domain, obtaining a F1 score of 0.2 and 0.39 on the Period and Comma classes respectively. This shows the importance of selecting a training corpus similar to the target domain with fewer unknown tokens in the test set, to allow the model to learn surface features which will improve performance on the target domain. This section will look further into possible ways to improve performance on unseen domains, or domains with minimal training data, where the target-domain contains minimal out-of-vocabulary tokens.

\subsubsection{Attaining linguistic generalisation}
\citet{robertaAcquireLinguisticPreference} performs a study of the ability of RoBERTa in preferring linguistic over surface generalisations in the pretraining phase. They obtain the following results: 
(1) models learn to represent both surface features and linguistic features with relatively little data; 
(2) RoBERTa begins to acquire a linguistic bias with over 1B words of pretraining data; 
(3) increasing pretraining data strengthens linguistic bias; 
(4) there is considerable variation in models’ preferences between specific pairs of linguistic and surface features \citep{robertaAcquireLinguisticPreference}. 
In addition, their inclusion of manually designed inoculation data ranging from 0\% to 1\% allowed the models to attain stronger linguistic generalisation. Thus, increasing the amount of training data and the inclusion of some inoculating data manually or programmatically are possible approaches for punctuation retrieval transfer learning.

\textbf{Multi-task learning} An approach taken by various researchers \citep{adversarial,jointlearningcorrbirnn, dynamiccrf} use multi-task learning (i.e. POS tagging, Capitalisation prediction, Sentence boundary prediction) to guide the model towards a more linguistically generalised solution.

\citet{domainAdaptationBERT} presents a method to improve domain adaptation on unlabelled target data, but requires some examples of target data before training begins. The method features a BERT classifier to sort source data in decreasing order of similarity to target data. This sorted source data is then fed into the learning algorithm, allowing the model to learn examples of increasing difficulty.

\citet{clim} tackles the same issue using a different approach. They propose 

% \section*{}

\bibliography{asr.bib}
\end{multicols}
\end{document}