Skip to content

Latest commit

 

History

History
72 lines (44 loc) · 8.72 KB

Chatbot_Evaluation_Metrics.md

File metadata and controls

72 lines (44 loc) · 8.72 KB

Context

Chatbot evaluation metrics can be examined from three different perspectives: 1) metrics related to AI-based chatbots, 2) metrics related to generative AI based chatbots, and 3) metrics to evaluate a complete chatbot system. In the following, I will introduce the criteria of each.

Common evaluation metrics for AI-based chatbots

The common evaluation metrics for AI-based chatbots include:

  • Perplexity: a measure of how well a language model predicts the next word in a sentence. It is calculated as the exponentiated average log-probability of the test set, with lower values indicating a better model [1, 2].

  • BLEU: BLEU (Bilingual Evaluation Understudy) is a measure of the similarity between the predicted response and the reference response. It compares the n-grams (i.e. sequence of words) in the predicted response to the n-grams in the reference response, and calculates a score based on how many n-grams match. A higher BLEU score indicates a higher degree of similarity between the predicted response and the reference response [3-5].

  • METEOR: METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a measure of the similarity between the predicted response and the reference response, which also takes into account synonyms and stemming. It uses an alignment algorithm to align the words in the predicted response and reference response, and calculates a score based on how many aligned words match. A higher METEOR score indicates a higher degree of similarity between the predicted response and the reference response [6-8].

  • ROUGE: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a measure of the similarity between the predicted response and the reference response, which can be used for summarization tasks. It compares the n-grams in the predicted response to the n-grams in the reference response, and calculates a recall score based on how many n-grams in the predicted response also appear in the reference response. A higher ROUGE score indicates a higher degree of similarity between the predicted response and the reference response [9-11].

  • Embedding Average cosine Similarity (EACS): EACS (Embedding Average cosine Similarity) is a measure of the semantic similarity between the predicted response and the reference response, which is based on word embeddings. It calculates the cosine similarity between the average word embeddings of the predicted response and the reference response. A higher EACS score indicates a higher degree of semantic similarity between the predicted response and the reference response [12-16].

  • Intent accuracy: Intent accuracy is a measure of how well the chatbot recognizes the user's intent. It is calculated as the proportion of user inputs that are correctly classified by the chatbot's intent recognition module.

  • Task completion rate: Task completion rate is a measure of how well the chatbot is able to complete the task assigned to it. It is calculated as the proportion of user inputs for which the chatbot is able to provide a correct and complete response.

Common evaluation metrics for Generative AI-based chatbots

For Generative AI-based chatbots, such as Blenderbot or DialoGPT, some additional evaluation metrics that are commonly used include:

  • Distinct n-gram: Distinct n-gram is a measure of the diversity of the generated text. It is calculated as the proportion of unique n-grams (i.e. sequences of words) in the generated text. The higher the distinct-n value, the more diverse the generated text is, and hence more likely to be human-like.

  • Self-BLEU: Self-BLEU is a measure of the similarity of the generated text to the text it was trained on. It compares the n-grams in the generated text to the n-grams in the training text, and calculates a score based on how many n-grams match. A higher Self-BLEU score indicates a higher degree of similarity between the generated text and the training text.

  • Fraternity: Fraternity is a measure of the similarity of the generated text to the text generated by other models trained on the same data. It compares the generated text to the text generated by other models, and calculates a score based on how many n-grams match. A higher Fraternity score indicates a higher degree of similarity between the generated text and the text generated by other models.

  • Human evaluation: Human evaluation is a measure of the quality of the generated text as judged by human evaluators. This can include metrics such as coherence, fluency, and overall quality. The human evaluation can be done through surveys, where human evaluators rate the quality of the generated text on a scale or through the Turing test where the human evaluators are asked to identify if the text was generated by a machine or human.

  • Adversarial evaluation: Adversarial evaluation is a measure of the model's ability to generate text that is indistinguishable from text written by humans. This is achieved by training a discriminator model to distinguish between human-written text and machine-generated text, and then using the discriminator to evaluate the quality of the generated text. A high performance on adversarial evaluation indicates that the generated text is difficult to distinguish from human-written text.

Common evaluation metrics for a whole chatbot system

When evaluating a whole chatbot system, which includes both the natural language understanding (NLU) and natural language generation (NLG) components, common evaluation metrics include:

  • Task completion rate: Task completion rate is a measure of how well the chatbot is able to complete the task assigned to it. It is calculated as the proportion of user inputs for which the chatbot is able to provide a correct and complete response.

  • Intent accuracy: Intent accuracy is a measure of how well the chatbot recognizes the user's intent. It is calculated as the proportion of user inputs that are correctly classified by the chatbot's intent recognition module.

  • Dialogue-level metrics: a measure of how well the chatbot is able to carry out a dialogue with the user, such as dialogue length (i.e., the number of turns in a dialogue), and the number of successful dialogues (i.e., the number of dialogues that successfully reach the desired outcome).

  • User satisfaction: a measure of how satisfied the users are with the chatbot, which can be assessed through surveys or interviews. It can include metrics such as: Chatbot usability, Chatbot helpfulness, and Chatbot friendliness.

  • Engagement: a measure of how long the users spend interacting with the chatbot, how many interactions they have, and how much they return to the chatbot. It can include metrics such as: Session duration, Number of interactions, and Return rate.

  • Retention: a measure of how many users return to the chatbot after their first interaction. It can be calculated as the proportion of users who return to the chatbot after their first interaction.

  • Business-level metrics: a measure of how well the chatbot is able to meet the business objectives, such as customer service efficiency (i.e., measure of how quickly the chatbot is able to respond to customer queries), conversion rate (i.e., measure of how many users complete a desired action (e.g., making a purchase)), and revenue (measure of how much revenue is generated by the chatbot).

References: