NLU Evaluation Scripts

Methodology

We evaluate Recall and F-scores on a small and large dataset based on the home automation bot dataset from “Benchmarking Natural Language Understanding Services for Building Conversational Agents (2019)". The data is available on github.

The benchmark was conducted between 28th July and 4th August 2022.

The experiment settings were trained on a single-fold, with the test set kept as holdout dataset during the training phase.

Full predictions with its scores on the test dataset are sorted back to its Ground Truth and provided in the folder.

Disclaimer:

Google Cloud AutoML:
- The benchmark results is based on the confidence threshold that yields the best F1-Score.

Small

640 Training Sentences - 10 Sentences per Intent

1076 Test Sentences

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Recall	0.867	0.782	0.789	0.725
F1 (Macro)	0.870	0.799	0.789	0.700

Example sentence on Small models

Query: is there anything i need to be aware of

Ground Truth: calendar_query

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	calendar_query	general_dontcare	general_dontcare	calendar_remove
Confidence	0.73	0.15	0.49	0.09

Query: is there an alarm at four am

Ground Truth: alarm_query

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	alarm_query	alarm_set	alarm_set	alarm_set
Confidence	0.7	0.96	1.0	0.27

Large

1908 Training Sentences - ~30 Sentences per Intent

5518 Test Sentences

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Recall	0.901	0.836	0.860	0.876
F1 (Macro)	0.903	0.862	0.860	0.867

Example sentence on Large models

Query: how many countries are in the European Union

Ground Truth: qa_factoid

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	qa_factoid	qa_currency	qa_maths	general_quirky
Confidence	0.72	0.42	0.34	0.30

Query: let's go through all pending reminders

Ground Truth: calendar_query

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	calendar_query	calendar_remove	general_quirky	calendar_set
Confidence	0.74	0.85	0.73	0.29

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
large		large
small		small
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLU Evaluation Scripts

Methodology

Small

Example sentence on Small models

Query: is there anything i need to be aware of

Ground Truth: calendar_query

Query: is there an alarm at four am

Ground Truth: alarm_query

Large

Example sentence on Large models

Query: how many countries are in the European Union

Ground Truth: qa_factoid

Query: let's go through all pending reminders

Ground Truth: calendar_query

About

Releases

Packages

Contributors 4

sprinklr-inc/nlu-evaluation-scripts

Folders and files

Latest commit

History

Repository files navigation

NLU Evaluation Scripts

Methodology

Small

Example sentence on Small models

Query: is there anything i need to be aware of

Ground Truth: calendar_query

Query: is there an alarm at four am

Ground Truth: alarm_query

Large

Example sentence on Large models

Query: how many countries are in the European Union

Ground Truth: qa_factoid

Query: let's go through all pending reminders

Ground Truth: calendar_query

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages