Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.ctm in data simulator annotator compliant with RT-09 specification #8004

Merged
merged 47 commits into from
Jan 8, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
45baf76
.ctm fix for data simulation
popcornell Dec 9, 2023
6528af1
Merge branch 'main' into ctm
tango4j Dec 9, 2023
6e2aa4e
Merge branch 'main' into ctm
tango4j Dec 11, 2023
c9a5b53
.ctm fix, channel should be 1 not 0
popcornell Dec 11, 2023
6a278db
Merge remote-tracking branch 'popcornell/ctm' into ctm
popcornell Dec 11, 2023
ca27864
.ctm fix, only two na, type and confidence
popcornell Dec 11, 2023
8b62af8
Revised all the parts in NeMo touching CTM files
tango4j Dec 12, 2023
a4dd387
Making all files writing CTM uses the same function
tango4j Dec 12, 2023
dcfb24a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 12, 2023
ccb6005
Updated tutorial, nemo-docs and tests for CTM formats
tango4j Dec 12, 2023
f0e7ea7
Fixed type_of_token arg name in create_alignment_manifest script
tango4j Dec 12, 2023
e76e85a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 12, 2023
d447492
Fixed the docstrings in create_alignment_manifest.py
tango4j Dec 12, 2023
2e86423
Merge branch 'ctm' of https://github.com/popcornell/NeMo into ctm
tango4j Dec 12, 2023
9714854
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 12, 2023
343c9a7
Merge branch 'main' into ctm
tango4j Dec 12, 2023
0de3390
Some missing refactored variables for type_of_token
tango4j Dec 12, 2023
223dc9c
Merge branch 'ctm' of https://github.com/popcornell/NeMo into ctm
tango4j Dec 12, 2023
6e6344e
Another un-fixed part in data_simulation_utils.py
tango4j Dec 12, 2023
a0397ea
Merge branch 'main' into ctm
tango4j Dec 12, 2023
7659d0c
Merge branch 'main' into ctm
popcornell Dec 12, 2023
8d94c36
Reflected comments from PR
tango4j Dec 13, 2023
b4bf970
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 13, 2023
4896351
Merge branch 'main' into ctm
tango4j Dec 13, 2023
724adf3
Reflected another precision related comments from PR
tango4j Dec 13, 2023
a35a24e
Merge branch 'ctm' of https://github.com/popcornell/NeMo into ctm
tango4j Dec 13, 2023
86d5198
Updated tests to use decimal rounding of 2
tango4j Dec 13, 2023
604be4d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 13, 2023
28cc0d5
Changed beg_time to start_time and fixed unit tests
tango4j Dec 15, 2023
eb163fe
Merge branch 'ctm' of https://github.com/popcornell/NeMo into ctm
tango4j Dec 15, 2023
4a97f44
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 15, 2023
23331aa
Merge branch 'main' into ctm
tango4j Dec 15, 2023
566c0f3
Merge branch 'main' into ctm
tango4j Dec 18, 2023
2eaa24b
Merge branch 'main' into ctm
tango4j Dec 18, 2023
1166231
Fixed typos and errors in manifest_utils.py
tango4j Dec 18, 2023
3c82369
Resolved the merge conflicts
tango4j Dec 18, 2023
4c2421f
Resolved another merge conflict
tango4j Dec 18, 2023
9bd0e25
Merge branch 'main' into ctm
tango4j Dec 19, 2023
9d73f23
Merge branch 'main' into ctm
tango4j Dec 29, 2023
378a14d
Merge branch 'main' into ctm
stevehuang52 Jan 2, 2024
e6f16c7
Merge branch 'main' into ctm
tango4j Jan 2, 2024
a56d047
Fixed the test errors
tango4j Jan 5, 2024
3889596
Merge branch 'main' into ctm
tango4j Jan 5, 2024
851f716
Fixed the missed commented lines
tango4j Jan 5, 2024
68cf52f
Merge branch 'ctm' of https://github.com/popcornell/NeMo into ctm
tango4j Jan 5, 2024
56a2123
Merge branch 'main' into ctm
tango4j Jan 5, 2024
329bc0d
Merge branch 'main' into ctm
tango4j Jan 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/asr/speaker_diarization/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,14 +205,14 @@ The following are descriptions about each field in an input manifest JSON file.
``ctm_filepath`` (Optional):

CTM file is used for the evaluation of word-level diarization results and word-timestamp alignment. CTM file follows the following convention: ``<uniq-id> <speaker ID> <word start time> <word end time> <word> <confidence>`` Since confidence is not required for evaluating diarization results, it can have any value. Note that the ``<speaker_id>`` should be exactly matched with speaker IDs in RTTM.
The CTM file is used for the evaluation of word-level diarization results and word-timestamp alignment. The CTM file follows this convention: ``<session name> <channel ID> <start time> <duration> <word> <confidence> <type of token> <speaker>``. Note that the ``<speaker>`` should exactly match speaker IDs in RTTM. Since confidence is not required for evaluating diarization results, we assign ``<confidence>`` the value ``NA``. If the type of token is words, we assign ``<type of token>`` as ``lex``.

Example lines of CTM file:

.. code-block:: bash
TS3012d.Mix-Headset MTD046ID 12.879 0.32 okay 0
TS3012d.Mix-Headset MTD046ID 13.203 0.24 yeah 0
TS3012d.Mix-Headset 1 12.879 0.32 okay NA lex MTD046ID
TS3012d.Mix-Headset 1 13.203 0.24 yeah NA lex MTD046ID
Evaluation on Benchmark Datasets
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,6 @@ def main(cfg):

# If RTTM is provided and DER evaluation
if diar_score is not None:
metric, mapping_dict, _ = diar_score

# Get session-level diarization error rate and speaker counting error
der_results = OfflineDiarWithASR.gather_eval_results(
diar_score=diar_score,
Expand Down
19 changes: 17 additions & 2 deletions nemo/collections/asr/parts/utils/data_simulation_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,13 @@

from nemo.collections.asr.parts.preprocessing.perturb import AudioAugmentor
from nemo.collections.asr.parts.preprocessing.segment import AudioSegment
from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_ctm, write_manifest, write_text
from nemo.collections.asr.parts.utils.manifest_utils import (
get_ctm_line,
read_manifest,
write_ctm,
write_manifest,
write_text,
)
from nemo.collections.asr.parts.utils.speaker_utils import labels_to_rttmfile
from nemo.utils import logging

Expand Down Expand Up @@ -774,7 +780,16 @@
prev_align = 0 if i == 0 else alignments[i - 1]
align1 = round(float(prev_align + start), self._params.data_simulator.outputs.output_precision)
align2 = round(float(alignments[i] - prev_align), self._params.data_simulator.outputs.output_precision)
text = f"{session_name} {speaker_id} {align1} {align2} {word} 0\n"
text = get_ctm_line(
source=session_name,
channel=1,
beg_time=align1,
duration=align2,
token=word,
conf=None,
type_token='lex',
speaker=speaker_id,
)
Fixed Show fixed Hide fixed
arr.append((align1, text))
return arr

Expand Down
64 changes: 64 additions & 0 deletions nemo/collections/asr/parts/utils/manifest_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,70 @@
from nemo.utils.data_utils import DataStoreObject


def get_ctm_line(
source: str,
channel: int,
beg_time: float,
duration: float,
token: str,
conf: float,
type_of_token: str,
speaker: str,
NA_token: str = 'NA',
UNK: str = 'unknown',
default_channel: str = '1',
output_precision: int = 3,
) -> str:
"""
Get a line in Conversation Time Mark (CTM) format. Following CTM format appeared in `Rich Transcription Meeting Eval Plan: RT09` document.

CTM Format:
<SOURCE><SP><CHANNEL><SP><BEG-TIME><SP><DURATION><SP><TOKEN><SP><CONF><SP><TYPE><SP><SPEAKER><NEWLINE>

Reference:
https://web.archive.org/web/20170119114252/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

Args:
source (str): <SOURCE> is name of the source file, session name or utterance ID
channel (int): <CHANNEL> is channel number defaults to 1
beg_time (float): <BEG_TIME> is begin time of the word
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of feel that start_time is easier to understand than beg_time, without needing to look at the documentations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is easier to understand and probably beg_time was an ill-suited choice in the initial .ctm specification.
On the other hand, if someone is using this function without looking at the specification maybe he should not use it.
IDK I think if we use the original names it is less confusion (even if yeah, it is less intuitive).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed beg_time to start_time and then mentioned the difference in the docstring.

duration (float): <DURATION> is duration of the word
token (str): <TOKEN> Token or word for the current entry
conf (float): <CONF> is a floating point number between 0 (no confidence) and 1 (certainty). A value of “NA” is used (in CTM format data)
when no confidence is computed and in the reference data.
type_of_token (str): <TYPE> is the token type. The legal values of <TYPE> are “lex”, “frag”, “fp”, “un-lex”, “for-lex”, “non-lex”, “misc”, or “noscore”
speaker (str): <SPEAKER> is a string identifier for the speaker who uttered the token. This should be “null” for non-speech tokens and “unknown” when
the speaker has not been determined.
NA_token (str, optional): A token for . Defaults to '<NA>'.
output_precision (int, optional): The precision of the output floating point number. Defaults to 3.

Returns:
str: Return a line in CTM format filled with the given information.
"""
VALID_TOKEN_TYPES = ["lex", "frag", "fp", "un-lex", "for-lex", "non-lex", "misc", "noscore"]
if type(beg_time) != float:
beg_time = round(float(beg_time), output_precision)
if type(duration) != float:
duration = round(float(duration), output_precision)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, beg_time and duration do not get rounded if they are floats already. Please remove the if-statements, I don't think they are necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to always round the number. Also checking whether beg_time is either float or string containing floating point number.

if channel is not None and type(channel) != int:
channel = str(channel)
if conf is not None and type(conf) != float:
raise ValueError(f"`conf` must be a float, but got {type(conf)}")
if conf is not None and not (0 <= conf <= 1):
raise ValueError(f"`conf` must be between 0 and 1, but got {conf}")
if type_of_token is not None and type(type_of_token) != str:
raise ValueError(f"`type` must be a string, but got {type(type)}")
if type_of_token is not None and type_of_token not in VALID_TOKEN_TYPES:
raise ValueError(f"`type` must be one of {VALID_TOKEN_TYPES}, but got {type_of_token}")
if speaker is not None and type(speaker) != str:
raise ValueError(f"`speaker` must be a string, but got {type(speaker)}")
channel = default_channel if channel is None else channel
conf = NA_token if conf is None else conf
speaker = NA_token if speaker is None else speaker
type_of_token = UNK if type_of_token is None else type_of_token
return f"{source} {channel} {beg_time} {duration} {token} {conf} {type_of_token} {speaker}\n"


def rreplace(s: str, old: str, new: str) -> str:
"""
Replace end of string.
Expand Down
68 changes: 53 additions & 15 deletions scripts/speaker_tasks/create_alignment_manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,41 @@
import os
import shutil
from pathlib import Path
from typing import List

from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_ctm, write_manifest
from nemo.collections.asr.parts.utils.manifest_utils import get_ctm_line, read_manifest, write_ctm, write_manifest
from nemo.utils import logging


def get_unaligned_files(unaligned_path):
def get_seg_info_from_ctm_line(
ctm_list: List[str],
output_precision: int,
speaker_index: int = 7,
beg_time_index: int = 2,
duration_index: int = 3,
):
"""
Get time stamp information and speaker labels from CTM lines.
This is following CTM format appeared in `Rich Transcription Meeting Eval Plan: RT09` document.

Args:
ctm_list (list): List containing CTM items. e.g.: ['sw02001-A', '1', '0.000', '0.200', 'hello', '0.98', 'lex', 'speaker3']
output_precision (int): Precision for CTM outputs in integer.

Returns:
start (float): Start time of the segment.
end (float): End time of the segment.
speaker_id (str): Speaker ID of the segment.
"""
speaker_id = ctm_list[speaker_index]
start = float(ctm_list[beg_time_index])
end = float(ctm_list[beg_time_index]) + float(ctm_list[duration_index])
start = round(start, output_precision)
end = round(end, output_precision)
return start, end, speaker_id


def get_unaligned_files(unaligned_path: str) -> List[str]:
"""
Get files without alignments in order to filter them out (as they cannot be used for data simulation).
In the unaligned file, each line contains the file name and the reason for the unalignment, if necessary to specify.
Expand Down Expand Up @@ -71,7 +100,17 @@ def create_new_ctm_entry(session_name, speaker_id, wordlist, alignments, output_
# note that using the current alignments the first word is always empty, so there is no error from indexing the array with i-1
align1 = float(round(alignments[i - 1], output_precision))
align2 = float(round(alignments[i] - alignments[i - 1], output_precision,))
text = f"{session_name} {speaker_id} {align1} {align2} {word} 0\n"
text = get_ctm_line(
source=session_name,
channel=speaker_id,
beg_time=align1,
duration=align2,
token=word,
conf=0,
type_of_token='lex',
speaker=speaker_id,
output_precision=output_precision,
)
arr.append((align1, text))
return arr

Expand Down Expand Up @@ -206,11 +245,7 @@ def create_manifest_with_alignments(
prev_end = 0
for i in range(len(lines)):
ctm = lines[i].split(' ')
speaker_id = ctm[1]
start = float(ctm[2])
end = float(ctm[2]) + float(ctm[3])
start = round(start, output_precision)
end = round(end, output_precision)
speaker_id, start, end = get_seg_info_from_ctm_line(ctm_list=ctm, output_precision=output_precision)
interval = start - prev_end

if (i == 0 and interval > 0) or (i > 0 and interval > silence_dur_threshold):
Expand All @@ -231,13 +266,16 @@ def create_manifest_with_alignments(
end_times.append(f['duration'])

# build target manifest entry
target_manifest.append({})
target_manifest[tgt_i]['audio_filepath'] = f['audio_filepath']
target_manifest[tgt_i]['duration'] = f['duration']
target_manifest[tgt_i]['text'] = f['text']
target_manifest[tgt_i]['words'] = words
target_manifest[tgt_i]['alignments'] = end_times
target_manifest[tgt_i]['speaker_id'] = speaker_id
target_manifest.append(
{
'audio_filepath': f['audio_filepath'],
'duration': f['duration'],
'text': f['text'],
'words': words,
'alignments': end_times,
'speaker_id': speaker_id,
}
)

src_i += 1
tgt_i += 1
Expand Down
130 changes: 128 additions & 2 deletions tests/collections/asr/utils/test_data_simul_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
normalize_audio,
read_noise_manifest,
)
from nemo.collections.asr.parts.utils.manifest_utils import get_ctm_line


@pytest.fixture()
Expand Down Expand Up @@ -129,6 +130,131 @@
return words, alignments, speaker_id


class TestGetCtmLine:
@pytest.mark.unit
@pytest.mark.parametrize("conf", [0, 1])
def test_wrong_type_conf_values(self, conf):
# Test with wrong integer confidence values
with pytest.raises(ValueError):
result = get_ctm_line(
source="test_source",
channel=1,
beg_time=0.123,
duration=0.456,
token="word",
conf=conf,
type_of_token="lex",
speaker="speaker1",
)
expected = f"test_source 1 0.123 0.456 word {conf} lex speaker1\n"
assert result == expected, f"Failed on valid conf value {conf}"

@pytest.mark.unit
@pytest.mark.parametrize("conf", [0.0, 0.5, 1.0, 0.001, 0.999])
def test_valid_conf_values(self, conf):
# Test with valid confidence values
result = get_ctm_line(
source="test_source",
channel=1,
beg_time=0.123,
duration=0.456,
token="word",
conf=conf,
type_of_token="lex",
speaker="speaker1",
)
expected = f"test_source 1 0.123 0.456 word {conf} lex speaker1\n"
assert result == expected, f"Failed on valid conf value {conf}"

@pytest.mark.unit
@pytest.mark.parametrize("conf", [-0.1, 1.1, 2, -1, 100, -100])
def test_invalid_conf_ranges(self, conf):
# Test with invalid confidence values
with pytest.raises(ValueError):
get_ctm_line(
source="test_source",
channel=1,
beg_time=0.123,
duration=0.456,
token="word",
conf=conf,
type_of_token="lex",
speaker="speaker1",
)

@pytest.mark.unit
def test_valid_input(self):
# Test with completely valid inputs
result = get_ctm_line(
source="test_source",
channel=1,
beg_time=0.123,
duration=0.456,
token="word",
conf=0.789,
type_token="lex",
speaker="speaker1",
)
Fixed Show fixed Hide fixed
expected = "test_source 1 0.123 0.456 word 0.789 lex speaker1\n"
assert result == expected, "Failed on valid input"

@pytest.mark.unit
@pytest.mark.parametrize(
"beg_time, duration",
[
("not a float", 1.0),
(1.0, "not a float"),
(1, 2.0), # Integers should be converted to float
(2.0, 3), # Same as above
],
)
def test_invalid_types_for_time_duration(self, beg_time, duration):
# Test with invalid types for beg_time and duration
with pytest.raises(ValueError):
get_ctm_line(
source="test_source",
channel=1,
beg_time=beg_time,
duration=duration,
token="word",
conf=0.5,
type_token="lex",
speaker="speaker1",
)
Fixed Show fixed Hide fixed

@pytest.mark.unit
@pytest.mark.parametrize("conf", [-0.1, 1.1, "not a float"])
def test_invalid_conf_values(self, conf):
# Test with invalid values for conf
with pytest.raises(ValueError):
get_ctm_line(
source="test_source",
channel=1,
beg_time=0.123,
duration=0.456,
token="word",
conf=conf,
type_token="lex",
speaker="speaker1",
)
Fixed Show fixed Hide fixed

@pytest.mark.unit
def test_default_values(self):
# Test with missing optional parameters
result = get_ctm_line(
source="test_source",
channel=None,
beg_time=0.123,
duration=0.456,
token="word",
conf=None,
type_token=None,
speaker=None,
)
Fixed Show fixed Hide fixed
expected = "test_source 1 0.123 0.456 word NA unknown NA\n"
assert result == expected, "Failed on default values"


class TestDataSimulatorUtils:
# TODO: add tests for all util functions
@pytest.mark.parametrize("max_audio_read_sec", [2.5, 3.5, 4.5])
Expand Down Expand Up @@ -253,11 +379,11 @@
)
assert ctm_list[0] == (
alignments[1],
f"{session_name} {speaker_id} {alignments[1]} {alignments[1]-alignments[0]} {words[1]} 0\n",
f"{session_name} 1 {alignments[1]} {alignments[1]-alignments[0]} {words[1]} NA lex {speaker_id}\n",
)
assert ctm_list[1] == (
alignments[2],
f"{session_name} {speaker_id} {alignments[2]} {alignments[2]-alignments[1]} {words[2]} 0\n",
f"{session_name} 1 {alignments[2]} {alignments[2]-alignments[1]} {words[2]} NA lex {speaker_id}\n",
)


Expand Down
Loading
Loading