Work in progress
A module to align stt transcription with accurate text that has got speaker labels.
For when you already have a transcription, eg in a csv file, with speaker names but no timecodes. And you want to add timecodes but not lose the speaker labels (or the accurate text).
git clone git@github.com:pietrop/align-diarized-text.git
cd align-diarized-text
See docs notes 'analyses' for more info on getting the media and transcripts, in the right format for this.
npm install align-diarized-text
see example usage in /src/add-timecodes-to-quotes/
const alignDiraizedText = require('./index.js');
const linesWithSpeaker = require('../../sample-data/input-example.json');
const sttJson = require('../../sample-data/stt-transcript.json')
const res = alignDiraizedText(linesWithSpeaker, sttJson);
// do something with output json
To troubleshoot the aligement you can also use generateInteractiveTranscript
, see example usage in /src/generate-html-to-check-alignement/
to generate an index.html
file.
An Array list of objects with text and speaker attribtues.
[
{
"id": "6af9762b-d0aa-42d2-9d6d-1c114f9219db",
"text": "Thank you. It's good to be here.",
"speaker": "Elizabeth Warren"
},
{
"id": "b02da04e-044e-436a-8862-166139568136",
"text": "So I think of it this way, who is this economy really working for? It's doing great for a thinner and thinner slice at the top. It's doing great for giant drug companies. This is not doing great for people are trying to get a prescription filled. It's doing great for people who want to invest in private prisons just not for the African-Americans and Latinx whose families are torn apart whose lives are destroyed and whose communities are ruined.",
"speaker": "Elizabeth Warren"
},
...
The text is human accurate transcription, timecodes are missing, and speaker diraziation info is present.
This could be initially originated from a .tsv
or .csv
file, and converted to json with 'convert-csv-to-json'.
STT array of timecoded words, This is generated from video/audio file of the debate. see sample-data
folder more.
View sample-data
folder for example output.
Something like this
[
{
"start": 144.94,
"end": 168.31,
"text": "Thank you. It's good to be here.",
"words": [
{
"end": 145.18,
"start": 144.94,
"text": "Thank"
},
{
"start": 145.2,
"end": 145.66,
"text": "you."
},
...
],
"id": "6af9762b-d0aa-42d2-9d6d-1c114f9219db",
"speaker": "Elizabeth Warren"
},
...
TBC
There's a docs folder in this repository.
docs/notes contains dev draft notes on various aspects of the project. This would generally be converted either into ADRs or guides when ready.
- npm >
6.1.0
- Node 10 - dubnium
Node version is set in node version manager .nvmrc
NA
NA
NA