Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subtitle line remains stuck for 30 mins | awk script > 2mins length #975

Open
mrfragger opened this issue May 31, 2023 · 3 comments
Open

Comments

@mrfragger
Copy link

05:14:41.490 --> 05:14:42.820
any possible chance

05:14:42.820 --> 05:46:50.320
this text remains for 32 mins in subs

05:46:50.320 --> 05:14:50.420
which is correct

05:14:50.420 --> 05:14:55.710
and transcription is correct

05:14:55.710 --> 05:14:57.460
but it runs over 

05:14:57.460 --> 05:15:03.590
so gotta come up with a sed or awk script

05:15:03.590 --> 05:15:05.300
to detect if say subtitle duration

05:15:05.300 --> 05:15:11.220
exceeds 2 mins let's say

Had this happen yesterday also and found it think it was in a 48 hour audiobook I was doing. This one just happened again today with a 10 hour audiobook. So what happens is

this text remains for 32 mins in subs
remains below constantly on for 32 mins and some of the new subtitles that is what can fit show just above it

Obviously to correct this just change
05:14:42.820 --> 05:46:50.320
to
05:14:42.820 --> 05:14:50.320

05:46:50.320 --> 05:14:50.420
to
05:14:50.320 --> 05:14:50.420

in both cases just changing the xx:46:xx.xxx to xx:14:xx.xxx

my current command to pipe wav max length 78 and split at word
for f in *.opus ; do ffmpeg -i "$f" -f wav -ar 16000 -ac 1 - | ~/whisper/whisper.cpp/./main -m ~/whisper/whisper.cpp/models/ggml-medium.en.bin - -ovtt -of "$f" -l en -ml 78 -sow -t 8 ; for f in *.vtt ; do sed -r -i .bak -e 's|Yellow|yellow|g' -e 's|blue|Blue|g' -e 's|Pink|pink|g' "$f" ; done && for i in *opus.vtt ; do mv -i -- "$i" "$(printf '%s\n' "$i" | sed '1s/.opus.vtt/.vtt/')" ; mkdir vttsubs/ ; mv *.vtt vttsubs/ ; done && rm *.bak ; done

I'll try to figure out an awk script to see if it can automatically check duration on a subtitle line say exceeding 2 mins

@mrfragger
Copy link
Author

In the process of correcting the ones and here's an instance where
"the sea and moon, more together than"
stays on screen for an hour hiding the new subtitles. No music or anything
so no idea what caused it.

02:42:11.810 --> 02:42:13.840
here, with nothing but

02:42:13.840 --> 03:25:12.370
the sea and moon, more together than

03:25:12.370 --> 03:33:12.370
in that crowd, or even in my rooms.

03:33:12.370 --> 02:42:20.800
Don't you understand that?"

02:42:21.600 --> 02:42:23.430
"I don't understand anything," she

02:42:23.430 --> 02:42:25.820
said with decision, determined to

to fix changed to
02:42:11.810 --> 02:3:13.840
here, with nothing but

02:42:13.840 --> 02:42:15.370
the sea and moon, more together than

02:42:15.370 --> 02:42:18.370
in that crowd, or even in my rooms.

02:42:18.370 --> 02:42:20.800
Don't you understand that?"

02:42:21.600 --> 02:42:23.430
"I don't understand anything," she

@mrfragger
Copy link
Author

mrfragger commented Jun 5, 2023

Ok tried many combinations with max length and with split on word and without split on word.

It definitely is some calculation bug with max-length. Well have to use it as it sometimes goes way over 100 characters. So just have to scan for the problem ones and fix it manually. Also it only occurs on about 2% of the ones I've done so hard to say what exactly is the culprit. Once it was music but many other problematic ones didn't have music at all. It seems without max-length it sets it to around 100 characters.

maxlength80sow-example.vtt
Line# 3126: 00:56 55
Line# 8754: 02:34 41
Line# 8757: 03:41 34
maxlength90sow-example.vtt
Line# 6450: 02:34 41
Line# 6453: 03:41 34
nomaxlength-example.vtt (this one had no timing issues with subs)
nosplitonword-example.vtt
Line# 3195: 00:56 55
Line# 8916: 02:34 41
Line# 8919: 03:41 34
splitonword-example.vtt
Line# 3162: 00:56 55
Line# 8829: 02:34 41
Line# 8832: 03:41 34

I tried looking at some much code but most is way over my head

//  500 -> 00:05.000
// 6000 -> 01:00.000
static std::string to_timestamp(int64_t t, bool comma = false) {
    int64_t msec = t * 10;
    int64_t hr = msec / (1000 * 60 * 60);
    msec = msec - hr * (1000 * 60 * 60);
    int64_t min = msec / (1000 * 60);
    msec = msec - min * (1000 * 60);
    int64_t sec = msec / 1000;
    msec = msec - sec * 1000;

    char buf[32];
    snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);

    return std::string(buf);
}

This part is from openai-whisper and seems to indicate word timestamps are required when using max length. However I don't want vtt or srt subtitles with word-timestamps as it significantly increases file size. Definitely useful for karaoke or language learning I suppose.

parser.add_argument("--max_line_width", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of characters in a line before breaking the line")

parser.add_argument("--max_line_count", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of lines in a segment")

Here's an example of one corrected timecodes are in ( )

02:34:48.430 --> 02:34:49.280
xxxxxxxx xx xxx

02:34:49.280 --> 03:41:15.260 (02:34:54.260)
xxxxxxx xxxxx xxxxxxx xx xxxx xxxxxx xxxxx xx xxx xxxxxxxxx xxxx xxx xxxxxx

(02:34:54.260) 03:41:15.260 --> 02:34:55.120
xxx xxxxx xx xxxxxxxxx

02:34:55.120 --> 02:34:59.840
xx xxxxx x xxxxxx xxxxxxxxxx xxxxxxxx xxxx xx xxx xxxxxx

@mrfragger
Copy link
Author

These are just random notes of code I was looking at but like I said...over my ability

think this one is from openai-whisper if I remember correctly

condition_on_previous_text: bool
       if True, the previous output of the model is provided as a prompt for the next window;
       disabling may make the text inconsistent across windows, but the model becomes less prone to
       getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

https://github.com/openai/whisper/blob/main/whisper/transcribe.py


 consecutive = torch.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0]
            consecutive.add_(1)
            if len(consecutive) > 0:
                # if the output contains two consecutive timestamp tokens
                slices = consecutive.tolist()
                if single_timestamp_ending:
                    slices.append(len(tokens))

                last_slice = 0
                for current_slice in slices:
                    sliced_tokens = tokens[last_slice:current_slice]
                    start_timestamp_pos = (
                        sliced_tokens[0].item() - tokenizer.timestamp_begin
                    )
                    end_timestamp_pos = (
                        sliced_tokens[-1].item() - tokenizer.timestamp_begin
                    )
                    current_segments.append(
                        new_segment(
                            start=time_offset + start_timestamp_pos * time_precision,
                            end=time_offset + end_timestamp_pos * time_precision,
                            tokens=sliced_tokens,
                            result=result,
                        )
                    )
                    last_slice = current_slice

                if single_timestamp_ending:
                    # single timestamp at the end means no speech after the last timestamp.
                    seek += segment_size
                else:
                    # otherwise, ignore the unfinished segment and seek to the last timestamp
                    last_timestamp_pos = (
                        tokens[last_slice - 1].item() - tokenizer.timestamp_begin
                    )
                    seek += last_timestamp_pos * input_stride
            else:
                duration = segment_duration
                timestamps = tokens[timestamp_tokens.nonzero().flatten()]
                if (
                    len(timestamps) > 0
                    and timestamps[-1].item() != tokenizer.timestamp_begin
                ):
                    # no consecutive timestamps but it has a timestamp; use the last one.
                    last_timestamp_pos = (
                        timestamps[-1].item() - tokenizer.timestamp_begin
                    )
                    duration = last_timestamp_pos * time_precision

                current_segments.append(
                    new_segment(
                        start=time_offset,
                        end=time_offset + duration,
                        tokens=tokens,
                        result=result,
                    )
                )
                seek += segment_size

https://github.com/openai/whisper/blob/main/whisper/decoding.py

class ApplyTimestampRules(LogitFilter):
    def __init__(
        self, tokenizer: Tokenizer, sample_begin: int, max_initial_timestamp_index: Optional[int]
    ):
        self.tokenizer = tokenizer
        self.sample_begin = sample_begin
        self.max_initial_timestamp_index = max_initial_timestamp_index

    def apply(self, logits: Tensor, tokens: Tensor):
        # suppress <|notimestamps|> which is handled by without_timestamps
        if self.tokenizer.no_timestamps is not None:
            logits[:, self.tokenizer.no_timestamps] = -np.inf

        # timestamps have to appear in pairs, except directly before EOT; mask logits accordingly
        for k in range(tokens.shape[0]):
            seq = [t for t in tokens[k, self.sample_begin :].tolist()]
            last_was_timestamp = len(seq) >= 1 and seq[-1] >= self.tokenizer.timestamp_begin
            penultimate_was_timestamp = len(seq) < 2 or seq[-2] >= self.tokenizer.timestamp_begin

            if last_was_timestamp:
                if penultimate_was_timestamp:  # has to be non-timestamp
                    logits[k, self.tokenizer.timestamp_begin :] = -np.inf
                else:  # cannot be normal text tokens
                    logits[k, : self.tokenizer.eot] = -np.inf

        # apply the `max_initial_timestamp` option
        if tokens.shape[1] == self.sample_begin and self.max_initial_timestamp_index is not None:
            last_allowed = self.tokenizer.timestamp_begin + self.max_initial_timestamp_index
            logits[:, last_allowed + 1 :] = -np.inf

        # if sum of probability over timestamps is above any other token, sample timestamp
        logprobs = F.log_softmax(logits.float(), dim=-1)
        for k in range(tokens.shape[0]):
            timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1)
            max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max()
            if timestamp_logprob > max_text_token_logprob:
                logits[k, : self.tokenizer.timestamp_begin] = -np.inf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant