Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store language code in database #118

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

philmcmahon
Copy link
Contributor

What does this change?

I'm currently experimenting with https://github.com/m-bain/whisperX. WhisperX appears to support fewer languages than whisper.cpp. To understand the impact of moving to whisperX this PR sets the transcription service to record the language code of the transcription in the dynamo database.

I made the new field optional in the zod type to avoid breaking existing transcriptions. In a few weeks we could make it compulsory.

How to test

Needs verifying on CODE but this is a very small change.

@philmcmahon philmcmahon requested a review from a team as a code owner January 14, 2025 16:04
@@ -39,6 +39,7 @@ export const TranscriptionDynamoItem = z.object({
userEmail: z.string(),
completedAt: z.optional(z.string()), // dynamodb can't handle dates so we need to use an ISO date
isTranslation: z.boolean(),
languageCode: z.optional(z.string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an iso country code? If so this type could be more narrow if we use our LanguageCode type

@@ -1,5 +1,5 @@
import { z } from 'zod';
import { languageCodeToLanguage } from './languages';
import { LanguageCode, languageCodeToLanguage } from './languages';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your zod enum means we can get rid of zodLanguageCode just below which does the same thing (and also getKeys). Your solution of making a zod enum from the array of country codes is way simpler than what I did!

keyof LanguageCodeToLanguage,
];

export const LanguageCode = z.enum(languageCodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think zod objects are camelCase cause they're just consts? although we're not following this convention everywhere

Suggested change
export const LanguageCode = z.enum(languageCodes);
export const languageCode = z.enum(languageCodes);

@@ -63,6 +64,14 @@ const handleTranscriptionSuccess = async (
metrics: MetricsService,
sourceMediaDownloadUrl: string,
) => {
const languageCode = LanguageCode.safeParse(transcriptionOutput.languageCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In processMessage we're parsing the sqs event object before calling this function. It might be cleaner to change TranscriptionOutputSuccess so that its languageCode is an enum than to do a second round of parsing here.
Although I'm now realising that this would be more of a refactor of our types and zod parsers rather than the original goal of just storing language! So feel free to ignore this or just put in a TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants