-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store language code in database #118
base: main
Are you sure you want to change the base?
Conversation
@@ -39,6 +39,7 @@ export const TranscriptionDynamoItem = z.object({ | |||
userEmail: z.string(), | |||
completedAt: z.optional(z.string()), // dynamodb can't handle dates so we need to use an ISO date | |||
isTranslation: z.boolean(), | |||
languageCode: z.optional(z.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an iso country code? If so this type could be more narrow if we use our LanguageCode
type
@@ -1,5 +1,5 @@ | |||
import { z } from 'zod'; | |||
import { languageCodeToLanguage } from './languages'; | |||
import { LanguageCode, languageCodeToLanguage } from './languages'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your zod enum means we can get rid of zodLanguageCode
just below which does the same thing (and also getKeys
). Your solution of making a zod enum from the array of country codes is way simpler than what I did!
keyof LanguageCodeToLanguage, | ||
]; | ||
|
||
export const LanguageCode = z.enum(languageCodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think zod objects are camelCase cause they're just consts? although we're not following this convention everywhere
export const LanguageCode = z.enum(languageCodes); | |
export const languageCode = z.enum(languageCodes); |
@@ -63,6 +64,14 @@ const handleTranscriptionSuccess = async ( | |||
metrics: MetricsService, | |||
sourceMediaDownloadUrl: string, | |||
) => { | |||
const languageCode = LanguageCode.safeParse(transcriptionOutput.languageCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In processMessage
we're parsing the sqs event object before calling this function. It might be cleaner to change TranscriptionOutputSuccess
so that its languageCode is an enum than to do a second round of parsing here.
Although I'm now realising that this would be more of a refactor of our types and zod parsers rather than the original goal of just storing language! So feel free to ignore this or just put in a TODO
What does this change?
I'm currently experimenting with https://github.com/m-bain/whisperX. WhisperX appears to support fewer languages than whisper.cpp. To understand the impact of moving to whisperX this PR sets the transcription service to record the language code of the transcription in the dynamo database.
I made the new field optional in the zod type to avoid breaking existing transcriptions. In a few weeks we could make it compulsory.
How to test
Needs verifying on CODE but this is a very small change.