Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microphone & OpenAI Transcribe API #8

Merged
merged 39 commits into from
Nov 19, 2023
Merged

Conversation

tigerpaws01
Copy link
Collaborator

@tigerpaws01 tigerpaws01 commented Nov 17, 2023

What This Branch Did

  • Microphone Integration: The keyboard now takes input from microphone.
  • OpenAI Transcribe API: The keyboard now transcribes recorded audio clips.
  • Permissions & Settings:
    • If microphone permissions are not granted upon starting recording, a message is shown, and the user is redirected to the app.
    • In the app's MainActivity, one can set an API Key, and navigate to the app settings panel with a button (for manual permission configuration).
    • Upon entering MainActivity, the user will be prompted the option to grant microphone permissions.
  • Exception Handling: Basic handling via message Toasts. Common exceptions will (very likely) not block or crash the keyboard.

Known Issues & Future Directions

  • Motormouth Countering: Prevent recording an overly long audio clip.
  • Are You Done?: Implement automatic sentence break detection.
  • Be My Spokesman?: A totally silent audio clip seems to produce weird sentences like 多謝您收睇時局新聞,再會! among many.
  • Not A Province: The whisper-1 model produces both simplified and traditional Chinese (Mandarin) characters.
  • Whisper To My Ear: Currently, audio clips are recorded and saved as files (with hardcoded names). Whether they can be stored in memory / streams, and whether this is a better option, is unknown.
  • Configurations: Several settings may have rom for improvement.
    • Ktor Engine: OkHttp
    • Output format: MPEG4
    • Audio Encoder: AMR_NB

Testing This Branch

  • As previous branches, start an emulator.
  • Connect microphone to host audio input:
    image
  • Configure an API Key.
  • Test the transcription utility.

Notes

It is recommended NOT to read all the references thoroughly. There are a lot. Reading solely paragraphs in interest would suffice.


Closes: #2

This is essential for requesting microphone usage.
It is considered as a dangerous permission. Unlike normal permissions, `RECORD_AUDIO` has to be explicitly requested in an `ActivityCompat` as well.

Ref:
- MediaRecorder: https://developer.android.com/guide/topics/media/platform/mediarecorder
- Permissions: https://developer.android.com/training/permissions/requesting#normal-dangerous
It seems that permission cannot be requested in a service, but only in an `ActivityCompat`.
Therefore, the user will be redirected to either the `MainActivity` or the App Settings Panel.
It is unsure whether this is the best practice, or a recommended one at all, but it works rather intuitively.
Requests permission via `ActivityCompat.requestPermissions`.
A request code is given to distinguish between requests. It has no other meaning.
An `onRequestPermissionsResult` is overriden to process request results. In this case, if the permission is not given, a toast message shows up.

Refs:
- https://developer.android.com/guide/topics/media/platform/mediarecorder
- https://developer.android.com/training/permissions/requesting
…button.

The button opens up the application settings panel for the user to manually configure microphone settings.
Ref: https://stackoverflow.com/a/32822298
This is planned to be refactored later into a specialized class, just like keyboard and job manager were.
Ref: https://developer.android.com/guide/topics/media/platform/mediarecorder
…n is granted.

The same code as in `MainActivity`.
- Checks permission upon microphone usage (as suggested in https://developer.android.com/training/permissions/requesting#principles).
- If permission is not granted, opens up the `MainActivity`, where the permission can either be automatically or manually set.
- Otherwise, starts the MediaRecorder.
Including recording cancellation & window events.
This is required to make OpenAI API Calls, as an exception encountered stated.
`SecurityException: Permission denied (missing INTERNET permission?)`
…API.

Followed the setup in the (un?)official OpenAI API for Kotlin: https://github.com/aallam/openai-kotlin/tree/main
- `mavenCentral()` is omitted.
  - It's included in settings.gradle. Also refer to the following link, stating a change in the Gradle standards.
  - https://stackoverflow.com/questions/69163511/build-was-configured-to-prefer-settings-repositories-over-project-repositories-b
- Setting up a Ktor engine: OkHttp is chosen due to information here
  - https://ktor.io/docs/http-client-engines.html
  - Version is from the latest entry found in here: https://mvnrepository.com/artifact/io.ktor/ktor-client-okhttp (under the Central tag)
  - Without setting up a client engine, exceptions will be thrown. See: ktorio/ktor#1070
…format.

- Passes the recorded audio file name to `WhisperJobManager` so it can make trascription calls with that filename.
- Renamed the variable to be consistent.
- Changed audio output format to MPEG4 (.m4a) so it's supported by OpenAI (['flac', 'm4a', 'mp3', 'mp4', 'mpeg', 'mpga', 'oga', 'ogg', 'wav', 'webm']).
- Whether this is the best format remains to be checked.
- Whether `AMR_NB` is the best audio encoder remains to be checked.
- Whether there are other configs to improve the audio / performance remains to be checked.

Refs:
- https://developer.android.com/reference/android/media/MediaRecorder.AudioSource
- https://developer.android.com/reference/android/media/MediaRecorder.AudioEncoder
- https://developer.android.com/reference/android/media/MediaRecorder.OutputFormat
Done via Android Studio (Ctrl + Alt + Shift + L).
This class is responsible for encapsulating the process of starting and stopping a MediaRecorder.
… list of required permissions.

Kotlin does not have the `static` keyword. Instead, using `companion object`s is advised.
Ref: https://stackoverflow.com/questions/40352684/what-is-the-equivalent-of-java-static-methods-in-kotlin
The code is almost the same as in `WhisperInputService`.
Code is almost the same as in `WhisperInputService`, but works on multiple permissions.
DataStore is a data storage solution. It provides two interfaces:
- Preference: key-value pairs
- Proto: protocol buffer based typed objects

This will be used to store the API key (from user input).
For simplicity, Preference Datastore is used.

Ref: https://developer.android.com/topic/libraries/architecture/datastore#preferences-create
1. First, disable api key input, and set api key button. Apply a "loading" hint to the input field.
2. Retrieve the stored api key from the dataStore in the IO thread.
  - dataStore seems to be a (static-like?) variable accessible under a Context. This is defined with `val Context.dataStore: ...` using a "delegate".
  - dataStore.data is a `Flow<Preferences>`.
  - A `Flow` has emitters and collectors working asynchronously, decoupled from each other.
  - Emitters can emit data into the flow, while collectors can collect data from the flow.
  - dataStore uses this model to implement event- or data-driven programming.
  - `map` transforms `Flow<T1>` into `Flow<T2>`. Here, a flow of `Preferences` is transformed into a flow of the data stored in each `Prefereces`.
  - `first()` captures the first element emitted by the flow.
    - Using `last()` would capture the last element emitted by the flow. This blocks the coroutine scope. Therefore, it seems like DataStore somehow keeps emitting `Perferences` without termination.
    - Using `collect` specifies a function to process the collected data. This also seems to block.
  - `first()` would throw an error if the flow is empty, but it seems like DataStore always have data ready in the flow.
  - `first()` is a blocking call, thus run in the IO thread.
  - This variant of DataStore (Preferences DataStore) offers no data type safety. The `stringPreferencesKey` to tell DataStore that the expected stored data of key "api-key" is a String.
3. After the stored API Key data is retrieved, the input field is set depending on whether there exists a stored api key.
  - If null or empty, the hint displays "Enter API Key" message.
  - Otherwise, display the stored api key.
  - These situations have been tested,
    - DataStore can retrieve newer data, if the data is updated.
    - DataStore can retrieve older data multiple times (i.e., the stored data won't be eliminated or exhausted after reading).
4. Finally, re-enable the input field and button, and assign the set api key button onclick event (to avoid setting the api key before retrieval).

Refs:
(Recommended NOT to be thoroughly read. There are quite a lot.)
- Using DataStore: https://developer.android.com/topic/libraries/architecture/datastore
- Using Flows (generally, offical): https://developer.android.com/kotlin/flow
- Using Flows (generally): https://www.baeldung.com/kotlin/flow-intro
- Flow.first: https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines.flow/first.html
- DataStore v.s. SharedPreferences: https://juejin.cn/post/7112486451626901540
…ll. Reformatted code.

Transcription results can be null in case of cancellation and exception.
As the callback function expects a nullable `String?`, it makes more sense to have the callback handle it being null, instead of preventing it from running at all.
…nto feature/03-mic-integration

change sync with master
Copy link
Owner

@j3soon j3soon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR. Just confirmed this works on a Pixel_3a_API_34_extension_level_7_x86_64 simulator. @ijsun has also tested the exported APK on a physical android device as well.

I only have some minor comments as below.

@tigerpaws01 tigerpaws01 requested a review from j3soon November 19, 2023 03:52
Copy link
Owner

@j3soon j3soon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I appreciate the well-organized and intuitive code. Thank you!

@j3soon j3soon merged commit 379cb8c into master Nov 19, 2023
@j3soon j3soon deleted the feature/03-mic-integration branch November 19, 2023 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connect the Speech Input to OpenAI Whisper API
2 participants