Skip to content

ifak-prototypes/speech_recognition_srcsd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Recognition Client/Server and Desktop Application

The software being developed in this project uses Systran Faster Whisper technology to input text via speech and use the resulting text in any other application.

Natural language processing often requires a lot of resources. It makes sense to use a graphics card, which greatly speeds up the processing of recorded speech files. Therefore, the processing of speech with this software should be realizable in two ways:

  • Local processing: the speech is recorded on the local computer, processed and transferred to local applications.
  • Decentralized processing: The speech is recorded locally and processed into a character string on a remote computer (GPU server) and transferred back to the source computer as a character string. There, the text is then transferred to local applications.

Installation

Microsoft Windows

The following command must be executed in a terminal window (cmd):

.\bin\prepare.bat

Linux

Run the following command in a shell:

bash ./bin/prepare.sh

Start the application

Microsoft Windows

Use the following command in a terminal window (cmd):

.\run.bat

Linux

Run the following command in a shell:

bash ./run.sh

Currently the client only works with the X Window System. Wayland is not supported.

Advanced setup options

You may modify the run.bat or respectively the run.sh file when you want to modify the program behavior. For example when you are on a Linux machine and you definitely don't want to use the GPU, then you could modify the python program call in run.sh file to

python src/srcsd/tkclient.py --device=cpu

There are following options:

setting explanation default value
device defines, on which computing resource is computed; one of [cpu, gpu]; ignored if local=false gpu if available, else cpu
local defines, whether the client uses local audio data processing or not. In latter case a remote GPU server can be used for processing.
one of [true, false], requires a running server if false
true
host string representing the hostname to be used for requests; only used if local is false localhost
port port to use for requests to the host; only used if local is false 8001
ssl_selfsigned whether the server is using a selfsigned certificate; client requires a copy of the certificate to trust it true
ssl_cert path of the ssl certificate file to use; only used if local=false , required ssl_selfsigned=true ./keys/cert.pem

Start the server

The server is only used, if you start the app with with local set to false. (see "Advanced setup options")

An SSL certificate and a private key are required, if you want to use the server. The key and certificate included in this repository are only valid for accessing the server via the following addresses: localhost, 0.0.0.0, 127.0.0.1 and 192.168.0.100.

If you want to access the server via a different ip/dns address you can either use your own existing certificate and key, or add the necessary addresses to ./keys/cert.conf by appending them to the [ sans ] section or replacing any of the existing addresses in that section. You can then generate the certificate and private key using OpenSSL with the following command:

openssl req -x509 -out ./keys/cert.pem -keyout ./keys/privkey.pem -newkey rsa:4096 -sha256 -days 365 -extensions ext -config ./keys/cert.conf

Microsoft Windows

Use the following command in a terminal window (cmd):

.\run_server.bat

Linux

Run the following command in a shell:

bash ./run_server.sh

Advanced setup options

You may modify the run_server.bat or respectively the run_server.sh file when you want to modify the program behavior. For example when you are on a Linux machine and you want to host the server on port 4000, you can modify the python program call in run_server.sh:

python src/srcsd/server.py --port=4000

There are following options:

setting explanation default value
port port to listen for requests on 8001
ssl_key path of the ssl private key file to use ./keys/privkey.pem
ssl_cert path of the ssl certificate file to use ./keys/cert.pem
device defines, on which computing resource is computed; one of [cpu, gpu] gpu if available, else cpu

Usage

The program contains the following setting options:

Setting Usage
Model This is a selection of the Whisper model. Smaller models are faster but also more imprecise.
Language The original language
Task transcribe means, that the text is created in the original language; translate means, that the text is translated to English.
Format normal takes the text from the Whisper model as is, while stripped means, that leading and trailling whitespaces are omitted. Stripped text is preffered, when working with spread-sheets or presentation programs, while normal is including white spaces - so it is preffered for floating text.
Pause The processing of speech starts after a little break. (e.g. pause between 2 sentences) This parameter determines the duration length of this break.
Active Determines whether audio data should be processed or not.
Insert via CTR-V Defines whether the system automatically puts the recognized text into the system clipboard and the CTRL-V key combination is automatically pressed. On linux systems, this requires xclip.

The text input field contains the recognized text.

Data privacy

Local data processing

The program stores recorded audio files on the computer into the directory 'audio_data'. Processed audio data files will be deleted directly after converting them into text. If the program is killed, there could be residual files in the audio_data directory. They can be safely deleted manually or they are deleted at the next program start.

Known Issues/Restrictions

  • sometimes random outputs when recording (background noise)

Client/Server based audio data processing

The client stores recorded audio files into the directory audio_data and the server stores received audio files into the directory .uploads. Audio data files will be deleted, after they have been transferred/used. If the program is killed, there could be residual files in those directories. They can be safely deleted manually or they are deleted at the next program start.

Copyright and License Information

Copyright (c) 2023, Institut für Automation und Kommunikation e.V. (ifak e.V.) and Keanu-Farell Kunzi. See the LICENSE file for licensing conditions (MIT license).

About

Speech Recognition Client/Server and Desktop Application

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published