Speech to text app

Hi, does anyone know of a speech to text app for simple notes, either native or AD that does not require google voice?

1 Like

PocketSphinx was developed to run on mobile devices, so there wouldn’t be any need to send speech data to a cloud service. I guess the simplest solution would be to record a sound file and use that as input for PocketSphinx to create a text file.

Thanks, I hadn’t realized that all other solutions are cloud based. I’ll check it out.

Mozilla has deepspeech which is offline i think. But its no app. And no idea how powerfull a machine needs to be to run it.

The Speech Note app recently appeared on OpenRepos and uses Mozilla DeepSpeech with a collection of different languages.
:wink:

https://openrepos.net/content/mkiol/speech-note

Link for the lazy. If the dev reads this: PLEASE bring it to the official store.

2 Likes

Whisper from openai works quite well:

[defaultuser@Xperia10III 2010 - Book 3 - Death's End]$ whisper --model tiny.en 01\ -\ Preface.mp3
/home/defaultuser/.local/lib/python3.8/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00.000 --> 00:02.480]  This is audible.
[00:02.480 --> 00:05.320]  Macmillan audio presents,
[00:05.320 --> 00:07.080]  Death's End,
[00:07.080 --> 00:09.360]  Bites Asinleo,
[00:09.360 --> 00:11.960]  Translated by Ken Lu.
[00:11.960 --> 00:13.240]  Read for you,
[00:13.240 --> 00:14.880]  by P. J. Oclan.
[00:19.680 --> 00:22.440]  Exerped from the preface to
[00:22.440 --> 00:24.840]  a past outside of time.
[00:24.840 --> 00:28.840]  I suppose this ought to be called
[00:28.840 --> 00:32.840]  history, but since all I can rely on is my memory,
[00:32.840 --> 00:35.840]  it lacks the rigor of history.
[00:35.840 --> 00:38.840]  It's not even accurate to call it the past,
[00:38.840 --> 00:41.840]  for the events related in these pages
[00:41.840 --> 00:43.840]  didn't occur in the past,
[00:43.840 --> 00:45.840]  aren't taking place now,
[00:45.840 --> 00:48.840]  and will not happen in the future.
[00:48.840 --> 00:51.840]  I don't want to record the details,
[00:51.840 --> 00:53.840]  only a frame for a history

You’ll need python3-pip and ffmpeg-tools then just ‘pip install --user git+https://github.com/openai/whisper.git’
(–user to not waste rootfs, might need PATH=$PATH:~/.local/bin in .bash_profile)
Medium model supposedly requires 5Gb RAM so tiny/base/small should be fine

Edit: it also does translation, but the tiny model was 0/10 making stuff up, the default small model (461Mb) actually works surprisingly well:

whisper --task translate testPL.wav
/home/defaultuser/.local/lib/python3.8/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Polish
[00:00.000 --> 00:02.000]  Hi, how are you?
1 Like

Whisper - Wow, looks spectacular! Definitely it is something I have to integrate into Speech Note/Keyboard apps.

1 Like

@mikol Thank you for speech note / keyboard app.
I use this apps very often.
It would be great to have a spell check for upper and lower case in German Language.
The recognition rate is really good.

1 Like

I didn’t even know Speech Keyboard existed!

Is there a way to integrate it better with OKBoard? Edit: my bad, it works. Just have to switch OKBoard off and on again.

Also Speech Note doesn’t capitalise at all here. It means I have to edit everything before I send it. I also don’t want each sentence on a new line, though some people might.
Punctuation isn’t in the model though might be after some training:

Coqui (huge) EN is better and faster than Mozilla EN, though still slow on my XA2. I take it we’re not using the GPU here?