Hi, does anyone know of a speech to text app for simple notes, either native or AD that does not require google voice?
PocketSphinx was developed to run on mobile devices, so there wouldn’t be any need to send speech data to a cloud service. I guess the simplest solution would be to record a sound file and use that as input for PocketSphinx to create a text file.
Thanks, I hadn’t realized that all other solutions are cloud based. I’ll check it out.
Mozilla has deepspeech which is offline i think. But its no app. And no idea how powerfull a machine needs to be to run it.
The Speech Note app recently appeared on OpenRepos and uses Mozilla DeepSpeech with a collection of different languages.
https://openrepos.net/content/mkiol/speech-note
Link for the lazy. If the dev reads this: PLEASE bring it to the official store.
Whisper from openai works quite well:
[defaultuser@Xperia10III 2010 - Book 3 - Death's End]$ whisper --model tiny.en 01\ -\ Preface.mp3
/home/defaultuser/.local/lib/python3.8/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00.000 --> 00:02.480] This is audible.
[00:02.480 --> 00:05.320] Macmillan audio presents,
[00:05.320 --> 00:07.080] Death's End,
[00:07.080 --> 00:09.360] Bites Asinleo,
[00:09.360 --> 00:11.960] Translated by Ken Lu.
[00:11.960 --> 00:13.240] Read for you,
[00:13.240 --> 00:14.880] by P. J. Oclan.
[00:19.680 --> 00:22.440] Exerped from the preface to
[00:22.440 --> 00:24.840] a past outside of time.
[00:24.840 --> 00:28.840] I suppose this ought to be called
[00:28.840 --> 00:32.840] history, but since all I can rely on is my memory,
[00:32.840 --> 00:35.840] it lacks the rigor of history.
[00:35.840 --> 00:38.840] It's not even accurate to call it the past,
[00:38.840 --> 00:41.840] for the events related in these pages
[00:41.840 --> 00:43.840] didn't occur in the past,
[00:43.840 --> 00:45.840] aren't taking place now,
[00:45.840 --> 00:48.840] and will not happen in the future.
[00:48.840 --> 00:51.840] I don't want to record the details,
[00:51.840 --> 00:53.840] only a frame for a history
You’ll need python3-pip and ffmpeg-tools then just ‘pip install --user git+https://github.com/openai/whisper.git’
(–user to not waste rootfs, might need PATH=$PATH:~/.local/bin in .bash_profile)
Medium model supposedly requires 5Gb RAM so tiny/base/small should be fine
Edit: it also does translation, but the tiny model was 0/10 making stuff up, the default small model (461Mb) actually works surprisingly well:
whisper --task translate testPL.wav
/home/defaultuser/.local/lib/python3.8/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Polish
[00:00.000 --> 00:02.000] Hi, how are you?
Whisper - Wow, looks spectacular! Definitely it is something I have to integrate into Speech Note/Keyboard apps.
@mikol Thank you for speech note / keyboard app.
I use this apps very often.
It would be great to have a spell check for upper and lower case in German Language.
The recognition rate is really good.
I didn’t even know Speech Keyboard existed!
Is there a way to integrate it better with OKBoard? Edit: my bad, it works. Just have to switch OKBoard off and on again.
Also Speech Note doesn’t capitalise at all here. It means I have to edit everything before I send it. I also don’t want each sentence on a new line, though some people might.
Punctuation isn’t in the model though might be after some training:
Coqui (huge) EN is better and faster than Mozilla EN, though still slow on my XA2. I take it we’re not using the GPU here?