Predictive text for more languages

ApB · 12 September 2020 18:27

So far the predictive text support is only available for a subset of the languages and only for officially supported devices.

While its something you can live without, having it can/will make a difference for community supported devices and for those whose language is not popular enough.

The community (@rinigus, @martonmiklos, @ljo) has made an effort already:
http://talk.maemo.org/showthread.php?t=100266

So i am opening this topic for discussion on what is missing, solutions, help with finding corpuses -if you are a linguist we probably need you- and if there are any roadblocks.

martonmiklos · 12 September 2020 19:05

If someone provide a reference input for his/her preferred language I am glad to generate ngram database and package it to openrepos.

Also in general there are some some area of improvements in the presage based predictor like adding UI for removing/editing ngrams. But yeah it is on my TODO list and since I moved to XA2 (which have been for a while) I have not even installed the predictor.

ljo · 12 September 2020 19:14

Just to mention it explicitly here. I am also working on an fully unicode-aware rewritten relicensed version of presage which would make it possible to include in base SailfishOS and make the number of supported languages much larger. Unfortunately the covid situation made the progress much slower, but hopefully I can get the students back later this semester so I can make some needed final pushes.

ApB · 12 September 2020 19:54

My main issue is finding a corpus. I contacted the linguistics department of some local Uni, explained them what i wanted to do with the dataset, but they didn’t seem to want to cooperate.

filip.k · 12 September 2020 20:08

I don’t understand what do you mean by “reference input” but in Slovenian language (which is fully translated and officially supported) prediction only works in Calendar but not in Message application. Which seems very odd to me. Any explanation?

ApB · 12 September 2020 20:10

Reference input = a big chunk of text = corpus that the database will use to predict the words.

martonmiklos · 12 September 2020 20:31

prediction only works in Calendar but not in Message application.

That is not prediction but rather a completer.

I don’t understand what do you mean by “reference input”

Some explanation on the topic: Jolla licensed a predicitive engine called Xt9 from Nuance and this is what they ship to the Jolla1/C/AquaFish and Xperia X licensed customers. However this prediction engine does not support the community supported languages and not available for the non licensed SFOS installations i.e. ports.

So we decided to hack our own prediction plugin based on the presage library.

It has several predictors the most important is the ngram predictor.

This works roughly the following way: we grab a large amount of text which contains a representative sample from the words/phrases used in a given language. This is what called corpus. For some languages there are available corpuses usually maintained by universities, however some does not have such available. For e.g for Hungarian I used some novels, and some polite letters which is somewhat working but not ideal.

To install the language support for the predictor this corpus got sliced to ngrams. ngram are basically two-three-four word chains extracted from the corpus. These ngrams are put to a database and the predictor make suggestions based on the ngrams matching your last typed words.
Large ngram sets makes the queries slow so the good corpus should be small (while still representative).

martonmiklos · 12 September 2020 20:36

I am also working on an fully unicode-aware rewritten relicensed version of presage which would make it possible to include in base SailfishOS and make the number of supported languages much larger.

Is it available somewhere in a git repo?

rozgwi · 12 September 2020 20:58

What language are you looking for?

Corpora are hard to come by as good quality text sources usually are copyrighted (journals, books, newspapers)

Have a look at the Europarl Corpus. It’s generated from the proceedings of the European Parliament and contains a number of texts for the languages spoken in the EU.

Might not be the ideal source given their mostly juristic content, but could be a start

ljo · 12 September 2020 21:09

Not yet. I have a handful company internal svn repos I need to migrate, so probably to gitlab. Hopefully it will be in time for being a xmas treat for everyone. But since it is a rewrite I need to be able to say with full certainty there is no hesitation on licensing original work and the company paid for the work so there are no copyright issues.

ApB · 12 September 2020 21:19

I am looking for greek -since is what i need- and the link you gave includes it. And probably can be used. But its kind of official. I mean you don’t text your friends in the same way you speak in the EU parliament.

ljo · 12 September 2020 21:27

I don’t think we should recap everything since there is a reference to the original thread in the thread start. But, yes, europarl could a part of the language specific corpus covering formal language, but addditional parts of informal language is neeed to boost the vocabulary. Eventhough the learning functionality works well you need to have a large enough ngram database to start with. So we should make sure we get the languages of interest and if they are feasible right now. So please dont forget to tell which language.

rinigus · 12 September 2020 21:31

@ljo, if you need to have my code re-licensed, let me know. Don’t remember what we had for presage bits, but if it is GPLv3 we may probably bump it down a version. Assuming that Jolla still doesn’t want to touch GPLv3 code. Good luck with getting students on it!

ljo · 12 September 2020 21:33

Thanks for telling it was greek you were talking about

ApB · 12 September 2020 21:36

We should get a list of all the languages needed. People can post here.

ljo · 12 September 2020 21:39

Presage is gplv2 (updated, my battery ran out). Jolla said they would accept lgplv3 when I offered. @rinigus, thanks.

martonmiklos · 13 September 2020 07:30

Presage is gplv3. Jolla said they would accept lgplv3 when I offered.

Just the input handler plugin is gplv3, however the presage itself is GPL-v2.0 see:

github.com/sailfish-keyboard/presage

COPYING

master

		    GNU GENERAL PUBLIC LICENSE
		       Version 2, June 1991

 Copyright (C) 1989, 1991 Free Software Foundation, Inc.
     59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

			    Preamble

  The licenses for most software are designed to take away your
freedom to share and change it.  By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users.  This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it.  (Some other Free Software Foundation software is covered by
the GNU Library General Public License instead.)  You can apply it to
your programs, too.

This file has been truncated. show original

I managed to contact with the original author (Matteo Vescovi) at the time when we worked on the predictor, and we could still ask him about the feasibility on relicensing if you did not done too much work on the rewrite and have not tried to ask for the relicensing yet.

rinigus · 13 September 2020 07:49

I think the issue was with linking to it then. Although, it does have dbus API, as far as I remember and can be worked around through separate processes

dexic · 13 September 2020 07:51

Hello!
Would it be possible, to just have this predictive text input with empty database, so that while I type, I fill in the ngrams? This way, I would already have something usable from scratch?
And how would I be able to install it? Through Openrepos?
I am very excited about this, as I went to a clean install without AD on an XA2 a few weeks ago. And this is the only function, I am missing.

ljo · 13 September 2020 08:04

Yes, sorry, my battery ran out before I could update the post (now fixed, typing blindly into the textbox behind the keyboard is a challenge especially almost falling asleep ) It is already progressed enough, so no need to ask. But as @rinigus and you said it will in some cases be necessary for other minor parts.