Predictive text for more languages

I am looking for greek -since is what i need- and the link you gave includes it. And probably can be used. But its kind of official. I mean you don’t text your friends in the same way you speak in the EU parliament.

I don’t think we should recap everything since there is a reference to the original thread in the thread start. But, yes, europarl could a part of the language specific corpus covering formal language, but addditional parts of informal language is neeed to boost the vocabulary. Eventhough the learning functionality works well you need to have a large enough ngram database to start with. So we should make sure we get the languages of interest and if they are feasible right now. So please dont forget to tell which language.

@ljo, if you need to have my code re-licensed, let me know. Don’t remember what we had for presage bits, but if it is GPLv3 we may probably bump it down a version. Assuming that Jolla still doesn’t want to touch GPLv3 code. Good luck with getting students on it!

Thanks for telling it was greek you were talking about :slight_smile:

We should get a list of all the languages needed. People can post here.

Presage is gplv2 (updated, my battery ran out). Jolla said they would accept lgplv3 when I offered. @rinigus, thanks.

1 Like

Presage is gplv3. Jolla said they would accept lgplv3 when I offered.

Just the input handler plugin is gplv3, however the presage itself is GPL-v2.0 see:

I managed to contact with the original author (Matteo Vescovi) at the time when we worked on the predictor, and we could still ask him about the feasibility on relicensing if you did not done too much work on the rewrite and have not tried to ask for the relicensing yet.


I think the issue was with linking to it then. Although, it does have dbus API, as far as I remember and can be worked around through separate processes

Would it be possible, to just have this predictive text input with empty database, so that while I type, I fill in the ngrams? This way, I would already have something usable from scratch?
And how would I be able to install it? Through Openrepos?
I am very excited about this, as I went to a clean install without AD on an XA2 a few weeks ago. And this is the only function, I am missing. :slight_smile:

1 Like

Yes, sorry, my battery ran out before I could update the post (now fixed, typing blindly into the textbox behind the keyboard is a challenge :slight_smile: especially almost falling asleep ) It is already progressed enough, so no need to ask. But as @rinigus and you said it will in some cases be necessary for other minor parts.

Depending on your language there are also other parts that are still needed to make the experience smooth. Like inflection and compounding. In comparison finding a small corpus should not be too hard unless it is a very small language or a language not supported right now.

Sure, but for a starter, this would be nice. Afterwards, with an update, the corpus can be added. But if the software works and only the corpus is missing: give it a try. I am from a country of 7 million people. Getting a corpus for it - i don’t know where and how. So, filling up the words while using the phone sounds like a good alternative for me. :slight_smile:

@dexic: for reference, we have Estonian corpus processed for country of 1.4 m people. 7m should be fine - just look for data. Contact some language institute or lab and ask from them what can they propose. That way I’ve got Estonian corpus.

1 Like

I sent an e-mail to a friend in Belgrade to take a shot. Wish him luck!


About corpus, quite good corpus could be prepared from wikipedia articles. I have used it for OkBoard together with some university corpuses. But I think wikipedia dump alone is also ok.

1 Like

This topic was touched briefly in the meeting today.

if Presage could be integrated out-of-process, then licensing issues could be addressed
but if Presage has some UTF limitations, there has been another effort by ljo and his team
so hope for better news in the future for all SFOS languages

@sledges (or anyone at Jolla) could you elaborate a bit more in case something can be done so we can have more languages supported in the phone without the need to install extra stuff (or at least be able to install from the official store whatever you need).

1 Like

Also -unrelated to the above- i’d like to add a way of getting a corpus from Wikipedia in case you have trouble finding one.

Download this: GitHub - attardi/wikiextractor: A tool for extracting plain text from Wikipedia dumps

Download a _locale_wiki-latest-pages-articles.xml file from:

and run: python3 --infn _locale_wiki-latest-pages-articles.xml

you will get a large .txt file to use as corpus.

On the above substitute locale with your preferred text one. Ie in the case of Czech use cs and so on. (Index of /cswiki/latest/)


If jolla-keyboard could communicate with the Presage engine without linking it directly (e.g. via D-Bus), then Presage’s GPLv2 licence shouldn’t be a problem.
How exactly Presage is not Unicode-aware I do not have the details. I had kept in touch with @ljo, but do not have the latest on their effort.

1 Like

There is DBus service for it - presage/apps/dbus at master · sailfish-keyboard/presage · GitHub. Although, we maybe missing few extra API calls that we used for predictive keyboard. But those should be easy to add (from the project README, looks like just forget will be missing).

Re unicode - it will be needed for some languages, but many could work without in this context. See current supported languages to judge on applicability assuming that similar languages can be supported as well.

1 Like

If jolla-keyboard could communicate with the Presage engine without linking it directly (e.g. via D-Bus), then Presage’s GPLv2 licence shouldn’t be a problem.

But guys! (Just before investing any expensive time in this area).
Did anyone contacted with Matteo Vescovi (orignal presage author) about a licensing change proposal?

If not than I am more than happy to ask him kindly. What license requirement does Jolla have on integration?

I am not into the licensing business, so if anyone could summarise the reason of the change than it would be helpful.

1 Like