Predictive text for more languages

So far the predictive text support is only available for a subset of the languages and only for officially supported devices.

While its something you can live without, having it can/will make a difference for community supported devices and for those whose language is not popular enough.

The community (@rinigus, @martonmiklos, @ljo) has made an effort already:
http://talk.maemo.org/showthread.php?t=100266

So i am opening this topic for discussion on what is missing, solutions, help with finding corpuses -if you are a linguist we probably need you- and if there are any roadblocks.

2 Likes

If someone provide a reference input for his/her preferred language I am glad to generate ngram database and package it to openrepos.

Also in general there are some some area of improvements in the presage based predictor like adding UI for removing/editing ngrams. But yeah it is on my TODO list and since I moved to XA2 (which have been for a while) I have not even installed the predictor. :frowning:

1 Like

Just to mention it explicitly here. I am also working on an fully unicode-aware rewritten relicensed version of presage which would make it possible to include in base SailfishOS and make the number of supported languages much larger. Unfortunately the covid situation made the progress much slower, but hopefully I can get the students back later this semester so I can make some needed final pushes.

3 Likes

My main issue is finding a corpus. I contacted the linguistics department of some local Uni, explained them what i wanted to do with the dataset, but they didn’t seem to want to cooperate. :roll_eyes:

I don’t understand what do you mean by “reference input” but in Slovenian language (which is fully translated and officially supported) prediction only works in Calendar but not in Message application. Which seems very odd to me. Any explanation?

Reference input = a big chunk of text = corpus that the database will use to predict the words.

2 Likes

prediction only works in Calendar but not in Message application.

That is not prediction but rather a completer.

I don’t understand what do you mean by “reference input”

Some explanation on the topic: Jolla licensed a predicitive engine called Xt9 from Nuance and this is what they ship to the Jolla1/C/AquaFish and Xperia X licensed customers. However this prediction engine does not support the community supported languages and not available for the non licensed SFOS installations i.e. ports.

So we decided to hack our own prediction plugin based on the presage library.

It has several predictors the most important is the ngram predictor.

This works roughly the following way: we grab a large amount of text which contains a representative sample from the words/phrases used in a given language. This is what called corpus. For some languages there are available corpuses usually maintained by universities, however some does not have such available. For e.g for Hungarian I used some novels, and some polite letters which is somewhat working but not ideal.

To install the language support for the predictor this corpus got sliced to ngrams. ngram are basically two-three-four word chains extracted from the corpus. These ngrams are put to a database and the predictor make suggestions based on the ngrams matching your last typed words.
Large ngram sets makes the queries slow so the good corpus should be small (while still representative).

I am also working on an fully unicode-aware rewritten relicensed version of presage which would make it possible to include in base SailfishOS and make the number of supported languages much larger.

Is it available somewhere in a git repo?

What language are you looking for?

Corpora are hard to come by as good quality text sources usually are copyrighted (journals, books, newspapers)

Have a look at the Europarl Corpus. It’s generated from the proceedings of the European Parliament and contains a number of texts for the languages spoken in the EU.

Might not be the ideal source given their mostly juristic content, but could be a start

Not yet. I have a handful company internal svn repos I need to migrate, so probably to gitlab. Hopefully it will be in time for being a xmas treat for everyone. But since it is a rewrite I need to be able to say with full certainty there is no hesitation on licensing original work and the company paid for the work so there are no copyright issues.

I am looking for greek -since is what i need- and the link you gave includes it. And probably can be used. But its kind of official. I mean you don’t text your friends in the same way you speak in the EU parliament.

I don’t think we should recap everything since there is a reference to the original thread in the thread start. But, yes, europarl could a part of the language specific corpus covering formal language, but addditional parts of informal language is neeed to boost the vocabulary. Eventhough the learning functionality works well you need to have a large enough ngram database to start with. So we should make sure we get the languages of interest and if they are feasible right now. So please dont forget to tell which language.

@ljo, if you need to have my code re-licensed, let me know. Don’t remember what we had for presage bits, but if it is GPLv3 we may probably bump it down a version. Assuming that Jolla still doesn’t want to touch GPLv3 code. Good luck with getting students on it!

Thanks for telling it was greek you were talking about :slight_smile:

We should get a list of all the languages needed. People can post here.

Presage is gplv2 (updated, my battery ran out). Jolla said they would accept lgplv3 when I offered. @rinigus, thanks.

1 Like

Presage is gplv3. Jolla said they would accept lgplv3 when I offered.

Just the input handler plugin is gplv3, however the presage itself is GPL-v2.0 see:

I managed to contact with the original author (Matteo Vescovi) at the time when we worked on the predictor, and we could still ask him about the feasibility on relicensing if you did not done too much work on the rewrite and have not tried to ask for the relicensing yet.

1 Like

I think the issue was with linking to it then. Although, it does have dbus API, as far as I remember and can be worked around through separate processes

Hello!
Would it be possible, to just have this predictive text input with empty database, so that while I type, I fill in the ngrams? This way, I would already have something usable from scratch?
And how would I be able to install it? Through Openrepos?
I am very excited about this, as I went to a clean install without AD on an XA2 a few weeks ago. And this is the only function, I am missing. :slight_smile:

Yes, sorry, my battery ran out before I could update the post (now fixed, typing blindly into the textbox behind the keyboard is a challenge :slight_smile: especially almost falling asleep ) It is already progressed enough, so no need to ask. But as @rinigus and you said it will in some cases be necessary for other minor parts.