Predictive text for more languages

ljo · 13 September 2020 08:15

Depending on your language there are also other parts that are still needed to make the experience smooth. Like inflection and compounding. In comparison finding a small corpus should not be too hard unless it is a very small language or a language not supported right now.

dexic · 13 September 2020 08:35

Sure, but for a starter, this would be nice. Afterwards, with an update, the corpus can be added. But if the software works and only the corpus is missing: give it a try. I am from a country of 7 million people. Getting a corpus for it - i don’t know where and how. So, filling up the words while using the phone sounds like a good alternative for me.

rinigus · 13 September 2020 11:12

@dexic: for reference, we have Estonian corpus processed for country of 1.4 m people. 7m should be fine - just look for data. Contact some language institute or lab and ask from them what can they propose. That way I’ve got Estonian corpus.

dexic · 13 September 2020 11:40

I sent an e-mail to a friend in Belgrade to take a shot. Wish him luck!

pemek · 14 September 2020 09:10

About corpus, quite good corpus could be prepared from wikipedia articles. I have used it for OkBoard together with some university corpuses. But I think wikipedia dump alone is also ok.

ApB · 3 June 2021 19:49

This topic was touched briefly in the meeting today.

if Presage could be integrated out-of-process, then licensing issues could be addressed
but if Presage has some UTF limitations, there has been another effort by ljo and his team
so hope for better news in the future for all SFOS languages

@sledges (or anyone at Jolla) could you elaborate a bit more in case something can be done so we can have more languages supported in the phone without the need to install extra stuff (or at least be able to install from the official store whatever you need).

ApB · 3 June 2021 19:50

Also -unrelated to the above- i’d like to add a way of getting a corpus from Wikipedia in case you have trouble finding one.

Download this: GitHub - attardi/wikiextractor: A tool for extracting plain text from Wikipedia dumps

Download a _locale_wiki-latest-pages-articles.xml file from:

https://dumps.wikimedia.org/_locale_wiki/latest/

and run: python3 WikiExtractor.py --infn _locale_wiki-latest-pages-articles.xml

you will get a large .txt file to use as corpus.

On the above substitute locale with your preferred text one. Ie in the case of Czech use cs and so on. (Index of /cswiki/latest/)

sledges · 11 June 2021 15:23

If jolla-keyboard could communicate with the Presage engine without linking it directly (e.g. via D-Bus), then Presage’s GPLv2 licence shouldn’t be a problem.
How exactly Presage is not Unicode-aware I do not have the details. I had kept in touch with @ljo, but do not have the latest on their effort.

rinigus · 11 June 2021 15:36

There is DBus service for it - presage/apps/dbus at master · sailfish-keyboard/presage · GitHub. Although, we maybe missing few extra API calls that we used for predictive keyboard. But those should be easy to add (from the project README, looks like just forget will be missing).

Re unicode - it will be needed for some languages, but many could work without in this context. See current supported languages to judge on applicability assuming that similar languages can be supported as well.

martonmiklos · 13 June 2021 19:13

If jolla-keyboard could communicate with the Presage engine without linking it directly (e.g. via D-Bus), then Presage’s GPLv2 licence shouldn’t be a problem.

But guys! (Just before investing any expensive time in this area).
Did anyone contacted with Matteo Vescovi (orignal presage author) about a licensing change proposal?

If not than I am more than happy to ask him kindly. What license requirement does Jolla have on integration?

I am not into the licensing business, so if anyone could summarise the reason of the change than it would be helpful.