Depending on your language there are also other parts that are still needed to make the experience smooth. Like inflection and compounding. In comparison finding a small corpus should not be too hard unless it is a very small language or a language not supported right now.
Sure, but for a starter, this would be nice. Afterwards, with an update, the corpus can be added. But if the software works and only the corpus is missing: give it a try. I am from a country of 7 million people. Getting a corpus for it - i donāt know where and how. So, filling up the words while using the phone sounds like a good alternative for me.
@dexic: for reference, we have Estonian corpus processed for country of 1.4 m people. 7m should be fine - just look for data. Contact some language institute or lab and ask from them what can they propose. That way Iāve got Estonian corpus.
I sent an e-mail to a friend in Belgrade to take a shot. Wish him luck!
About corpus, quite good corpus could be prepared from wikipedia articles. I have used it for OkBoard together with some university corpuses. But I think wikipedia dump alone is also ok.
This topic was touched briefly in the meeting today.
if Presage could be integrated out-of-process, then licensing issues could be addressed
but if Presage has some UTF limitations, there has been another effort by ljo and his team
so hope for better news in the future for all SFOS languages
@sledges (or anyone at Jolla) could you elaborate a bit more in case something can be done so we can have more languages supported in the phone without the need to install extra stuff (or at least be able to install from the official store whatever you need).
Also -unrelated to the above- iād like to add a way of getting a corpus from Wikipedia in case you have trouble finding one.
Download this: GitHub - attardi/wikiextractor: A tool for extracting plain text from Wikipedia dumps
Download a _locale_wiki-latest-pages-articles.xml file from:
https://dumps.wikimedia.org/_locale_wiki/latest/
and run: python3 WikiExtractor.py --infn _locale_wiki-latest-pages-articles.xml
you will get a large .txt file to use as corpus.
On the above substitute locale with your preferred text one. Ie in the case of Czech use cs and so on. (Index of /cswiki/latest/)
If jolla-keyboard could communicate with the Presage engine without linking it directly (e.g. via D-Bus), then Presageās GPLv2 licence shouldnāt be a problem.
How exactly Presage is not Unicode-aware I do not have the details. I had kept in touch with @ljo, but do not have the latest on their effort.
There is DBus service for it - presage/apps/dbus at master Ā· sailfish-keyboard/presage Ā· GitHub. Although, we maybe missing few extra API calls that we used for predictive keyboard. But those should be easy to add (from the project README, looks like just forget
will be missing).
Re unicode - it will be needed for some languages, but many could work without in this context. See current supported languages to judge on applicability assuming that similar languages can be supported as well.
If jolla-keyboard could communicate with the Presage engine without linking it directly (e.g. via D-Bus), then Presageās GPLv2 licence shouldnāt be a problem.
But guys! (Just before investing any expensive time in this area).
Did anyone contacted with Matteo Vescovi (orignal presage author) about a licensing change proposal?
If not than I am more than happy to ask him kindly. What license requirement does Jolla have on integration?
I am not into the licensing business, so if anyone could summarise the reason of the change than it would be helpful.