← Back to Index

Voikko
General architecture

Voikko consists of a set of separately released components that form a stack of layers as illustrated in the picture below.

General architecture of Voikko

In this project we develop the components shown with blue background, and some of the components with yellow background:

libvoikko
Libvoikko is the high level library that contains among other things algorithms that generate spelling suggestions and perform rule based hyphenation. It is also capable of caching the results of common spell checking operations to improve performance. All of the grammar checking is also done within libvoikko. Recent versions of libvoikko contain built in implementation of Malaga parser, but earlier versions used the separate Malaga library. The library source contains small command line test tools and Python bindings to the library.
Suomi-malaga
Suomi-malaga is a description of Finnish morphology implemented with Malaga. It was designed by Hannu Väisänen for text indexing and therefore accepted many common spelling mistakes and historical word forms that should be rejected when doing spell checking. For this reason, Voikko originally used its own branch of Suomi-malaga, with version numbers matching 0.7.X. Now the versions have been merged, and Suomi-malaga 1.0 and later can be used to build a morphology suitable for spell checking and text indexing from the same source package.
libreoffice-voikko
Libreoffice-voikko is an LibreOffice extension that uses Voikko to provide Finnish spell checking, hyphenation and grammar checking.
Enchant Voikko plugin
Voikko provider plugin for multi-backend Enchant spell checker library is included in Enchant version 1.4 and later.
tmispell-voikko
Tmispell-voikko is an ispell compatible spell checker that uses Voikko to provide Finnish spell checking and falls back to real ispell for other languages. Tmispell-voikko was originally written by Pauli Virtanen for the freely distributable but closed source spell checker Soikko. Tmispell-voikko also contains an Enchant provider plugin for Enchant version 1.3. Tmispell-voikko is deprecated and not actively developed anymore. Developers using ispell to add spell checking capability in their applications should consider switching to Enchant instead.
Joukahainen
Joukahainen is our web application used to maintain the vocabulary. Joukahainen is designed to store and provide vocabulary data in an application independent format which should make it easier to use and experiment with the data outside the Voikko project.

Upstream version of Malaga by Björn Beutel is used to compile and debug Suomi-malaga. The Malaga implementation within libvoikko is also based on original Malaga, but it has been modified to make it more suitable for our purposes.

The reasons for using malaga instead of Hunspell

This project started in late 2005 under the name Hunspell-fi, with an aim to create Finnish vocabulary and affix files for Hunspell. The Hunspell based implementation was developed roughly six months, and there were no serious problems but it was also evident that the work progressed rather slowly. In early 2006 Hannu Väisänen published Suomi-malaga, which contained a vocabulary that was (depending on how one defines "word") roughly ten times larger than the Hunspell-fi vocabulary at that time. Additionally the Hunspell-fi implementation did not support compound words and only a few derived word forms, which were both supported by Suomi-malaga.

Suomi-malaga had a lot of correctness problems from the spellchecking point of view that did not exist in our Finnish Hunspell dictionary, but with the limited resources we had at that time we really could not afford ignoring the huge amount of work that had gone into producing Suomi-malaga. Using the vocabulary of Suomi-malaga in Hunspell was not possible due to different semantics of word classification between these projects. It would be somewhat easier now that the data has been moved to Joukahainen and the classification has been modernised.

There are still some problems with the malaga based approach that might not exist in Hunspell. Malaga is not thread safe (this is going to be fixed within libvoikko), and the performance is sufficient but not great. Writing an accurate Finnish morphology with malaga is not easy, but there are currently only a few cases (mostly involving inflection within compound words) where no satisfactory solution has been found yet. However it is unlikely that Hunspell is any better in this regard. The COMPOUNDRULE patterns in Hunspell would make some things easier that are somewhat complicated to do with Malaga, but there are other major limitations (or at least there have been, some may have been fixed in recent versions) in Hunspell that should be considered:

All of the problems above could definitely be solved within Hunspell, but migh require a lot of work. Compromising quality just to become compatible with Hunspell is not an option, because Finnish people have come to expect really good results from their spell checkers (we have had advanced compound word checking in commercial text editors for well over ten years).

Currently work is going on to port the Finnish Malaga morphology into finite state form. Voikko already supports some finite state backends for various languages. Once all of our supported formats are finite state and compatible with Hunspell licensing it will be possible to consider merging the code.

How to distribute Voikko

The core parts of Voikko are all hidden behind the public interface of libvoikko, which is designed to be distributed as a shared library and used by any number of applications in the operating system. Our goal is to get the software shipped as a part of various Linux distributions so that Finnish writing aids would work out of the box for anyone who needs them. In the best case users would not even know that they are using Voikko. The source packages released by us should be suitable for easy packaging in different distributions (if not, tell us and we try to improve them). Just make sure that you package a compatible set of modules.

Note that currently the interface between libvoikko and Suomi-malaga is not considered to be fully stable, although it has remained unchanged for quite long time. We still have some requirements left that cannot be implemented without changing this interface. We will do our best to make libvoikko handle missing or incorrect versions of Suomi-malaga lexicon files as gracefully as possible. We however think that binary packages of libvoikko should have a dependency on binary packge of Suomi-malaga (commonly called voikko-fi) since the library is essentially useless without it.

The suggestion above implies that the Enchant provider plugin should not be distributed in the same binary package with Enchant main library, otherwise the dependency chain will drag Suomi-malaga binaries (the largest component of Voikko) on almost every Linux desktop on the planet regardless of the installation language. Luckily the provider plugin can be easily shipped in a separate package, since Enchant does runtime detection and loading of provider plugins with dlopen. This way no Voikko specific material gets installed on systems where Finnish spell checking is not needed.

We do not have official reference packaging available, but Fedora packages of Voikko follow the guidelines above and could be used as a starting point for packages for other distributions.

To make the application integration easier, it would be preferable to have an unified standard interface between linguistic tools and applications using them. The proposed freedesktop.org Desktop Language Checking Spec is a step to this direction.

For Windows and OS X the packaging may have to be done a bit differently, as neither of them natively supports software packaged this way. However in OS X one can use similar third party packaging systems such as Fink and MacPorts.