← Back to Index

Transitioning the Finnish dictionary from Malaga to VFST

Libvoikko 4.0 and voikko-fi 2.0 were released on 2015-12-14. In these versions the default dictionary format for Finnish language was changed and other configuration changes have also been made. This document provides detailed information about the change and its effects on developers, software distributors and end users.

This document may occasionally be updated with corrections or new information. If you are interested in the latest changes please look at GitHub change history.

History, rationale and goals

The first version of Voikko consisting of libvoikko 1.0 and Suomi-malaga 0.7.1 was released in August 2006 to support spell checking and hyphenation for Finnish language. Grammar checking was introduced during 2008 (libvoikko 2.0). Slight changes were made in the dictionary format to add support for morphological analysis during 2009 (libvoikko 2.2).

Since those days all of these features have seen gradual improvements but the limits of Malaga grammar formalism have already been reached. There are no reasonable ways of making the Malaga interpreter any faster than it already is and generating Finnish word forms is not possible either which has made many interesting applications such as inflecting thesaurus impossible.

Re-writing the morphology in VFST (Varissuo Finite State Transducer) format was started in 2012. The work has been done in parallel with maintenance work on Malaga dictionary. Indeed many useful fixes that have been committed during the past three years to our Malaga dictionary have resulted from this work. The primary goals for the VFST dictionary format were the following:

The benefits of VFST dictionary format

The initial version of Voikko with VFST dictionaries fulfills the goals mentioned above. Without loss of functionality it offers the following improvements over Malaga format:

Performance of different dictionary formats with equal coverage just before the release of libvoikko 4.0 and voikko-fi 2.0. Speed measurements are from spell checking a list of 5 700 000 strings from Finnish Wikipedia using an old laptop and single CPU core. The default configurations for Malaga and VFST are shown in bold. Smaller numbers are better.
DictionaryTime (min:sec)Size (MB)
Malaga (full functionality)9:0612
Malaga (spelling, grammar and hyphenation)7:467.6
VFST (full functionality)4:445.8
VFST (spelling, grammar, hyphenation and baseforms)3:573.8
VFST (spelling, grammar and hyphenation)3:451.5

Libvoikko 4.0.1 and 4.0.2 contain additional improvements that further improve the speed of VFST backend by more than 20 %.

Programs using libvoikko do not require any changes

Programs that use libvoikko through C, Java, Python or .Net API do not need to be modified to use the new dictionary format. Spell checking, hyphenation and grammar checker APIs will work as before. Morphological analysis with VFST will produce all of the attributes that were produced with Malaga. However the following changes may affect some programs:

Default build configuration of libvoikko has changed

The default build configuration of libvoikko 4.0 has been changed. Those who build libvoikko from source code may need to adjust their configurations:

Suomi-malaga is now voikko-fi

The Finnish dictionary source package suomi-malaga has been renamed to voikko-fi. Voikko-fi can be used to build both Malaga and VFST dictionaries for use with libvoikko. It is now recommended to use VFST format if possible.

Removal of Malaga based dictionary format

To support the transition to VFST dictionary format libvoikko and voikko-fi supported both dictionary formats for a period of 14 months. Support for Malaga based dictionary format was removed from the master branch of our Git repository in March 2017. The latest versions supporting both formats were libvoikko 4.1.1 and voikko-fi 2.1.

Questions, bug reports and support requests

If you have comments or need support related to this transition please send email to Harri Pitkänen (hatapitk@iki.fi) or the Voikko mailing lists.