← Back to Index
Voikko
Transitioning the Finnish dictionary from Malaga to VFST
Libvoikko 4.0 and voikko-fi 2.0 were released on 2015-12-14. In these versions the default dictionary
format for Finnish language was changed and other configuration changes have also been made. This
document provides detailed information about the change and its effects on developers,
software distributors and end users.
This document may occasionally be updated with corrections or new information.
If you are interested in the latest changes please look at
GitHub change history.
History, rationale and goals
The first version of Voikko consisting of libvoikko 1.0 and Suomi-malaga 0.7.1 was released
in August 2006 to support spell checking and hyphenation for Finnish language. Grammar checking
was introduced during 2008 (libvoikko 2.0). Slight changes were made in the dictionary format
to add support for morphological analysis during 2009 (libvoikko 2.2).
Since those days all of
these features have seen gradual improvements but the limits of Malaga grammar formalism
have already been reached. There are no reasonable ways of making the Malaga interpreter
any faster than it already is and generating Finnish word forms is not possible either
which has made many interesting applications such as inflecting thesaurus impossible.
Re-writing the morphology in VFST (Varissuo Finite State Transducer) format was started in 2012.
The work has been done in parallel with maintenance work on Malaga dictionary. Indeed many
useful fixes that have been committed during the past three years to our Malaga dictionary have
resulted from this work. The primary goals for the VFST dictionary format were the following:
- The format should be usable for algorithms that generate word forms.
- Performance of existing functionality (spell checking, hyphenation and morphological analysis)
should increase.
- The tools needed for compiling the dictionary should be easily available and actively maintained
(the Malaga software has not been developed since about 2008).
- The dictionary format should be platform independent (the Malaga format depends on platform
endianess).
- Regressions in features or memory use are not allowed.
The benefits of VFST dictionary format
The initial version of Voikko with VFST dictionaries fulfills the goals mentioned above. Without loss
of functionality it offers the following improvements over Malaga format:
- The tools needed for building the dictionary are actively maintained. voikkovfstc is
developed as a part of libvoikko and foma has had its latest release in 2015. Additionally
it is relatively easy to replace Foma with another Xerox lexc compatible compiler if necessary.
- The VFST dictionary format is platform independent. The dictionary data can be directly
mapped into process memory on little endian architectures. For the (rare) cases where this optimisation
is needed on a big endian system voikkovfstc can also produce big endian optimised dictionaries.
- Processing speed using the VFST format is about twice as fast as it was with Malaga format. Additionally
there are some unexplored optimisation opportunities that may lead to additional improvements in future releases.
- The size of the dictionary file on disc and memory is roughly half of an equivalent Malaga dictionary.
- It is possible to transform the dictionary programmatically using regular expression operations.
- The format is suitable for word form generation and some specialised uses in grammar checking.
Performance of different dictionary formats with equal coverage just before the release of libvoikko 4.0 and voikko-fi 2.0.
Speed measurements are from spell checking a list of 5 700 000 strings from Finnish Wikipedia using an
old laptop and single CPU core. The default configurations for Malaga and VFST are shown in bold.
Smaller numbers are better.
Dictionary | Time (min:sec) | Size (MB) |
Malaga (full functionality) | 9:06 | 12 |
Malaga (spelling, grammar and hyphenation) | 7:46 | 7.6 |
VFST (full functionality) | 4:44 | 5.8 |
VFST (spelling, grammar, hyphenation and baseforms) | 3:57 | 3.8 |
VFST (spelling, grammar and hyphenation) | 3:45 | 1.5 |
Libvoikko 4.0.1 and 4.0.2 contain additional improvements that further improve the speed of VFST backend by more than 20 %.
Programs using libvoikko do not require any changes
Programs that use libvoikko through C, Java, Python or .Net API do not need to be modified to use
the new dictionary format. Spell checking, hyphenation and grammar checker APIs will work as before.
Morphological analysis with VFST will produce all of the attributes that were produced with Malaga.
However the following changes may affect some programs:
- The baseline of attributes that are expected to be returned in all dictionary configurations
will include the following attributes that were not part of the baseline Malaga dictionary:
- TENSE
- KYSYMYSLIITE
- POSSESSIVE
Programs that behave badly when additional unknown attributes are introduced may need to be fixed not
to make such assumptions.
- BASEFORM is now produced in the standard dictionary configuration but may still be omitted in
dictionaries optimised for low memory use.
- WORDBASES (and WORDIDS) will return more detailed chain of derivation for some words. For example
WORDBASES for word kanavoitumisen is now +kanavoi(kanavoida)+tua(+tua) instead of
+kanavoitua(kanavoida). The syntax of these attributes is still the same as before.
Default build configuration of libvoikko has changed
The default build configuration of libvoikko 4.0 has been changed. Those who build libvoikko from source
code may need to adjust their configurations:
- Previously only Malaga backend (dictionary format 2) was enabled by default. Now we enable HFST (dictionary format 3)
and Finnish VFST (dictionary format 5) instead.
- HFST (dictionary format 3) requires the hfstospell library. It can be disabled with --disable-hfst
- Finnish VFST (dictionary format 5) can be disabled with --disable-vfst
- VFST compiler tool voikkovfstc is now built by default. It can be disabled with --disable-buildtools
- Some VFST related features are still experimental. These are still disabled by default but the switch to enable them has been renamed.
It is now called --enable-expvfst
- If you still need support for Malaga (dictionary format 2) please use libvoikko 4.1.1 and compile with --enable-malaga
- The default dictionary path used as a last resort location to look for dictionary files is no longer set by default.
This choice has been made as such default path was really ever useful on Unix systems. It is still recommended to set a default dictionary
path on systems where it is useful: --with-dictionary-path=/usr/lib/voikko
- In the default build configuration it is now possible to use and distribute the libvoikko library under MPL 1.1 / GPL 2+ / LGPL 2.1+ tri-license.
This license offers some additional flexibility over plain "GPL 2 or later" license that was required by the Malaga backend. The new license is
the same as the one used by Hunspell spell checker library. Please note that the Finnish VFST dictionary is still only available under the GPL license.
Suomi-malaga is now voikko-fi
The Finnish dictionary source package suomi-malaga has been renamed to voikko-fi. Voikko-fi can be used to
build both Malaga and VFST dictionaries for use with libvoikko. It is now recommended to use VFST format
if possible.
- Building the VFST dictionaries requires Foma. For those who prefer building with
Autotools based build system the autotools2 branch of Foma is also available.
You only need the foma tool, shared libraries and development headers are not needed.
- Building VFST dictionaries requires voikkovfstc tool which is built as a part of libvoikko. Thus you now need to build and
install libvoikko before building voikko-fi (previously the order was not sigificant).
- Python 3 has replaced Python 2 as the build scripting tool. Python 3 is needed even if you build only Malaga dictionaries.
- Commands to build and install the VFST dictionaries are make vvfst and make vvfst-install.
- The VFST dictionaries will produce BASEFORM attributes by default for morphological analysis. We recommend that dictionaries
built for general use do not disable this feature. It is possible to make the dictionaries more compact by compiling with option
VVFST_BASEFORMS=no. You can do this when you know that the baseforms will not be used (embedded dictionaries produced just for
spell checking). Such dictionaries will still provide baseforms but they will be incorrect.
- Some rarely used build options have been deprecated by not providing support for them with VFST format. The support may be restored in
future releases if demand for these options still exists. Please see README in the source distribution for detailed information.
Removal of Malaga based dictionary format
To support the transition to VFST dictionary format libvoikko and voikko-fi supported both dictionary formats for a period of 14 months.
Support for Malaga based dictionary format was removed from the master branch of our Git repository in March 2017. The latest versions
supporting both formats were libvoikko 4.1.1 and voikko-fi 2.1.
Questions, bug reports and support requests
If you have comments or need support related to this transition please send email to Harri Pitkänen (hatapitk@iki.fi)
or the Voikko mailing lists.