languagemodel-slavic
Binaries are available at Maven Central
Please follow this link for project documentation.
Stemming engines available
Hunspell
Dictionary locations:
/usr/share/hunspell/usr/local/share/hunspell
On Mac OS X, which also relies on hunspell for spell checking purposes, additional dictionary locations can be examined:
/System/Library/Spelling/Library/Spelling~/Spelling/opt/local/share/hunspell(in case MacPorts are installed)/sw/share/hunspell(in case Fink is installed)
Morphological analysis
If such information is provided by dictionaries, hunspell can also perform morphological analysis – see hunspell(4) man page, section "Optional data fields".
Java API status
2 projects are available – one is based on JNA and the other one on BridJ.
Mac OS X status
Despite Mac OS X inherently relies on hunspell for spell checking tasks and supprts 3rd party hunspell dictionaries, its Objective C API doesn't support stemming nor morphological analysis (see NSSpellChecker class reference).
Mac OS X Java API
"Mac OS X for Java Geeks" (Chapter 11, "The Mac OS X Spelling Framework"), refers to com.apple.spell.ui Java package, but the book has been published in 2003, and covers Mac OS X 10.2 and JDK 1.4. The package mentioned is missing from Mac OS X 10.9 distribution. The Apple-shipped Java packages are instead:
apple.applescriptapple.awtapple.keychain(JDK 1.4 only)apple.lafapple.launcherapple.securityapple.utilcom.apple.concurrentcom.apple.cryptocom.apple.dnssdcom.apple.eawtcom.apple.eiocom.apple.javacom.apple.jobjc(particularly, containscom.apple.jobjc.appkit.NSSpellCheckerandcom.apple.jobjc.foundation.NSSpellServerclasses)com.apple.lafcom.apple.mrjcom.apple.resources
seman by aot.ru
Stemming and morphological analysis (Linux)
$ for w in 'друг' 'друзья' 'люди' 'какая'; do echo $w; done | iconv -t CP1251 | ./TestLem russian | iconv -f CP1251
Loading..
Input a word..
+ ДРУГ С од мр,им,ед 147889 ДРУ'Г
+ ДРУГ С од мр,им,мн 147889 ДРУЗЬЯ'
+ ЧЕЛОВЕК С од мр,им,мн 135031 ЛЮ'ДИ
+ КАКАТЬ ДЕЕПРИЧАСТИЕ нп,нс дст,нст 151931 КА'КАЯ + КАКОЙ МС-П но,од,жр,им,ед 148987 КАКА'Я
Syntax analysis (Linux)
$ echo 'Варкалось, хливкие шорьки пырялись по наве' | iconv -t CP1251 | ./TestSynan russian | iconv -f CP1251
ok
sentences count: 1
sentences count: 1
<chunk>
<input>Варкалось, хливкие шорьки пырялись по наве</input>
<sent>
<synvar>
<clause type="ГЛ_ЛИЧН">Варкалось , хливкие шорьки пырялись по наве</clause>
<group type="ПРИЛ_СУЩ">хливкие шорьки</group>
<group type="ОДНОР_ИГ">Варкалось , хливкие шорьки</group>
<group type="ПГ">по наве</group>
</synvar>
<rel name="ПРИЛ_СУЩ" gramrel="вн,им,мн," lemmprnt="ШОРЕК" grmprnt="но,мр,вн,им,мн," lemmchld="ХЛИВКИЙ" grmchld="но,од,вн,им,мн," > шорьки -> хливкие </rel>
<rel name="ПГ" gramrel="пр," lemmprnt="ПО" grmprnt="" lemmchld="НАВ" grmchld="но,мр,пр,ед," > по -> наве </rel>
<rel name="ОДНОР_ИГ" gramrel="вн,им,мн," lemmprnt="," grmprnt="" lemmchld="ВАРКАЛОСЬ" grmchld="но,ср,жр,мр,пр,тв,вн,дт,рд,им,ед,мн," > , -> Варкалось </rel>
<rel name="ОДНОР_ИГ" gramrel="вн,им,мн," lemmprnt="," grmprnt="" lemmchld="ШОРЕК" grmchld="но,мр,вн,им,мн," > , -> шорьки </rel>
<rel name="ПОДЛ" gramrel="" lemmprnt="ПЫРЯТЬСЯ" grmprnt="дст,нп,нс,прш,мн," lemmchld="ВАРКАЛОСЬ" grmchld="но,ср,жр,мр,пр,тв,вн,дт,рд,им,ед,мн," > пырялись -> Варкалось </rel>
</sent>
</chunk>
mystem by Yandex
Setting up
Version 2.1 for Mac OS X is linked incorrectly against /usr/local/Cellar/gcc47/4.7.2/gcc/lib/libstdc++.6.dylib:
$ otool -L mystem
mystem:
/usr/local/Cellar/gcc47/4.7.2/gcc/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.17.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
and dumps a core when run. The problem can be fixed with install_name_tool:
$ install_name_tool -change /usr/local/Cellar/gcc47/4.7.2/gcc/lib/libstdc++.6.dylib /usr/lib/libstdc++.6.dylib mystem
$ otool -L mystem
mystem:
/usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.17.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
Invocation example:
$ echo -e 'какая\nдрузья\nлюди\nваркалось\nхливкие\nшорьки\nглокая\nкуздра' | ./mystem -n -e utf-8 -i -l
какать=V,несов,нп=непрош,деепр|какой=APRO=им,ед,жен
друг=S,муж,од=им,мн
человек=S,муж,од=им,мн
варкаться?=V,несов,нп=прош,ед,изъяв,сред
хливкий?=A=им,мн,полн|?=A=вин,мн,полн,неод
шорька?=S,жен,неод=им,мн|?=S,жен,неод=род,ед|?=S,жен,неод=вин,мн
глокать?=V,несов,нп=непрош,деепр|глокий?=A=им,ед,полн,жен
куздра?=S,ед,жен,неод=им|куздра?=S,гео,жен,неод=им,ед
Apache Lucene
Apache Licene contains a port of C++ hunspell API to Java, see the API documentation.
LanguageTool
Feature comparison
Human languages support
Colons can be used to align columns.
| Product | Russian | Ukrainian | English | German | Morphological Analysis | Syntax Analysis |
|---|---|---|---|---|---|---|
| hunspell | yes | yes | yes | yes | yes (if supported by dictionaries) | no |
| seman | yes | no | yes | yes | yes | yes |
| mystem | yes | no | no | no | yes | no |
| LanguageTool | yes | yes | yes | yes | yes | no |
| Lucene | ? | ? | ? | ? | ? | no |
Programming languages support
| Product | C++ | Java |
|---|---|---|
| hunspell | yes | yes |
| seman | yes | no |
| mystem | yes | no |
| LanguageTool | no | yes |
| Lucene | no | yes |
OS support
| Product | Windows | Linux | Mac OS X |
|---|---|---|---|
| hunspell | yes | yes | yes |
| seman | yes | yes | no |
| mystem | yes | yes | yes |
| LanguageTool | yes | yes | yes |
| Lucene | yes | yes | yes |
License
| Product | License | Can be distributed with Caché? |
|---|---|---|
| hunspell | GPL/LGPL/MPL | yes |
| seman | LGPL | yes |
| mystem | non-commercial | no |
| LanguageTool | LGPL | yes |
| Lucene | Apache License | yes |
Native (C++) implementation
For C++ implementation, it is possible to link against either hunspell, seman or mystem and return the results of morphological analytis as a JSON object using Boost Property Tree
ZWARRAYP type can be used to pass strings from/to Caché.
