NLP tools for East Asian languages
Comments
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
$
%
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
NameURLLanguagePurposeAvailabilityLicenseFoundPublicationComments
Experiences of using the tool
2
Stanford Word Segmenter
https://nlp.stanford.edu/software/segmenter.shtml
Chinesetokenizationfree downloadGNU Public
Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT.
The Stanford Word Segmenter is incorporated into nltk's tokenize package.
3
GATE
https://gate.ac.uk/gate/plugins/Lang_Chinese/src/chinese/
Chinese, Korean
Plug-ins for numerous functions, including: gazetteer, information extraction, tokenizer
free downloadLGPLv3
These are plug-ins for the GATE software environment
4
USAS
http://phlox.lancs.ac.uk/ucrel/semtagger/chinese
Chinese
semantic tagging
free online
none (limited online interface)
Maximum 3000 characters input
5
TreeTagger
https://vlo.clarin.eu/record?q=chinese&fqType=resourceClass:or&fq=resourceClass:Software&fq=resourceClass:Tool&docId=http_58__47__47_hdl.handle.net_47_11022_47_1007-0000-0000-8E2A-2
Chinese
POS tagging, lemmatization
ACACLARIN VLO
6
SPPAS
https://vlo.clarin.eu/record?q=chinese&fqType=resourceClass:or&fq=resourceClass:Software&fq=resourceClass:Tool&docId=oai_58_sldr.org_58_sldr000800
Mandarin Chinese, Taiwanese, Cantonese and Japanese,
audio annotation
free downloadPUBCLARIN VLO
7
Prozed
https://vlo.clarin.eu/record?q=chinese&fqType=resourceClass:or&fq=resourceClass:Software&fq=resourceClass:Tool&docId=oai_58_sldr.org_58_sldr000778
Chineseprosody editorfree download
CC BY-NC-SA 4.0
CLARIN VLO
8
Voyanthttp://voyant-tools.org/?lang=jaJapanese
Analysis, exploration, visualization
free online
none (online service)
http://guides.library.upenn.edu/japanesetext
Suite of text analysis tools, now works with Japanese, including tokenization
9
topic-modelling-tool
https://github.com/senderle/topic-modeling-tool
Japanesetopic modellingfree download
http://guides.library.upenn.edu/japanesetext
A point-and-click tool for creating and analyzing topic models produced by MALLET.
10
i2ocrhttp://www.i2ocr.com/
Chinese, Japanese, Korean, Thai, Malay, Malayam, Tagalog
OCRfree onlinenone (online)
11
Convertiohttps://convertio.co/ocr/
Chinese, Japanese, Korean, Thai, Malay, Malayam, Tagalog
OCR
10 free pages, then pricing plan
none (online)Not tested
12
KoNLPyhttp://konlpy.orgKorean
POS tagging, corpus analysis (collocations, chunking, wordclouds)
free downloadGPL v3
Eunjeong L. Park, Sungzoon Cho. “KoNLPy: Korean natural language processing in Python”, Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea, Oct 2014.
Versions exist for Linux, Mac, Windows
13
awesome-korean-nlphttps://github.com/insikk/awesome-korean-nlpKorean
A curated list of resources dedicated to Natural Language Processing for Korean
14
pycantonesehttp://pycantonese.org/Cantonese
Python corpus search functions as well as various analytic and annotation tools
free download
Apache License, Version 2.0
http://pycantonese.org/papers.html
Also https://github.com/pycantonese/pycantonese
15
Antconc
http://www.laurenceanthony.net/software/antconc/
Not language specific
Text and corpus analysis
free downloadcustom licence
Widely used with Japanese and Chinese. See also full suite of tools at http://www.laurenceanthony.net/software.html
16
SegmentAnt
http://www.laurenceanthony.net/software/segmentant/
Chinese, Japanese, Korean, Thai, Malay, Malayam, Tagalog
tokenizationfree downloadcustom licence
Multi-platform (Windows/Mac/Linux). A freeware Japanese and Chinese segmenter (segmentation/tokenizing) tool. POS tagging tools under development.
17
UDPipe
https://lindat.mff.cuni.cz/services/udpipe/?process&model=japanese&data=%27%EF%BC%91%EF%BC%99%EF%BC%92%EF%BC%93%E5%B9%B4%E3%81%AE%E9%96%A2%E6%9D%B1%E5%A4%A7%E9%9C%87%E7%81%BD%E3%81%A7%E8%B5%B7%E3%81%8D%E3%81%9F%E6%9C%9D%E9%AE%AE%E4%BA%BA%E8%99%90%E6%AE%BA%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6%E3%80%81%E5%9B%BD%E6%94%BF%E6%96%B0%E5%85%9A%E3%80%8C%E5%B8%8C%E6%9C%9B%E3%81%AE%E5%85%9A%E3%80%8D%E3%82%92%E7%AB%8B%E3%81%A1%E4%B8%8A%E3%81%92%E3%81%9F%E5%B0%8F%E6%B1%A0%E7%99%BE%E5%90%88%E5%AD%90%E3%83%BB%27
Japanese
tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
Online service with restricted usage conditions
UDPipe is free software distributed under the Mozilla Public License 2.0 and the linguistic models are free for non-commercial use and distributed under the CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
LINDAT
http://ufal.mff.cuni.cz/udpipe#udpipe_acknowledgements
Appears to be restricted to use with a specific dataset or data-type
18
On-line Chinese Toolshttp://mandarintools.com/Chinese
tokenization, encoding detector, dictionary
Variable: some free to download
Portal offering links to various reference, pedagogic and processing resources, includes a few relevant peieces of software
19
MeCabhttp://taku910.github.io/mecab/Japanese
tokenization, morphological analysis, POS tagging
not known
Worked well a few years ago, with the help of a Japanese speaker; not tested recently (CLARIN-SI).
20
ChaSenhttp://chasen.naist.jp/hiki/ChaSen/Japanesetokenizationnot knownnot known
Documentation in Japanese
21
Language Grid
http://langrid.org/service_manager/language-services
Japanese
web services orchestration
Portal for a number of NLP web services, mainly based in Japan
22
janome
https://www.google.com/url?q=https://github.com/mocobeta/janome&sa=D&ust=1507555462832000&usg=AFQjCNHUlV7EKCVxo9kUroFgifYQFM9xJg
Japanese
tokenization, POS tagging
free download
Apache, version 2.0
suggestion
Japanese morphological analysis engine written in pure Python
23
SentiStrengthhttp://sentistrength.wlv.ac.uk/#Non-EnglishChinese
sentiment analysis
free onlinenone (online)
Requires pre-tokenization with spaces
24
25
26
27
28
29
Glossary
30
encoding detector
recognizing the character set and encoding used for a text
31
gazetteer
geographical index or dictionary, used to help identify and resolve place names in texts
32
lemmatization
identifying the headword for inflected word forms in the text e.g. 'go' for 'went'
33
morphological analysis
identifying the buidling blocks from which words are built (e.g. 'mean-ing-ful')
34
named entity recognition
identifying references to things like people, places, dates and events in texts
35
NLP
natural language processing', an umbrella term for software used to process language (includes all of the terms in this glossary!)
36
OCR
optical character recognition', extracting electronic text from images, especially from scanned images of printed texts
37
parsingassigning syntactic structure to text
38
POS tagger
assigning word classes or 'part-of-speech' (POS) tags to words, e.g. noun, verb, etc.
39
semantic tagging
assigning classifications relating to meaning, often relating to semantic fields, e.g. 'food and drink', 'the natural world', 'emotions'
40
sentiment analysis
identifying opinions or attitudes in texts, usually 'good' or 'bad' sentiments towards a particular topic or concept
41
tokenization
identifying the boundaries of words and other relevant linguistic units, also sometimes known as segmentation
42
topic modelling
an automatic process to identify topics or themes in a text
43
segmentationsee tokenization
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...
 
 
 
Sheet1