GitHunt
P4

p4perf4ce/pythainlp

Thai Natural Language Processing in Python.

PyThaiNLP Logo

PyThaiNLP

Python 3.6
pypi
Downloads
License
FOSSA Status
Build Status
Build status
Codacy Badge
Coverage Status

Thai Natural Language Processing in Python.

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to nltk but with focus on Thai language.

This is a document for development branch (post 2.0). Things will break.

Google Colab Badge

Capabilities

  • Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
  • Thai word segmentation (word_tokenize), including subword segmentation based on Thai Character Cluster (subword_tokenize)
  • Thai transliteration (transliterate)
  • Thai part-of-speech taggers (pos_tag)
  • Read out number to Thai words (bahttext, num_to_thaiword)
  • Thai collation (sort by dictionoary order) (collate)
  • Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)
  • Thai spelling suggestion and correction (spell and correct)
  • Thai soundex (soundex) with three engines (lk82, udom83, metasound)
  • Thai WordNet wrapper
  • and much more - see examples in PyThaiNLP Get Started notebook.

Installation

PyThaiNLP uses PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/

Stable release

Standard installation:

$ pip install pythainlp

Development release:

$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

For some advanced functionalities, like word vector, extra packages may be needed. Install them with these options during pip install:

$ pip install pythainlp[extra1,extra2,...]

where extras can be

  • artagger (to support artagger part-of-speech tagger)*
  • deepcut (to support deepcut machine-learnt tokenizer)
  • icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
  • ipa (for IPA, International Phonetic Alphabet, support in transliteration)
  • ml (to support fastai 1.0.22 ULMFiT models)
  • ner (for named-entity recognizer)
  • thai2fit (for Thai word vector)
  • thai2rom (for machine-learnt romanization)
  • full (install everything)
  • Note: standard artagger package from PyPI will not work on Windows, please pip install https://github.com/wannaphongcom/artagger/tarball/master#egg=artagger instead.

** see extras and extras_require in setup.py for package details.

Documentation

See https://thainlp.org/pythainlp/docs/2.0/

License

FOSSA Status

Contribute to PyThaiNLP

Please do fork and create a pull request :)

For style guide and other information, including references to algorithms we use, please refer to our contributing page.

āļ āļēāļĐāļēāđ„āļ—āļĒ

āļ›āļĢāļ°āļĄāļ§āļĨāļ āļēāļĐāļēāđ„āļ—āļĒāđƒāļ™āļ āļēāļĐāļē Python

PyThaiNLP āđ€āļ›āđ‡āļ™āđ„āļĨāļšāļēāļĢāļĩāļ āļēāļĐāļēāđ„āļžāļ—āļ­āļ™āđ€āļžāļ·āđˆāļ­āļāļēāļĢāļ›āļĢāļ°āļĄāļ§āļĨāļœāļĨāļ āļēāļĐāļēāļ˜āļĢāļĢāļĄāļŠāļēāļ•āļī āđ‚āļ”āļĒāđ€āļ™āđ‰āļ™āļāļēāļĢāļŠāļ™āļąāļšāļŠāļ™āļļāļ™āļ āļēāļĐāļēāđ„āļ—āļĒ āđāļˆāļāļˆāđˆāļēāļĒāļŸāļĢāļĩ (āļ•āļĨāļ­āļ”āđ„āļ›) āđ€āļžāļ·āđˆāļ­āļ„āļ™āđ„āļ—āļĒāđāļĨāļ°āļŠāļēāļ§āđ‚āļĨāļāļ—āļļāļāļ„āļ™!

āđ€āļžāļĢāļēāļ°āđ‚āļĨāļāļ‚āļąāļšāđ€āļ„āļĨāļ·āđˆāļ­āļ™āļ•āđˆāļ­āđ„āļ›āļ”āđ‰āļ§āļĒāļāļēāļĢāđāļšāđˆāļ‡āļ›āļąāļ™

āđ€āļ­āļāļŠāļēāļĢāļ™āļĩāđ‰āļŠāļģāļŦāļĢāļąāļšāļĢāļļāđˆāļ™āļžāļąāļ’āļ™āļē āļ­āļēāļˆāļĄāļĩāļāļēāļĢāđ€āļ›āļĨāļĩāđˆāļĒāļ™āđāļ›āļĨāļ‡āđ„āļ”āđ‰āļ•āļĨāļ­āļ”

  • āļĢāļļāđˆāļ™āđ€āļŠāļ–āļĩāļĒāļĢāļĨāđˆāļēāļŠāļļāļ”āļ„āļ·āļ­āļĢāļļāđˆāļ™ 2.0.5
  • PyThaiNLP 2 āļĢāļ­āļ‡āļĢāļąāļš Python 3.6 āļ‚āļķāđ‰āļ™āđ„āļ›
  • āļœāļđāđ‰āđƒāļŠāđ‰ Python 2.7+ āļĒāļąāļ‡āļŠāļēāļĄāļēāļĢāļ–āđƒāļŠāđ‰ PyThaiNLP 1.6 āđ„āļ”āđ‰

ðŸ“Ŧ āļ•āļīāļ”āļ•āļēāļĄāļ‚āđˆāļēāļ§āļŠāļēāļĢāđ„āļ”āđ‰āļ—āļĩāđˆ Facebook Pythainlp

āļ„āļ§āļēāļĄāļŠāļēāļĄāļēāļĢāļ–

  • āļŠāļļāļ”āļ„āđˆāļēāļ„āļ‡āļ—āļĩāđˆāļ•āļąāļ§āļ­āļąāļāļĐāļĢāļ°āđāļĨāļ°āļ„āļģāđ„āļ—āļĒāļ—āļĩāđˆāđ€āļĢāļĩāļĒāļāđƒāļŠāđ‰āđ„āļ”āđ‰āļŠāļ°āļ”āļ§āļ āđ€āļŠāđˆāļ™ āļžāļĒāļąāļāļŠāļ™āļ° (pythainlp.thai_consonants), āļŠāļĢāļ° (pythainlp.thai_vowels), āļ•āļąāļ§āđ€āļĨāļ‚āđ„āļ—āļĒ (pythainlp.thai_digits), āđāļĨāļ° stop word (pythainlp.corpus.thai_stopwords) -- āđ€āļŦāļĄāļ·āļ­āļ™āļāļąāļšāļ„āđˆāļēāļ„āļ‡āļ—āļĩāđˆāļ­āļĒāđˆāļēāļ‡ string.letters, string.digits, āđāļĨāļ° string.punctuation
  • āļ•āļąāļ”āļ„āļģāļ āļēāļĐāļēāđ„āļ—āļĒ (word_tokenize) āđāļĨāļ°āļĢāļ­āļ‡āļĢāļąāļšāļāļēāļĢāļ•āļąāļ”āļĢāļ°āļ”āļąāļšāļ•āđˆāļģāļāļ§āđˆāļēāļ„āļģāđ‚āļ”āļĒāđƒāļŠāđ‰ Thai Character Clusters (subword_tokenize)
  • āļ–āļ­āļ”āđ€āļŠāļĩāļĒāļ‡āļ āļēāļĐāļēāđ„āļ—āļĒāđ€āļ›āđ‡āļ™āļ­āļąāļāļĐāļĢāļĨāļ°āļ•āļīāļ™āđāļĨāļ°āļŠāļąāļ—āļ­āļąāļāļĐāļĢ (transliterate)
  • āļĢāļ°āļšāļļāļŠāļ™āļīāļ”āļ„āļģ (part-of-speech) āļ āļēāļĐāļēāđ„āļ—āļĒ (pos_tag)
  • āļ­āđˆāļēāļ™āļ•āļąāļ§āđ€āļĨāļ‚āđ€āļ›āđ‡āļ™āļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ āļēāļĐāļēāđ„āļ—āļĒ (bahttext, num_to_thaiword)
  • āđ€āļĢāļĩāļĒāļ‡āļĨāļģāļ”āļąāļšāļ„āļģāļ•āļēāļĄāļžāļˆāļ™āļēāļ™āļļāļāļĢāļĄāđ„āļ—āļĒ (collate)
  • āđāļāđ‰āđ„āļ‚āļ›āļąāļāļŦāļēāļāļēāļĢāļžāļīāļĄāļžāđŒāļĨāļ·āļĄāđ€āļ›āļĨāļĩāđˆāļĒāļ™āļ āļēāļĐāļē (eng_to_thai, thai_to_eng)
  • āļ•āļĢāļ§āļˆāļ„āļģāļŠāļ°āļāļ”āļœāļīāļ”āđƒāļ™āļ āļēāļĐāļēāđ„āļ—āļĒ (spell, correct)
  • soundex āļ āļēāļĐāļēāđ„āļ—āļĒ (soundex) 3 āļ§āļīāļ˜āļĩāļāļēāļĢ (lk82, udom83, metasound)
  • Thai WordNet wrapper
  • āđāļĨāļ°āļ­āļ·āđˆāļ™ āđ† āļ”āļđāļ•āļąāļ§āļ­āļĒāđˆāļēāļ‡āđ„āļ”āđ‰āđƒāļ™ PyThaiNLP Get Started notebook

āļ•āļīāļ”āļ•āļąāđ‰āļ‡

āļĢāļļāđˆāļ™āđ€āļŠāļ–āļĩāļĒāļĢ

$ pip install pythainlp

āļĢāļļāđˆāļ™āļāļģāļĨāļąāļ‡āļžāļąāļ’āļ™āļē

$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

āļŠāļģāļŦāļĢāļąāļšāļ„āļ§āļēāļĄāļŠāļēāļĄāļēāļĢāļ–āđ€āļžāļīāđˆāļĄāđ€āļ•āļīāļĄāļšāļēāļ‡āļ­āļĒāđˆāļēāļ‡ āđ€āļŠāđˆāļ™ word vector āļˆāļģāđ€āļ›āđ‡āļ™āļ•āđ‰āļ­āļ‡āļ•āļīāļ”āļ•āļąāđ‰āļ‡āđāļžāļ„āđ€āļāļˆāļŠāļ™āļąāļšāļŠāļ™āļļāļ™āđ€āļžāļīāđˆāļĄāđ€āļ•āļīāļĄ āļ•āļīāļ”āļ•āļąāđ‰āļ‡āđāļžāļ„āđ€āļžāļˆāđ€āļŦāļĨāđˆāļēāļ™āļąāđ‰āļ™āđ„āļ”āđ‰ āļ”āđ‰āļ§āļĒāļāļēāļĢāļĢāļ°āļšāļļāļ­āļ­āļ›āļŠāļąāļ™āđ€āļŦāļĨāđˆāļēāļ™āļĩāđ‰āļ•āļ­āļ™ pip install:

$ pip install pythainlp[extra1,extra2,...]

āđ‚āļ”āļĒāļ—āļĩāđˆ extras āļ„āļ·āļ­

  • artagger (āļŠāļģāļŦāļĢāļąāļšāļ•āļąāļ§āļ•āļīāļ”āļ›āđ‰āļēāļĒāļāļģāļāļąāļšāļŠāļ™āļīāļ”āļ„āļģ artagger)*
  • deepcut (āļŠāļģāļŦāļĢāļąāļšāļ•āļąāļ§āļ•āļąāļ”āļ„āļģ deepcut)
  • icu (āļŠāļģāļŦāļĢāļąāļšāļāļēāļĢāļ–āļ­āļ”āļ•āļąāļ§āļŠāļ°āļāļ”āđ€āļ›āđ‡āļ™āļŠāļąāļ—āļ­āļąāļāļĐāļĢāđāļĨāļ°āļāļēāļĢāļ•āļąāļ”āļ„āļģāļ”āđ‰āļ§āļĒ ICU)
  • ipa (āļŠāļģāļŦāļĢāļąāļšāļāļēāļĢāļ–āļ­āļ”āļ•āļąāļ§āļŠāļ°āļāļ”āđ€āļ›āđ‡āļ™āļŠāļąāļ—āļ­āļąāļāļĐāļĢāļŠāļēāļāļĨ (IPA))
  • ml (āļŠāļģāļŦāļĢāļąāļšāļāļēāļĢāļĢāļ­āļ‡āļĢāļąāļšāđ‚āļĄāđ€āļ”āļĨ ULMFiT)
  • ner (āļŠāļģāļŦāļĢāļąāļšāļāļēāļĢāļ•āļīāļ”āļ›āđ‰āļēāļĒāļŠāļ·āđˆāļ­āđ€āļ‰āļžāļēāļ° (named-entity))
  • thai2fit (āļŠāļģāļŦāļĢāļąāļš word vector)
  • thai2rom (āļŠāļģāļŦāļĢāļąāļšāļāļēāļĢāļ–āļ­āļ”āļ•āļąāļ§āļŠāļ°āļāļ”āđ€āļ›āđ‡āļ™āļ­āļąāļāļĐāļĢāļĨāļ°āļ•āļīāļ™)
  • full (āļ•āļīāļ”āļ•āļąāđ‰āļ‡āļ—āļļāļāļ­āļĒāđˆāļēāļ‡)
  • āļŦāļĄāļēāļĒāđ€āļŦāļ•āļļ: āđāļžāļ„āđ€āļāļˆ artagger āļĄāļēāļ•āļĢāļāļēāļ™āļˆāļēāļ PyPI āļ­āļēāļˆāļĄāļĩāļ›āļąāļāļŦāļēāļāļēāļĢāļ–āļ­āļ”āļĢāļŦāļąāļŠāļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļšāļ™ Windows āļāļĢāļļāļ“āļēāļ•āļīāļ”āļ•āļąāđ‰āļ‡ artagger āļĢāļļāđˆāļ™āđāļāđ‰āđ„āļ‚āļ”āđ‰āļ§āļĒāļ„āļģāļŠāļąāđˆāļ‡ pip install https://github.com/wannaphongcom/artagger/tarball/master#egg=artagger āđāļ—āļ™ āļāđˆāļ­āļ™āļˆāļ°āļ•āļīāļ”āļ•āļąāđ‰āļ‡ PyThaiNLP

** āļŠāļēāļĄāļēāļĢāļ–āļ”āļđ extras āđāļĨāļ° extras_require āđƒāļ™ setup.py āļŠāļģāļŦāļĢāļąāļšāļĢāļēāļĒāļĨāļ°āđ€āļ­āļĩāļĒāļ”āđāļžāļ„āđ€āļāļˆāļ‚āļ­āļ‡āđ€āļŠāļĢāļīāļĄ

āđ€āļ­āļāļŠāļēāļĢāļāļēāļĢāđƒāļŠāđ‰āļ‡āļēāļ™

āļ­āđˆāļēāļ™āļ—āļĩāđˆ https://thainlp.org/pythainlp/docs/2.0/

āļŠāļąāļāļāļēāļ­āļ™āļļāļāļēāļ•

  • āđ‚āļ„āđ‰āļ” PyThaiNLP āđƒāļŠāđ‰āļŠāļąāļāļāļēāļ­āļ™āļļāļāļēāļ• Apache Software License 2.0
  • āļ„āļĨāļąāļ‡āļ„āļģāđāļĨāļ°āļ‚āđ‰āļ­āļĄāļđāļĨāļ—āļĩāđˆāļŠāļĢāđ‰āļēāļ‡āđ‚āļ”āļĒāđ‚āļ„āļĢāļ‡āļāļēāļĢ PyThaiNLP āđƒāļŠāđ‰āļŠāļąāļāļāļēāļ­āļ™āļļāļāļēāļ•āļ„āļĢāļĩāđ€āļ­āļ—āļĩāļŸāļ„āļ­āļĄāļĄāļ­āļ™āļŠāđŒāđāļšāļšāđāļŠāļ”āļ‡āļ—āļĩāđˆāļĄāļē-āļ­āļ™āļļāļāļēāļ•āđāļšāļšāđ€āļ”āļĩāļĒāļ§āļāļąāļ™ 4.0 Creative Commons Attribution-ShareAlike 4.0 International License
  • āļ„āļĨāļąāļ‡āļ„āļģāđāļĨāļ°āļ‚āđ‰āļ­āļĄāļđāļĨāļ­āļ·āđˆāļ™āđ† āļ—āļĩāđˆāļ­āļēāļˆāđāļˆāļāļˆāđˆāļēāļĒāđ„āļ›āļžāļĢāđ‰āļ­āļĄāļāļąāļšāđāļžāļ„āđ€āļāļˆ PyThaiNLP āļ­āļēāļˆāđƒāļŠāđ‰āļŠāļąāļāļāļēāļ­āļ™āļļāļāļēāļ•āļ­āļ·āđˆāļ™ āđ‚āļ›āļĢāļ”āļ”āļđāđ€āļ­āļāļŠāļēāļĢ Corpus License

āļ•āļĢāļēāļŠāļąāļāļĨāļąāļāļĐāļ“āđŒ

āļ­āļ­āļāđāļšāļšāđ‚āļ”āļĒāļ„āļļāļ“ āļ§āļĢāļļāļ•āļĄāđŒ āļžāļŠāļļāļ˜āļēāļ”āļĨ āļˆāļēāļāļāļēāļĢāļ›āļĢāļ°āļāļ§āļ”āļ—āļĩāđˆ https://www.facebook.com/groups/408004796247683/permalink/475864542795041/ āđāļĨāļ° https://www.facebook.com/groups/408004796247683/permalink/474262752955220/

āļŠāļ™āļąāļšāļŠāļ™āļļāļ™āđāļĨāļ°āļĢāđˆāļ§āļĄāļžāļąāļ’āļ™āļē

āļ„āļļāļ“āļŠāļēāļĄāļēāļĢāļ–āļĢāđˆāļ§āļĄāļžāļąāļ’āļ™āļēāđ‚āļ„āļĢāļ‡āļāļēāļĢāļ™āļĩāđ‰āđ„āļ”āđ‰ āđ‚āļ”āļĒāļāļēāļĢ fork āđāļĨāļ°āļŠāđˆāļ‡ pull request āļāļĨāļąāļšāļĄāļē

Languages

Jupyter Notebook50.4%Python48.6%Makefile0.6%Shell0.3%Batchfile0.1%
Apache License 2.0
Created June 19, 2019
Updated October 9, 2019
p4perf4ce/pythainlp | GitHunt