solthiruthi package¶
Submodules¶
solthiruthi.Ezhimai module¶
solthiruthi.WordSpeller module¶
solthiruthi.datastore module¶
-
class
solthiruthi.datastore.
DTrie
[source]¶ Bases:
solthiruthi.datastore.Trie
trie where number of alphabets at each nodes grows with time; implementation uses a dictionary; it contains an attribute count for frequency of letter.
-
class
solthiruthi.datastore.
Queue
[source]¶ Bases:
list
-
ExceptionMsg
= u'Queue does not support list method %s'¶
-
-
class
solthiruthi.datastore.
RTrie
(is_tamil=False)[source]¶ Bases:
solthiruthi.datastore.DTrie
-
class
solthiruthi.datastore.
TamilTrie
(get_idx=<function getidx>, invert_idx=<function tamil>, alphabet_len=323)[source]¶ Bases:
solthiruthi.datastore.Trie
Store a list of words into the Trie data structure
solthiruthi.dictionary module¶
solthiruthi.dom module¶
-
class
solthiruthi.dom.
Document
(filename)[source]¶ Bases:
solthiruthi.datastore.Queue
open contents of a file on load
-
class
solthiruthi.dom.
Entity
(word, flagged=False, **kwargs)[source]¶ Bases:
solthiruthi.dom.Position
-
class
solthiruthi.dom.
WordEntity
(word, **kwargs)[source]¶ Bases:
solthiruthi.dom.Entity
solthiruthi.heuristics module¶
-
class
solthiruthi.heuristics.
AdjacentConsonants
(freq=2)[source]¶ Bases:
solthiruthi.heuristics.Rule
donot allow adjacent consonants in the word. this may not be as useful as AdjacentVowels rules
-
agaram_letters
= set([u'\u0ba3', u'\u0ba4', u'\u0ba9', u'\u0ba8', u'\u0baa', u'\u0baf', u'\u0bae', u'\u0bb1', u'\u0bb0', u'\u0bb3', u'\u0bb2', u'\u0b95', u'\u0bb4', u'\u0b99', u'\u0bb5', u'\u0b9a', u'\u0b9f', u'\u0b9e'])¶
-
mei_letters
= set([u'\u0b9a\u0bcd', u'\u0baf\u0bcd', u'\u0ba4\u0bcd', u'\u0b99\u0bcd', u'\u0bae\u0bcd', u'\u0ba3\u0bcd', u'\u0bb5\u0bcd', u'\u0b95\u0bcd', u'\u0baa\u0bcd', u'\u0b9f\u0bcd', u'\u0bb4\u0bcd', u'\u0ba9\u0bcd', u'\u0b9e\u0bcd', u'\u0bb3\u0bcd', u'\u0ba8\u0bcd', u'\u0bb2\u0bcd', u'\u0bb1\u0bcd', u'\u0bb0\u0bcd'])¶
-
reason
= u'\u0b92\u0ba9\u0bcd\u0bb1\u0bc8\u0ba4\u0bcd\u0ba4\u0bca\u0b9f\u0bb0\u0bcd\u0ba8\u0bcd\u0ba4\u0bc1\u0b92\u0ba9\u0bcd\u0bb1\u0bc1 \u0bae\u0bc6\u0baf\u0bcd \u0b8e\u0bb4\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1\u0b95\u0bcd\u0b95\u0bb3\u0bcd \u0bb5\u0bb0\u0b95\u0bcd\u0b95\u0bc2\u0b9f\u0bbe\u0ba4\u0bc1. \u0b87\u0ba4\u0bc1 \u0baa\u0bc6\u0bb0\u0bc1\u0bae\u0bcd\u0baa\u0bbe\u0bb2\u0bc1\u0bae\u0bcd \u0baa\u0bbf\u0bb4\u0bc8\u0baf\u0bbe\u0b95 \u0b87\u0bb0\u0bc1\u0b95\u0bcd\u0b95\u0bc1\u0bae\u0bcd.'¶
-
-
class
solthiruthi.heuristics.
AdjacentVowels
[source]¶ Bases:
solthiruthi.heuristics.Rule
donot allow adjacent vowels in the word. ஆஅக்காள் (originally -> அக்காள்) will be flagged
-
reason
= u'\u0b92\u0ba9\u0bcd\u0bb1\u0bc8\u0ba4\u0bcd\u0ba4\u0bca\u0b9f\u0bb0\u0bcd\u0ba8\u0bcd\u0ba4\u0bc1\u0b92\u0ba9\u0bcd\u0bb1\u0bc1 \u0b89\u0baf\u0bbf\u0bb0\u0bc6\u0bb4\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1\u0b95\u0bcd\u0b95\u0bb3\u0bcd \u0bb5\u0bb0\u0b95\u0bcd\u0b95\u0bc2\u0b9f\u0bbe\u0ba4\u0bc1. \u0b87\u0ba4\u0bc1 \u0baa\u0bc6\u0bb0\u0bc1\u0bae\u0bcd\u0baa\u0bbe\u0bb2\u0bc1\u0bae\u0bcd \u0baa\u0bbf\u0bb4\u0bc8\u0baf\u0bbe\u0b95 \u0b87\u0bb0\u0bc1\u0b95\u0bcd\u0b95\u0bc1\u0bae\u0bcd.'¶
-
uyir_letters
= set([u'\u0b85', u'\u0b87', u'\u0b86', u'\u0b89', u'\u0b88', u'\u0b8a', u'\u0b8f', u'\u0b8e', u'\u0b90', u'\u0b93', u'\u0b92', u'\u0b94'])¶
-
-
class
solthiruthi.heuristics.
BadIME
[source]¶ Bases:
solthiruthi.heuristics.Rule
donot allow vowels with kombu, thunaikaal etc in the word. ஆாள் (originally intended as -> ஆள்) will be flagged
-
reason
= u'\u0b9a\u0bca\u0bb2\u0bcd\u0bb2\u0bbf\u0bb2\u0bcd \u0baa\u0bbf\u0bb4\u0bc8 \u0b95\u0bbe\u0bb0\u0ba3\u0bae\u0bcd, \u0b87\u0bb2\u0bcd\u0bb2\u0bbe\u0ba4 \u0ba4\u0bae\u0bbf\u0bb4\u0bcd \u0b8e\u0bb4\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1..'¶
-
uyir_letters
= set([u'\u0b85', u'\u0b87', u'\u0b86', u'\u0b89', u'\u0b88', u'\u0b8a', u'\u0b8f', u'\u0b8e', u'\u0b90', u'\u0b93', u'\u0b92', u'\u0b94'])¶
-
-
class
solthiruthi.heuristics.
RepeatedLetters
[source]¶ Bases:
solthiruthi.heuristics.Rule
donot allow more than one repetition of a letter in word
-
reason
= u'\u0b92\u0bb0\u0bc7 \u0b8e\u0bb4\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1 \u0baa\u0bb2 \u0bae\u0bc1\u0bb0\u0bc8 (>= 2) \u0ba4\u0bca\u0b9f\u0bb0\u0bcd\u0b9a\u0bcd\u0b9a\u0bbf\u0baf\u0bbe\u0b95 \u0bb5\u0ba8\u0bcd\u0ba4\u0bbe\u0bb2\u0bcd \u0b85\u0ba4\u0bc1 \u0baa\u0bbf\u0bb4\u0bc8\u0baf\u0bbe\u0ba9 \u0b9a\u0bca\u0bb2\u0bcd \u0b86\u0b95\u0bc1\u0bae\u0bcd'¶
-
-
class
solthiruthi.heuristics.
Rule
[source]¶ Bases:
object
-
apply
(word, ctx)[source]¶ @word is just that. @ctx is a dict of NwordsPrevious, NwordsNext, and a list of surrounding words for as items. e.g. ctx = {‘NPrev’ : 4, ‘Prev’ : [w1,w2,w3,w4],’NNext’:2,’Next’:[w1,w2]} return value should be boolean (False if error found) and an optional reason as second argument
-
solthiruthi.morphology module¶
-
class
solthiruthi.morphology.
RemoveHyphenatesNumberDate
[source]¶ Bases:
solthiruthi.morphology.RemoveCaseSuffix
Done correctly (மேல்) 65536-மேல், ivan paritchayil இரண்டாவது, 2-வது