Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say
Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:
Expected output
Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say
Quote:
5 6 7 8 etc. |
An example will make this clear.
Input:
Quote:
आधी इतक इतपत ईचना ईचनात ई ईना ईन |
Quote:
1 ई 2 ईन 3 आधी इतक ईना 4 इतपत ईचना 5 ईचनात |
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help