Quantcast
Channel: UNIX and Linux Forums
Viewing all articles
Browse latest Browse all 16232

Sorting on length with identification of number of characters

$
0
0
Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say
Quote:

5
6
7
8
etc.
Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:
Quote:

आधी
इतक
इतपत
ईचना
ईचनात

ईना
ईन
Expected output
Quote:

1

2
ईन
3
आधी
इतक
ईना
4
इतपत
ईचना
5
ईचनात
Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Viewing all articles
Browse latest Browse all 16232

Trending Articles