Abstract: | A technique for information retrieval includes parsing a corpus to
identify a number of wordform instances within each document of the
corpus. A weighted morpheme-by-document matrix is generated based at
least in part on the number of wordform instances within each document of
the corpus and based at least in part on a weighting function. The
weighted morpheme-by-document matrix separately enumerates instances of
stems and affixes. Additionally or alternatively, a term-by-term
alignment matrix may be generated based at least in part on the number of
wordform instances within each document of the corpus. At least one lower
rank approximation matrix is generated by factorizing the weighted
morpheme-by-document matrix and/or the term-by-term alignment matrix. |