WO2001046855A1

WO2001046855A1 - Method and apparatus for word stemming using spelling correction algorithms

Info

Publication number: WO2001046855A1
Application number: PCT/US2000/042767
Authority: WO
Inventors: Mark Kantrowitz; Vibhu Mittal
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1999-12-16
Filing date: 2000-12-12
Publication date: 2001-06-28
Anticipated expiration: 2002-06-16
Also published as: AU4716901A; WO2001046855A8

Abstract

Disclosed is a method and apparatus for stemming a word entered into a computer utilizing a spelling correction algorithm (118). The method includes the steps of deriving a stem of the entered word through the use of a spelling correction algorithm and returning the stem. In the preferred embodiment, the spelling algorithm is a word similarity metric, for example, an edit distance metric, that compares the entered word with candidate stems in a lexicon. An input device (110) and screen monitor (112) coupled with processor (116), utilizes both an information retrieval algorithm (122) and a stemming algorithm (118) with lexicon (120) to produce a stem of an entered word.

Description

METHOD AND APPARATUS FOR WORD STEMMING USING SPELLING CORRECTION ALGORITHMS

BACKGROUND OF THE INVENTION

1. Field of the Invention This invention relates to word stemming and, more particularly, to stemming of words entered into a computer utilizing spelling correction algorithms.

2. Description of the Prior Art

Stemming algorithms perform language processing. Specifically, a stemming algorithm will take a word and process it until the stem (or root) of the word is derived. For example, an ideal stemmer will map "tries" to "try", "stripped" to "strip" , and "striped" to "stripe" . To do this, several different methods have been developed including suffix removal, strict truncation of character strings, word segmentation, letter bigrams, and linguistic morphology.

Stemming is commonly used in information retrieval (IR) programs, for example, web search engines. IR programs are used to locate documents in a system that relate to a user entered query. An IR program receives the user query, then searches a set of documents looking for documents that contain the words of the entered query. Thus, if a user enters "training cats", the program will retrieve any document in its search set that contains the words "training cats" . However, the basic program will not retrieve those documents that contain "train", "trains", "trained" , or "cat" unless the user enters these variations of the words in the query. To alleviate this problem, word stemming is used with the IR program. In such a case, the user does not need to enter the variations of the desired word(s) , rather the IR program with stemming automatically searches for them. Thus, searches of the entered word plus the variations of that word are performed.

Stemming algorithms accomplish this by reducing an entered word down to its stem. A search is performed either on the stem or on the entered word plus its variants that are included by the stemming algorithm, thereby resulting in a search on all variants of that stem. Stemming can occur either at the index creation or at the query stage. At the index creation stage, document terms are reduced to their stemmed equivalents. The search in this case is on the stem of the entered word. This saves space in storing the document and decreases processing time. At the query stage, the query is expanded by adding the variants of the stem to the query. Stemming at the query stage takes longer to process but can allow the user to refine the query by choosing which word variants to include in the query.

Furthermore, the stemming algorithm can be used in a hybrid IR program. In such a program, the query is executed both with and without stemming. The results are combined to produce a preference for documents that match the query more precisely, while still allowing the query to match word variants. A popular stemmer in use is the Porter stemmer described in M.F. Porter, "An algorithm for suffix stripping", Program 14(3) pp. 313-316 (July 1980). The Porter stemmer is an iterative process in which different suffixes are removed at each step using simple rules. Sometimes the roots produced by this method are not words, but just parts of a word (i.e., a pseudo-root) . This can lead to results that are confusing to the end user. Additionally, this εtemmer produces several errors, mainly due to inappropriately merging two equivalence classes or failing to stem two words to the same root. For example, "general", "generous", "generation", and "generic" are all conflated to the same root, while "recognize" and "recognition" are not conflated at all. For example, "hero" and "heroine" are conflated with "heroin" Therefore, a search engine using Porter's stemmer will include articles about illegal drug use with articles about heroic figures, and vice versa.

Variations of the Porter stemmer have been created to help reduce the amount of error in the results. In R. Krovetz, "Viewing morphology as an inference process" , Proceedings of the 16^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191-202, (1993), Krovetz discusses a modified Porter stemmer in which each of the stems is compared with a dictionary before each stage of the stemmer. If there is a match, the match is returned as the root .

Other methods of stemming use two- level rules that represent a grammar of the morphology of the language. See, for example, U.S. Patent Nos . 5,323,316; 5,475,587; 5,559,693; 5,594,641; and 5,625,554. These algorithms utilize stem and suffix finding means and inflectional generation and recognition routines. U.S. Patent No. 5,794,177 also includes parts of speech to the analysis.

Other methods generate multiple hypotheses for the stem and compare the possible stems with a lexicon of stems . These methods also take into account the meaning of a word by analyzing its morphology. See U.S. Patent Nos . 5,251,129 and 5,331,556.

In addition to using a lexicon of word roots, the prior art uses rules and/or requires knowledge of the morphology of the language.

SUMMARY OF THE INVENTION It is an object of the present invention to provide a method for stemming a word entered into a computer that has improved accuracy and speed in identifying the stem of a word.

It is also an object of this invention to provide a stemmer that conflates words into real word stems (not pseudo-roots) .

It is another object of the present invention to remove the need for specific rules in the stemming algorithm.

It is yet another object of this invention to provide a stemmer in which no knowledge of the morphology of the language is required, making the stemmer easier to extend.

Accordingly, we have developed a method and apparatus for stemming a word entered into a computer utilizing a spelling correction algorithm. The method includes the steps of deriving a stem of the entered word through the use of a spelling correction algorithm and returning the stem.

A benefit of utilizing a spelling correction algorithm to derive the stem of the entered word is realized when the entered word is misspelled. Since the spelling correction is integrated with the stemming of the word, the invention is insensitive to spelling errors in the entered word.

In the preferred embodiment, the spelling correction algorithm is a word similarity metric that compares the entered word with candidate stems in a lexicon. Preferably, the lexicon of candidate stems is restricted to only contain actual words, i.e., fragments of words (pseudo-roots) are not included in the lexicon. Additionally, the lexicon may contain part of speech information for each candidate stem. When part of speech information is used, the first step is to determine the part of speech for the entered word. The selection of a candidate stem is restricted to selecting a stem that has a part of speech that matches the part of speech of the entered word.

In the preferred embodiment, the word similarity metric is an edit distance metric. Preferably, if two candidate stems have equal edit distances, the shortest candidate stem is selected; alternatively, the candidate stem with the highest frequency may be selected.

In another embodiment, the edit distance metric calculates intermediate edit distances and the edit distance for the entered word and each candidate stem. These values can be cached for later use such that prior to executing the edit distance metric, the entered word and the lexicon candidate stem are compared to the cache. If the value for an intermediate edit distance or the edit distance is in the cache, that value is used; otherwise, the value is calculated. In another embodiment, the entered word is first compared with a list of exception words and corresponding stems. If a match is found between the entered word and an exception word, the corresponding stem for the exception word is returned as the stem.

Other embodiments of the invention include selecting the stem based upon a match of the entered word and the first few letters of the candidate stem, selecting the stem based upon a match of all but the last letter or letters of the candidate stem, and selecting the stem from a list of equivalence class prefixes in which the longest prefix is selected as the stem.

The present invention has application with IR programs, for example, web search engines, to expand a query by adding terms corresponding to the stemmed versions of the input query term or by adding terms corresponding to other words in the input query terms' equivalence classes. Furthermore, the size of indexes can be reduced by conflating terms that stem alike. As an alternate method of IR, a weighted combination of the document scores with and without stemming can be utilized. Additionally, the present invention has application with text classification, text clustering, dictionary storage, machine translation, and computing word co-occurrence statistics, for example.

Porter's stemmer and the present invention, utilizing an edit distance metric, were executed on a corpus of 81,673 English words grouped into 17,782 stemmed equivalence classes. Each equivalence class contained words that shared a common root, and the word that most closely matched the root was listed first. If the stemming algorithm identified an entered word's stem as the first word in the word's equivalence class, then it was considered to have stemmed the word correctly. Since Porter's stemmer returns a pseudo-root instead of a word, this algorithm was considered to have stemmed the word correctly if the word' s root matched the result of applying the stemmer to the first word in the word's equivalence class.

Porter's stemmer had an accuracy of 66.1% on the corpus, while the present invention had an accuracy of 93.4%. The accuracy of the present invention was 90.2% when the complex edits corresponding to common suffixes were assigned a zero cost. Additionally, a set of simple heuristics that deleted the most common suffixes (e.g., "s", "ing", "ed" , "by", and "ation") had an accuracy of 31.6%, while an s-deletion heuristic alone had an accuracy of 19.7%. BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a block diagram of an apparatus for stemming a word according to the present invention;

Fig. 2 is a flow diagram illustrating the general process according to the present invention; and Figs. 3A-3D are flow diagrams illustrating a computer program object for implementing one embodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS According to the present invention, there is provided a method and apparatus for stemming a word entered into a computer accomplished by a stemming algorithm that incorporates a spelling correction algorithm. Referring to Fig. 1, the apparatus includes an input device 110, a screen monitor 112, and a computer 114. The computer 114 incorporates a processor 116, a stemming algorithm 118, and a lexicon 120. The computer may also have an IR algorithm 122.

As shown in Fig. 2, stemming of a word is achieved through the steps of obtaining a word 210, executing a stemming algorithm 212, and returning a stem

214 of the entered word. Once the stem is returned, it may be used by another program, for example, an IR program.

The stemming algorithm utilizes a spelling correction algorithm to derive the stem by comparing the entered word with words located in a lexicon. In the preferred embodiment, the lexicon contains possible stems. The lexicon of possible stems is preferably limited to real words, i.e., no parts of words are contained therein. The lexicon may also contain part of speech information associated with each possible stem. The lexicon of possible stems may be incorporated into a spelling correction dictionary by marking the dictionary words that are also possible stems.

The lexicon of possible stems may be implemented using tries, hash tables, finite state automata, or binary trees. A trie is a data structure invented by Ed Fredkin

(see Fredkin, Ed, Communications of the ACM, Vol. 3, pp.

490-500, (I960)) and described in detail by Donald Knuth

(see Knuth, Donald, The Art of Computer Programming, Vol. 3, Addison Wesley) . The word "trie" comes from

"information reTRIEval" , but the data structure is very useful for compact storage of vocabulary lists.

The spelling correction algorithm compares the entered word and the possible stems using word similarity metrics. The algorithm may select a possible stem based on a comparison of the first few letters of the possible stem with the entered word or based on a comparison of all but the last letters, for example, the last three, of the possible stem with the beginning letters of the entered word. In the preferred embodiment, the algorithm executes the comparison using edit distance methodology. Edit distance, also known as Damerau-Levenshtein edit distance, is a well-known technique for measuring word similarity. Given a pair of words, the edit distance algorithm counts the minimum number of edits (insertions, deletions, substitutions, and transpositions) necessary to transform one word into the other. Each type of edit is assigned a different "cost" (or modulus that has been preloaded into the algorithm) . The costs are summed to determine the edit distance. In the preferred embodiment, edits that reduce the length of the entered word and complex edits corresponding to common suffixes have lower costs .

To calculate edit distance for an entered word and a word in the lexicon, the words are split into two portions, intermediate edit distances are computed for the left and right portions, and the two edit distances are summed. These steps are performed recursively for each possible split point. The smallest sum of all the possible split points is selected as the edit distance for that pair of words .

The candidate stem with the lowest edit distance is selected as the stem. Preferably, if the stemming routine identifies two or more candidate stems with equal edit distances, the shortest candidate stem is returned as the stem; alternatively, the candidate stem having the highest frequency is returned.

In an embodiment where the lexicon has part of speech information associated with each candidate stem, the part of speech for the entered word is determined prior to stemming. In addition to the comparison by the spelling checking algorithm, a stem is selected based on a match of the part of speech for the candidate stem and the part of speech for the entered word. Since the number of edit distance calculations can be large, the intermediate edit distance computations are cached (stored in memory) for later use. The edit distance for a pair of words is also cached for later use. To take advantage of the cache, the entered word and the lexicon word are first compared with the cache. If an edit distance value for the pair of words is found in the cache, that value is returned as the intermediate edit distance or edit distance accordingly. Otherwise, the edit distance metric is executed for the pair of words . The cache has the potential to become large; therefore, it is flushed periodically. It is also possible to track the frequency of certain intermediate edit distances and edit distances and to keep those distances having a high frequency in the cache. In another embodiment, uhe cache can be preloaded by running the edit distance metric on a corpus of spelling errors and their corresponding corrections. The intermediate edit distances and the edit distances for each pair of these words are cached for later use as described above . A shortensd list of candidate stems can be created based on an iterative process of removing letters from the entered word and then looking up the new word in the lexicon. If a match is found, it is added to the list of candidate stems. Once the candidate stems are selected, the edit distance calculations are performed. This decreases processing time since only a shortened list is evaluated for edit distance.

One embodiment of the invention includes a list having exception words and corresponding stems wherein the entered word is compared to the exceptions prior to stemming. If a match is found within the exceptions, the corresponding stem is returned and the stemming algorithm does not continue to execute. If no match is found, stemming occurs as normal .

Other embodiments of the invention use a different lexicon for selecting the stem. The stem can be selected based on a comparison of the entered word with a lexicon of equivalence classes. In this lexicon, each equivalence class has a prefix. This prefix is the longest prefix common to each member of the class. The invention determines the longest prefix that matches the entered word and returns that prefix as the stem. Alternatively, the shortest or most frequent member of the equivalence class can be returned as the stem.

In another embodiment, a lexicon having exception and regular expressions is utilized. The expressions are regular n-grams (a string of characters that may comprise all or part of a word). First, the spelling correction algorithm determines if the entered word matches an exception expression. If there is no match, then the entered word is compared to the regular expression. If there is a match between the entered word and a regular expression, zero or more letters are deleted from the end of the entered word and zero or more letters are added to replace the deleted letters resulting in the stem. If the entered word matches an exception expression, the actions of deleting and replacing letters are prevented and the entered word is returned as the stem.

Figs. 3A-3D depict a flow diagram illustrating one embodiment of the present invention. Beginning with Fig. 3A, step 310 receives the entered word to be stemmed. Step 316 evaluates whether the word is contained in a list of exceptions. If the word is an exception, the stem corresponding to the exception is returned as the stem in steps 318 and 320. If the word is not in the list of exceptions, the last letter of the word is removed in step 322. In step 324, a match for the shortened word is sought in a lexicon of possible stems. If no match is found and the word has more than one letter, step 326 repeats steps 322 and 324. If no match is found and the word does not have more than one letter, steps 326, 328, and 330 return the entered word as the stem.

Referring now to Fig. 3B, if a match is found at step 324, the last letter of the word is removed at step 332. Step 334 determines if the new word is in the lexicon. If the word is not in the lexicon and the word has a length greater than one, step 338 repeats steps 332 and 334 one time. If the word does not have a length greater than one, steps 332 and 334 will not be repeated. If in step 334 the word is found in the lexicon, the word is selected as a candidate stem in step 336; steps 332 and 334 are repeated one time. The loop that selects candidate stems (332, 334, and 336) can be performed any number of times. For each candidate stem, steps 340 and 342 ensure that the candidate stem is not longer than the entered word. If a stem is longer than the entered word, the candidate stem is disregarded.

As shown in Fig. 3C, step 344 performs the edit distance calculations for the entered word and each candidate stem. As shown in Fig. 3D, steps 346 and 348 select the candidate stem having the lowest edit distance. If two or more candidate stems have the same edit distance, steps 348 and 350 select the candidate stem having the shortest length. Step 352 returns the stem. The following sets forth portions of computer code in the PERL language for implementing an embodiment of the stemming algorithm of this invention.

#! /usr/local/bin/perl

$roots = 'roots.txt';

$corpus = 'stem.txt';

$debug = 0;

$test = 1; $includestem = 1;

$testmostfreq = 0 ; $freqfile = ^■topl000.txt';

# Minimum number of characters in a prefix. $minprefix = 0 ; # Was 3

$ignoresmarts = 1; if-- I # Prints the errors made by the stemmer if set to 1. $printerrors = 0 ;

# Preference for deletions (which make the string shorter)

# over substitutions, transpositions, and insertions. $prefershorter = 1 ;

$shorterprefscore = 0.5;

# Set up the initial costs matrix.

%costs = ( ) ; if ($ignoresmarts) {

%coststail = ( ) ;

o *.

-

— -.

.. ) = oo 0 0)) • < *. - .. - -

_^ . •.• < <ϋϋ 0 = = - =

Q QJJ tn - - QJ . *. t = o o • • = •- •~ <

<u c rtr

• • Λ Λ = QJ rt - •» - - = • • *.

- - - — >* QJ =

•- o o tn - • • o » o .. • • n QJ

-. o *. - - rt = •> = = Cn • • o = o o = o - • • - ~ - - - U •• - O

*. QJ » *. 0) » = QJ o o σ • • - » QJ - • • o *. • • =

-= .. = .. = Λ • • _* _^ o o = cn = υ • • _^ _^

• • <D .. • • = Λ = = = - - • • = - ~ o 0 ~

• • ω τ) - = • • • • QJ τ) rϋ = QJ QJ o _^

= 0) X) u o 0 » • • T3 = •• • • 0 • • ~ QJ • • •» = _^

*. rt = » o QJ QJ - ω 0 = υ = o = QJ ~ = o X) rt _^ rt = *. u = •- A - •• - - • • • •

- rt = = o • • πi = = - - - QJ QJ σ = = _^ Cn > *.

= - *. » » Cn = • • » = - - - ~ >! - =

• • _^ *> o o = _^ Cn o •_* «. • • = = o • • •_^ _^ Cn • •

0J » -. - o ~ = Cn o • • » • • - <ϋ = QJ = o

-. • • = = = *. = • • • • *. •» . = • • *. *. <ϋ

*. • • • • QJ rt = A •H = = •• <ϋ 0) 0 • • = = =

• • • • = QJ •= • • • • Cn -i-l ._" = ^ • • _* •_^ 0 _^

X) = QJ ~ • • - QJ 0) = u - QJ QJ ! =

= i— 1 rt «. o o QJ = = QJ • • *. σ rt >-. «. _^

• • X! = .. Λ - <ϋ -. 3 » - CD o = - u = • • « =

*. o .. = = υ - o o QJ - = ~ ~ = ~ - = • • o QJ

= o » = - 0) 0 •• = - - - QJ = » = o > = _^ » • •

*. *. *. = • • • • TJ » o — = - - σ - • • *. *. « o QJ ω o = σ QJ .. QJ .. • • .. o _^ • • o = _^ Cn QJ

» - QJ *. • • = 0) Λ X) ~ = QJ T3 o - •• = • • - QJ • . o

.. = .« = = •= • • QJ _^ = • • rt υ υ = ..

• • 0) QJ = 0⁾ rt - - = • • , » = O =

.. • • rt .. = o o • • -_* 0) >! QJ =; • • 0⁾ .* ._*

- ^ CD = o rt - » -. T) σ QJ Cn - *. - . *. σ =

- rt - •_* = o = ω - » = - o 0 = = = - - - o QJ o = _^ „ QJ QJ υ •* = QJ •* •_* _^ _^ -r-l _^ •_^ =

- - • • o • • = • • » - = • • - • •

= A o «. • • QJ _^ *. - - = » QJ QJ -. _^ QJ

- rt - • • CL) = 0 > o = • • = • • = • • • • υ = • • • • 3 o = = U • • QJ _^ ω QJ «. • • QJ • • _^ QJ Cn 0 r .. = » • • > = υ = Ό = 0

*. = » o rt rt rϋ τs • • 0) QJ = X! υ » _^

O • • *. rt = = = = • • QJ QJ = Cn o QJ •= = = = -

- CQ - = -. .. = - - » QJ τ) = QJ = *. = = » = QJ - _^ -

= / = • • - - o o o U o -

• • s • • o -. ., o •_* *. = - _^ •- σ = o o _^ .. .. » _^ =

CQ - .- = - = = - o - *. *. - - _^ « ~ = • •

= O 0) Q 0) = QJ • • • • o - - = = • • = = _^ QJ • • T⁾ 0)

— - QJ - = _" •• QJ • • • • = QJ • •

= QJ s 0 > = • • > • • • • QJ QJ • • .. ..

II •• QJ U QJ πs ⁾ •• ω XI rt 3 0) CD Cn ! X! • • QJ τJ M A υ υ QJ QJ 0

H - rt rt rt (ϋ Λ Λ as (ϋ υ n QJ QJ QJ QJ QJ •4-1 tϋ

4J CO 4 CQ 0

QJ u

CQ o\° ω

lier: ",0, "lie;st : ", 0, "like: ",0, "liness: ",0, "ling: " , 0 ly: " , 0, "man: , 0 , "med : " , 0 , "men : " , 0 , "ment : " , 0 , ments: " , 0, " ing: " , 0, "nee _:. " ,, 0-,, " nces: " , 0, "ned: " , 0,

# Load the stem word dictionary, open (ROOTS, $roots) ;

while (<ROOTS>) { chomp ; $word = $_;

$length = length($word) ; $lexicon{$word} = 1;

# We only need to index the word from 0 to 3 characters chopped

# off the end because the stems tend to not have much chopped off

# Only the non-stemmed words tend to have a lot chopped off. for ($i = 0; $i <= 3; $i++) { if ($length - $i >= $minprefix) {

$subword = substr ($word, 0 , $length-$i) ; if ($index{$subword} eq "") {

$index{$subword} = "$word"; } else {

$index{$subword} .= ":$word";

} } }

} close (ROOTS) ;

# print $coststail{"ing: "} . "\n";

# print &stemword ( "walking" ) . "\n";

# print &stemword ( "walker" ) . "\n";

# print rSStemword ( "walked" ) . "\n";

# print &stemword ( "walk" ) . "\n";

sub stemword { local ($word) = @ ;

— local ($length, $ι, $j , $subword, $match, $score, $matches, ©matches, $tmp, $bestmatch, $bestscore) if ($lexicon{$word} ) {

return ($word) ; } else {

# Find the candidate stems list. $length = length ($word) ; $matches = " " ; for ($i = $length; $i >= 0; $i--) {

$subword = substr ($word, 0 , $i) ; $tmp = $index{$subword} ; if ($tmp ne "") { $matches = $tmp; for ($j = 1; $j <= 3 && $i > $j; $j++) { $tmp = $index{substr ($word, 0, $i-$j ) } ; $matches .= " : " . $tmp if ($tmp ne "");

} last ;

}

} print "Matches: $matches\n" if ($debug) ; ©matches = split (/:/, $matches) ;

# Run edit distance on each word. $bestscore = -1; $bestmatch = " " ; foreach $match (©matches) {

# Require the potential stem to be no longer

# than the word being stemmed. if (length ($match) <= length ($word) ) { $score = kedist ($word, $match) ; print "Score: $score $word, $match\n" if ($debug) ; if ($bestscore == -1 | | $score <= $bestscore) { if ( ($bestscore == -1) II

[$score < $bestscore)

($score == $bestscore) &&

(length ($match) <= length ($bestmatch) )) ) $bestscore = $score; $bestmatch = $match;

}

# Return the stem or stems with the best score . return ($bestmatch) ;

} } sub edist { local ($word, $stem) = @_; local ($mismatch, $wrest , $srest , $dist ) ; if ($word eq $stem) { return (0) ; } else { $mismatch = &find_mismatch ($word, $stem) ; # $common = substr ($word, 0 , $mismatch) ; $wrest = substr ($word, $mismatch) $srest = substr ($stem, $mismatch) $dist = &edist2 ($wrest , $srest , 1) return ($dist) ; } } sub edist2 { local ($word, $stem, $firsttime) = @_; local ($wlength, $slength, $i, $j , $wl, $wr, $sl, $sr) ; local ($mindist , $ldist , $rdist , $dist) ;

print "Edist2: $word, $stem, $firsttime\n" ; $wlength = length ($word) ; $slength = length ($stem) ; if ($wlength == 0 && $slength == 0) { return (0) ; } elsif ($wlength <= 1 && $slength <= 1) { if ($firsttime) {

$dist = $coststail{ "$word: $stem" } ; $dist = $costs{ "$word: $stem" } if ($dist eq ""); else {

$dist = $costs{ "$word:$stem" } ;

if ($dist ne "") { return ($dist) ; elsif ($prefershorter && $wlength > $slength) {

$costs{ "$word: $stem" } = $shorterprefscore ; return ($shorterprefscore) ; else- {

$costs{ "$word: $stem" } = 1 ; return (1) ;

} elsif ($wlength == 2 && $slength == 2) { if ($firsttime) {

$dist = $coststail{ "$word: $stem" } ;

$dist = $costs{ "$word:$stem" } if ($dist eq " " ) ,- } else {

$dist = $costs{ "$word: $stem" } ;

} if ($dist ne "") { return ($dist) ; } elsif (substr ($word, 0, 1) eq substr ($stem, 1) &&

QJ

4

CQ

-H

Xi = X XII

4-1 * — * cn CJ¹ ... Λ

*—»- β ω . — . — . • ^• -_". ^^

Q) cn . — . .

H T— •—^". . 4J β i r 4 4--11

,—. CQ CQ X) 4 ^ ^11 • •HH

H -H QJ *—--- Cn C CQQ

*. 3 X) -— -* β V> ^,^-

O + -— , QJ =

«. • - — — .— _^ + H ^H ε Xi ,-— , + + CQ • - 3 3 C CQQ

QJ 4-1 r 4-4 4-1 + ^■n <-£) _^~.

4J Cn ε -H -H -H ^■CΛ- ι •- •~ = = • •

CQ β •-. QJ • - II _^-_* -— H •— _^^-*

QJ _^—_^ 4 . . , — . r • *_* —• -H • - ^■n • •_* CQ rrHH rH _*—— CQ = r -H ss β • *. X o . — . ,— , -H

3 = ε ε Xi 4-1 -r-i «. -H - -r-i • • rtrt = =

4-1 ^•- ε .. QJ Q) ε 4 tn II υ> O O H 4 4JJ *— _^-*

CQ H —- QJ Xl 4 4-> 0) n β - - ~ - *. s C CQQ C CQQ

X! 4 CQ CQ 4-) β QJ — xi i ε ε 4 4--11 4 4--11 β II II CQ 0 CQ QJ r-H -r-i — QJ 0) = C CQQ C CQQ

CQ 3 • • • • H CQ -CΛ- 0 0 4J 4-1 * * 0 0 0 0

•• T3 x* 3 i 3 3 CQ CQ CQ _^-. U U cr = = 5 = .. 4 4 QJ

Qj ε ε •— — * 0 0 XI II U — ^•— ' CQ ε

QJ QJ o rH 3 3 II V β SH O -H II II

*-^■-. 4J 4-1 S -H 0 V o QJ 4J 4-> 4-1 4 υ 4

H CQ CQ <Λ- rt = = 3 ^•I— 1 H CQ CQ CQ CQ 4-1 4_> 4J

- -CΛ- r 4J *—»— *— * -H 3 Xi XI X! Xi CQ CQ CQ

X) ^{• •} • • -"-— CQ CQ CQ — _^-* •*. .» — <Λ- β β β β II -H -H

^ XJ Xi CQ -_^^ 4J 4 4-1 .. _^-, H • - CQ CQ CQ CQ -H X) XI

0 !-ι -U CQ CQ CQ ,— » 4-) 4-> • *. O -H II 4-⁾ 4-4 J-l

S 0 0 CQ . — . 0 0 0 CQ CQ O — • II II II II CQ V>

^■Λ- 3 • *- 3 O QJ υ υ υ -H -H II II -- -H — -

-— -w- ,-~ -CΛ- ε X) XI II ^• — - -H H ιH Xi

^ = H r -H QJ 4-J ^•r— i 3 3 CQ CQ H 4-1

4 — - — - — 4J II II II β r — CQ -H ^•CΛ- 4-1 — Λ- Λ-

CQ CQ β CQ β 4-⁾ β -H — - -H

X! 4-> ^ 4-1 CQ 4-) 4-1 4-1 4-) 4 XI β CQ β -~— * CQ β CQ CQ *— «— CQ CQ β β — *-"* β

CQ 0 4-⁾ 0 4-1 •—_^* -H •H ^•H -rH -H 4 ^•H o

U QJ QJ ϋ 0⁾ 4-1 τ* Xi 0) X! τs ^ QJ 0)

CQ !H ε 0 4-1 rH QJ v -CΛ- , ^ 4-1 rH CQ QJ H QJ ω 4-1 — -H } else {

$rdist = $costs{ "$wr :$sr" } ;

$ldist = &edist2 ($wl,$sl,0) if ($ldist eq "");

$rdist = &edist2 ($wr, $sr, $firsttime) if ($rdist eq " " ) ;

$dist = $ldist + $rdist;

$mindist = $dist if ($mindist == -1 | | $dist < $mindist) ;

} }

}

$costs{ "$word: $stem" } = $mindist; return ($mindist) ;

} } } sub find_mismatch { local ($wordl, $word2) = @_; local ( $wl , $w2 , ©wordl , @word2 , $count) ; ©wordl = split (//, $wordl) ; @word2 = split (//, $word2) ; for ($count=0 ; $count < &min (length ($wordl) , length ($word2) ); $count++) { $wl = shift (©wordl) ; $w2 = shift (@word2) ; if ($wl ne $w2) { last ;

>

> return (Scount) ;

^}

sub min { local ($vall, $val2) = @ if ($vall <= $val2) { return ($vall) ; } else { return ($val2) ;

}

if ($test) {

$correct = 0 ; $total = 0; if ($testmostfreq) { $includestem = 1; open(FREQ, $freqfile) ; while (<FREQ>) { chomp; $freq{$_} = 1;

} close (FREQ) ;

} open (CORPUS, $corpus) ; while (<CORPUS>) { chomp;

$firstw = $class[0];

} else {

}

$total++; $s = &stemword ($word) ; if ($s eq $firstw) {

$correct++;

} elsif ($printerrors) { printf "%s --> %s ! = %s\n", $word, $s, $firstw; } } }

} close (CORPUS) ; printf "%.2f\n", 100*$correct/$total ; } else {

$result = &stemword($ARGV[0] ) ; print "$result\n";

}

It will be understood by those skilled in the art that while the foregoing description sets forth in detail preferred embodiments of the present invention, modifications, additions, and changes may be made thereto without departing from the spirit and scope of the invention. Having thus described our invention with the detail and particularity required by the Patent Laws, what is desired to be protected by Letters Patent is set forth in the following claims.

Claims

WE CLAIM :

1. A computer implemented method for stemming a word entered into a computer, comprising the steps of: deriving a stem of the entered word utilizing a spelling correction algorithm; and returning the stem.

2. The computer implemented method for stemming a word entered into a computer according to claim 1, wherein the spelling correction algorithm includes the steps of : comparing the entered word with a lexicon of candidate stems using a word similarity metric; and selecting the stem based on the result of the word similarity metric.

3. The computer implemented method for stemming a word entered into a computer according to claim 2, wherein the stem is selected based on a match of the entered word and the first few letters of the candidate stem.

4. The computer implemented method for stemming a word entered into a computer according to claim 2, wherein the stem is selected based on a match of all but at least the last letter of the candidate stem.

5. The computer implemented method for stemming a word entered into a computer according to claim 2, wherein the lexicon of candidate stems is implemented from the group consisting of tries, hash tables, finite state automata, and binary trees.

6. The computer implemented method for stemming a word entered into a computer according to claim 2, wherein the lexicon of candidate stems is restricted to contain actual words.

7. The computer implemented method for stemming a word entered into a computer according to claim 2, wherein the spelling correction algorithm integrates correcting the spelling of and stemming a word, whereby the method is insensitive to spelling errors in the entered word.

8. The computer implemented method for stemming a word entered into a computer according to claim 2, further including the steps of: first comparing the entered word with a list of exception words and corresponding stems; and if a match is found, selecting the corresponding stem, otherwise continuing with the derivation of the stem.

9. The computer implemented method for stemming a word entered into a computer according to claim 2, the spelling correction algorithm further including the step of determining a part of speech for the entered word, wherein each candidate stem in the lexicon has a part of speech associated with it and the stem is additionally selected based on a match of the part of speech for the candidate stem and the part of speech for the entered word.

10. The computer implemented method for stemming a word entered into a computer according to claim 2, wherein the lexicon of candidate stems is merged with a spelling correction dictionary such that the words within the spelling correction dictionary which are also candidate stems are marked as such.

11. The computer implemented method for stemming a word entered into a computer according to claim 2 , wherein the word similarity metric is an edit distance metric .

12. The computer implemented method for stemming a word entered into a computer according to claim 11, wherein the edit distance metric includes a cost for each type of edit, edits that reduce the length of the entered word have a lower cost, and for each candidate stem, the edit distance metric includes the steps of : counting the minimum number of edits to transform the entered word into the candidate stem, assigning a cost to each edit, and summing the costs.

13. The computer implemented method for stemming a word entered into a computer according to claim 11, wherein the edit distance metric includes a cost for each type of edit, complex edits corresponding to common suffixes have a lower cost, and for each candidate stem, the edit distance metric includes the steps of: counting the minimum number of edits to transform the entered word into the candidate stem, assigning a cost to each edit, and summing the costs.

14. The computer implemented method for stemming a word entered into a computer according to claim 11, wherein if two or more candidate stems have equal edit distances, the shortest candidate stem is selected.

15. The computer implemented method for stemming a word entered into a computer according to claim 11, wherein if two or more candidate stems have equal edit distances, the highest frequency candidate stem is selected.

16. The computer implemented method for stemming a word entered into a computer according to claim 11, wherein: the edit distance metric calculates an edit distance for the entered word and each candidate stem in the lexicon and caches the edit distances, and for subsequently entered words, prior to executing the edit distance metric, the subsequently entered word and each candidate stem from the lexicon are compared to the cache, and if the edit distance for the entered word and the candidate stem from the lexicon is found in the cache, that edit distance is used, otherwise the edit distance metric is executed.

17. The computer implemented method for stemming a word entered into a computer .according to claim 16, wherein: the edit distance metric also calculates an intermediate edit distance for each break point within the entered word and caches the intermediate edit distance, for subsequently entered words, prior to calculating intermediate edit distance, the subsequently entered word and the candidate stem from the lexicon are compared to the cache, and if the intermediate edit distance for the entered word and the candidate stem from the lexicon is found in the cache, that intermediate edit distance is use, otherwise the intermediate edit distance is calculated.

18. The computer implemented method for stemming a word entered into a computer according to claim 17, further including the steps of: tagging the intermediate edit distances and the edit distances that are repeatedly stored in the cache as frequent distances; and periodically deleting the intermediate edit distances and edit distances stored in the cache except for the frequent distances.

19. The computer implemented method for stemming a word entered into a computer according to claim 17, wherein: the cache is pre-loaded with a corpus of misspelled words, each misspelled word being associated with a corresponding correction and an edit distance for the misspelled word and its corresponding correction, and prior to executing the edit distance metric, the entered word and the candidate stem from the lexicon are compared to the cache, if the edit distance for the entered word and the candidate stem from the lexicon is found in the cache, that edit distance is used, otherwise the edit distance metric is executed.

20. The computer implemented method for stemming a word entered into a computer according to claim 1, wherein the spelling correction algorithm further includes the steps of : a) removing a letter from the end of the word; b) looking for a match between the shortened word and a stem in a lexicon, wherein each stem in the lexicon is a whole word; c) if a match is not found, repeating steps a) and b) until a match is found or until one letter is left in the word whereby returning the entered word as the stem; d) if a match is found, i) repeating steps a) and b) at least two more times, and ii) merging the matched stems into a list of candidate stems; e) removing from the list any candidate stem that is longer than the entered word; f) computing the edit distance between the word and each candidate stem, wherein edits that reduce the length of the word have lower costs and complex edits corresponding to common suffixes have lower costs; g) selecting the candidate stem with the lowest edit distance, wherein if there is more than one candidate stem with the lowest edit distance, selecting the shortest candidate stem; and h) returning the selected candidate stem as the stem.

21. The computer implemented method for stemming a word entered into a computer according to claim 20, wherein in step g) if there is more than one candidate stem with the lowest edit distance, selecting the most frequent candidate stem.

22. A computer implemented method for stemming a word entered into a computer, comprising the steps of: determining which prefix from a list of equivalence class prefixes has the longest length of all prefixes that match the entered word, wherein each equivalence class prefix is comprised of the longest prefix common to all members of the equivalence class; and returning the prefix as the stem.

23. A computer implemented method for stemming a word entered into a computer, comprising the steps of: determining which prefix from a list of equivalence class prefixes has the longest length of all prefixes that match the entered word, wherein each equivalence class prefix is comprised of the longest prefix common to all members of the equivalence class; and returning the shortest member of the equivalence class as the stem.

24. A computer implemented method for stemming a word entered into a computer, comprising the steps of: determining which prefix from a list of equivalence class prefixes has the longest length of all prefixes that match the entered word, wherein each equivalence class prefix is comprised of the longest prefix common to all members of the equivalence class; and returning the most frequent member of the equivalence class as the stem.

25. A computer implemented method for stemming a word entered into a computer, comprising the steps of: a) determining whether the entered word matches an exception expression in a lexicon; b) if the entered word does not match an exception expression, i) determining whether the entered word matches a regular expression in the lexicon, ii) if the entered word matches a regular expression in the lexicon,

A) deleting zero or more letters from the end of the entered word,

B) substituting zero or more letters for any deleted letter resulting in an altered word, and C) returning the altered word as the stem, and iii) if the entered word does not match a regular expression in the lexicon, returning the entered word as the stem; and c) if the entered word matches an exception expression in the list, returning the entered word as the stem.

26. An apparatus for stemming a word, comprising: a computer processor; an input means coupled to the computer processor; a memory storage device coupled to the computer processor; a data structure stored in the memory storage device including a hierarchical tree with interconnected nodes, the nodes containing computer-manipulable information concerning stemming a word entered into the computer processor by the input means, wherein the information contains instructions on stemming the word through the use of an edit distance algorithm; a lexicon of candidate stems stored in the memory storage device, wherein each candidate stem is a whole word; and means, executed by the computer processor, for performing the stemming of the word by accessing and manipulating the data structure and the lexicon.