WO2001046855A1 - Method and apparatus for word stemming using spelling correction algorithms - Google Patents
Method and apparatus for word stemming using spelling correction algorithms Download PDFInfo
- Publication number
- WO2001046855A1 WO2001046855A1 PCT/US2000/042767 US0042767W WO0146855A1 WO 2001046855 A1 WO2001046855 A1 WO 2001046855A1 US 0042767 W US0042767 W US 0042767W WO 0146855 A1 WO0146855 A1 WO 0146855A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- entered
- stem
- stemming
- computer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
Definitions
- This invention relates to word stemming and, more particularly, to stemming of words entered into a computer utilizing spelling correction algorithms.
- Stemming algorithms perform language processing. Specifically, a stemming algorithm will take a word and process it until the stem (or root) of the word is derived. For example, an ideal stemmer will map “tries” to “try”, “stripped” to “strip” , and “striped” to “stripe” . To do this, several different methods have been developed including suffix removal, strict truncation of character strings, word segmentation, letter bigrams, and linguistic morphology.
- IR programs are used to locate documents in a system that relate to a user entered query.
- An IR program receives the user query, then searches a set of documents looking for documents that contain the words of the entered query.
- the program will retrieve any document in its search set that contains the words "training cats” .
- the basic program will not retrieve those documents that contain "train”, “trains”, “trained” , or "cat” unless the user enters these variations of the words in the query.
- word stemming is used with the IR program. In such a case, the user does not need to enter the variations of the desired word(s) , rather the IR program with stemming automatically searches for them. Thus, searches of the entered word plus the variations of that word are performed.
- Stemming algorithms accomplish this by reducing an entered word down to its stem.
- a search is performed either on the stem or on the entered word plus its variants that are included by the stemming algorithm, thereby resulting in a search on all variants of that stem.
- Stemming can occur either at the index creation or at the query stage.
- document terms are reduced to their stemmed equivalents.
- the search in this case is on the stem of the entered word. This saves space in storing the document and decreases processing time.
- the query is expanded by adding the variants of the stem to the query. Stemming at the query stage takes longer to process but can allow the user to refine the query by choosing which word variants to include in the query.
- the stemming algorithm can be used in a hybrid IR program.
- the query is executed both with and without stemming.
- the results are combined to produce a preference for documents that match the query more precisely, while still allowing the query to match word variants.
- a popular stemmer in use is the Porter stemmer described in M.F. Porter, "An algorithm for suffix stripping", Program 14(3) pp. 313-316 (July 1980).
- the Porter stemmer is an iterative process in which different suffixes are removed at each step using simple rules. Sometimes the roots produced by this method are not words, but just parts of a word (i.e., a pseudo-root) . This can lead to results that are confusing to the end user.
- this ⁇ temmer produces several errors, mainly due to inappropriately merging two equivalence classes or failing to stem two words to the same root. For example, “general”, “generous”, “generation”, and “generic” are all conflated to the same root, while “recognize” and “recognition” are not conflated at all. For example, “hero” and “heroine” are conflated with “heroin” Therefore, a search engine using Porter's stemmer will include articles about illegal drug use with articles aboutraum figures, and vice versa.
- the method includes the steps of deriving a stem of the entered word through the use of a spelling correction algorithm and returning the stem.
- a benefit of utilizing a spelling correction algorithm to derive the stem of the entered word is realized when the entered word is misspelled. Since the spelling correction is integrated with the stemming of the word, the invention is insensitive to spelling errors in the entered word.
- the spelling correction algorithm is a word similarity metric that compares the entered word with candidate stems in a lexicon.
- the lexicon of candidate stems is restricted to only contain actual words, i.e., fragments of words (pseudo-roots) are not included in the lexicon.
- the lexicon may contain part of speech information for each candidate stem.
- the first step is to determine the part of speech for the entered word. The selection of a candidate stem is restricted to selecting a stem that has a part of speech that matches the part of speech of the entered word.
- the word similarity metric is an edit distance metric.
- the shortest candidate stem is selected; alternatively, the candidate stem with the highest frequency may be selected.
- the edit distance metric calculates intermediate edit distances and the edit distance for the entered word and each candidate stem. These values can be cached for later use such that prior to executing the edit distance metric, the entered word and the lexicon candidate stem are compared to the cache. If the value for an intermediate edit distance or the edit distance is in the cache, that value is used; otherwise, the value is calculated.
- the entered word is first compared with a list of exception words and corresponding stems. If a match is found between the entered word and an exception word, the corresponding stem for the exception word is returned as the stem.
- Other embodiments of the invention include selecting the stem based upon a match of the entered word and the first few letters of the candidate stem, selecting the stem based upon a match of all but the last letter or letters of the candidate stem, and selecting the stem from a list of equivalence class prefixes in which the longest prefix is selected as the stem.
- the present invention has application with IR programs, for example, web search engines, to expand a query by adding terms corresponding to the stemmed versions of the input query term or by adding terms corresponding to other words in the input query terms' equivalence classes. Furthermore, the size of indexes can be reduced by conflating terms that stem alike. As an alternate method of IR, a weighted combination of the document scores with and without stemming can be utilized. Additionally, the present invention has application with text classification, text clustering, dictionary storage, machine translation, and computing word co-occurrence statistics, for example.
- Porter's stemmer and the present invention utilizing an edit distance metric, were executed on a corpus of 81,673 English words grouped into 17,782 stemmed equivalence classes. Each equivalence class contained words that shared a common root, and the word that most closely matched the root was listed first. If the stemming algorithm identified an entered word's stem as the first word in the word's equivalence class, then it was considered to have stemmed the word correctly. Since Porter's stemmer returns a pseudo-root instead of a word, this algorithm was considered to have stemmed the word correctly if the word' s root matched the result of applying the stemmer to the first word in the word's equivalence class.
- Fig. 1 is a block diagram of an apparatus for stemming a word according to the present invention
- Fig. 2 is a flow diagram illustrating the general process according to the present invention
- Figs. 3A-3D are flow diagrams illustrating a computer program object for implementing one embodiment of the present invention.
- the apparatus includes an input device 110, a screen monitor 112, and a computer 114.
- the computer 114 incorporates a processor 116, a stemming algorithm 118, and a lexicon 120.
- the computer may also have an IR algorithm 122.
- stemming of a word is achieved through the steps of obtaining a word 210, executing a stemming algorithm 212, and returning a stem
- the stem may be used by another program, for example, an IR program.
- the stemming algorithm utilizes a spelling correction algorithm to derive the stem by comparing the entered word with words located in a lexicon.
- the lexicon contains possible stems.
- the lexicon of possible stems is preferably limited to real words, i.e., no parts of words are contained therein.
- the lexicon may also contain part of speech information associated with each possible stem.
- the lexicon of possible stems may be incorporated into a spelling correction dictionary by marking the dictionary words that are also possible stems.
- the lexicon of possible stems may be implemented using tries, hash tables, finite state automata, or binary trees.
- a trie is a data structure invented by Ed Fredkin
- the spelling correction algorithm compares the entered word and the possible stems using word similarity metrics.
- the algorithm may select a possible stem based on a comparison of the first few letters of the possible stem with the entered word or based on a comparison of all but the last letters, for example, the last three, of the possible stem with the beginning letters of the entered word.
- the algorithm executes the comparison using edit distance methodology.
- Edit distance also known as Damerau-Levenshtein edit distance, is a well-known technique for measuring word similarity. Given a pair of words, the edit distance algorithm counts the minimum number of edits (insertions, deletions, substitutions, and transpositions) necessary to transform one word into the other.
- Each type of edit is assigned a different "cost" (or modulus that has been preloaded into the algorithm) .
- the costs are summed to determine the edit distance.
- edits that reduce the length of the entered word and complex edits corresponding to common suffixes have lower costs .
- the words are split into two portions, intermediate edit distances are computed for the left and right portions, and the two edit distances are summed. These steps are performed recursively for each possible split point. The smallest sum of all the possible split points is selected as the edit distance for that pair of words .
- the candidate stem with the lowest edit distance is selected as the stem.
- the stemming routine identifies two or more candidate stems with equal edit distances, the shortest candidate stem is returned as the stem; alternatively, the candidate stem having the highest frequency is returned.
- the part of speech for the entered word is determined prior to stemming.
- a stem is selected based on a match of the part of speech for the candidate stem and the part of speech for the entered word. Since the number of edit distance calculations can be large, the intermediate edit distance computations are cached (stored in memory) for later use. The edit distance for a pair of words is also cached for later use. To take advantage of the cache, the entered word and the lexicon word are first compared with the cache. If an edit distance value for the pair of words is found in the cache, that value is returned as the intermediate edit distance or edit distance accordingly.
- the edit distance metric is executed for the pair of words .
- the cache has the potential to become large; therefore, it is flushed periodically. It is also possible to track the frequency of certain intermediate edit distances and edit distances and to keep those distances having a high frequency in the cache.
- uhe cache can be preloaded by running the edit distance metric on a corpus of spelling errors and their corresponding corrections. The intermediate edit distances and the edit distances for each pair of these words are cached for later use as described above .
- a shortensd list of candidate stems can be created based on an iterative process of removing letters from the entered word and then looking up the new word in the lexicon. If a match is found, it is added to the list of candidate stems. Once the candidate stems are selected, the edit distance calculations are performed. This decreases processing time since only a shortened list is evaluated for edit distance.
- One embodiment of the invention includes a list having exception words and corresponding stems wherein the entered word is compared to the exceptions prior to stemming. If a match is found within the exceptions, the corresponding stem is returned and the stemming algorithm does not continue to execute. If no match is found, stemming occurs as normal .
- the stem can be selected based on a comparison of the entered word with a lexicon of equivalence classes.
- each equivalence class has a prefix. This prefix is the longest prefix common to each member of the class. The invention determines the longest prefix that matches the entered word and returns that prefix as the stem. Alternatively, the shortest or most frequent member of the equivalence class can be returned as the stem.
- a lexicon having exception and regular expressions is utilized.
- the expressions are regular n-grams (a string of characters that may comprise all or part of a word).
- the spelling correction algorithm determines if the entered word matches an exception expression. If there is no match, then the entered word is compared to the regular expression. If there is a match between the entered word and a regular expression, zero or more letters are deleted from the end of the entered word and zero or more letters are added to replace the deleted letters resulting in the stem. If the entered word matches an exception expression, the actions of deleting and replacing letters are prevented and the entered word is returned as the stem.
- Figs. 3A-3D depict a flow diagram illustrating one embodiment of the present invention.
- step 310 receives the entered word to be stemmed.
- Step 316 evaluates whether the word is contained in a list of exceptions. If the word is an exception, the stem corresponding to the exception is returned as the stem in steps 318 and 320. If the word is not in the list of exceptions, the last letter of the word is removed in step 322.
- step 324 a match for the shortened word is sought in a lexicon of possible stems. If no match is found and the word has more than one letter, step 326 repeats steps 322 and 324. If no match is found and the word does not have more than one letter, steps 326, 328, and 330 return the entered word as the stem.
- Step 334 determines if the new word is in the lexicon. If the word is not in the lexicon and the word has a length greater than one, step 338 repeats steps 332 and 334 one time. If the word does not have a length greater than one, steps 332 and 334 will not be repeated. If in step 334 the word is found in the lexicon, the word is selected as a candidate stem in step 336; steps 332 and 334 are repeated one time.
- the loop that selects candidate stems (332, 334, and 336) can be performed any number of times. For each candidate stem, steps 340 and 342 ensure that the candidate stem is not longer than the entered word. If a stem is longer than the entered word, the candidate stem is disregarded.
- step 344 performs the edit distance calculations for the entered word and each candidate stem.
- steps 346 and 348 select the candidate stem having the lowest edit distance. If two or more candidate stems have the same edit distance, steps 348 and 350 select the candidate stem having the shortest length.
- Step 352 returns the stem.
- $subword substr ($word, 0 , $length-$i) ; if ($index ⁇ $subword ⁇ eq "") ⁇
- $dist $coststail ⁇ "$word: $stem” ⁇ ;
- $dist $costs ⁇ "$word:$stem” ⁇ if ($dist eq " " ) ,- ⁇ else ⁇
- $rdist $costs ⁇ "$wr :$sr" ⁇ ;
- $rdist &edist2 ($wr, $sr, $firsttime) if ($rdist eq " " ) ;
- $result &stemword($ARGV[0] ) ; print "$result ⁇ n";
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU47169/01A AU4716901A (en) | 1999-12-16 | 2000-12-12 | Method and apparatus for word stemming using spelling correction algorithms |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US46503299A | 1999-12-16 | 1999-12-16 | |
| US09/465,032 | 1999-12-16 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2001046855A1 true WO2001046855A1 (en) | 2001-06-28 |
| WO2001046855A8 WO2001046855A8 (en) | 2001-10-04 |
Family
ID=23846237
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2000/042767 Ceased WO2001046855A1 (en) | 1999-12-16 | 2000-12-12 | Method and apparatus for word stemming using spelling correction algorithms |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU4716901A (en) |
| WO (1) | WO2001046855A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109308352B (en) * | 2018-08-01 | 2021-10-22 | 昆明理工大学 | A Shortest Path-Based Word Relevance Judgment Method |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5606690A (en) * | 1993-08-20 | 1997-02-25 | Canon Inc. | Non-literal textual search using fuzzy finite non-deterministic automata |
| US5765149A (en) * | 1996-08-09 | 1998-06-09 | Digital Equipment Corporation | Modified collection frequency ranking method |
| US5940624A (en) * | 1991-02-01 | 1999-08-17 | Wang Laboratories, Inc. | Text management system |
-
2000
- 2000-12-12 WO PCT/US2000/042767 patent/WO2001046855A1/en not_active Ceased
- 2000-12-12 AU AU47169/01A patent/AU4716901A/en not_active Abandoned
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5940624A (en) * | 1991-02-01 | 1999-08-17 | Wang Laboratories, Inc. | Text management system |
| US5606690A (en) * | 1993-08-20 | 1997-02-25 | Canon Inc. | Non-literal textual search using fuzzy finite non-deterministic automata |
| US5765149A (en) * | 1996-08-09 | 1998-06-09 | Digital Equipment Corporation | Modified collection frequency ranking method |
Also Published As
| Publication number | Publication date |
|---|---|
| AU4716901A (en) | 2001-07-03 |
| WO2001046855A8 (en) | 2001-10-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Eisenschlos et al. | Understanding tables with intermediate pre-training | |
| Ahmad et al. | Learning a spelling error model from search query logs | |
| Huang et al. | Applying machine learning to text segmentation for information retrieval | |
| EP1474759B1 (en) | System, method, and software for automatic hyperlinking of persons' names in documents to professional directories | |
| Keshava et al. | A simpler, intuitive approach to morpheme induction | |
| JPH09223161A (en) | Method and apparatus for generating a query response in a computer-based document retrieval system | |
| Snover et al. | Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step | |
| Schaback et al. | Multi-level feature extraction for spelling correction | |
| Freitag | Morphology induction from term clusters | |
| McTait | Linguistic knowledge and complexity in an EBMT system based on translation patterns | |
| McTait | Translation patterns, linguistic knowledge and complexity in an approach to EBMT | |
| Torunoglu-Selamet et al. | Exploring spelling correction approaches for turkish | |
| WO2001046855A1 (en) | Method and apparatus for word stemming using spelling correction algorithms | |
| JP2009176148A (en) | Unknown word determining system, method and program | |
| Agichtein et al. | Combining Strategies for Extracting Relations from Text Collections. | |
| Üstün et al. | Incorporating word embeddings in unsupervised morphological segmentation | |
| Goldsmith et al. | From signatures to finite state automata | |
| JP3531222B2 (en) | Similar character string search device | |
| Gao et al. | Web-based citation parsing, correction and augmentation | |
| Moghadam et al. | Comparative study of various Persian stemmers in the field of information retrieval | |
| Ahlberg et al. | A best-first anagram hashing filter for approximate string matching with generalized edit distance | |
| Alfonseca et al. | A rote extractor with edit distance-based generalisation and multi-corpora precision calculation | |
| Gadri et al. | Developing a Multilingual Stemmer for the Requirement of Text Categorization and Information Retrieval | |
| Algan et al. | A use case: Reformulating query rewriting as a statistical machine translation problem | |
| Farkas et al. | Forest reranking through subtree ranking |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ CZ DE DE DK DK DM DZ EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| AK | Designated states |
Kind code of ref document: C1 Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ CZ DE DE DK DK DM DZ EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: C1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
| CFP | Corrected version of a pamphlet front page |
Free format text: REVISED ABSTRACT RECEIVED BY THE INTERNATIONAL BUREAU AFTER COMPLETION OF THE TECHNICAL PREPARATIONS FOR INTERNATIONAL PUBLICATION |
|
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |