Hammarström, 2007 - Google Patents
A fine-grained model for language identificationHammarström, 2007
View PDF- Document ID
- 4048992092697967456
- Author
- Hammarström H
- Publication year
- Publication venue
- Proceedings of iNEWS-07 Workshop at SIGIR 2007
External Links
Snippet
Existing state-of-the-art techniques to identify the language of a written text most often use a 3-gram frequency table as basis for'fingerprinting'a language. While this approach performs very well in practice (99%-ish accuracy) if the text to be classified is of size, say, 100 …
- 210000002787 Omasum 0 abstract description 8
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
- G06F17/30657—Query processing
- G06F17/30675—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/2775—Phrasal analysis, e.g. finite state techniques, chunking
- G06F17/278—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
- G06F17/2715—Statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/277—Lexical analysis, e.g. tokenisation, collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/3071—Clustering or classification including class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2795—Thesaurus; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30613—Indexing
- G06F17/30619—Indexing indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/30707—Clustering or classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bojar et al. | Findings of the 2014 workshop on statistical machine translation | |
Grishman | Information extraction | |
Witten | Text Mining. | |
Canhasi et al. | Albanian fake news detection | |
Bansal et al. | Coreference semantics from web features | |
Muresan et al. | Combining linguistic and machine learning techniques for email summarization | |
Hammarström | A fine-grained model for language identification | |
Hahn et al. | Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text | |
Ahmed | The role of linguistic feature categories in authorship verification | |
Khan et al. | Sentence embedding based semantic clustering approach for discussion thread summarization | |
Patman et al. | Names: A new frontier in text mining | |
Baron et al. | Who is Who and What is What: Experiments in cross-document co-reference | |
Sapkota et al. | The use of orthogonal similarity relations in the prediction of authorship | |
Sjöbergh et al. | Finding the correct interpretation of Swedish compounds, a statistical approach. | |
Merhben et al. | Lexical disambiguation of Arabic language: an experimental study | |
Avancini et al. | Automatic expansion of domain-specific lexicons by term categorization | |
Park et al. | Detecting experiences from weblogs | |
Vodolazova et al. | Extractive text summarization: can we use the same techniques for any text? | |
Hammarström | Unsupervised Learning of Morphology and the Languages of the World | |
Kondrak | Cognates and word alignment in bitexts | |
Milne et al. | A study in language identification | |
Alfonseca et al. | German decompounding in a difficult corpus | |
Kuba et al. | POS tagging of Hungarian with combined statistical and rule-based methods | |
Merhbene et al. | An experimental study for some supervised lexical disambiguation methods of Arabic language | |
Wan et al. | Using thematic information in statistical headline generation |