Hammarström, 2007 - Google Patents

A fine-grained model for language identification

Hammarström, 2007

Document ID: 4048992092697967456
Author: Hammarström H
Publication year: 2007
Publication venue: Proceedings of iNEWS-07 Workshop at SIGIR 2007

External Links

Cited by

Snippet

Existing state-of-the-art techniques to identify the language of a written text most often use a 3-gram frequency table as basis for'fingerprinting'a language. While this approach performs very well in practice (99%-ish accuracy) if the text to be classified is of size, say, 100 …

Continue reading at www.academia.edu (PDF) (other versions)

210000002787 Omasum 0 abstract description 8

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
- G06F17/30657—Query processing
- G06F17/30675—Query execution
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/2775—Phrasal analysis, e.g. finite state techniques, chunking
- G06F17/278—Named entity recognition
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
- G06F17/2715—Statistical methods
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/277—Lexical analysis, e.g. tokenisation, collocates
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/3071—Clustering or classification including class or cluster creation or modification
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2795—Thesaurus; Synonyms
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30613—Indexing
- G06F17/30619—Indexing indexing structures
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/30707—Clustering or classification into predefined classes
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements

Similar Documents

Publication	Publication Date	Title
Bojar et al.	2014	Findings of the 2014 workshop on statistical machine translation
Grishman	2010	Information extraction
Witten	2004	Text Mining.
Canhasi et al.	2022	Albanian fake news detection
Bansal et al.	2012	Coreference semantics from web features
Muresan et al.	2001	Combining linguistic and machine learning techniques for email summarization
Hammarström	2007	A fine-grained model for language identification
Hahn et al.	2019	Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text
Ahmed	2018	The role of linguistic feature categories in authorship verification
Khan et al.	2020	Sentence embedding based semantic clustering approach for discussion thread summarization
Patman et al.	2003	Names: A new frontier in text mining
Baron et al.	2008	Who is Who and What is What: Experiments in cross-document co-reference
Sapkota et al.	2013	The use of orthogonal similarity relations in the prediction of authorship
Sjöbergh et al.	2004	Finding the correct interpretation of Swedish compounds, a statistical approach.
Merhben et al.	2012	Lexical disambiguation of Arabic language: an experimental study
Avancini et al.	2006	Automatic expansion of domain-specific lexicons by term categorization
Park et al.	2010	Detecting experiences from weblogs
Vodolazova et al.	2013	Extractive text summarization: can we use the same techniques for any text?
Hammarström	2009	Unsupervised Learning of Morphology and the Languages of the World
Kondrak	2005	Cognates and word alignment in bitexts
Milne et al.	2012	A study in language identification
Alfonseca et al.	2008	German decompounding in a difficult corpus
Kuba et al.	2004	POS tagging of Hungarian with combined statistical and rule-based methods
Merhbene et al.	2013	An experimental study for some supervised lexical disambiguation methods of Arabic language
Wan et al.	2003	Using thematic information in statistical headline generation