[go: up one dir, main page]

Hammarström, 2007 - Google Patents

A fine-grained model for language identification

Hammarström, 2007

View PDF
Document ID
4048992092697967456
Author
Hammarström H
Publication year
Publication venue
Proceedings of iNEWS-07 Workshop at SIGIR 2007

External Links

Snippet

Existing state-of-the-art techniques to identify the language of a written text most often use a 3-gram frequency table as basis for'fingerprinting'a language. While this approach performs very well in practice (99%-ish accuracy) if the text to be classified is of size, say, 100 …
Continue reading at www.academia.edu (PDF) (other versions)

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30634Querying
    • G06F17/30657Query processing
    • G06F17/30675Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2765Recognition
    • G06F17/2775Phrasal analysis, e.g. finite state techniques, chunking
    • G06F17/278Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2705Parsing
    • G06F17/2715Statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2765Recognition
    • G06F17/277Lexical analysis, e.g. tokenisation, collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30705Clustering or classification
    • G06F17/3071Clustering or classification including class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2795Thesaurus; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30613Indexing
    • G06F17/30619Indexing indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30705Clustering or classification
    • G06F17/30707Clustering or classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements

Similar Documents

Publication Publication Date Title
Bojar et al. Findings of the 2014 workshop on statistical machine translation
Grishman Information extraction
Witten Text Mining.
Canhasi et al. Albanian fake news detection
Bansal et al. Coreference semantics from web features
Muresan et al. Combining linguistic and machine learning techniques for email summarization
Hammarström A fine-grained model for language identification
Hahn et al. Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text
Ahmed The role of linguistic feature categories in authorship verification
Khan et al. Sentence embedding based semantic clustering approach for discussion thread summarization
Patman et al. Names: A new frontier in text mining
Baron et al. Who is Who and What is What: Experiments in cross-document co-reference
Sapkota et al. The use of orthogonal similarity relations in the prediction of authorship
Sjöbergh et al. Finding the correct interpretation of Swedish compounds, a statistical approach.
Merhben et al. Lexical disambiguation of Arabic language: an experimental study
Avancini et al. Automatic expansion of domain-specific lexicons by term categorization
Park et al. Detecting experiences from weblogs
Vodolazova et al. Extractive text summarization: can we use the same techniques for any text?
Hammarström Unsupervised Learning of Morphology and the Languages of the World
Kondrak Cognates and word alignment in bitexts
Milne et al. A study in language identification
Alfonseca et al. German decompounding in a difficult corpus
Kuba et al. POS tagging of Hungarian with combined statistical and rule-based methods
Merhbene et al. An experimental study for some supervised lexical disambiguation methods of Arabic language
Wan et al. Using thematic information in statistical headline generation