Nguyen et al., 2020 - Google Patents
Neural machine translation with BERT for post-OCR error detection and correctionNguyen et al., 2020
View PDF- Document ID
- 17324260038114322473
- Author
- Nguyen T
- Jatowt A
- Nguyen N
- Coustaty M
- Doucet A
- Publication year
- Publication venue
- Proceedings of the ACM/IEEE joint conference on digital libraries in 2020
External Links
Snippet
The quality of OCR has a direct impact on information access, and an indirect impact on the performance of natural language processing applications, making fine-grained (eg, semantic) information access even harder. This work proposes a novel post-OCR approach …
- 230000001537 neural 0 title abstract description 13
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
- G06F17/30657—Query processing
- G06F17/30675—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30613—Indexing
- G06F17/30619—Indexing indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/277—Lexical analysis, e.g. tokenisation, collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2809—Data driven translation
- G06F17/2827—Example based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/68—Methods or arrangements for recognition using electronic means using sequential comparisons of the image signals with a plurality of references in which the sequence of the image signals or the references is relevant, e.g. addressable memory
- G06K9/6807—Dividing the references in groups prior to recognition, the recognition taking place in steps; Selecting relevant dictionaries
- G06K9/6842—Dividing the references in groups prior to recognition, the recognition taking place in steps; Selecting relevant dictionaries according to the linguistic properties, e.g. English, German
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30861—Retrieval from the Internet, e.g. browsers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/22—Manipulating or registering by use of codes, e.g. in sequence of text characters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/36—Image preprocessing, i.e. processing the image information without deciding about the identity of the image
- G06K9/46—Extraction of features or characteristics of the image
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99936—Pattern matching access
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nguyen et al. | Neural machine translation with BERT for post-OCR error detection and correction | |
Chiron et al. | ICDAR2017 competition on post-OCR text correction | |
Teufel et al. | Automatic classification of citation function | |
Peng et al. | Context sensitive stemming for web search | |
Cohen et al. | Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods | |
Taslimipoor et al. | Shoma at parseme shared task on automatic identification of vmwes: Neural multiword expression tagging with high generalisation | |
Jatowt et al. | Post-OCR error detection by generating plausible candidates | |
Kettunen et al. | Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods | |
Lund et al. | How well does multiple OCR error correction generalize? | |
Provatorova et al. | Named entity recognition and linking on historical newspapers: UvA. ILPS & REL at CLEF HIPE 2020 | |
Chou et al. | Boosted web named entity recognition via tri-training | |
Xu et al. | Exploiting lists of names for named entity identification of financial institutions from unstructured documents | |
Ekbal et al. | Active machine learning technique for named entity recognition | |
Daðason | Post-correction of Icelandic OCR text | |
Pal et al. | Vartani Spellcheck--Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance | |
Minkov et al. | NER systems that suit user’s preferences: adjusting the recall-precision trade-off for entity extraction | |
Barteld et al. | Token-based spelling variant detection in Middle Low German texts | |
Lütke | AnyGraphMatcher Submission to the OAEI Knowledge Graph Challenge 2019. | |
Hammarström et al. | Poor man's ocr post-correction: Unsupervised recognition of variant spelling applied to a multilingual document collection | |
JP2003263441A (en) | Keyword determination database creation method, keyword determination method, apparatus, program, and recording medium | |
Gashaw et al. | Enhanced amharic-arabic cross-language information retrieval system using part of speech tagging | |
Mei et al. | A novel unsupervised method for new word extraction | |
Villanova-Aparisi et al. | Reading Order Independent Metrics for Information Extraction in Handwritten Documents | |
Soni et al. | Correcting whitespace errors in digitized historical texts | |
Saad | Named entity recognition for biomedical patent text using Bi-LSTM variants |