[go: up one dir, main page]

WO1997038394A1 - Procede d'evaluation automatique d'une adresse reportee sur un document apres avoir ete transformee en donnees numeriques - Google Patents

Procede d'evaluation automatique d'une adresse reportee sur un document apres avoir ete transformee en donnees numeriques Download PDF

Info

Publication number
WO1997038394A1
WO1997038394A1 PCT/DE1997/000554 DE9700554W WO9738394A1 WO 1997038394 A1 WO1997038394 A1 WO 1997038394A1 DE 9700554 W DE9700554 W DE 9700554W WO 9738394 A1 WO9738394 A1 WO 9738394A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
pattern
character string
determined
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/DE1997/000554
Other languages
German (de)
English (en)
Inventor
Hans-Ulrich Block
Thomas BRÜCKNER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Siemens Corp
Original Assignee
Siemens AG
Siemens Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG, Siemens Corp filed Critical Siemens AG
Priority to EP97916350A priority Critical patent/EP0891599A1/fr
Priority to JP9535727A priority patent/JP2000508100A/ja
Publication of WO1997038394A1 publication Critical patent/WO1997038394A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • a system is known with which, for. B. Business letter documents can be categorized and then forwarded in electronic or paper form, or can be stored in a targeted manner.
  • the system contains a unit for layout segmentation of the document, a unit for optical text recognition, a unit for address recognition and a unit for content analysis and categorization.
  • a mixed bot-up and top-down approach is used, which as the individual steps
  • the address recognition is carried out with a unification-based parameter that works with an attributed context-free grammar for addresses. Parts of the text that are correctly parsed in the sense of the address grammar are accordingly addresses. The contents of the addresses are determined using equations of the grammar. The procedure is described in [2]. Information retrieval techniques for automatic indexing of texts are used for content analysis and categorization. The details are as follows:
  • a new business letter is then categorized by comparing the index terms of this letter with the lists of significant words for all categories. Depending on the significance, the weights of the index terms contained in the letter are multiplied by a constant and summed up. By dividing this sum by the number of index terms in the letter, there is a probability for each class. The exact calculations result from [3].
  • the result of the content analysis is then a list of hypotheses sorted according to probabilities.
  • the runtime of the content analysis is specified between half a second and two seconds of CPU time with a maximum number of 75 index terms per letter.
  • the object on which the invention is based is to specify a method by which the address recognition and address evaluation is improved. It is assumed that the address of the document already exists as digital data, which are then processed further. This object is achieved in accordance with the features of patent claim 1.
  • the method according to the invention is based on the technique of approximate string matching.
  • the method described by Bertossi et al in [4] is used, which compares a word with a pattern and calculates the number of confusions, omissions and insertions of letters in the word.
  • the pattern is selected which most closely corresponds to the word w to be examined.
  • a similarity or distance measure d is required for the two words, the pattern m and the word w to be examined.
  • the absolute number of errors is not suitable for this, since the patterns can be of different lengths. This problem can be shown using examples:
  • the reconstruction information of a letter is not a calculable measure. Therefore, according to the invention, the Markov entropy H-y- (N) is used as a model for this.
  • ew is the number of errors in the word to be examined w.
  • 1 shows a system with which the address is recognized and evaluated on a paper document
  • 2 shows a more precise representation of the system for evaluating the address.
  • a paper document Dok is scanned by a scanner SC and an image file BD is generated.
  • an image file BD is generated.
  • the part of the image which contains the address is segmented.
  • the layout segmentation is designated SG in FIG. 1.
  • the result is an image file that only contains the address part A-SG of the document.
  • This image data of the address is converted into ASCII data using OCR.
  • the address in ASCII data is named in Fig. 1 ADR.
  • the ASCII address file ADR still contains errors, so that by comparing this address file with stored patterns it is often not possible to identify the addressee uniquely.
  • the address recognition is designated ADR-E, a
  • the file can contain the addressees assigned to these patterns, both of which can be contained in a memory.
  • the address recognition ADR-E emits an address hypothesis for each pattern, which is called ADR-H and which represents the measure of the similarity.
  • the technique of "approximate string matching" is used in the exemplary embodiment.
  • the method described by Bertossi 14] is used, which compares a word with a pattern and the number of mix-ups, omissions and insertions of letters is calculated in accordance with Fig. 2 in the unit MA to which the address ADR in ASCII code and the pattern m are supplied one for every possible addressee Set of unique addressee names.
  • the patterns m to m n are thus compared with the address ADR, that is, they are determined and a hypothesis ADR-H is formed for each pattern, that is, the most similar word (hypothesis) is determined in the address for each pattern.
  • the distance measure ° -inf is used for each pair of pattern-hypotheses according to the above formula for the similarity of two
  • Patterns and the corresponding addressees are stored in one unit (memory).
  • the hypotheses for the individual patterns are contained in ADR-H, in the unit DIST the distance measurement is carried out for each
  • the distance dimensions d ⁇ n fi are fed to a unit MIN for miniature calculation, which determines the minimum dinf ⁇ r- j and subjects it to a threshold value check SW.
  • the threshold value check SW rejects an address as unassignable if d is above the threshold value, this is shown with rw, otherwise the addressee ADR-A corresponding to the pattern is issued.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Pour identifier et évaluer la suite de caractères contenue dans une adresse et pour affecter l'adresse à un destinataire, la suite de caractères de l'adresse est comparée à des modèles mémorisés qui contiennent une désignation d'adresses définie pour chaque destinataire. Le modèle retenu est celui qui se rapproche le plus de la suite de caractères. A cet effet, une mesure de déplacement est constituée qui définit la similitude entre l'adresse et le modèle et cette mesure de déplacement est examinée afin de voir si elle se situe au-dessus ou en dessous d'un seuil prédéterminé. Si la mesure de déplacement se situe en dessous du seuil prédéterminé, le destinataire associé à ce modèle est sorti.
PCT/DE1997/000554 1996-04-03 1997-03-18 Procede d'evaluation automatique d'une adresse reportee sur un document apres avoir ete transformee en donnees numeriques Ceased WO1997038394A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP97916350A EP0891599A1 (fr) 1996-04-03 1997-03-18 Procede d'evaluation automatique d'une adresse reportee sur un document apres avoir ete transformee en donnees numeriques
JP9535727A JP2000508100A (ja) 1996-04-03 1997-03-18 宛先をデジタルデータに変換した後で文書に記載されたこの宛先を自動的に評価するための方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE19613401.3 1996-04-03
DE19613401 1996-04-03

Publications (1)

Publication Number Publication Date
WO1997038394A1 true WO1997038394A1 (fr) 1997-10-16

Family

ID=7790414

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/DE1997/000554 Ceased WO1997038394A1 (fr) 1996-04-03 1997-03-18 Procede d'evaluation automatique d'une adresse reportee sur un document apres avoir ete transformee en donnees numeriques

Country Status (3)

Country Link
EP (1) EP0891599A1 (fr)
JP (1) JP2000508100A (fr)
WO (1) WO1997038394A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1843276A1 (fr) * 2006-04-03 2007-10-10 Océ-Technologies B.V. Procédé de traitement automatisé des documents textes sur papier
US7436979B2 (en) 2001-03-30 2008-10-14 Siemens Energy & Automation Method and system for image processing

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DOSTER W: "Contextual postprocessing system for cooperation with a multiple-choice character-recognition system", IEEE TRANSACTIONS ON COMPUTERS, NOV. 1977, USA, vol. C-26, no. 11, ISSN 0018-9340, pages 1090 - 1101, XP002035206 *
IMPEDOVO S ET AL: "Hand-written numeral recognition 'the organization degree measurement'", PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, MUNICH, WEST GERMANY, 19-22 OCT. 1982, 1982, NEW YORK, NY, USA, IEEE, USA, pages 40 - 43, vol.1, XP002035209 *
JUMARIE G: "New results in the information theory of patterns and forms", SYSTEMS ANALYSIS - MODELLING - SIMULATION, 1987, EAST GERMANY, vol. 4, no. 6, ISSN 0232-9298, pages 483 - 520, XP002035208 *
ROSENBAUM W S ET AL: "Multifont OCR postprocessing system", IBM JOURNAL OF RESEARCH AND DEVELOPMENT, JULY 1975, USA, vol. 19, no. 4, ISSN 0018-8646, pages 398 - 421, XP002035207 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7436979B2 (en) 2001-03-30 2008-10-14 Siemens Energy & Automation Method and system for image processing
EP1843276A1 (fr) * 2006-04-03 2007-10-10 Océ-Technologies B.V. Procédé de traitement automatisé des documents textes sur papier

Also Published As

Publication number Publication date
JP2000508100A (ja) 2000-06-27
EP0891599A1 (fr) 1999-01-20

Similar Documents

Publication Publication Date Title
DE3889092T2 (de) Optische Zeichenlesevorrichtung.
EP1665132B1 (fr) Procede et systeme de detection de donnees provenant de plusieurs documents lisibles par ordinateur
DE69428590T2 (de) Auf kombiniertem lexikon und zeichenreihenwahrscheinlichkeit basierte handschrifterkennung
DE69636057T2 (de) Sprecherverifizierungssystem
DE69814104T2 (de) Aufteilung von texten und identifizierung von themen
DE69600461T2 (de) System und Verfahren zur Bewertung der Abbildung eines Formulars
DE60204005T2 (de) Verfahren und einrichtung zur erkennung eines handschriftlichen musters
DE69423692T2 (de) Sprachkodiergerät und Verfahren unter Verwendung von Klassifikationsregeln
DE2541204A1 (de) Verfahren zur fehlererkennung und einrichtung zur durchfuehrung der verfahren
DE19511470C1 (de) Verfahren zur Ermittlung eines Referenzschriftzuges anhand einer Menge von schreiberidentischen Musterschriftzügen
DE19705757A1 (de) Verfahren und Gerät für das Design eines hoch-zuverlässigen Mustererkennungs-Systems
DE19511472C1 (de) Verfahren zur dynamischen Verifikation eines Schriftzuges anhand eines Referenzschriftzuges
DE2513566A1 (de) Binaere referenzmatrix
DE3246631C2 (de) Zeichenerkennungsvorrichtung
DE19933984C2 (de) Verfahren zur Bildung und/oder Aktualisierung von Wörterbüchern zum automatischen Adreßlesen
EP0891599A1 (fr) Procede d'evaluation automatique d'une adresse reportee sur un document apres avoir ete transformee en donnees numeriques
EP1076896B1 (fr) Procede et dispositif d'identification par un ordinateur d'au moins un mot de passe en langage parle
DE69625649T2 (de) Verfahren zur Überprüfung von Unterschriften
EP2259210A2 (fr) Procédé et dispositif destinés à l'analyse d'une base de données
EP2273383A1 (fr) Procédé et dispositif de recherche automatique de documents dans un dispositif de stockage de données
DE69901324T2 (de) Vorrichtung, Verfahren und Speichermedium zur Sprechererkennung
EP0731955B1 (fr) Procede et dispositif de saisie et d'identification automatique d'informations enregistrees
Sorensen et al. Black-White Differences in the Occurrence of Job Shifts.
EP1758688A1 (fr) Procede pour determiner automatiquement des donnees de puissance operationnelles
DE102009013390A1 (de) Verfahren und Vorrichtung zum Klassifizieren eines physikalischen Objekts mittels eines parametrierten Klassifikators

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 1997916350

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1997916350

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1997916350

Country of ref document: EP