KR101977206B1

KR101977206B1 - Assonantic terms correction system

Info

Publication number: KR101977206B1
Application number: KR1020170061079A
Authority: KR
Inventors: 최보람
Original assignee: 주식회사 한글과컴퓨터
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2019-06-18
Anticipated expiration: 2037-05-17
Also published as: KR20180126262A

Abstract

본 발명은 문서 또는 복수 단어를 포함하는 문장 내에서 사용된 단어, 구문에 대한 빈도를 산출하고, 이 빈도에 따른 유사 단어 사용여부를 판별하여 단어 또는 구문의 오사용을 판별하고, 판별결과에 따라 보정을 수행하거나, 사용자에게 정보를 제공하도록 한 유사어 보정 시스템 및 보정 방법에 관한 것이다.
본 발명에 따른 유사어 보정 시스템은 전자문서가 출력되는 출력부; 상기 전자문서에 포함된 텍스트를 추출하여 유사 유닛 쌍을 검출하고, 유사 유닛 쌍 중 상대적으로 출현빈도가 적은 유닛을 오탈자로 판단하며, 상기 오탈자로 판단된 유닛을 교정하거나, 표시하여 출력하는 판단부를 포함하여 구성되는 것을 특징으로 한다.According to the present invention, a frequency of a word or a phrase used in a sentence including a document or a plurality of words is calculated, and the misuse of the word or phrase is determined by determining whether the similar word is used according to the frequency, A similarity correction system and a correction method in which correction is performed or information is provided to a user.
A similarity correction system according to the present invention comprises: an output unit for outputting an electronic document; A unit for detecting a similar unit pair by extracting the text included in the electronic document, determining a unit with a relatively low appearance frequency as a typographical unit among the similar unit pairs, and correcting or displaying the unit judged to be a misspelling And the like.

Description

[0001] ASSONANTIC TERMS CORRECTION SYSTEM [0002]

본 발명은 유사어 보정 시스템 및 방법에 관한 것으로 특히, 문서 또는 복수 단어를 포함하는 문장 내에서 사용된 단어, 구문에 대한 빈도를 산출하고, 이 빈도에 따른 유사 단어 사용 여부를 판별하여 단어 또는 구문의 오사용을 판별하고, 판별결과에 따라 보정을 수행하거나, 사용자에게 정보를 제공하도록 한 유사어 보정 시스템 및 보정 방법에 관한 것이다.More particularly, the present invention relates to a system and a method for correcting similarity, and more particularly, to a method and apparatus for calculating a frequency of a word or phrase used in a document or a sentence including a plurality of words, A similarity correction system and a correction method in which correction is performed according to a result of discrimination or information is provided to a user.

문자, 웹페이지, 메신저, 문서작성기와 같이 다양한 종류의 텍스트 입력 및 게시를 위한 프로그램, 장치들이 이용되고 있으며, 음성 못지않게 사용자 상호 간의 소통이 텍스트를 통해 이루어지고 있다.Programs and devices for inputting and posting various kinds of texts such as characters, web pages, instant messengers, and document creators are used, and communication between users is performed through texts as well as voice.

이러한 텍스트 입력시 키 입력의 실수와 함께 맞춤법 오해, 유사 단어의 이용으로 인한 문법적 오류, 오타가 자주 발생하며, 이러한 오탈자, 틀린 맞춤법, 유사단어는 쉽게 확인되지 않고 그대로 이용되는 경우가 많다. 특히, 글을 작성한 사용자가 반복적으로 글을 확인하는 경우에도 사람의 인지 능력으로 인해 잘못된 글자를 찾아내는 것이 쉽지 않다.In such text entry, mistakes in keystrokes as well as spelling misconceptions, grammatical errors due to the use of similar words, and typo occur frequently, and these misapprehension, wrong spelling, and similar words are often used without being easily identified. In particular, even if a user who writes repeatedly checks the article, it is not easy to find the wrong letter due to a person's cognitive ability.

이러한 불편을 해소하기 위해 기존의 장치 또는 프로그램에서는 맞춤법 규칙, 표준어와 관련된 데이터베이스를 유지하고, 데이터베이스에 저장된 데이터를 근거로 작성된 텍스트를 분석하여 오탈자, 문법적 오류를 검증하여 표시하거나, 자동으로 정정하는 서비스를 제공하고 있다.In order to solve this inconvenience, existing devices or programs maintain a database related to spelling rules and standard words, analyze texts based on data stored in the database, verify and display typographical errors, or automatically correct .

이러한 서비스나 장치는 오탈자, 잘못된 조사와 같이 오입력된 글자의 검출을 자동으로 하여 표시하거나 정정함으로써 사용자의 편의성을 높이고는 있으나, 형태적인 오류의 검출만이 가능한 한계가 있다.Such a service or device automatically detects and corrects erroneous characters such as misspellings and erroneous inspections, thereby enhancing the convenience of the user. However, there is a limitation in that only a morphological error can be detected.

구체적으로 맞춤법에 맞는 단어이나, 잘못된 단어를 선택하는 경우에 대한 검출이나, 교정이 불가능한 한계가 있다. Specifically, there is a limit in which detection or correction of the case of selecting a word or an erroneous word corresponding to a spelling is impossible.

한국공개특허 10-2005-0026732(공개일 2005. 03. 16.)Korean Patent Laid-Open No. 10-2005-0026732 (published on March 23, 2005)

따라서, 본 발명의 목적은 문서 또는 복수 단어를 포함하는 문장 내에서 사용된 단어, 구문에 대한 빈도를 산출하고, 이 빈도에 따른 유사 단어 사용 여부를 판별하여 단어 또는 구문의 오사용을 판별하고, 판별결과에 따라 보정을 수행하거나, 사용자에게 정보를 제공하도록 한 유사어 보정 시스템 및 보정 방법을 제공하는 것이다.SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a method and apparatus for calculating a frequency of a word or phrase used in a sentence including a document or a plurality of words, determining whether the similar word is used according to the frequency, And to provide a similarity correction system and a correction method in which correction is performed according to a result of discrimination or information is provided to a user.

상기 목적을 달성하기 위한 본 발명에 따른 유사어 보정 시스템은 전자문서가 출력되는 출력부; 상기 전자문서에 포함된 텍스트를 추출하여 유사 유닛 쌍을 검출하고, 유사 유닛 쌍 중 상대적으로 출현빈도가 적은 유닛을 오탈자로 판단하며, 상기 오탈자로 판단된 유닛을 교정하거나, 표시하여 출력하는 판단부를 포함하여 구성되는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a similarity correction system comprising: an output unit for outputting an electronic document; A unit for detecting a similar unit pair by extracting the text included in the electronic document, determining a unit with a relatively low appearance frequency as a typographical unit among the similar unit pairs, and correcting or displaying the unit judged to be a misspelling And the like.

상기 유닛은 구문, 단어, 어절, 낱글자 및 어조사 중 어느 하나의 단위로 구분되는 것을 특징으로 한다.The unit is characterized by being divided into units of any one of a phrase, a word, a word, a word, and a tone.

상기 판단부는 상기 전자문서에서 미리 정해진 상기 단위로 복수의 상기 유닛을 추출하는 유닛 검출부; 상기 유닛 검출부에서 추출된 상기 유닛 각각에 대해 출현빈도 통계를 포함하는 빈도데이터를 작성하는 통계작성부; 및 상기 통계작성부에 의해 작성된 빈도데이터를 이용하여 상기 유닛 간의 유사여부를 판별하고, 유사한 유닛의 출현빈도를 확인하여 오탈자의 여부를 결정하는 오타 판단부;를 포함하여 구성되는 것을 특징으로 한다.Wherein the determination unit comprises: a unit detection unit that extracts a plurality of units from the predetermined unit in the electronic document; A statistic preparation unit for generating frequency data including appearance frequency statistics for each of the units extracted by the unit detection unit; And a typing judgment unit for judging whether the units are similar using the frequency data generated by the statistic preparation unit and for confirming the occurrence frequency of similar units to determine whether or not there is a misspelling.

상기 오타 판단부는 미리 설정되는 유사판단 기준값 이상의 일치율을 가지는 유닛을 유사 유닛으로 판단하는 것을 특징으로 한다.And the typestation judging unit judges the unit having a matching rate equal to or greater than a similar judgment reference value that is set in advance to be a similar unit.

미리 설정되는 빈도 기준값에 의해 구분되는 상위 출현빈도의 유닛과 유사한 것으로 판단된 하위 출현빈도의 유닛을 오탈자로 판단하는 것을 특징으로 한다.A unit of a lower occurrence frequency determined to be similar to a unit of a higher occurrence frequency classified by a preset frequency reference value is determined to be a punctuality.

또한, 본 발명에 따른 유사어 보정 방법은 상기 오타 판단부는 전자문서가 출력부를 통해 출력되는 출력단계; 판단부가 상기 전자문서에 포함된 텍스트를 추출하는 유닛 추출단계; 상기 판단부가 추출된 유닛 중 유사한 유닛 쌍을 검출하는 유닛 쌍 검출단계; 상기 판단부가 유사한 상기 유닛 쌍 중 상대적으로 출현빈도가 적은 유닛을 오탈자로 판단하는 오탈자 판단단계; 및 상기 판단부가 상기 오탈자로 판단된 상기 유닛을 교정하거나, 표시하여 출력하는 교정단계;를 포함하여 구성되는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of correcting similarity, comprising: an output step of outputting an electronic document through an output unit; A unit extracting step of the judgment unit extracting the text included in the electronic document; A unit pair detecting step of detecting a similar unit pair among the extracted units; A misjudgment step of judging, by the judging unit, a unit having a relatively low frequency of appearance among the similar unit pairs as a misjudgment; And a correction step of correcting, displaying and outputting the unit determined by the determination unit as the misspelling.

상기 유닛 추출단계 내지 상기 오탈자 판단단계는 유닛 검출부가 상기 전자문서에서 미리 정해진 상기 단위로 복수의 상기 유닛을 추출하는 단계; 추출된 상기 유닛 각각에 대해 통계작성부가 출현빈도 통계를 포함하는 빈도데이터를 작성하는 단계; 오탈자 판단부가 상기 통계작성부에 의해 작성된 빈도데이터를 이용하여 상기 유닛 간의 유사여부를 판별하고, 유사한 상기 유닛의 출현빈도를 확인하여 오탈자여부를 결정하는 단계를 포함하여 구성되는 것을 특징으로 한다.Wherein the unit extraction step and the misreader determination step comprise: a step in which the unit detection unit extracts a plurality of the units in the predetermined unit in the electronic document; Creating frequency data including statistical generator appearance frequency statistics for each of the extracted units; Determining whether there is a similarity between the units using the frequency data generated by the statistical preparation unit by the misjudgment unit, and determining whether the units are misplaced by checking the appearance frequency of the similar unit.

상기 결정하는 단계는 상기 오타판단부가 미리 설정되는 유사 판단 기준값 이상의 일치율을 가지는 유닛을 유사 유닛으로 판단하는 것을 특징으로 한다.And the determining step determines that the unit having the matching rate equal to or higher than the similar determination reference value set in advance by the typecasting unit is a similar unit.

상기 결정하는 단계는 상기 오타판단부가 미리 설정되는 빈도 기준값에 의해 구분되는 상위 출현빈도의 유닛과 유사한 것으로 판단된 하위 출현빈도의 유닛을 오탈자로 판단하는 것을 특징으로 한다.Wherein the determining step determines that the unit of the lower occurrence frequency determined to be similar to the unit of the higher occurrence frequency that is distinguished by the frequency reference value that is set in advance by the typecasting unit is a misspelling.

본 발명에 따른 유사어 보정 시스템 및 보정 방법은 문서 또는 복수 단어를 포함하는 문장 내에서 사용된 단어, 구문에 대한 빈도를 산출하고, 이 빈도에 따른 유사 단어 사용 여부를 판별하여 단어 또는 구문의 오사용을 판별하고, 판별결과에 따라 보정을 수행하거나, 사용자에게 정보를 제공함으로써, 단순 오탈자 외에 잘못 사용된 단어 구문을 검출하는 것이 가능해진다.The similarity correction system and correction method according to the present invention calculates the frequency of a word or phrase used in a sentence including a document or a plurality of words and determines whether the similar word is used according to the frequency, It is possible to detect erroneously used word phrases in addition to simple punctuation by performing correction according to the discrimination result or providing information to the user.

도 1은 본 발명에 따른 유사어 보정 시스템의 구성을 도시한 구성 예시도.
도 2는 본 발명에 따른 오타 판단 방법을 설명하기 위한 예시도.
도 3은 도 2의 전자문서에 대한 출현빈도 통계 작성의 예를 나타낸 예시도.
도 4는 본 발명에 따른 유사어 교정 방법을 설명하기 위한 예시도.BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a configuration diagram showing the configuration of a similarity correction system according to the present invention; Fig.
FIG. 2 is an exemplary diagram for explaining a typing judgment method according to the present invention; FIG.
3 is an exemplary view showing an example of creating appearance frequency statistics for the electronic document of FIG.
FIG. 4 is an exemplary diagram for explaining a similarity correction method according to the present invention; FIG.

이하, 본 발명의 바람직한 실시예를 첨부한 도면을 참조하여 당해 분야의 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 설명하기로 한다. 첨부된 도면들에서 구성에 표기된 도면번호는 다른 도면에서도 동일한 구성을 표기할 때에 가능한 한 동일한 도면번호를 사용하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어서 관련된 공지의 기능 또는 공지의 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고 도면에 제시된 어떤 특징들은 설명의 용이함을 위해 확대 또는 축소 또는 단순화된 것이고, 도면 및 그 구성요소들이 반드시 적절한 비율로 도시되어 있지는 않다. 그러나 당업자라면 이러한 상세 사항들을 쉽게 이해할 것이다.Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. It should be noted that the drawings denoted by the same reference numerals in the drawings denote the same reference numerals whenever possible, in other drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. And certain features shown in the drawings are to be enlarged or reduced or simplified for ease of explanation, and the drawings and their components are not necessarily drawn to scale. However, those skilled in the art will readily understand these details.

도 1은 본 발명에 따른 유사어 보정 시스템의 구성을 도시한 구성 예시도이다.BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a configuration diagram showing the configuration of a similarity correction system according to the present invention; FIG.

도 1을 참조하면, 본 발명에 따른 유사어 보정 시스템은 유닛 검출부(10), 통계작성부(20), 오타 판단부(30), 저장부(40) 및 출력부(50)를 포함하여 구성된다.1, a similarity correction system according to the present invention includes a unit detection unit 10, a statistic creation unit 20, a typing determination unit 30, a storage unit 40, and an output unit 50 .

유닛 검출부(10)는 처리부(60)를 통해 실행되는 전자문서에 포함된 단어 또는 구문을 구분하여 통계작성부에 전달한다. 이 유닛 검출부(10)는 미리 정해진 구분 단위 예를 들어, 단어, 구문과 같이 의미를 가지는 단위 외에도 어절, 낱글자, 조사와 같이 오탈자를 판단하기 위한 유닛 단위로 구분하고 이를 통계작성부(10)에 전달하는 역할을 한다. 예를 들어 "현재 위치는 강남구 삼성동이고, 이적 위치는 강남구 대치동입니다."와 같은 문장이 있는 경우, 유닛 단위가 단어로 구분되는 경우, "현재", "위치"와 같이 문장에 포함된 단어 즉 유닛을 추출하고 이를 통계 작성부(20)에 전달하게 된다. 여기서 유닛 검출부(10)는 전자문서의 시작부터 끝까지 전 문장에 대해 유닛을 검출할 수도 있고, 페이지와 같이 일정한 단위별로 유닛 검출을 수행할 수 있으며, 제시된 바에 의해서 본 발명을 한정하는 것은 아니다.The unit detection unit 10 separates a word or a phrase included in an electronic document executed through the processing unit 60 and transmits the word or phrase to the statistic preparation unit. The unit detecting unit 10 divides the units into unit units for judging typographical errors such as a word, a phrase, and a unit of a predetermined unit, for example, words and phrases, and outputs them to the statistic preparation unit 10 It is a role to deliver. For example, in the case of a sentence such as "The current location is Samsung-dong, Gangnam-gu, Daechi-dong, Gangnam-gu," when the unit is divided into words, the word "present" Extracts the unit and transmits it to the statistic preparation unit 20. Here, the unit detection unit 10 may detect a unit with respect to a whole sentence from the beginning to the end of an electronic document, perform unit detection by a certain unit such as a page, and do not limit the present invention as presented.

통계작성부(20)는 유닛 검출부(10)에 의해 전달된 유닛 단위로 출연빈도를 산출한다. 구체적으로 통계작성부(20)는 유닛 검출부(10)에서 전달되는 유닛 단위로 분류하고, 각 유닛이 미리 정해진 구간 또는 문서 전체에 걸쳐 몇 번 나왔는지를 정리하여, 이를 룩업 테이블과 같은 데이터형태로 생성한다. 그리고 통계작성부(20)는 룩업 테이블 형태로 생성된 데이터를 저장부(40)에 저장하거나 오타 판단부(30)에 전달한다.The statistic preparation unit 20 calculates the appearance frequency in units of units delivered by the unit detection unit 10. [ Specifically, the statistic preparation unit 20 classifies the units into unit units transmitted by the unit detection unit 10, summarizes how many times each unit has occurred over a predetermined section or the entire document, and generates them in the form of data such as a lookup table do. The statistic preparation unit 20 stores the data generated in the form of a lookup table in the storage unit 40 or transmits the data to the typing judgment unit 30.

오타 판단부(30)는 통계작성부(20)에 의해 작성된 출현빈도를 이용하여 오타판단을 수행한다. 즉, 오타 판단부(30)는 자주 출현한 유닛과 유사한 유닛을 출현빈도 데이터를 통해 확인하고, 출현빈도가 높은 유닛과 유사하면서 매우 낮은 출현빈도를 가지는 유닛이 확인되는 경우 이를 오타로 판단하게 된다. 예를 들어, "주소지 이전"에 관한 설명을 하는 전자문서에서 "이적"이라는 단어가 확인되면 이를 출현빈도에 따라 확인하고 오타판단부(30)가 오타로 판단하게 된다. 오타 판단부(30)는 출현빈도의 높고 낮음을 확인하기 위한 상위 기준값과 하위기준값을 이용하여 상위 기준값 이상의 출현빈도의 유닛에 대해 하위기준값 이하의 유닛을 비교할 수 있으며, 이러한 상위 및 하위 기준값은 미리 정해질 수 있다. 여기서, 상위 기준값 또는 하위 기준값 중 어느 하나만 지정되고 다른 기준값은 정해지지 않은채로 유닛 간의 비교가 이루어질 수 있으며, 제시된 바에 의해서만 본 발명을 한정하는 것은 아니다. 아울러, 오타 판단부(30)는 오타가 확인되는 경우 오타에 관한 정보를 출력부(50)를 통해 출력하거나, 사용자가 오타를 확인할 수 있도록 유닛을 표시하고, 표시된 정보를 사용자가 확인할 수 있도록 출력하게 된다. 한편, 오타 판단부(30)가 유사어 여부를 판단할 수 있도록 유사도 기준값이 출현빈도 기준값과 별도로 설정될 수 있다. 오타판단부는 유사도 기준값에 정해지는 값 이상의 유사도 즉, 일치율을 나타내는 유닛을 상호 유사한 것으로 판단하고, 전술한 출현빈도를 유사한 유닛에 적용하여 오탈자를 판단하게 된다.The typing judgment unit (30) judges the typing using the appearance frequency created by the statistic preparation unit (20). That is, the typing judging unit 30 confirms a unit similar to a frequently appearing unit through appearance frequency data, and judges a unit having a very low appearance frequency similar to a unit having a high appearance frequency as a typo . For example, if the word "transfer" is confirmed in the electronic document describing the "relocation of the address book ", it is confirmed according to the appearance frequency and the typos judging unit 30 judges it as an error. The typing judging unit 30 can compare the unit of the frequency of appearance with the upper reference value or higher and the unit of the lower reference value or less using the upper reference value and the lower reference value for confirming the occurrence frequency is high or low, Can be determined. In this case, only one of the upper reference value and the lower reference value is specified, and the other reference values are not defined, the comparison between the units can be made, and the present invention is not limited to the presented one. In addition, the typing judging unit 30 outputs the information on the typing through the output unit 50 when a typing is confirmed, or displays the unit so that the user can check the typing, and outputs the displayed information to the user . On the other hand, the similarity reference value may be set separately from the appearance frequency reference value so that the typing determining unit 30 may determine whether or not the similarity exists. The typos judging unit judges units similar to each other that are equal to or higher than a value determined by the similarity reference value to be similar to each other and applies the aforementioned appearance frequency to similar units to judge the punctuation.

저장부(40)는 통계작성부(20)에 의해 작성된 출현 빈도 통계가 저장되며, 오타 판단부(30)의 판단을 위한 기준값이 저장된다.The storage unit 40 stores the appearance frequency statistics generated by the statistic preparation unit 20 and stores a reference value for judgment by the misjudgment judgment unit 30. [

출력부(50)는 전자문서를 출력하여 제공하며, 오타 판단부(30)에 의해 오타에 관한 정보가 출력되거나, 오타를 확인할 수 있도록 표시된 유닛을 출력한다.The output unit 50 outputs and provides an electronic document, and the typing determination unit 30 outputs information about the typographical errors or outputs a unit so that the typos can be confirmed.

도 2는 본 발명에 따른 오타 판단 방법을 설명하기 위한 예시도이고, 도 3은 도 2의 전자문서에 대한 출현빈도 통계 작성의 예를 나타낸 예시도이다.FIG. 2 is an exemplary diagram for explaining a punctuation judgment method according to the present invention, and FIG. 3 is an exemplary view showing an example of generating appearance frequency statistics for the electronic document of FIG.

도 2 및 도 3을 참조하면, 전술한 바와 같이 본 발명에 따른 유사어 보정 시스템은 기본의 맞춤법 검사에서 확인되지 않은 표준어가 잘못 사용된 경우의 오타 즉, 유사어가 사용된 경우를 검출하여 오타를 검출하고 이를 보정함으로써 잘못된 단어, 구문을 정정할 수 있게 하는 역할을 한다.Referring to FIG. 2 and FIG. 3, as described above, the similarity correction system according to the present invention detects a mistyping in the case where a standard word that has not been verified in the basic spelling check is misused, that is, And corrects them to correct erroneous words and phrases.

이를 위해 본 발명의 유사어 보정 시스템 및 방법은 문서 전반에 걸쳐 특정 유닛의 출현빈도를 확인하고, 출현빈도가 높은 유닛과 낮은 유닛을 비교함으로써 오타 가능성을 확인하여 보정을 하게 된다.To this end, the similarity correction system and method of the present invention confirms the occurrence frequency of a specific unit throughout the document, and checks the possibility of error by comparing the unit having a high frequency with the unit having a low frequency.

구체적으로 도 2에서와 같이 구청 이전 계획이라는 전자문서를 작성하면, 유닛 검출부(10)는 전자문서에 포함된, 구문, 단어, 어절, 낱글자와 같이 미리 정해진 단위로 유닛을 검출하고, 이를 통계작성부(20)에 전달하게 된다.Specifically, as shown in FIG. 2, when an electronic document called a ward office transfer plan is created, the unit detection unit 10 detects a unit in a predetermined unit such as a phrase, a word, a word and a word included in an electronic document, (20).

통계 작성부(20)는 유닛 검출부(10)에 검출된 유닛을 분류하여 출현빈도를 정리하고 이를 데이터로 작성한다. 이러한 데이터의 형태는 도 3에 도시된 것과 같은 룩업테이블의 형태일 수 있으나, 이로써 본 발명은 한정하는 것은 아니다.The statistic preparation unit 20 classifies the units detected by the unit detection unit 10, arranges the frequency of occurrence, and writes the data as data. This type of data may be in the form of a lookup table as shown in FIG. 3, but the invention is not so limited.

통계 작성부(20)에 의해 유닛을 출현빈도 데이터가 작성되면, 오타 판단부(30)는 미리 정해진 기준값을 이용하여 오타로 판단되는 유사어 즉, 유사한 유닛을 검출하게 된다.When the appearance frequency data of the unit is generated by the statistic preparation unit 20, the typos judgment unit 30 detects a similar word judged as an error, that is, a similar unit, by using a predetermined reference value.

일례로 도 3에 도시된 바와 같이 "이전"과 "이적"이 함께 나타나는 경우 실제로 오타일 가능성이 매우 높으며, 오타 판단부(30)는 이를 확인하여 해당 유닛을 정정하거, 사용자가 확인이 용이하도록 표시하여 출력하게 된다.For example, as shown in FIG. 3, when "previous" and "transfer" are displayed together, it is very likely that there is a typo. The typos judging unit 30 confirms the unit and corrects the unit, And outputs it.

좀 더 구체적으로 도 2에 도시된 바와 같이 "구청 이전 계획"이라는 전자 문서가 사용자에 의해 작성되거나 확인되는 경우 문서 내용에는 당연히 "이전"이라는 단어형태의 유닛이 노출빈도가 높아지게 된다. 이때, "이적" 또는 "이점"과 같이 맞춤법에 적합한 단어가 사용되는 경우 기존의 맞춤법 검사 방법에 의해서는 이를 검출할 수 없게 된다.More specifically, as shown in Fig. 2, when an electronic document called "ward office relocation plan" is created or confirmed by the user, the unit of the word "previous" At this time, if a word suitable for a spelling is used, such as "transfer" or "advantage", it can not be detected by a conventional spelling inspection method.

그러나, 본 발명에서는 "이점", "이적"이라는 단어가 검출되는 이의 사용빈도를 확인하고, 이와 유사하면서 자주 사용된 단어가 있는지 확인하게 된다.However, in the present invention, the words "advantage" and "transfer" are detected and their frequency of use is checked to see if there are similar and frequently used words.

이때, 도 3에서와 같이 자주 사용된 단어인 "이전"이 확인되고, "이전"과 "이점" 또는 "이적"과의 빈도수를 확인하게 된다. 즉, 오타인 경우 "이점"과 "이적"은 출현빈도가 수회 미만으로 확인될 수 있으며, 이때 오타 판단부(30)는 "이점"과 "이적"과 같이 자주 사용된 단어와 유사하면서 출현빈도가 낮은 유닛을 오타로 판단하게 된다. 그리고 전술한 바와 같이 오타로 확인된 유닛에 대해 표시를 하여 출력부(50)를 통해 출력하거나, 보정을 수행하여 문서를 교정함으로써 맞춤법 검사에 의해 확인되지 않는 오탈자를 확인하여 교정하는 것이 가능해진다.At this time, as shown in FIG. 3, the frequently used word "old" is confirmed, and the frequency of "old" and "advantage" or "transfer" is confirmed. That is, in the case of typos, the "advantage" and "transfer" can be confirmed to be fewer than a few times. At this time, the typing judging unit 30 is similar to frequently used words such as "advantage" and " It is judged as a typo by a unit having a low value. As described above, it is possible to display and output the unit identified as a typo through the output unit 50, or correct the document by performing correction, thereby making it possible to check and correct the misrecognizable parts by the spelling check.

도 4는 본 발명에 따른 유사어 교정 방법을 설명하기 위한 예시도이다.FIG. 4 is an exemplary diagram for explaining a similarity correction method according to the present invention.

도 4를 참조하면 본 발명에 따른 유사어 교정 방법은 출력단계(S10), 유닛 추출단계(S20), 유닛 쌍 검출단계(S30), 오탈자 판단단계(S40) 및 교정단계(S50)을 포함하여 구성된다.4, the similarity correction method according to the present invention includes an output step S10, a unit extraction step S20, a unit pair detection step S30, a misreader determination step S40, and a correction step S50, do.

출력단계(S10)는 전자문서가 출력부(50)를 통해 출력되는 단계이다. 이 출력단계(S10)에서는 사용자에 의해 작성되거나 사용자가 확인하고자 하는 전자문서가 출력되는 단계이다. 이 전자문서는 복수의 텍스트를 포함하여 구성된다.The output step S10 is a step of outputting the electronic document through the output unit 50. [ In this output step S10, an electronic document which is created by the user or is to be confirmed by the user is outputted. This electronic document comprises a plurality of texts.

유닛 추출 단계(S20)는 전술한 바와 같이 유닛검출부(10)가 전자문서에서 미리 정해진 단위로 유닛을 추출하고 이를 통계작성부(20)에 전달하는 단계이다. 이 유닛 추출 단계(S20)에서 추출되는 단계는 단어 또는 구문과 같이 의미를 가지는 최소 단위로 구성될 수 있으나, 이외에도 낱글자, 어절, 어조사와 같이 형태적인 단위로 구분되어 추출될 수도 있으며, 추출되는 유닛의 단위는 미리 정해진게 된다. 그리고, 이와 같이 추출된 유닛은 통계작성부(20)에 의해 수집되어, 유닛별로 출현빈도가 산출되게 된다.The unit extraction step S20 is a step in which the unit detection unit 10 extracts a unit in a predetermined unit in the electronic document and transmits the extracted unit to the statistic preparation unit 20 as described above. The unit extracted in the unit extracting step S20 may be configured as a minimum unit having a meaning such as a word or a phrase, but may also be divided into morphological units such as a word, a word, and a tone, The unit of < / RTI > The units thus extracted are collected by the statistic preparation unit 20, and the appearance frequency is calculated for each unit.

유닛 쌍 검출 단계(S30)는 오타판단부(30)가 미리 정해진 유사도 기준값을 이용하여 통게작성부(20)에 의해 작성된 출현빈도 중 빈도 기준값 상위의 유닛과 유사한 하위 유닛을 비교하여 검출하는 단계이다. 오타판단부(30)는 출현빈도 데이터가 산출되면, 출현빈도가 높은 유닛과 비슷한 유닛이 있는지 검색하게 되며, 이때 출현빈도가 낮은 유닛을 먼저 비교하여 유닛 쌍을 검출하게 된다. 이때, 미리 설정된 유사도 기준값을 이용하여, 이 유사도 기준값 이상의 일치율을 나타내는 유닛을 유사 유닛으로 판별하게 된다. 즉, 이와 같은 높은 출현빈도와 낮은 출현빈도의 유닛을 비교함으로써 모든 유닛을 비교하지 않고도 빠른 속도로 유사 유닛 특히, 잘못 쓰여진 것으로 판단되는 유닛을 검출할 수 있게 된다.The unit pair detection step S30 is a step of comparing and detecting lower units similar to units above the frequency reference value among the appearance frequencies created by the message generation unit 20 using a predetermined similarity degree reference value . When the occurrence frequency data is calculated, the typing judgment unit 30 searches for a unit similar to a unit having a high appearance frequency. In this case, the unit having a low appearance frequency is first compared to detect a unit pair. At this time, a unit showing a match rate equal to or higher than the similarity reference value is discriminated as a similar unit by using a predetermined similarity reference value. That is, by comparing the unit with such a high appearance frequency and the unit with a low appearance frequency, it is possible to detect a similar unit, particularly a unit judged to be erroneously written, at a high speed without comparing all the units.

오탈자 판단단계(S40)는 오타판단부(30)가 유사한 유닛의 출현빈도를 확인하여 유사 유닛 중 어느 하나를 오탈자로 판단하는 단계이다. 즉, 오타판단부(30)는 상위 기준값 이상의 출현빈도로 분류되는 유닛과 유사한 것으로 판단된 유닛 중 하위 기준값 이상의 출현빈도로 분류되는 유닛을 오타로 판단하게 된다.The punctuation judging step S40 is a step in which the punctuation judging unit 30 checks the occurrence frequency of similar units and judges any one of the similar units as a punctuation. That is, the typos judging unit 30 judges a unit classified as an occurrence frequency equal to or higher than the lower reference value among the units judged to be similar to the unit classified as the occurrence frequency equal to or higher than the upper reference value.

교정단계(S50)는 오타판단부(30)가 오탈자로 판단한 유닛을 전자문서 상에 확인이 용이하게 표시하거나, 자동으로 교정하는 단계이다. 여기서, 오탈자로 판단된 유닛의 표시는 색, 크기, 형태, 배경색과 같은 요소를 변경하여 표시함으로써 사용자가 확인이 용이하도록 변형할 수 있으나, 이로써 본 발명을 한정하는 것은 아니며, 리스트로 정리하거나, 별도의 창을 통해 확인을 요청할 수도 있다.The calibration step (S50) is a step of easily displaying or automatically calibrating a unit determined by the typing judgment unit (30) as a misspelling on an electronic document. Herein, the display of the unit judged to be a misspelling may be modified by displaying elements such as color, size, shape, background color and so on so as to facilitate confirmation by the user, but the present invention is not limited thereto, You can also ask for confirmation via a separate window.

이상에서 본 발명의 기술적 사상을 예시하기 위해 구체적인 실시 예로 도시하고 설명하였으나, 본 발명은 상기와 같이 구체적인 실시 예와 동일한 구성 및 작용에만 국한되지 않고, 여러 가지 변형이 본 발명의 범위를 벗어나지 않는 한도 내에서 실시될 수 있다. 따라서, 그와 같은 변형도 본 발명의 범위에 속하는 것으로 간주해야 하며, 본 발명의 범위는 후술하는 특허청구범위에 의해 결정되어야 한다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, And the like. Accordingly, such modifications are deemed to be within the scope of the present invention, and the scope of the present invention should be determined by the following claims.

10 : 유닛 검출부
20 : 통계 작성부
30 : 오타 판단부
40 : 저장부
50 : 출력부10: Unit detector
20: Statistical writing department
30:
40:
50: Output section

Claims

An output unit for outputting an electronic document;
A unit for detecting a similar unit pair by extracting the text included in the electronic document, determining a unit with a relatively low appearance frequency as a typographical unit among the similar unit pairs, and correcting or displaying the unit judged to be a misspelling Including,
Wherein,
A unit detection unit for extracting a plurality of said units from a predetermined unit in said electronic document;
A statistic preparation unit for generating frequency data including appearance frequency statistics for each of the units extracted by the unit detection unit; And
And a typing judging unit for judging whether or not the punctual error is misjudged with respect to a similar unit pair having a morphological error using the frequency data created by the statistic preparing unit and a similar unit pair having no morphological error and suitable for the spelling,
Wherein the unit is divided into units of any one of a phrase, a word, a word, a word, and a tone.

delete

The method according to claim 1,
The typing determination unit
Wherein the similarity determination unit determines a unit having a matching rate equal to or higher than a similarity determination reference value that is set in advance as a similar unit.

The method according to claim 1,
Wherein the typos judging unit judges that the unit of the lower occurrence frequency judged to be similar to the unit of the higher occurrence frequency classified by the preset frequency reference value is a misspelling.

An output step of outputting an electronic document through an output unit;
A unit extracting step of the judgment unit extracting the text included in the electronic document;
A unit pair detecting step of detecting a similar unit pair among the extracted units;
A misjudgment step of judging, by the judging unit, a unit having a relatively low frequency of appearance among the similar unit pairs as a misjudgment; And
And a correction step of correcting, displaying and outputting the unit judged by the judging unit,
The unit extracting step includes:
Extracting a plurality of said units from a predetermined unit in said electronic document,
The unit pair detecting step includes:
Determining similarity between the units based on frequency data including appearance frequency statistics generated for each of the extracted units, detecting similar unit pairs,
Wherein the misrecognition step comprises:
Checking the appearance frequency of the similar unit to judge whether or not there is a misspelling for a pair of similar units having a morphological error and a pair of similar units suited for spelling because there is no morphological error,
Wherein the unit is divided into units of any one of a phrase, a word, a word, a word, and a tone.

delete

The method according to claim 6,
The unit pair detecting step includes:
And judges a unit having a matching rate equal to or higher than a similar judgment reference value that is set in advance to be a similar unit.

The method according to claim 6,
Wherein the misrecognition step comprises:
A unit of a lower occurrence frequency determined to be similar to a unit of a higher occurrence frequency distinguished by a preset frequency reference value is determined to be a misspelling.