KR101004141B1

KR101004141B1 - Structural Statement Checking Method through CSS Conversion and Schema Matching of Text File

Info

Publication number: KR101004141B1
Application number: KR1020090015496A
Authority: KR
Inventors: 이상호; 김봉근; 박상일; 안현정
Original assignee: 연세대학교 산학협력단
Priority date: 2009-02-24
Filing date: 2009-02-24
Publication date: 2010-12-27
Anticipated expiration: 2029-02-24
Also published as: KR20100096574A

Abstract

본 발명은 사용자에 의해 작성된 구조계산서의 설계문서 누락항목을 검토하기 위해, 표준화된 설계 검토 항목이 반영된 구조계산서 표준 정보모델을 준비하고, 사용자에 의해 작성된 구조계산서의 각 항목별 계층을 정의한 후에, XSD(XML SCHEMA DEFINITION) 파일로 변환하는 과정을 거쳐, 상기한 구조계산서 표준 정보모델과의 XML 스키마 매칭 기법을 수행함으로써, 실무에서 작성된 구조계산서의 누락된 설계 항목의 유무를 판단하게 된다.The present invention is to prepare a structural statement standard information model reflecting the standardized design review items, and to define the hierarchical structure of each item of the structural statement prepared by the user, in order to review the design document missing items of the structural statement prepared by the user, Through the process of converting to an XML SCHEMA DEFINITION (XSD) file, the XML schema matching technique with the structural statement standard information model is performed to determine whether there is a missing design item of the structural statement prepared in practice.

본 발명에서 제시한 방법은 구조계산서 설계검토의 효율성 및 정확성 향상에 활용될 것으로 기대된다.The method proposed in the present invention is expected to be used to improve the efficiency and accuracy of structural bill design review.

스키마 매칭, 설계항목, 토목, 구조계산서, 트리, 변환, XSD Schema Matching, Design Items, Civil Engineering, Structural Statement, Tree, Transformation, XSD

Description

METHODS FOR ANALYZING AND SCHEMA MATCHING STRUCTURAL CALCULATION DOCUMENT CONVERTED INTO XML}

본 발명은 건설기술 정보시스템 분야에 관한 것으로서, 상세하게는 사용자에 의해 작성된 구조계산서 텍스트 정보로부터 XML(Extensible Markup Language) 또는 XSD(XML SCHEMA DEFINITION) 형식의 문서로 변환하여, 표준화된 구조계산서의 XML 데이터와 스키마 매칭(SCHEMA MATCHING)을 수행함으로써, 사용자에 의해 작성된 구조계산서에 누락된 항목이 있는지 여부를 확인하는 방법에 관한 발명이다.The present invention relates to the field of construction technology information system, and in detail, converts a structural statement text information written by a user into an XML (Extensible Markup Language) or XSD (XML SCHEMA DEFINITION) format document and standardizes the XML of a structural statement. The present invention relates to a method of checking whether there is a missing item in a structured statement prepared by a user by performing schema matching with data.

교량의 구조계산서는 설계조건 및 설계 과정에서 발생하는 구조적 검토 결과를 포함하고 있는 중요 설계 정보 문서이기 때문에 그 구조물의 안전성을 평가하는데 있어서 기본적인 기준이 되는 문서이며, 또한 이후 교량의 유지, 관리 및 보수를 하는데 있어서도 그 활용도가 매우 높은 문서이다. 따라서 한국시설안전기술공단은 ‘시설물의 안전관리에 관한 특별법’(건설교통부, 1995)에 의해 설계도서 정보를 보관해 오고 있으며, 그 활용도를 높이기 위해서 최근에는 STEP 및 XML 기술 을 활용하기에 이르렀다(한국시설안전기술공단, 2004).The structural calculation of the bridge is an important design information document that contains the design conditions and the results of the structural review occurring in the design process, so it is a basic document for evaluating the safety of the structure. This document is also very useful for doing so. Therefore, Korea Facility Safety Technology Corporation has kept design book information under the 'Special Act on Safety Management of Facilities' (Ministry of Construction and Transportation, 1995), and recently, it has been using STEP and XML technologies to improve its utilization. Korea Institute of Facility Safety and Technology, 2004).

또한 최근 국내외에서는 AASHTOWare, ANSYS CivilFEM Bridge, ARoad, MIDAS, LUSAS Bridge 등 교량설계자동화와 관련한 다양한 프로그램이 개발되어 많은 부분에서 사용되고 있다. 이러한 프로그램에는 교량설계에 참조되는 다양한 설계기준이 데이터베이스 형태로 내장 되어 있어서, 교량설계과정 중 설계기준에 대한 검토 시 사용자에게 그와 관련한 정보를 제공하거나 설계변수의 값을 결정함에 있어서 권장사항으로 활용되기도 한다. Recently, various programs related to bridge design automation such as AASHTOWare, ANSYS CivilFEM Bridge, ARoad, MIDAS, LUSAS Bridge have been developed and used in many areas. In this program, various design criteria referred to in the bridge design are embedded in the form of a database, which is used as a recommendation in determining the values of design variables or providing information to users when reviewing design criteria during the bridge design process. Sometimes.

하지만, 이 같은 검토 과정은 미리 프로그램화되어 있는 절차에 따라 진행되는 것으로, 기존에 프로그램 없이 작성된 설계서의 검토는 아직까지 수동적인 방법으로 행하여지고 있다.However, such a review process is performed according to a pre-programmed procedure, and the review of a design previously prepared without a program is still performed by a manual method.

건설정보화는 일반적으로 건설사업의 각 업무 프로세스 사이의 원활한 정보흐름을 확보하여 업무의 효율성을 증대시키는 것을 목적으로 하기 때문에 표준화된 구조계산서 작성 지침에 따라 각 항목을 구성하는 것이 중요하다. Since construction informatization generally aims to increase the efficiency of work by ensuring a smooth flow of information between each work process of a construction project, it is important to organize each item according to the standardized structural statement preparation guidelines.

하지만, 현재 각 업체에서는 각 업체의 특성에 맞는 구조계산 검토 체크리스트를 구성하여 사용하고 있으며, 그 항목, 형식 및 기준이 각기 다르다. 따라서 구조계산서 각각의 항목을 비교하여 누락항목을 찾는다는 것 차체가 매우 어려운 일이다.However, at present, each company is constructing and using a checklist for structural calculation according to the characteristics of each company, and its items, formats, and criteria are different. Therefore, it is very difficult to find the missing items by comparing each item in the structural statement.

이에 본 발명에서는, 사용자에 의해 작성된 구조계산서를 XSD(XML SCHEMA DEFINITION) 변환하여, 기 정의된 구조계산서 표준 정보 모델과의 스키마 매칭과정(SCHEMA MATCHING PROCESS)을 실행함으로써, 사용자에 의해 작성된 구조계산서의 누락항목을 자동으로 찾아내는 것을 목적으로 한다. Accordingly, in the present invention, by converting the structure calculation prepared by the user to XSD (XML Schema Definition), and performing a schema matching process (SCHEMA MATCHING PROCESS) with a predefined structure calculation standard information model, It aims to automatically find missing items.

이를 위해 본 발명에서는, 첫 번째로 사용자에 의해 작성된 텍스트 타입의 구조계산서를 자동으로 XML 또는 XSD 형식의 문서로 변환하는 변환모듈을 제안한다.To this end, first of all, the present invention proposes a conversion module for automatically converting a textual structure calculation written by a user into a document in XML or XSD format.

또한, 상기 변환모듈에 의해 사용자에 의해 작성된 구조계산서를 XSD(XML SCHEMA DEFINITION) 변환하여, 구조계산서 표준 데이터를 기준으로 스키마 매칭 기법을 이용하여 구조계산서의 누락 항목을 추출하는 비교분석모듈을 제안한다.In addition, the present invention proposes a comparative analysis module for converting a structural calculation prepared by a user by the conversion module to an XML schema definition and extracting missing items of the structural calculation by using a schema matching technique based on the structural calculation standard data. .

본 발명은 상술한 바와 같은 목적을 달성하기 위하여, 순서를 가진 유한한 문자열 집합으로서 유한한 행(line)으로 구분되어 있으며, i번째 행의 문자열 집합 Sⁱ는 수학식 1에 의해 정의되는 텍스트 파일 형식의 구조계산서에 대해,In order to achieve the object as described above, the present invention is a finite string set having an order and is divided into finite lines, and the string set S ⁱ of the i th line is a text file defined by Equation (1). About structural statement of form,

상기 문자열 집합 Sⁱ로부터, 각 행에 대한 문자열 정보를 머리기호(heading symbols), 제목, 내용, 참조로 구분하여 임시 테이블에 순차적으로 저장하는 단계 와; 상기 저장된 임시 테이블의 머리기호에 대한 정보를 이용하여, 상기 각 제목이 문서의 트리구조에서 위치하는 계층정보를 부여하는 계층정보 부여단계와; 상기 계층정보와 상기 임시 테이블에 저장된 정보들을 이용하여 트리의 깊이우선(depth-first) 순서에 의해 XML 파일 생성을 수행하는 XML 파일 생성단계;에 의해 XSD(XML Schema Definition)로 변환하는 문서 변환단계와;Sequentially storing the string information for each row from the string set S ⁱ into heading symbols, titles, contents, and references in a temporary table; A hierarchical information assigning step of giving hierarchical information in which each heading is located in a tree structure of a document by using information on the header of the stored temporary table; An XML file generation step of generating an XML file in a depth-first order of a tree by using the hierarchical information and information stored in the temporary table; converting the document into an XML Schema Definition (XSD) by Wow;

상기 XSD 데이터와 표준화된 구조계산서의 XML 데이터를 상호 대비하는 스키마 매칭과정을 수행함으로써, 구조계산서의 항목 누락 여부를 판별하는 누락항목 판별단계를 포함하는 것을 특징으로 하는 구조계산서 검사 방법을 제시한다.By performing a schema matching process that compares the XSD data and the XML data of the standardized structure statement with each other, the present invention proposes a structure statement checking method including a missing item determination step of determining whether an item of the structure statement is missing.

상기 문서 변환단계에서의 계층정보 부여단계는 상기 제목에 대한 머리기호의 존재여부를 판별하는 단계; 상기 제목 중 순수 제목에 대한 문자열이 상기 제목에 부합하는지 여부를 판별하는 단계;를 포함하는 것이 바람직하다.The hierarchical information assigning step in the document converting step may include determining whether a head symbol exists for the title; And determining whether a string for a pure title among the titles matches the title.

상기 제목 문자열 hⁱ는 수학식 2,3a,3b에 의해 정의되는 것이 바람직하다.The title string h ⁱ is preferably defined by Equations 2, 3a and 3b.

상기 제목을 나타낼 때 사용되는 미리 정의된 문자열 그룹 ID의 집합을 HS_ID라 하고, ∀ID, HS_ID ⊂ ∑+일 때, hsⁱ≠ø의 조건은 수학식 4에 의해 정의되는 것이 바람직하다.A set of predefined character string group IDs used for representing the title is referred to as an HS _ID , and when ∀ID and HS _ID ⊂ Σ +, the condition hs ⁱ ≠ ø is preferably defined by Equation 4.

문자열 Sⁱ내에서 f번 이상 출현하지 말아야하는 금칙 문자들의 집합을 X_f라 하고, x_f ∈ X_f일 때, hcⁱ=ø의 조건은 수학식 5에 의해 정의되는 것이 바람직하다.A set of kinsoku characters that should not appear more than f times in the string S ⁱ is X _f , and when x _f ∈ X _f , the condition of hc ⁱ = ø is preferably defined by Equation 5.

B₀는 왼쪽 괄호를 표현하는 문자들의 집합으로 B₀⊂HS_ID, B_c는 오른쪽 괄호를 표현하는 문자들의 집합, ＜a,b＞은 서로 쌍을 이루는 동일한 종류의 괄호에 대한 집합으로 a∈B₀, b∈B_c이라 정의하며, 제목의 끝을 나타내는 구분자의 집합 D_e=B_c,C_e, C_e는 제목과 내용의 구분을 위해 사용되는 문자들의 집합으로서 B_c∩C_e=ø으로 정의할 때 l은 수학식 6에 의해 정의되는 것이 바람직하다.B ₀ is the set of characters representing the left parenthesis, B ₀ ⊂HS _ID , B _c is the set of characters representing the right parenthesis, and <a, b> is the set of parentheses of the same type paired together. Defined as B ₀ , _b 정의 B _c , and the set of delimiters indicating the end of the title D _e = B _c , C _e , C _e is a set of characters used to distinguish the title from the content B _c ∩C _e = When defined as ø, l is preferably defined by equation (6).

상기 참조 문자열 rⁱ은 수학식 7,8에 의해 정의되는 것이 바람직하다.The reference string r ⁱ is preferably defined by Equation 7,8.

미리 정의된 참고문헌 이름을 나타내는 문자열을 원소로 가지는 집합을 RN이라 하고 임의의 문자열 집합 trnⁱ _ab= s_as_a+1...s_b, trnⁱ _ab⊂(Sⁱ)*, 2≤a≤b≤n일 때 rnⁱ는 수학식 9에 의해 정의되는 것이 바람직하다.A set that contains a string representing a predefined bibliographic name as an element is called RN and any set of strings trn ⁱ _ab = s _a s _{a + 1} ... s _b , trn ⁱ _ab ⊂ (S ⁱ ) *, 2≤ When a ≦ b ≦ n, rn ⁱ is preferably defined by equation (9).

미리 정의된 참고문헌 시작 구분자의 집합을 RS라 하고, 상기 rnⁱ≠ø인 경우 trsⁱ _β= s_α-β, β= min(δ),

일때 m+1은 수학식 10과 같이 정의되는 것이 바람직하다.A set of predefined reference start delimiters is RS, and when rn ⁱ ≠ ø, trs ⁱ _β = s _α-β , β = min (δ),

When m + 1 is preferably defined as in Equation 10.

상기 rnⁱ≠ø인 경우 rpⁱ= s_b+1s_b+2...s_c라 하고, 미리 정의된 참고문헌의 끝을 나타내는 구분자의 집합을 RE라 할 때, c는 수학식 11에 의해 정의되는 것이 바람 직하다.When rn ⁱ ≠ ø, rp ⁱ = s _{b + 1} s _{b + 2} ... s _c , and when a set of delimiters representing the end of a predefined reference is RE, c is expressed in Equation 11 It is desirable to be defined by.

기준 머리기호 집합 BS^d = bs^d ₁,bs^d ₂,...bs^d _n,...으로 순서를 가지고 있고, bs^d _n은 문서에서 한번만 출현하고, 문서에서 임의의 줄 i에서 나타나는 BS에 해당되는 목차의 머리기호를 bs_n ⁱ라 할 때 n은 i가 증가함에 따라 항상 증가하며, 하나의 상기 BS는 정해진 하나의 깊이에 매칭되어야 하며, 여러 개의 상기 BS가 정의되는 경우에 각 깊이는 순차적으로 증가하며, 기준 머리기호 집합으로 정의된 그룹을 BS^d라하고, 이때 D_c는 문서에서 BS의 원소가 문서에서 차지하는 계층을 나타낼 때, i번째 목차가 트리에서 차지하는 계층 D_i는 수학식 12,13에 의해 정의되는 것이 바람직하다.BS with the set of reference headers BS ^d = bs ^d ₁ , bs ^d ₂ , ... bs ^d _n , ..., bs ^d _n appears only once in the document, and appears on any line i in the document When the head symbol of the table of contents corresponding to bs _n ⁱ is n always increases as i increases, one BS must match a predetermined depth, and each depth is defined when several BSs are defined. are sequentially increased, and the number of groups defined by the head symbols BS ^d d, and wherein d _c is to indicate the layer a BS of the element occupies in the document from the document, layer d _i is the i-th table of contents share of the tree mathematics It is preferable that it is defined by Formula 12,13.

본 발명은 토목분야의 대표적인 엔지니어링 문서라 할 수 있는 구조계산서의 비 구조화된 텍스트 문서정보를 트리 형태의 준 구조화된 XML 문서로 변환하기 위한 방법을 제시한다.The present invention provides a method for converting unstructured text document information of a structural statement, which is a representative engineering document in the civil engineering field, into a semi-structured XML document in a tree form.

토목분야에서 구조계산서는 어떠한 시설물을 설계하는 과정이 기록된 문서로서 공용기간이 긴 교량의 경우 유지관리시 지속적으로 참조되는 중요한 자료이므 로, 본 발명에 의한 추출 방법은 구조물의 설계 및 시공단계뿐만 아니라, 유지관리 단계에서도 큰 의미를 가질 수 있다.Structural calculation in the civil engineering field is a document that records the process of designing any facility, and in the case of bridges with long public periods, it is an important document that is continuously referred to during maintenance. Rather, it can be significant at the maintenance stage.

문서의 구조는 다양한 측면에서 정의될 수 있는데, 크게 문서 자체가 드러내는 구조와 텍스트 단락이 의미하는 바에 따른 구조로 구분될 수 있다. The structure of the document can be defined in various aspects, which can be largely divided into a structure revealed by the document itself and a structure according to the meaning of a text paragraph.

도 1은 이와 같은 문서 구조의 구분에 대한 개념도를 나타낸 것이다. 1 shows a conceptual diagram of the division of such a document structure.

다시 문서 자체에 대한 구조는 물리적 구조(physical structure), 논리적 구조(logical structure)로 나뉠 수 있다(Summers, 1998; Worring and Smeulders, 1999). Again, the structure of the document itself can be divided into a physical structure and a logical structure (Summers, 1998; Worring and Smeulders, 1999).

물리적 구조란 문서에서 텍스트, 도면, 테이블 등의 배치와 페이지(page)의 구성, 문자의 굵기와 같이 기하적인 특징을 의미하며, 논리적 구조란 문서의 작성자, 문서제목, 요약, 본문, 참고문헌과 같은 구성체계와 문서 작성자가 구분해 놓은 본문 내에서의 세부 제목(headings)과 내용들(contents)의 구조를 의미한다.Physical structure refers to geometrical features such as the layout of text, drawings, tables, etc., the composition of pages, and the thickness of text in a document.The logical structure refers to the author, document title, summary, text, and references of a document. It refers to the structure of the detailed headings and contents within the text which the same structure and document author have distinguished.

Wang et al.(2005)의 경우 문서의 구성 체계만을 논리 구조로 정의하고 의미적 구조(semantic structure)를 별도로 정의하였으며, 이를 다시 명시적 의미 구조(apparent semantic structure)와 내적인 의미 구조(latent semantic structure)로 구분하였다. Wang et al. (2005) defined only the structure of the document as a logical structure and semantic structure separately, and this was again defined as an explicit semantic structure and a latent semantic structure. structure).

명시적 의미구조는 위에서 설명한 본문 내에서의 세부 제목과 내용들의 구조를 지칭하고 내적 의미 구조는 문서의 단락들이 내포하는 의미의 구분으로 규정하 였다. Explicit semantic structure refers to the structure of detailed headings and contents in the above-mentioned text, and internal semantic structure is defined as the division of meaning contained in the paragraphs of the document.

본 발명에서는 명시적 의미 구조에 따른 준 구조화된 문서로 변환하는 것에 초점을 두었다. The present invention focuses on the conversion to quasi-structured documents with explicit semantic structures.

이미지화된 문서를 대상으로 명시적 의미 구조를 추출하는 과정은 이미지 내에서 각 텍스트 단락을 블록화하여 구분해내면서 물리적 특성들을 저장하고, 저장된 물리적 특성들과 미리 정의된 지식베이스(knowledge-base)를 이용하여 본문에 있는 내용들의 계층 구조를 분류하는 과정으로 진행된다(Summers, 1998; Altamura et al., 2001; Klink and Kieninger, 2001; Anjewierden, 2001). The process of extracting an explicit semantic structure from an imaged document blocks each text paragraph in the image, stores the physical properties, and uses the stored physical properties and a predefined knowledge-base. To classify the hierarchical structure of the text (Summers, 1998; Altamura et al., 2001; Klink and Kieninger, 2001; Anjewierden, 2001).

의미적 구조를 구축하는 연구의 경우 대부분 내적 의미 구조를 추출하는 것에 초점이 맞추어져 있으며, 크게 두 가지 접근방법이 사용된다. Most researches on constructing semantic structures focus on extracting internal semantic structures, and two approaches are used.

하나는 Adelberg(1998) 사례와 같이 GUI를 통해 수작업으로 문서를 분해하고 특정한 지식적 의미를 담고 있는 마크업을 사용자에 의해 수행하는 방법이 있으며, 다른 하나는 Salton et al. (1997)의 사례와 같이 용어의 출현 빈도(term frequency)를 벡터화하여 사용하는 방법이다. One is a method of manually decomposing a document through the GUI and performing markups containing specific knowledge meanings by the user, as in the case of Adelberg (1998). The other is Salton et al. As in the case of (1997), it is a method of vectorizing the term frequency.

Wang et al.(2005)의 연구에서는 명시적 의미 구조의 추출에서 본문 내용 중 제목에 사용된 머리기호와 미리 정의된 지식베이스를 활용하여 계층을 분류하는 방법이 사용되었다. Wang et al. (2005) used a method of classifying hierarchies by using a head symbol used in a title and a predefined knowledge base in extracting an explicit semantic structure.

그러나 이와 같은 명시적 의미 구조의 추출에 사용되는 방법들은 본 발명에서 대상으로 하는 구조계산서에 바로 적용될 수 없는 몇 가지 문제점을 가지고 있다. 이러한 문제점은 구조계산서가 가지는 문서의 특성과 함께 다음에서 설명한다. However, the methods used for the extraction of explicit semantic structures have some problems that cannot be directly applied to the structural statement targeted by the present invention. These problems are described below along with the characteristics of the document of the structural statement.

일반적인 엔지니어링 문서와 같이 구조계산서에는 텍스트 정보 이외에도 도면이나 표가 함께 포함되어 있다. Like general engineering documents, structural statements include text and drawings as well as tables.

문서 내의 도면을 대상으로 하는 Worring and Smeulders(1999)의 연구와 표를 대상으로 한 Embley et al.(2006), Kawanaka et al.(2007)의 연구에서 나타낸 바와 같이 문서 내의 도면과 표 내의 정보 추출과 관련한 연구는 문서의 구조를 분석하는 연구와 다른 관점에서 부가적인 방법을 필요로 한다. Extracting information from drawings and tables in a document, as shown by studies of Worring and Smeulders (1999) and drawings of documents in Embley et al. (2006) and Kawanaka et al. (2007). Research in this context requires additional methods from a different perspective than the study of document structure.

따라서 본 발명에서는 텍스트 내용만을 문서의 구조를 분석하는 대상으로 이용한다. Therefore, in the present invention, only text content is used as an object for analyzing the structure of the document.

또한 본 발명은 앞서 언급한 물리적 구조의 특성을 이용한 문서구조 추출 방법을 완전히 대체하는 방법을 제시하기보다는 기존의 기법들이 구조계산서를 대상으로 할 때 발생할 수 있는 문제점을 극복하는 보완적인 방법을 제시하는 것에 의미를 둔다.In addition, the present invention proposes a complementary method for overcoming the problems that may occur when the existing techniques target the structural statement, rather than a method of completely replacing the document structure extraction method using the characteristics of the aforementioned physical structure. Puts meaning on things.

본 발명에서는 각 기술 분야별로 다양한 구조계산서들 중에서 특히 강 거더교 구조계산서를 대상으로 하여 본 발명의 바람직한 실시예를 적용하였으므로, 이를 중심으로 상세하게 설명하기로 한다. In the present invention, since the preferred embodiment of the present invention was applied to the structural girder bridge, particularly among the various structural statements for each technical field, the following description will be given in detail.

실무적으로 작성된 구조계산서의 특성을 분석하기 위해, 국내 엔지니어링 회사로부터 협조를 받아 강 거더교의 상부구조에 대한 구조계산서 파일을 수집하였다. 회사마다 사용하는 프로그램이나 문서 작성형태가 다를 수 있음을 감안하여 대표적인 6개 엔지니어링 회사로부터 기 설계되었던 자료 22개의 문서를 분석하였다.In order to analyze the characteristics of the structural statement, the structural statement file for the superstructure of the steel girder bridge was collected in cooperation with a domestic engineering company. We analyzed 22 documents that were designed from six representative engineering firms, considering that different programs or forms may be used.

국내 엔지니어링 회사에서 실무적으로 작성·사용되고 있는 구조계산서의 분석결과를 바탕으로, 구조계산서가 가지는 구조적 특성을 요약하여 정리하면 다음과 같다. Based on the analysis results of the structural statement, which is actually produced and used by domestic engineering companies, the structural characteristics of the structural statement are summarized as follows.

구조계산서의 본문은 일반적인 문서와 같이 크게 제목과 내용으로 구분할 수 있다. The body of the structural statement can be divided into titles and contents as in general documents.

문서에서 제목은 독자에게 해당 제목 이하에 나타나는 내용이 어떠한 분류에 속하는지에 대한 의미를 전달하기 위해 사용된다. In the document, the title is used to convey to the reader what category the content appears under that title.

문서 내에서 정의된 제목들의 논리적 구조는 트리형태로 표현이 가능하다.The logical structure of the headings defined in the document can be represented in a tree.

트리는 그래프의 한 종류로서 진입 경로를 가지지 않는 루트(root)라 불리는 하나의 정점을 가지는 유향 그래프이며, 트리에서 계층(level)은 root로부터 해당 노드까지의 경로의 개수를 의미한다. A tree is a kind of graph that is a directed graph with one vertex called root, which does not have an entry path, and the level in the tree refers to the number of paths from root to the node.

일반적으로 구조계산서와 같이 긴 문서에서 작성자는 제목에 어떠한 기호를 사용하여 그 제목이 문서의 트리 구조에서 위치하는 계층을 독자에게 전달한다. Generally, in long documents, such as structural statements, the author uses some symbol in the title to convey to the reader the hierarchy in which the title is located in the document's tree structure.

따라서 논리적 구조를 추출하기 위해서는 목차의 제목에 해당되는 내용을 식별하기 위해 사용되는 기호들이 가장 중요한 단서로 활용될 수 있다. Therefore, in order to extract the logical structure, the symbols used to identify the contents corresponding to the title of the table of contents may be used as the most important clues.

그러나 앞서 설명한 기존의 방법들은 본 발명의 대상으로 하는 구조계산서에 적용하기에 세 가지의 큰 문제점을 지니고 있다. However, the existing methods described above have three major problems to be applied to the structural statement of the present invention.

첫 번째 문제점은 하나의 문서 내에서 동일한 머리기호는 명시적 의미 구조에서 동일한 계층을 나타내는 것으로 간주된다는 것이다(Wang et al., 2005). The first problem is that identical headings within a document are considered to represent the same hierarchy in explicit semantic structures (Wang et al., 2005).

이러한 가정은 대부분의 일반 문서의 경우 통용이 가능하나 구조계산서의 경우 동일한 문서 내에서 사용된 기호라도 다른 계층을 의미하는 경우가 많으며, 각 문서를 작성하는 회사마다 서로 다른 기호를 사용하기 때문에 특정한 기호가 특정한 계층을 의미하는 것으로 볼 수 없다. This assumption can be used for most general documents, but in the case of structural statements, the symbols used within the same document often mean different hierarchies. Cannot be seen as meaning a particular hierarchy.

두 번째 문제점은 이미지화된 문서를 처리하는 경우 사용되는 지식베이스나 기계학습을 활용하는 방법들은 대부분 문서 포맷에 대한 특성에 초점이 맞추어져 있다. The second problem is that most of the knowledge base and machine learning methods used to process imaged documents are focused on the characteristics of the document format.

그러나 대부분의 엔지니어링 실무에서는 구조계산서 작성을 위한 정형화된 문서 포맷에 대한 기준이 없기 때문에 다양한 회사에서 작성된 문서의 포맷을 명시적 의미 구조를 추출하는데 사용하기에는 한계가 있다. However, in most engineering practices, there is no standard format for formal document format for writing structural statements, so there is a limit to using the format of document produced by various companies to extract explicit semantic structures.

마지막으로 구조계산서에서는 명시적 논리 구조의 하나로 볼 수 있는 많은 항목들이 물리적 포맷에 대한 분별성 없이 내용과 함께 한 줄에 나타나는 경우가 많다. Finally, in the structural statement, many items that can be regarded as one of the explicit logical structures are often displayed on one line with the contents without discernment of the physical format.

이러한 경우 문서의 포맷에 의해 처리하기보다는 문자열을 구성하는 규칙을 기반으로 문서를 분석하는 것이 구조계산서를 대상으로 적용하기에 보다 적합할 수 있다. In this case, analyzing the document based on the rules that make up the string rather than processing the document's format may be more appropriate for applying the structural statement to the target.

XML이 가지고 있는 사용자 정의에 의한 마크업을 지원하는 특징과 플랫폼에 독립적이라는 장점을 가지고 있어 문서 정보뿐만 아니라 다양한 분야에서 이기종간의 정보 교환을 위해 여전히 널리 사용되고 있다. It has the feature of supporting the markup by user definition and the independence of the platform. Therefore, it is still widely used for not only document information but also heterogeneous information exchange in various fields.

XML 문서에서 마크업에 사용되는 요소는 요소의 이름(name), 요소에 할당되 는 속성(attribute) 및 속성 값(value)과 요소의 내용(content)로 구성되며, 내용은 다시 여러 요소를 포함한 텍스트로 구성될 수 있다(Bray et al., 2006). The elements used for markup in an XML document consist of the element's name, the attribute assigned to the element, the attribute value, and the element's content. It may consist of text (Bray et al., 2006).

이에 따라 하위에 나타나는 요소는 상위의 자식요소로 구분되며, 이에 따라 트리 형태로 문서의 내용이 표현될 수 있다. Accordingly, the elements appearing below are divided into child elements of the above, and thus the contents of the document can be expressed in a tree form.

한국의 경우 건설산업의 원활한 정보 공유 및 교환을 위한 건설 CALS/EC 사업(건설교통부, 2003)의 일환으로 건설CALS/EC 전자문서 Pool(한국건설기술연구원, 2004)이 단체표준으로 제정되었다. In Korea, as part of the construction CALS / EC project (Ministry of Construction and Transportation, 2003) to facilitate the sharing and exchange of information in the construction industry, the Construction CALS / EC Electronic Document Pool (Korea Institute of Construction Technology, 2004) was established as a group standard.

여기에는 구조계산서를 포함한 220종의 문서 구조에 관한 XML 스키마가 제시되어 있다. Here is an XML schema for 220 document structures, including structure statements.

본 발명에서 대상으로 하는 구조계산서의 경우 발주기관명, 문서작성기관명, 제출일과 같은 문서관리를 위한 마크업이 존재하며, 문서의 트리 구조를 나타내기 위한 요소로는 편, 장, 절 및 항의 4단계에 대한 마크업이 존재한다. In the structural statement of the present invention, there are markups for document management such as the name of the ordering organization, the name of the document authoring agency, and the date of submission. There is a markup for.

그러나 주어진 장과 절 위주의 마크업 방식에서는 문서 트리의 깊이를 한정적으로밖에 표현할 수 없는 단점을 지니고 있다. However, in the given chapter and demarcation markup method, there is a drawback that the depth of the document tree can be expressed only limitedly.

따라서 본 발명에서는 문서의 목차가 나타날 수 있는 모든 계층 구조를 그대로 표현하는 것에 중점을 두었다. Therefore, the present invention focused on expressing all hierarchies in which the table of contents can appear.

구조계산서는 같은 교량 형식에서 동일한 설계방법이 적용된 경우 문서 구조는 그 패턴이 거의 같게 나타난다. When the same design method is applied in the same bridge type, the structural statement shows the pattern almost identical.

본 발명에서는 강 거더교 구조계산서를 예를 들어 발명의 바람직한 실시예를 구체적으로 설명한다.In the present invention, a preferred embodiment of the present invention will be described in detail using a steel girder bridge structural calculation example.

도 2는 강거더교 구조계산서 제목의 일부분을 사례로 나타난 것이다. Figure 2 shows a part of the title of the structure of the girder bridge structure as an example.

도 2에 나타난 바와 같이 수집된 문서를 통해 살펴본 강거더교 구조계산서에서 제목으로 서술되는 것들은 "행위 이름", "부위 이름" 그리고 "변수 이름"으로 크게 3가지의 종류로 나눌 수 있다. As shown in FIG. 2, the titles described in the Gurder bridge structural statement viewed through the collected documents can be broadly classified into three types: "action name", "part name", and "variable name".

"행위 이름"은 구조계산을 수행하는데 수반되는 세부 행위를 나타낸 것으로서 "슬래브 설계", "주형 설계", "이음부 설계", "단면 검토"와 같은 것들이 포함되며, 주로 "부위 이름"과 함께 사용되거나 "부위 이름" 하위에서 반복적으로 나타난다. "Action Name" refers to the detailed actions involved in performing structural calculations, including "slab design", "mould design", "joint design", and "section review", mainly with "site name" Used or appear repeatedly under "Site Name".

"부위 이름"은 해당 구조물을 물리적으로 이루는 요소들이나 공간 또는 타입을 지칭하는 것으로서 "켄틸레버부", "제 1 지간 중앙부", "splice - 1"과 같은 요소들을 나타내며, 이들은 적어도 한번 이상 반복적으로 나타난다. The term "part name" refers to elements or spaces or types that physically constitute the structure, and refers to elements such as "cantilever part", "first interstitial center part" and "splice-1" which are repeated at least once. appear.

마지막으로 "변수 이름"은 액티비티(Activity)를 수행하는데 있어 주어진 조건을 설명하거나 수행이후 최종적인 결과를 설명할 때 주로 나타나며, "교량제원", "사용재료", "고정하중", "활하중" 등이 이에 포함되고, 이들은 동일한 타입의 부위에 해당되는 경우 "부위 이름" 이하에 반복적으로 나타난다. Finally, "variable name" is often used to describe a given condition in performing an activity, or to describe the final result after the execution of the activity, and "bridge specifications", "used materials", "fixed load", "live load" And the like, which appear repeatedly below the “site name” when it corresponds to a site of the same type.

이와 같은 현상은 본 발명에서 대상으로 하고 있는 강교를 대상으로 한 경우 모두 공통적으로 나타났다. Such a phenomenon was common in all cases of steel bridges targeted in the present invention.

따라서 강거더교 구조계산서에 사용된 목차는 세부 텍스트 정보에 의미를 부여하는데 활용될 수 있는 것으로 판단된다. Therefore, it is judged that the table of contents used in the structure of the girder bridge can be used to give meaning to detailed text information.

즉, 동일한 단어인 길이라 할지라도 이것이 교량의 길이를 의미하는 것인지 몇 번째 거더의 길이를 의미하는 지를 구분하는 등 구조계산서의 목차정보 자체는 세부적인 텍스트 내용들이 어떠한 행위나 특정 부위에 속하는지에 대한 탐색에 실마리를 제공할 수 있다. That is, even if the length of the same word is used to distinguish the length of the bridge or the length of the girder, the table of contents information itself in the structural statement is used to determine what actions or specific parts belong to the specific text contents. Can provide clues to the search.

도 3은 텍스트 문서를 대상으로 준 구조화된 XML 문서를 생성하는 과정을 나타낸 것이다. 3 illustrates a process of generating a structured XML document for a text document.

본 발명에서는 이미지화된 문서로부터 문자를 인식하는데 발생하는 에러를 배제하기 위해 실무에서 사용되는 문서 작성 프로그램에서 텍스트 파일로 저장된 문서를 입력파일로 사용한다. In the present invention, a document stored as a text file is used as an input file in a document preparation program used in practice to exclude errors occurring in recognizing characters from an imaged document.

도 3에 나타난 바와 같이 입력된 텍스트 파일은 크게 3단계를 거쳐 준 구조화된 XML 문서로 변환된다. As shown in FIG. 3, the input text file is converted into a structured XML document in three steps.

첫 번째는 수집된 강거더교 구조계산서로부터 추출된 문서모델에 따라 텍스트 파일의 각 행에 따라 문자열 정보가 임시 테이블에 머리기호, 제목, 내용, 참고문헌으로 구분되어 순차적으로 저장된다. Firstly, according to the document model extracted from the collected girder bridge structural statement, the string information is stored in the temporary table in the temporary table, divided into header symbols, titles, contents, and references.

저장된 임시 테이블의 머리기호에 대한 정보를 이용하여 각 제목이 문서의 트리구조에서 위치하는 계층에 대한 정보가 부여되고 마지막으로 계층 정보와 임시 테이블에 저장된 정보들을 이용하여 XML 파일을 생성하게 된다. Using the information about the header of the stored temporary table, information about the hierarchies of each title in the document tree structure is given. Finally, an XML file is generated using the hierarchical information and the information stored in the temporary table.

이와 같은 과정에 사용되는 문서의 모델과 계층 구조의 규칙에 대한 정의는 후술한다.The definition of the model and hierarchy rules of the document used in such a process will be described later.

이하에서는 강교 구조계산서의 텍스트 문서정보를 파싱(parsing)하기 위해 정의한 문서 모델에 대하여 설명한다. Hereinafter, a document model defined for parsing text document information of a steel bridge structure statement will be described.

먼저 몇 가지 표현방법을 다음과 같이 정의한다(Lopresti, 2000; Linz, 2005). First, some expressions are defined as follows (Lopresti, 2000; Linz, 2005).

1) 문자열 집합 S = s₁s₂...s_n로 유한한 순서를 가진 문자들의 집합을 의미하며, 여기서 s_k∈∑ 로 1≤k≤n이고 ∑는 주어진 문서에서 다루는 공백을 포함한 모든 문자의 집합이다. 1) A set of strings S = s ₁ s ₂ ... s _n means a set of characters with a finite order, where s _k ∈∑ 1≤k≤n and ∑ is all including spaces covered in a given document Is a set of characters.

2) 문자 집합 이름 뒤에 상첨자로 표기된 *는 문자 집합의 멱집합(powerset)을 의미한다.(예: ∑^*는 ∑의 멱집합(powerset)이다.)2) A superscripted * after a character set name means a powerset of the character set (e.g., ∑ ^* is the powerset of ∑)

3) 문자 집합 이름 뒤에 상첨자로 표기된 +는 문자 집합의 멱집합에서 빈 문자 λ을 제외한 집합을 의미한다. 3) The superscripted + after a character set name means a set excluding the empty character λ from the set of characters.

4) n_sk(S)은 문자열 집합 S에 포함된 문자 s_k의 개수를 의미한다.4) n _sk (S) means the number of characters s _k included in the string set S.

5) A ::= B에서 기호 "::="은 전자(A)는 후자(B)로 표현됨을 의미한다.5) The symbol ":: =" in A :: = B means that the former (A) is represented by the latter (B).

6) 기호 "|"은 "또는"을 의미한다. 6) The symbol "|" means "or".

7) 기호 "|S|"은 문자열 S에 포함된 문자의 개수를 의미한다. 7) The symbol "| S |" means the number of characters included in the string S.

수집된 강교 구조계산서 분석을 통해 구조계산서의 텍스트 문서정보의 특성을 정의하면 다음과 같다. The characteristics of the text document information of the structural statement through the analysis of the collected steel bridge structural statements are as follows.

강거더교 구조계산서의 텍스트 정보의 구성요소는 다음의 정의 1과 같이 정 의될 수 있다. The components of the textual information in the structure of the girder bridge can be defined as in definition 1 below.

정의 1. (components of text information) 구조계산서 텍스트 문서는 순서를 가진 유한한 문자열 집합으로서 문자열 집합은 유한한 행(line)으로 구분되어 있으며, i번째 행의 문자열 집합 Sⁱ는 다음과 같이 구성된다.Definition 1. (components of text information) structure, the statement text document is set as a finite set of strings with string sequence are separated by a finite one row (line), a set of strings S ⁱ i-th row is composed as follows: .

여기서, hⁱ는 제목에 대한 문자열 집합으로 hⁱ = s₁s₂...s_l, cⁱ는 내용에 대한 문자열 집합으로 cⁱ = s_l+1s_l+2...s_m, rⁱ는 참조에 대한 문자열 집합으로서 rⁱ = s_m+1s_m+2...s_n이며, 0≤l≤m≤n. Where h ⁱ is the set of strings for the title h ⁱ = s ₁ s ₂ ... s _l , c ⁱ is the set of strings for the content c ⁱ = s _{l + 1} s _{l + 2} ... s _m , r ⁱ is a set of strings for the reference r ⁱ = s _{m + 1} s _{m + 2} ... s _n , where 0≤l≤m≤n.

정의 1에 나타낸 바와 같이 텍스트 정보는 크게 제목, 내용, 참조로 구성되며, 이들이 나타나는 경우의 수는 5가지로서 각 줄이 제목이나 내용만으로 구성된 경우, 제목과 내용 및 참조가 동시에 출현하는 경우, 제목과 참조로 구성되는 경우, 마지막으로 내용과 참조로 구성되는 경우로 나눌 수 있다. As shown in definition 1, the textual information consists largely of title, content, and reference, and there are five cases in which they appear, if each line consists only of title or content, and title and content and reference appear simultaneously, It can be divided into a case consisting of and a reference, and finally a case consisting of a content and a reference.

문서의 내용의 경우 다양한 양식의 배열이 존재하며, 이들 배열에 대한 규칙을 찾는다는 것은 매우 어려운 작업이다. There are many forms of arrays for the content of a document, and finding the rules for those arrays is a very difficult task.

따라서 이들 구성요소를 식별하기 위해 입력된 텍스트 라인에서 제목과 참조를 식별할 수 있는 규칙을 정의하고 해당 규칙을 만족하지 않는 문자열인 경우 내 용으로 분류할 수 있다. Therefore, a rule that can identify a title and a reference in the input text line to identify these components can be defined and classified as a content if the string does not satisfy the rule.

제목과 참조를 식별하기 위한 규칙은 다음과 같이 각각 정의 2 및 정의 3을 통해 설명한다. The rules for identifying titles and references are described in Definitions 2 and 3, respectively, as follows.

정의 1에서 정의된 제목에 대한 문자열은 다시 다음의 정의 2와 같이 구성된다. The string for the title defined in definition 1 is again constructed as in definition 2 below.

정의 2. (components of headings) 정의 1에서 정의한 제목 문자열 hⁱ는 다음과 같이 구성된다.Definition 2. (components of headings) The title string h ⁱ defined in Definition 1 consists of:

여기서, hsⁱ는 제목을 표기하기 위해 사용된 문자열의 집합으로 hsⁱ = s₁s₂...s_o, hsⁱ ⊂ ∑+, hcⁱ는 순수 제목에 대한 문자열 집합으로 hcⁱ= s_o+1s_o+2...s_p, hdⁱ는 hcⁱ가 끝남을 나타내는 구분자(delimiter) 기호로서 hdⁱ = s_l 각각 의미하고, o와 p및 l의 관계는 다음과 같다.Where hs ⁱ is the set of strings used to represent the title hs ⁱ = s ₁ s ₂ ... s _o , hs ⁱ ⊂ ∑ +, hc ⁱ is the set of strings for the pure title hc ⁱ = s _{o +1} s _{o + 2} ... s _p , hd ⁱ are delimiter symbols that indicate the end of hc ⁱ , meaning hd ⁱ = s _l , and the relationship between o, p, and l is as follows:

정의 2에서 정의한 바와 같이, 제목 문자열 hⁱ≠ø이기 위한 필요충분조건은 hsⁱ≠ø ∧ hcⁱ≠ø로 설정하였다. 따라서 드모르간 법칙에 의해 주어진 문자열에서 hsⁱ=ø이거나 hcⁱ=ø인 경우 hⁱ=ø으로 볼 수 있으며, 이를 제목 문자열의 존재 여부 판별에 사용할 수 있다. As defined in the definition 2, the necessary and sufficient condition for being subject string h ⁱ ≠ ø was set to ^{^{hs i ≠ ø ∧ hc i ≠}} ø. Therefore, if hs ⁱ = ø or hc ⁱ = ø in the string given by Demorgan's law, it can be regarded as h ⁱ = ø, which can be used to determine the existence of the title string.

hsⁱ≠ø인 조건과 hcⁱ=ø인 조건은 다음의 각각 정의 2a와 정의 2b와 같이 정의하였다. The condition hs ⁱ ≠ ø and the condition hc ⁱ = ø were defined as the following definitions 2a and 2b, respectively.

정의 2a. (heading symbols) 제목을 나타낼 때 사용되는 미리 정의된 문자열 그룹 ID의 집합을 HS_ID라 하고, ∀ID, HS_ID ⊂ ∑⁺일 때, hsⁱ≠ø인 조건은 다음과 같다.Definition 2a. (heading symbols) A set of predefined string group IDs used for representing titles is referred to as HS _ID , and when ∀ID and HS _ID ⊂ ∑ ⁺ , hs ⁱ ≠ ø is as follows.

정의 2b. (heading contents) 문자열에서 f번 이상 출현하지 말아야하는 금칙 문자들의 집합을 X_f라 하고, x_f ∈ X_f일 때, hcⁱ=ø인 조건은 다음과 같다.Definition 2b. (heading contents) A set of kinsoku characters that must not appear more than f times in a string is X _f , and when x _f ∈ X _f , the condition hc ⁱ = ø is as follows.

정의 2a 및 정의 2b는 각각 제목에 대한 머리기호의 존재 여부를 판별하기 위한 조건과 순수 제목에 대한 문자열이 제목에 부합하는 지의 여부를 판별하기 위해 정의한 것이다. Definitions 2a and 2b are respectively defined to determine whether a heading for a title exists and whether a string for a pure title matches the title.

「1.2」 또는 「2.3」과 같이 머리기호의 패턴과 비슷한 문자열이 맨 앞에 위치하는 경우에는 머리기호의 존재 여부만으로는 제목의 여부를 판단할 수 없다.If a character string similar to the pattern of the head symbol is placed at the beginning such as "1.2" or "2.3", the existence of the head symbol alone cannot determine the title.

정의 2b는 이러한 경우를 위한 조건으로서 구조계산서에서 소수가 문자열의 맨 앞에 위치하는 경우 해당 줄에는 산술식과 관련한 식들이 뒤따라 나오는 것이 일반적인 특성을 정의한 것이다. Definition 2b is a condition for this case. If a decimal point is placed at the beginning of a string in the structure statement, the line is followed by arithmetic expressions.

표 1과 표 2는 각각 수집된 문서 분석을 통해 5장의 응용을 위해 미리 정의한 머리기호에 대한 집합과 금칙문자에 대한 집합을 나타낸 것이다.Tables 1 and 2 show the set of predefined head symbols and the set of lexical characters for the five applications through the analysis of the collected documents, respectively.

구 분division IDID HS_ID HS _ID 숫자그룹1Number Group 1 1One 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …1 2 3 4 5 6 7 8 9 10, … 숫자그룹2Number Group 2 22 1.1, 1.2, …, 2.1, 2.2, …, 3.1, 3.2, …1.1, 1.2,... , 2.1, 2.2,... , 3.1, 3.2,... 숫자그룹3Number Group 3 33 1.1.1, 1.1.2, …, 1.2.1, 1.2.2, …, 2.1.1, 2.1.2, …1.1.1, 1.1.2,... , 1.2.1, 1.2.2,... , 2.1.1, 2.1.2,... 숫자그룹4Number Group 4 44 1.1.1.1, 1.1.1.2, …, 1.1.2.1, 1.1.2.2, …, 1.2.1.1, 1.2.1.2, …1.1.1.1, 1.1.1.2,... , 1.1.2.1, 1.1.2.2,... , 1.2.1.1, 1.2.1.2,... 숫자그룹5Number Group 5 55 1-1, 1-2, …, 2-1, 2-2, …, 3-1, 3-2, …1-1, 1-2,... , 2-1, 2-2,... , 3-1, 3-2,... 숫자그룹6Number Group 6 66 1-1-1, 1-1-2, …, 1-2-1, 1-2-2, …, 2-1-1, 2-1-2, …1-1-1, 1-1-2,... , 1-2-1, 1-2-2,... , 2-1-1, 2-1-2,... 숫자그룹7Number Group 7 77 1-1-1-1, 1-1-1-2, …, 1-1-2-1, 1-1-2-2, …, 2-1-1-1, …1-1-1-1, 1-1-1-2,... , 1-1-2-1, 1-1-2-2,... , 2-1-1-1,... 숫자그룹8Number Group 8 88 Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, Ⅵ, Ⅶ, Ⅷ, Ⅸ, Ⅹ, …I, II, III, IV, V, VI, V, V, V, V,. 숫자그룹9Number Group 9 99 ⅰ, ⅱ, ⅲ, ⅳ, ⅴ, ⅵ, ⅶ, ⅷ, ⅸ, ⅹ, …Ⅰ, ii, ⅲ, ⅳ, ⅴ, ⅵ, ⅶ, ⅷ, ⅸ, ⅹ,… 영문심벌그룹1English Symbol Group 1 1010 A, B, C, D, …, ZA, B, C, D,... , Z 영문심벌그룹2English Symbol Group 2 1111 a, b, c, d, …, za, b, c, d,... , z 한글심벌그룹1Hangul Symbol Group 1 1212 가, 나, 다, 라, …, 하Go, me, da, la,… , Ha 한글심벌그룹2Hangul Symbol Group 2 1313 ㄱ, ㄴ, ㄷ, ㄹ, …, ㅎA, b, c, d,. , ㅎ 원문자그롭1Original Character 1 1414 ①, ②, ③, ④, ⑤, ⑥, ⑦, ⑧, ⑨, ⑩, ⑪, ⑫, ⑬, ⑭, ⑮, …①, ②, ③, ④, ⑤, ⑥, ⑦, ⑧, ⑨, ⑩, ⑪, ⑫, ⑬, ⑭, ⑮,… 원문자그롭2Original Character 2 1515 ⓐ, ⓑ, ⓒ, ⓓ, …, ⓩⒶ, ⓑ, ⓒ, ⓓ,… , Ⓩ 원문자그롭3Original Character 3 1616 ⓐ, ⓑ, ⓒ, ⓓ, …, ⓩⒶ, ⓑ, ⓒ, ⓓ,… , Ⓩ 원문자그롭4Original Character 4 1717 ㉮, ㉯, ㉰, ㉱, …, ㉻㉮, ㉯, ㉰, ㉱,… , ㉻ 원문자그롭5Original Character 5 1818 ㉠, ㉡, ㉢, ㉣, …, ㉭㉠, ㉡, ㉢, ㉣,… , ㉭ 괄호그룹1Parentheses Group 1 19~3119-31 숫자그룹, 영문 및 한글 심벌그룹의 문자열 끝에 “)”를 붙인 그룹들
예: HS₁₉ ={ 1), 2), 3), …)}, HS₂₉ ={ i), ii), iii), …)}Groups with “)” at the end of strings of numeric groups, English and Korean symbol groups
Example: HS ₁₉ = (1), 2), 3),... )}, HS ₂₉ = {i), ii), iii),... )} 괄호그룹2Parentheses Group 2 32~4432-44 숫자그룹, 영문 및 한글심벌그룹의 문자열 양쪽에 "(",")"를 붙인 그룹들
예: HS₁₉ ={(1), (2), (3), …}, HS₂₉ ={ (a), (b), (c), …, (z)}Groups with "(", ")" appended to both strings of numeric groups, English and Korean symbol groups
Example: HS ₁₉ = {(1), (2), (3),... }, HS ₂₉ = {(a), (b), (c),... , (z)} 괄호그룹3Parentheses Group 3 45~4745 ~ 47 열린 괄호 기호 {[}, {(}, {<}Open parenthesis symbols {[}, {(}, {<} 기타 그룹Other groups 48~5948-59 {-}, {*}, {◁}, {◀}, {▷}, {▶},{⊙}, {◈}, {◎}, {·}, {○}, {"},{-}, {*}, {◁}, {◀}, {▷}, {▶}, {⊙}, {◈}, {◎}, {·}, {○}, {"},

ff X_f X _f 1One {+, *, ×, ÷, =, ≤, ≥, ∑, !, ± , %, ?, .}{+, *, ×, ÷, =, ≤, ≥, ∑,!, ±,%,?,.} 22 {[, ], , , <, >, /}{[,],,, <,>, /} 33 {(, ), -}{(,),-}

정의 2a와 정의 2b에 따라 제목 문자열의 존재의 여부를 판별하기 위해서는 분석대상의 문자열을 추출하여야 한다. According to definition 2a and definition 2b, the string to be analyzed should be extracted to determine the existence of the title string.

정의 1에 구분해 놓은 바와 같이 제목 문자열과 내용 또는 참조와 관련한 문자열이 공존하는 경우 제목 문자열로 추정할 수 있는 임시의 문자열의 영역은 다음 정의는 다음의 정의 2c를 이용하여 찾을 수 있다. As defined in definition 1, when the title string and the string related to the content or reference coexist, the area of the temporary string that can be estimated as the title string can be found using the definition 2c below.

정의 2c. (location of heading end) B0는 왼쪽 괄호를 표현하는 문자들의 집합으로 B₀⊂HS_ID, B_c는 오른쪽 괄호를 표현하는 문자들의 집합, ＜a,b＞은 서로 쌍을 이루는 동일한 종류의 괄호에 대한 집합으로 a∈B₀, b∈B_c이라 정의하며, 제목의 끝을 나타내는 구분자의 집합 D_e={B_c,C_e}, C_e는 제목과 내용의 구분을 위해 사용되는 문자들의 집합으로서 B_c∩C_e=ø으로 정의할 때 hsⁱ≠ø ∧ hcⁱ≠ø인 경우 정의 2에서 l은 다음과 같다.Definition 2c. (location of heading end) B0 is the set of characters representing the left parenthesis, B ₀ ⊂HS _ID , B _c is the set of characters representing the right parenthesis, and <a, b> are in the same kind of parentheses and for the set defined as a∈B _0, b∈B _c, a set of separators, which indicates the end of the title _{_{D e = {B c, c}} e}, c e is the set of characters used to distinguish the subject and content If hs ⁱ ≠ ø hc ⁱ ≠ ø, defined as B _c ∩C _e = ø, in definition 2, l is as follows.

강거더교 구조계산서에서 참고문헌은 구조물을 설계하는 과정과 관련한 근거를 나타낸다. 수집된 구조계산서에서 참고문헌이 기입된 경우 각 줄의 마지막 부분에 위치하고 있었으며, 참고문헌의 특정 페이지 번호나 절번호가 함께 표기된다.In the girder bridge structural statement, the references represent the basis for the process of designing the structure. When references are entered in the collected structural statement, they are located at the end of each line, with the specific page number or section number of the reference.

정의 1에서 정의된 참조에 대한 문자열의 구성과 각 구성 요소의 추출을 위한 정의는 다음과 같다. The composition of the string for the reference defined in definition 1 and the definition for extracting each component are as follows.

정의 3. (component of reference) 정의 1에서 정의한 참조 문자열 ri은 다음과 같이 구성된다.Definition 3. (component of reference) The reference string ri defined in Definition 1 is composed as follows.

여기서, rsⁱ는 참고문헌의 문자열 구분을 위해 사용된 구분자이며, rnⁱ은 참고문헌의 이름을 나타내는 문자열의 집합, reⁱ는 참고문헌정보의 끝점기호를 나타내는 집합, 그리고 rpⁱ는 참고문헌의 쪽 번호와 같이 부가적인 설명을 나타내는 문자열 집합이며, rⁱ≠ø이기 위한 조건은 다음과 같다.Where rs ⁱ is the delimiter used to distinguish the bibliography of the bibliography, rn ⁱ is the set of strings representing the name of the bibliography, re ⁱ is the set of end point symbols of the bibliographic information, and rp ⁱ is the bibliography of the bibliography It is a set of character strings indicating additional descriptions like page numbers, and the condition for r ⁱ ≠ ø is as follows.

정의 3a. (reference name) 미리 정의된 참고문헌 이름을 나타내는 문자열을 원소로 가지는 집합을 RN이라 하고 임의의 문자열 집합 trnⁱ _ab= s_as_a+1...s_b 이고 trnⁱ _ab⊂(Sⁱ)^*, 2≤a≤b≤n일 때 rnⁱ는 다음과 같다.Definition 3a. (reference name) A set of elements with a string representing a predefined reference name is called RN, and a random set of strings trn ⁱ _ab = s _a s _{a + 1} ... s _b and trn ⁱ _ab ⊂ (S ⁱ ) Rn ⁱ is as follows when ^* , 2≤a≤b≤n.

문서 작성자에 따라 참고문헌의 이름은 약자로 표기되는 경우가 발생한다.Depending on the author of the document, the name of the bibliography may be abbreviated.

따라서 정의 3a에서 미리 정의된 참고문헌 이름을 원소로 가지는 문자열의 집합 RN에는 이와 같은 약자에 대한 문자열을 원소로 포함하고 있어야 한다.Therefore, the set RN of the string with the bibliographic name predefined in definition 3a as an element must contain the string for this abbreviation as an element.

정의 3a에서 trnⁱ은 주어진 문자열 Sⁱ에서 공백(λ)을 포함하는 문자열로 정의하였는데 이는 문서작성자에 의한 띄어쓰기의 오류로 인한 참고문헌의 식별의 오류를 방지하기 위해서이다. 이와 같은 문제는 한글을 대상으로 한 문서에서 발생가능하다. In definition 3a, trn ⁱ is defined as a string containing spaces (λ) in the given string S ⁱ to prevent errors in bibliographic identification due to errors in spacing by the document author. Such a problem may occur in a document intended for Korean.

따라서 trnⁱ∈RN을 검토하는 과정에서 trnⁱ에는 공백을 제외한 문자열로 치환하여 RN의 각 원소와 비교하는 것이 보다 일관성을 유지할 수 있으며, 이에 따라 에 미리 정의하는 참고문헌의 이름 또한 공백을 제외한 문자열이어야 한다. Therefore, in the process of reviewing trn ⁱ ∈RN, it is more consistent to replace trn ⁱ with a string without spaces and compare it with each element of RN. Should be

단어별로 띄어쓰기가 이루어지는 영문의 경우 이와 같은 처리방법은 의미가 없다. In the case of English where spaces are written for each word, such a processing method is meaningless.

정의 3b. (start location of reference) 미리 정의된 참고문헌 시작 구분자의 집합을 RS할 경우 trsⁱ _β= s_α-β, β= min(δ),

이다. 만약 rnⁱ≠ø이고 참고문헌정보의 시작위치가 m+1이라고 하면 m+1은 다음과 같다.Definition 3b. (start location of reference) When RS sets a predefined reference start delimiter, trs ⁱ _β = s _α-β , β = min (δ),

to be. If rn ⁱ ≠ ø and the starting position of bibliographic information is m + 1, m + 1 is

정의 3b는 참고문헌을 표기하기 전에 사용되는 구분자를 식별하여 텍스트 내용(content)에 불필요한 문자가 편입되는 것을 막기 위해 정의된 것이다.Definition 3b is defined to identify the delimiters used before the bibliography to prevent unnecessary characters from being incorporated into the text content.

수집된 문서에서 참고문헌 시작에 대한 구분자는 주로 ??[??, ??(??, ??-?? 가 많이 사용되었으므로, 후술하게 되는 응용에서도 RS={[,(,-}를 참고문헌 시작 구분자를 인식하는 집합으로 사용하였다.In the collected documents, the delimiter for the beginning of the bibliography is mainly used ?? [??, ?? (??, ??-??, so RS = {[, (,-} is also referred to in the application described later). It was used as a set for recognizing the starting delimiter in the literature.

정의 3c. (reference page) rnⁱ≠ø인 경우 rpⁱ= s_b+1s_b+2...s_c라 하고, 미리 정의된 참고문헌의 끝을 나타내는 구분자의 집합을 RE라 할 때, c는 다음과 같다.Definition 3c. (reference page) If rn ⁱ ≠ ø, then rp ⁱ = s _{b + 1} s _{b + 2} ... s _c , and when a set of delimiters representing the end of a predefined reference is RE, c is Is the same as

정의 3c는 참고문헌의 특정 페이지 또는 절을 의미하는 문자열을 추출하기 위해 정의한 것이다.Definition 3c is a definition for extracting a string meaning a specific page or section of a reference.

정의 1에 의해 주어진 문자열 Sⁱ에서 rⁱ≠ø인 경우 마지막 문자인 sn이 참고문헌 문자열 끝을 나타내는 구분자인 경우 reference page에 해당되는 문자열은 정의 3c에 정의된 바와 같으며, 본 발명에서는 RE={],),.}를 이용하였다. When r ⁱ ≠ ø in the string S ⁱ given by definition 1, when the last character sn is a delimiter indicating the end of the bibliographic string, the string corresponding to the reference page is as defined in definition 3c, and in the present invention, RE = {],) ,.} was used.

도 4는 앞서 설명한 정의s를 이용하여 입력된 텍스트 파일을 파싱하는 과정을 나타낸 것이다. 4 illustrates a process of parsing an input text file using the above-described definitions.

도 4에 나타낸 바와 같이 텍스트 파일을 줄 단위로 읽어 들이면서 각 줄의 문자열에서 제목과 참고문헌을 각각 정의 2와 정의 3을 이용하여 식별한다. As shown in FIG. 4, the text file is read line by line, and the title and the reference are identified using the definition 2 and the definition 3 in the string of each line.

식별된 결과는 정의 1에 의해 본문 내용을 추정하는데 이용되며, 최종적으로 식별된 결과들이 임시 테이블에 저장된다. The identified results are used to estimate the body content by definition 1, and the finally identified results are stored in a temporary table.

실제 텍스트 파일에서는 빈 문자열만으로 이루어진 줄이 존재하기도 한다.In a real text file, there may be lines that consist only of empty strings.

따라서 실제 프로그램으로 구현하는 경우 빈 문자열만으로 이루어진 줄의 경우 아래의 처리과정을 생략하고 다음 줄을 다시 읽어서 유효한 문자열들이 존재하는 경우에만 처리함으로써 임시 테이블에 저장되는 정보는 유효한 문자열들이 저장되도록 하는 것이 이후 계층에 대한 레이블을 부여하는 과정에 효율적으로 활용될 수 있다. Therefore, in the case of the actual program implementation, in the case of a line consisting of empty strings only, the processing below is skipped and the next line is read again and processed only when valid strings exist so that the information stored in the temporary table is stored in a valid string. It can be used efficiently for labeling layers.

앞서 언급한 바와 같이 문서는 각 목차의 계층을 추정할 수 있도록 목차 앞에 머리기호가 사용된다. As mentioned earlier, the document uses a head symbol before the table of contents to estimate the hierarchy of each table of contents.

그러나 작성자 또는 회사마다 목차의 계층적인 위치를 추정할 수 있는 머리 기호의 사용 패턴이 다르기 때문에 머리기호의 그룹을 하나의 계층으로 정의하여 목차의 계층을 추정하는 방법은 각 문서에 사용된 머리기호의 패턴에 의존적이게 된다. However, different authors or companies have different usage patterns for headings that can be used to estimate the hierarchical location of a table of contents, so defining a group of headings as a single hierarchy estimates the hierarchy of tables of contents. It depends on the pattern.

따라서 보다 일반화된 방법으로 목차의 계층을 추정하기 위해서 본 발명에서는 목차 사이의 상대적인 계층차이로 해당 목차의 계층을 부여하는 방법을 제시하였다. Therefore, in order to estimate the hierarchy of the table of contents in a more generalized method, the present invention proposes a method of assigning the hierarchy of the table of contents with the relative difference between the tables.

이와 같은 방안에 따라 목차로 인식된 행에 레벨을 부여하기 위한 규칙을 다음과 같이 정의하였다. In this way, the rules for assigning levels to the rows recognized as the table of contents are defined as follows.

정의 4. (order of subtitles) 텍스트 문서에서 목차가 나타나는 순서는 트리의 깊이우선(depth-first) 순서와 일치한다. Definition 4. (order of subtitles) The order in which the table of contents appears in a text document corresponds to the depth-first order of the tree.

트리의 각 노드의 순서를 선형적인 순서로 정렬할 때 두 가지의 방식이 있는데 도 5에 나타낸 바와 같이 깊이우선 방식과 너비우선(breadth-first) 방식이 그것이다.When sorting the order of each node in the tree in a linear order, there are two methods, as shown in FIG. 5, a depth first method and a breadth-first method.

도 5는 깊이 우선 방식과 너비 우선 방식에서 정렬되는 순서에 대한 개념도를 나타낸 것이다. 5 is a conceptual diagram illustrating a sorting order in a depth first method and a width first method.

일반적으로 문서 각 제목이 나타나는 순서는 깊이우선 방식과 같다. In general, the order in which each heading appears in the document is the same as the depth-first method.

따라서 정의 4에 따라서 문서의 내용이 올바른 순서에 따라 기술되었다면 텍스트 정보를 순차적으로 읽어 들이면서 목차의 구조를 트리 형태로 변환하는 문제 는 트리 구조에서 해당 목차가 위치하는 깊이를 추정하는 문제로 결부된다. Therefore, if the contents of the document are described in the correct order according to definition 4, the problem of converting the structure of the table of contents into a tree form while reading text information sequentially is connected to the problem of estimating the depth of the table of contents in the tree structure. .

제목 간의 상대적인 깊이의 차이를 식별하기 위해서는 먼저 기준이 되는 머리기호와 이들이 트리에서 차지하는 깊이에 대한 기준이 필요하다. To identify the difference in relative depth between headings, we first need a reference to the headings that are the reference and the depth they occupy in the tree.

가장 간단하게는 문서에서 처음 인식된 머리기호 그룹을 기준으로 할 수 있으며, 문서에서 계층이 변화되지 않는 특정 머리기호와 해당 깊이를 사용자가 직접 정의할 수도 있다. In the simplest case, you can refer to a group of headers first recognized in the document, or you can define your own specific headers and their depths that do not change the hierarchy in the document.

단, 기준이 되는 목차에 사용되는 머리기호는 다음과 같은 특성을 지니고 있어야 한다. However, the head symbol used in the table of contents should have the following characteristics.

정의 5. (base-symbol group) d=1,2,...인 경우 기준 머리기호 집합 BS^d = bs^d ₁,bs^d ₂,...bs^d _n,...으로 순서를 가지고 있어야 하며, bs^d _n은 문서에서 한번만 출현하고, 문서에서 임의의 줄 i에서 나타나는 BS에 해당되는 목차의 머리기호를 bs_n ⁱ라 할 때 n은 i가 증가함에 따라 항상 증가한다. Definition 5. If (base-symbol group) d = 1,2, ... the base set of symbols BS ^d = bs ^d ₁ , bs ^d ₂ , ... bs ^d _n , ... Bs ^d _n appears only once in the document, and _n always increases as i increases, given that bs _n ⁱ is the heading of the table of contents corresponding to the BS in any line i in the document.

정의 5a. (depth of base-symbol group) 하나의 BS는 정해진 하나의 깊이에 매칭되어야 하며, 여러 개의 BS가 정의되는 경우에 각 깊이는 순차적으로 증가되어야 한다. Definition 5a. (depth of base-symbol group) One BS must match a defined depth, and if multiple BSs are defined, each depth must be increased sequentially.

위의 정의 5에 따라 기준 머리기호 집합으로 사용될 수 있는 머리기호 그룹을 표 1에서 찾는다면 ID가 1부터 44까지가 해당된다. If you look for a group of headings in Table 1 that can be used as a set of base headings according to definition 5 above, the IDs range from 1 to 44.

그러나 구조계산서를 포함한 보통의 문서에서는 ID가 1부터 8까지에 해당되는 머리기호들이 문서에서 단 한번만 출현하는 조건을 만족한다. However, in ordinary documents, including structural statements, the condition that IDs 1 through 8 appear only once in the document.

이와 같이 정의된 기준 머리기호 그룹이 정해지면 정의 4를 이용하여 문서에서 순차적으로 나타나는 제목들에 대한 계층을 이전에 정의되었던 계층과 비교하여 상대적으로 정의할 수 있다. Once the defined group of reference headings is defined, definition 4 can be used to define the hierarchy for titles that appear sequentially in the document, relative to the previously defined hierarchy.

이에 대한 규칙은 다음과 같다. The rules for this are as follows:

정의 6. (depth of headings) 기준 머리기호 집합으로 정의된 그룹을 BS^d라하고, 이때 D_c는 문서에서 BS의 원소가 문서에서 차지하는 계층을 나타낼 때, i번째 목차가 트리에서 차지하는 계층 D_i는 다음과 같다.Definition 6. (depth of headings) The group defined by the set of reference headings is BS ^d , where D _c represents the hierarchy D _i occupies in the tree when the element of BS in the document represents the hierarchy occupied by the document. Is as follows.

여기서 g(hsⁱ)는 머리기호 hsⁱ를 입력할 때 표 1에 따른 그룹 ID을 변환하는 함수이며, j=i-1로 이전 제목을 의미하고, k=max(K)로

이고 만약 K=ø이면 k=0이며, 가상노드의 필요성을 검토하는 E(i,j,k)는 수학식 13과 같다.Where g (hs ⁱ ) is a function to convert the group ID according to Table 1 when entering the head symbol hs ⁱ , where j = i-1 means the previous title, and k = max (K)

If k = ø k = 0, E (i, j, k) to examine the necessity of a virtual node is represented by the equation (13).

여기서 md는 기정의된 차수레벨에서 가장 높은 차수레벨 기본그룹의 그룹ID를 의미하고,

는 수학식 14와 같다.Where md means the group ID of the highest order level primary group at the predefined order level,

Is the same as Equation 14.

여기서 e=max(L)로서,

, BE는 순서를 가지는 머리기호 그룹들에서 제일 처음에 나타나는 머리기호의 집합을 의미한다.Where e = max (L) ,

, BE means the first set of headings in the ordered group of headings.

도 6은 수학식 12를 이용해 각 제목이 문서의 트리 구조로 변환되는 하나의 예를 나타낸 것이다. 6 illustrates an example in which each title is converted into a tree structure of a document by using Equation 12.

도 6에 나타낸 바와 같이 수학식 12에서 D_j+2이 되는 경우는 기준이 되는 계층들 사이에 다른 머리기호가 나타나는 경우를 나타낸다. As shown in FIG. 6, when D _{j + 2} is represented in Equation 12, another header symbol appears between layers as reference.

본 발명에서는 이러한 경우 해당 줄의 계층을 이전 계층의 +2만큼 증가시키 고 실제 트리구조로 문서를 변환할 때에 가상의 노드를 '부모이름_add'로 하여 새로이 생성함으로써 주어진 기준 계층에 대한 일관성을 유지할 수 있도록 하였다.In this case, in this case, the hierarchy of the given line is increased by increasing the hierarchy of the corresponding line by +2 of the previous hierarchy and newly creating a virtual node with 'parent name_add' when converting the document to the actual tree structure. It could be maintained.

상기한 방법에 따라 앞서 설명한 파싱된 텍스트 정보를 보관하는 임시 테이블의 제목들이 트리 구조의 어느 계층이 할당되는지에 대한 분류하는 과정은 도 7과 같다. According to the method described above, the process of classifying the titles of the temporary table for storing the parsed text information as to which layer of the tree structure is allocated is shown in FIG. 7.

도 8 내지 10은 본 발명의 실시예에 관한 화면을 캡쳐한 것으로서, 도 8은 입력되는 구조계산서 텍스트 문서의 예(엑셀 프로그램에서 텍스트 문서로 저장된 상태), 도 9는 문서 변환 모듈 동작의 예, 도 10은 모듈을 통해 변환된 XML 문서의 예를 나타낸 것이다.8 to 10 is a screen capture of an embodiment of the present invention, Figure 8 is an example of an input structure statement text document (state stored as a text document in the Excel program), Figure 9 is an example of the document conversion module operation, 10 illustrates an example of an XML document converted through a module.

지금까지는 문서 변환모듈에 대하여 상세히 설명하였으며, 이하에서는 문서변환모듈에 의해 변환된 XML데이터의 설계 누락항목 여부를 판별하는 누락항목 판별단계에 대하여 상세히 설명한다.So far, the document conversion module has been described in detail. Hereinafter, the missing item determination step of determining whether the XML data converted by the document conversion module is a design missing item will be described in detail.

본 단계를 수행하기에 앞서, 사용자에 의해 작성된 구조계산서 항목의 정정 여부를 판별하기 위해서는, 표준화된 구조계산서의 XML데이터가 기 정의되어 있어야 함은 당연하다.Before performing this step, it is natural that XML data of the standardized structural statement should be defined in order to determine whether the structural statement item prepared by the user is corrected.

설계검토 누락항목 추출을 위해 사용하는 스키마 매칭 기법은 크게 두 단계의 과정을 거치는데, 첫 번째 과정으로는 Tversky and Shafir(2004), 그리고 Yi et al.(2005)이 제안하고, Lee et al.(2006)이 수정한 Similarity measure 방법을 이 용하였다. 수학식 15는 요소명 집합에서의 유사도를 나타낸 것이며, 수학식16은 자신의 요소와 부모, 형제, 자식의 요소 항목을 합한 최종적인 유사도이다. The schema matching technique used to extract the design review missing items is a two-step process. The first process is Tversky and Shafir (2004), and Yi et al. (2005) proposed by Lee et al. The similarity measure method modified by (2006) was used. Equation 15 shows the similarity in the set of element names, and Equation 16 is the final similarity in which the element items of the parent, sibling, and child are added together.

식에서 i,j는 비교 매칭 대상이 되는 요소를 나타내며, N은 요소정의 집합을 나타내며, S는 매칭 유사도를 나타내는 값이다.In the equation, i, j represents an element to be compared and matched, N represents a set of element definitions, and S represents a value representing matching similarity.

여기서,

here,

상기 수학식 16은 자신의 요소와 부모, 형제, 자식의 요소 항목을 합한 최종적인 유사도이다. 여기서,

는 각각 그 요소, 부모 요소, 형제 요소, 자식 요소의 가중치이며,

이다.

의 의미는 각각의 집합에 공통으로 포함되어 있는 최소화 단어의 교집합을 의미하며, N, P, B, C는 각각 자신의 항목, 부모의 항목, 형제의 항목, 자식의 항목을 의미한다.Equation (16) is the final similarity of the sum of the element items of its own element and parent, sibling, and child. here,

Are the weights of the element, parent element, sibling element, and child element, respectively,

to be.

Means the intersection of the minimum words commonly included in each set, and N, P, B, and C mean their own items, their parent items, their siblings, and their children.

그리고 두 번째 과정으로는 Yi et al.(2005)이 제안하고, 김봉근 등(2006)이 수정한 Relaxation labeling 방법을 이용하였다. And as a second process, Yi et al. (2005) and Relaxation labeling method modified by Bong-Keun Kim et al. (2006) were used.

상기 수학식 17은 Relaxation labeling 방법을 나타낸 것으로, 여기서

는 계산단계 t에서 임시 표준의 i번째 항목을 수집한 문서의 k번째 항목에 매칭시켰을 때의 신뢰도이며,

는 지지함수이다.Equation 17 shows a relaxation labeling method, wherein

Is the reliability of matching the i th item of the temporary standard to the k th item of the document collected in the calculation step t,

Is the support function.

상기 수학식 17은 아래 수학식 18과 같이 표현이 가능하다.Equation 17 may be expressed as Equation 18 below.

여기서 m은 임시 표준 문서의 엘리먼트 수이며, n은 수집한 문서의 엘리먼트의 수이다. 그리고,

은 아래 수학식 19과 같다.Where m is the number of elements in the temporary standard document and n is the number of elements in the collected document. And,

Is expressed by Equation 19 below.

여기서, e는 에지(edge)로서 e(i,j)와 같이 2요소 쌍으로 정의되며, 요소 i와 j가 서로 직접적인 연결 관계에 있다면 e(i,j)=1과 같이 정의하고 그렇지 않은 경우는 e(i,j)=0이다. 또한 e~(i,j)는 경로를 의미하는 것으로 임의의 요소 i와 j 사이의 요소들이 연속되게 존재하는 경우 e~(i,j)=1과 같이 정의하고 그렇지 않은 경우에는 e~(i,j)=0이 되며, d_ij는 e~(i,j)=1인 경우 요소 i와 j 사이에 존재하는 에지(edge)의 수를 의미한다.Here, e is defined as a pair of two elements such as e (i, j) as an edge, and if element i and j are directly connected to each other, e (i, j) = 1, otherwise E (i, j) = 0. In addition, e ~ (i, j) means a path. If elements between any element i and j are consecutively defined, e ~ (i, j) = 1, otherwise e ~ (i , j) = 0, and d _ij means the number of edges existing between elements i and j when e ~ (i, j) = 1.

이렇게 해서 스키마 매칭 후 생성되는 신뢰도 행렬 P^(t)는 수학식 20와 같다. 이 값들은 기 정의된 구조계산서 표준 정보 모델과 실제 구조계산서에서 그 누락항목을 추출하는데 활용하게 된다.In this manner, the reliability matrix P ^(t) generated after schema matching is expressed by Equation 20. These values will be used to extract the missing items from the predefined structural statement standard information model and the actual structural statement.

도 11은 본 발명의 기술적 특징을 개념적으로 설명하는 흐름도로서, 점선박스로 표시된 부분은 XML 스키마 매칭을 하기 위해, 사용자에 의해 작성된 구조계산서를 계층이 정의된 XSD파일(Source)로 변환하는 문서변환 모듈을 나타낸 것이며, XSD파일(Source)과 표준 스키마 데이터(표준으로 정의된 최적모델)과의 스키마 매칭모듈에 의해 구조계산서의 누락항목을 추출하여 누락항목을 판별하는 비교분석모듈을 도시한 것이다.FIG. 11 is a flowchart conceptually illustrating technical features of the present invention, in which a portion indicated by a dotted box is a document conversion for converting a structural calculation written by a user into an XSD file (Source) having a hierarchy defined for XML schema matching. It shows the module, and shows the comparison analysis module that identifies the missing items by extracting the missing items of the structural statement by the schema matching module between the XSD file (Source) and the standard schema data (the best model defined by the standard).

도 12는 구조계산서의 설계 검토 누락 항목을 판별하는 비교 분석모듈의 흐 름도로 도시한 것으로서, 간략하게 정리하여 설명하면 다음과 같다.12 is a flow chart of a comparative analysis module for determining missing items for design review of a structural statement, which will be briefly described as follows.

먼저, 구조계산서 표준 정보모델을 정의하여 XSD로 변환하여 이를 target값 설정(표준값으로 정의)하며, 상기에서 설명한 문서변환모듈을 이용하여 구조계산서를 XSD파일로 만들어 source 파일로 사용한다. 그리고 각 element끼리 상호 비교를 통하여 그 유사도를 측정한 후, 이완 라벨링(relaxation labeling)을 통해 구조적 관계에 관한 값을 추가시켜 전체 유사도 값을 배열한다. 그리하여, 결과 값이 기준보다 낮은 값이 나오면 그 값은 target과 매칭이 되지 않는 것으로 판단하여 누락항목으로 처리하게 된다.First of all, the structural statement standard information model is defined and converted into XSD, the target value is set (defined as standard value), and the structural statement is made into the XSD file using the document conversion module described above and used as the source file. After measuring the similarity between the elements by comparing each other, the total similarity values are arranged by adding values related to structural relationships through relaxation labeling. Thus, if the result value is lower than the reference value, the value is determined not to match the target and is treated as a missing item.

도 13은 실제(source) 구조계산서에서 “보강재 설계”부분의 항목에 대해 임의로 설정한 표준 스키마와 매칭이 되는 모습을 보여주는 매칭도를 나타낸 것이며, 표 3은 도 13의 두 항목을 유사도 측정(similarity measure) 과정과, 이완 라벨링(relaxation labeling) 과정을 거친 후 최종 결과 값을 나타낸 것이다. 표 3에서 세로 항목은 미리 정의된 표준 구조계산서의 XML 스키마 각 항목들이며, 가로 항목은 실제 구조계산서의 각 항목들이다.FIG. 13 shows a matching diagram showing matching with a standard schema arbitrarily set for an item of “reinforcement design” portion of a source structural statement, and Table 3 shows similarity measurement of two items of FIG. 13. The final result is shown after the measure process and relaxation labeling process. In Table 3, the vertical items are the XML schema items of the predefined standard structure statement, and the horizontal items are the items of the actual structure statement.

Source

TargetSource

Target 보강재의 설계Design of Stiffeners 웹 보강재Web reinforcement 수직보강재의 설계Vertical Reinforcement Design 수평보강재의 설계Horizontal Reinforcement Design 지점부 보강재Branch reinforcement 플랜지 보강재Flange reinforcement 종방향 리브의 설계Longitudinal rib design 횡방향 리브의 설계Design of Transverse Ribs 잭업용 보강재Jack Up Reinforcement 보강재의 설계Design of Stiffeners 1One 00 00 00 00 00 00 00 00 플랜지 보강재Flange reinforcement 00 00 00 00 00 1One 00 00 00 종방향 리브의 설계(지점부)Longitudinal Rib Design (Point Section) 00 00 00 00 00 00 1One 00 00 횡방향 리브의 설계(지점부)Transverse Rib Design (Point Section) 00 00 00 00 00 00 00 1One 00 웹 보강재Web reinforcement 00 1One 00 00 00 00 00 00 00 수평보강재의 설계(지점부)Horizontal reinforcement design (branch) 00 00 00 00 00 00 00 00 00 종방향 리브의 설계(일반부)Longitudinal rib design (general part) 00 00 00 00 00 00 00 00 00 횡방향 리브의 설계(일반부)Lateral rib design (general part) 00 00 00 00 00 00 00 00 00 수평보강재의 설계(일반부)Design of Horizontal Reinforcement (General) 00 00 00 00 00 00 00 00 00 수직보강재의 설계Vertical Reinforcement Design 00 00 1One 00 00 00 00 00 00 수평보강재의 설계Horizontal Reinforcement Design 00 00 00 1One 00 00 00 00 00 지점부 보강재Branch reinforcement 00 00 00 00 1One 00 00 00 00 단지점부Just point 00 00 00 00 00 00 00 00 00 중간 지점부Middle point 00 00 00 00 00 00 00 00 00 잭업용 보강재Jack Up Reinforcement 00 00 00 00 00 00 00 00 1One 단지점부Just point 00 00 00 00 00 00 00 00 00 중간 지점부Middle point 00 00 00 00 00 00 00 00 00

XML 스키마의 각 항목이 표로 나열되는 순서는 도 14에 나타낸 것처럼 제일 상위 요소에서부터 그 자식 요소로 내려가면서 나열된다. 즉, 하나의 요소에 대해 자식 요소에 대한 항목을 모두 나열한 후에, 형제 요소로 넘어가고, 그 형제 요소의 자식 요소의 순서로 표에 나열되는 것이다.The order in which each item of the XML schema is listed in a table is listed descending from the top element to its child elements as shown in FIG. That is, after listing all the items for a child element for an element, it goes to the sibling element, and is listed in the table in the order of the child elements of the sibling element.

도 13에서 좌,우 같은 항목은 그 항목의 글자가 같기 때문에 같은 위치에 매칭될 것으로 예상할 수 있지만, 실제 구조계산서에서 '종방향 리브의 설계' 및 '횡방향 리브의 설계'는 임시 표준 스키마에서 '플랜지 보강재' 부분과 '웹 보강재'부분에 각각 포함되어 있어 글자의 비교만으로는 그 매칭이 어렵다. 하지만 표 3에서 알 수 있듯이, 그 글자는 정확히 한 곳에만 매칭이 되는데, 그 이유는 위에서 설명한 Relaxation labeling을 통해 구조적인 위치까지 계산했기 때문이다. In Figure 13, items such as left and right may be expected to match at the same position because the letters of the items are the same, but in the actual structural statement, the design of the longitudinal ribs and the design of the horizontal ribs are temporary standard schemas. Is included in the 'flange reinforcement' part and the 'web reinforcement' part, so that the matching of the letters is difficult. However, as can be seen in Table 3, the letters are matched exactly in one place, because the structural position is calculated through the relaxation labeling described above.

따라서 실제 구조계산서의 '플랜지 보강재'의 하위 항목인 '종방향 리브의 설계'와 '횡방향 리브 설계'는 표준 스키마의 '웹 보강재'의 하위 항목인 '종방향 리브의 설계 일반부'와 '황방향 리브 설계 일반부'에 매칭이 되는 것이 아니라, '플랜지 보강재'의 하위 항목인 '종방향 리브의 설계 지점부'와 '횡방향 리브 설계 지점부'에 매칭이 되는 것이다. 이런 방법으로 보면, 표 3에서 임시 표준 스키마의 '수평 보강재의 설계(지점부)', '종방향 리브의 설계(일반부)', '횡방향 리브의 설계(일반부)', '수평 보강재의 설계(일반부)', '단지점부', '중간 지점부', '단지점부', '중간 지점부'의 8개 항목은 매칭되는 값이 없다. Therefore, the design of the longitudinal ribs and the design of the longitudinal ribs, which are sub-items of the flange reinforcement in the actual structural statement, are the design general sections of the longitudinal ribs and the yellow, which are sub-items of the web reinforcement in the standard schema. It is not matched with the directional rib design general part, but is matched with the design point part of the longitudinal rib and the lateral rib design point part, which are sub-items of the flange reinforcement. In this way, in Table 3, the design of the horizontal reinforcement (branch), the design of the longitudinal ribs (general part), the design of the horizontal ribs (general part), and the design of the horizontal reinforcement of the temporary standard schema are shown in Table 3. (Normal part) ',' End point part ',' Middle point part ',' End point part ',' Middle point part 'have no matching values.

이러한 매칭 결과로부터 상기 매칭값이 존재하지 않는 항목들은 실제(Source) 구조계산서에는 없는 항목으로 볼 수 있으므로, 누락항목으로 판단하게 되는 것이다.From the matching result, the items without the matching value may be regarded as items that do not exist in the source structure statement, and thus are determined as missing items.

도 15는 본 발명에 의한 방법이 그대로 적용된 프로그램에 의해, 구조계산서의 업로드 과정, XML변환과정을 거쳐 변환된 XML파일의 계층적 시각화, 스키마 매칭 결과를 보여주는 화면의 일실시예를 나타낸 도면이다.FIG. 15 is a diagram illustrating an embodiment of a screen showing hierarchical visualization and schema matching results of an XML file converted through an upload process, an XML conversion process, and the like of a structure statement by a program to which the method according to the present invention is applied.

이상은 본 발명에 의해 구현될 수 있는 바람직한 실시예의 일부에 관하여 설명한 것에 불과하므로, 주지된 바와 같이 본 발명의 범위는 위의 실시예에 한정되어 해석되어서는 안 될 것이며, 위에서 설명된 본 발명의 기술적 사상과 그 근본을 함께 하는 기술적 사상은 모두 본 발명의 범위에 포함된다고 할 것이다.Since the above has been described only with respect to some of the preferred embodiments that can be implemented by the present invention, the scope of the present invention, as is well known, should not be construed as limited to the above embodiments, the present invention described above It will be said that both the technical idea and the technical idea which together with the base are included in the scope of the present invention.

도 1 내지 10은 본 발명에서의 문서변환단계의 실시예를 설명하기 위한 도면으로서,1 to 10 are views for explaining an embodiment of a document conversion step in the present invention,

도 1은 문서 구조의 구분에 대한 개념도이며,1 is a conceptual diagram for dividing a document structure.

도 2는 강거더교 구조계산서 제목의 일부분을 나타낸 도표이며,2 is a diagram showing a part of the title of structural structure of girder bridge,

도 3은 텍스트 문서를 대상으로 준 구조화된 XML 문서를 생성하는 과정을 나타낸 블록도이며,3 is a block diagram illustrating a process of generating a structured XML document for a text document.

도 4는 정의들을 이용하여 입력된 텍스트 파일을 파싱하는 과정을 나타낸 블록도이며,4 is a block diagram illustrating a process of parsing an input text file using definitions.

도 5는 깊이 우선 방식과 너비 우선 방식에서 정렬되는 순서에 대한 개념도이며,5 is a conceptual diagram for the order of sorting in the depth first and width first methods,

도 6은 수학식 12를 이용해 각 제목이 문서의 트리 구조로 변환되는 예를 나타낸 개념도이며,6 is a conceptual diagram illustrating an example in which each title is converted into a tree structure of a document by using Equation 12.

도 7은 파싱된 텍스트 정보를 보관하는 임시 테이블의 제목들이 트리 구조의 어느 계층이 할당되는지에 대한 분류하는 과정의 블록도이며,7 is a block diagram of a process of classifying titles of a temporary table storing parsed text information as to which layer of a tree structure is allocated.

도 8 내지 10은 본 발명의 실시예에 관한 화면을 캡쳐한 것으로서, 도 8은 입력되는 구조계산서 텍스트 문서 화면(엑셀 프로그램에서 텍스트 문서로 저장된 상태)이며, 도 9는 문서 변환 모듈 동작 화면이며, 도 10은 모듈을 통해 변환된 XML 문서 화면이다.8 to 10 are captured screens according to an embodiment of the present invention, Figure 8 is a structural statement text document screen (state stored as a text document in the Excel program), Figure 9 is a document conversion module operation screen, 10 is an XML document screen converted through a module.

도 11 내지 도 14는 본 발명에서 Source XML 스키마 데이터와 Target 스키마 데이터를 상호 대비하는 스키마 매칭과정을 설명하는 설명도로서,11 to 14 are explanatory diagrams illustrating a schema matching process for comparing source XML schema data and target schema data in accordance with the present invention.

도 11은 본 발명의 기술적 특징의 전반적인 흐름을 개념적으로 설명하는 플로우차트이며,11 is a flowchart conceptually illustrating the overall flow of the technical features of the present invention;

도 12는 본 발명에서의 누락항목 판별단계를 설명하는 비교분석모듈을 간략히 도시한 것이며,12 is a simplified view showing a comparison analysis module for explaining the step of identifying missing items in the present invention,

도 13은 본 발명의 일실시예로서 임시 표준 스키마(좌)와 실제 구조계산서(우)의 보강재 설계 항목에 대한 스키마 매칭도를 예시적으로 설명하는 설명도이며,FIG. 13 is an explanatory diagram exemplarily illustrating a schema matching diagram for reinforcement design items of a temporary standard schema (left) and an actual structural statement (right) as an embodiment of the present invention;

도 14는 XML 스키마의 엘리먼트를 표시 순서를 나타낸 것이다.14 illustrates a display order of elements of an XML schema.

도 15는 본 발명에 의한 방법이 그대로 적용된 프로그램에 의해, 구조계산서의 업로드, XML변환과 계층적 시각화, 스키마 매칭 결과를 보여주는 화면의 일실시예를 나타낸 도면이다.15 is a view showing an embodiment of a screen showing a structure statement upload, XML transformation and hierarchical visualization, and schema matching result by a program to which the method according to the present invention is applied.

Claims

A finite set of strings that are ordered and separated by a finite line, and the string set S ⁱ of the i th line is a structured statement in a text file format defined by Equation 1.

Sequentially storing, from the string set S ⁱ , string information for each row into a temporary table by dividing the string information for each row into a head symbol, a title, a content, and a reference; A hierarchical information assigning step of giving hierarchical information in which each heading is located in a tree structure of a document by using information on the header of the stored temporary table; An XML file generation step of generating an XML file in a depth-first order of a tree by using the hierarchical information and information stored in the temporary table; converting the document into an XML Schema Definition (XSD) by Wow;

And a missing item determination step of determining whether an item of the structural statement is missing by performing a schema matching process for comparing the XSD data and the XML data of the standardized structural statement with each other.

[Equation 1]

Where h ⁱ is the set of strings for the title h ⁱ = s ₁ s ₂ ... s _l , c ⁱ is the set of strings for the content c ⁱ = s _{l + 1} s _{l + 2} ... s _m , r ⁱ is a set of strings for the reference r ⁱ = s _{m + 1} s _{m + 2} ... s _n , where 0≤l≤m≤n.)

Missing item determination step in claim 1,

A similarity calculation step according to Equations 16 and 15;

The total similarity of arranging the total similarity values by adding structural relationship values calculated by a relaxation labeling process defined by Equation 17 to the similarity values between the schema elements and the elements calculated by the similarity calculating step. An array of values;

And determining that the target value of the item is not matched when the result value for the specific item is lower than the reference value.

[Equation 16]

(here,

to be.

Means the intersection of the minimum words commonly included in each set, and N, P, B, and C represent their own items, their parent items, their siblings, and their children.)

[Equation 15]

(here,

)

[Equation 17]

(here,

Is the support function.)

In the document conversion step of claim 1,

The title string h ⁱ is a structural statement check method, characterized in that defined by the equation (2, 3a, 3b).

[Equation 2]

Where hs ⁱ is the set of strings used to denote the title hs ⁱ = s ₁ s ₂ ... s _o , hs ⁱ ⊂ ∑ ⁺ , hc ⁱ is the set of strings for the pure title hc ⁱ = s _{o + 1} s _{o + 2} ... s _p , hd ⁱ are delimiter symbols that indicate the end of hc ⁱ , meaning hd ⁱ = s _l , and the relationship between o, p, and l is represented by equations 3a, 3b Is the same.)

Equation 3a

[Equation 3b]

In the document conversion step of claim 3,

The reference set of symbols BS ^d is in sequence bs ^d ₁ , bs ^d ₂ , ... bs ^d _n , ..., where bs ^d _n appears only once in the document and appears in the BS appearing on any line i in the document. When the heading of the corresponding table of contents is bs _n ⁱ , n always increases as i increases, and one BS must match a predetermined depth, and if multiple BSs are defined, each depth is sequentially increased, and the number of groups defined by the head symbols BS ^d d, and wherein d _c is to indicate the layer a BS of the element occupies in the document from the document, layer d _i is the i-th table of contents share of the tree formula Method of inspection of structural statement, as defined by 12,13,14.

[Equation 12]

(Where g (hs ⁱ ) is a function that converts the group ID according to Table 1 when entering the head symbol hs ⁱ , where j = i-1 means the previous title, and k = max (K)

If K = ø k = 0, E (i, j, k) to examine the necessity of a virtual node is represented by Equation 13.

[Equation 13]

(Where md means the group ID of the highest order level primary group at the predefined order level,

Is the same as Equation 14.)

[Equation 14]

(Where e = max (L) ,

, BE refers to the first set of headings in an ordered group of headings.)

In the document conversion step of claim 4,

A set of predefined string group IDs used to represent the title is referred to as HSID, and when ∀ID and HSID ⊂ ∑ +, the condition of hsi = ø is defined by Equation 4. Way.

&Quot; (4) "

In the document conversion step of claim 5,

A set of kinsoku characters that must not appear more than f times in the string set is X _f , and when x _{f f} X _f , the condition of hc ⁱ = ø is defined by Equation 5,

B ₀ is the set of characters representing the left parenthesis, B ₀ ⊂HS _ID , B _c is the set of characters representing the right parenthesis, and <a, b> is the set of parentheses of the same type paired together. Defined as B ₀ , _b 정의 B _c , the set of delimiters that denote the end of the title D _e = {B _c , C _e} , C _e is a set of characters used to distinguish the title from the content B _c ∩C When defined as _e = ø, l is defined by Equation 6.

The reference string r ⁱ in Equation 6 is defined by Equation 7.

[Equation 5]

&Quot; (6) "

[Equation 7]

Where rs ⁱ is the delimiter used to distinguish the bibliography of the bibliography, rn ⁱ is the set of strings representing the name of the bibliography, re ⁱ is the set of endpoints for bibliographic information, and rp ⁱ is the bibliography A set of strings that represent additional descriptions, such as the page number of, and the condition for r ⁱ ≠ ø is

Is the same.)