KR100816934B1

KR100816934B1 - Clustering System and Method Using Document Search Results

Info

Publication number: KR100816934B1
Application number: KR1020060033659A
Authority: KR
Inventors: 차완규
Original assignee: 엘지전자 주식회사
Priority date: 2006-04-13
Filing date: 2006-04-13
Publication date: 2008-03-26
Anticipated expiration: 2026-04-13
Also published as: CN101055585B; KR20070102034A; CN101055585A

Abstract

본 발명의 실시예에 따른 군집화 시스템은 사용자가 검색어를 입력하기 위한 입력수단; 문서가 저장된 데이터베이스에 접속하여 입력된 검색어가 포함되는 문서를 검색하기 위한 검색수단; 검색결과를 사용자에게 표시하기 위한 출력수단; 출력된 검색결과의 문서를 상기 데이터베이스의 군집저장공간에 저장하고, 상기 문서로부터 특성을 도출하여 이를 벡터화하고, 문서간의 유사도를 도출하기 위한 분석수단; 상기 분석수단에 의해 도출된 문서간의 유사도를 기반으로 문서들간의 가상의 중심을 연산하기 위한 연산수단; 및 상기의 문서들간의 가상의 중심을 기준으로 상기 군집저장공간의 문서들을 군집화하기 위한 군집수단;이 포함된다. Clustering system according to an embodiment of the present invention includes a user input means for inputting a search word; Retrieving means for retrieving a document including a search term inputted by accessing a database in which the document is stored; Output means for displaying a search result to a user; Analysis means for storing the output document of the search result in a cluster storage space of the database, deriving a feature from the document, vectorizing it, and deriving a similarity between documents; Calculating means for calculating a virtual center between the documents based on the similarity between the documents derived by the analyzing means; And clustering means for grouping documents in the cluster storage space based on a virtual center between the documents.

군집, 문서, 데이터베이스 Clusters, documents, databases

Description

Clustering system and method using document search results {Clustering system and method using search result document}

도 1은 본 발명의 실시예에 따른 군집화 시스템을 설명하기 위한 블록도.1 is a block diagram illustrating a clustering system according to an embodiment of the present invention.

도 2는 본 발명의 실시예에 따라 문서의 특성이 벡터화되는 모습을 설명하는 도면. 2 is a view for explaining how a property of a document is vectorized according to an embodiment of the present invention.

도 3은 본 발명의 실시예에 따라 사용자에게 제공되는 입력 인터페이스와 출력 인터페이스를 설명하기 위한 도면.3 is a view for explaining an input interface and an output interface provided to a user according to an embodiment of the present invention.

도 4는 본 발명의 실시예에 따라 군집화하기 위한 방법을 설명하는 도면.4 illustrates a method for clustering according to an embodiment of the present invention.

도 5는 군집된 결과 화면을 보여주는 도면.5 shows a clustered result screen.

본 발명은 사용자 입력한 검색어에 대한 검색결과를 군집화하기 시스템에 대한 것으로서, 벡터공간모델을 기반으로 한 각 문서간의 유사도를 바탕으로 군집화되도록 하고, 설정된 군집에 따라 유사한 문서들이 업데이트되도록 하기 위한 문서검색 결과를 이용한 군집화 시스템 및 그 방법에 대한 것이다.The present invention relates to a system for clustering search results for a user-entered search word, and to search for documents to be clustered based on similarity between documents based on a vector space model and to update similar documents according to a set cluster. The clustering system and the method using the results.

인터넷을 통한 정보 교류가 보편화되면서 정보의 급격한 증가를 가져왔으나, 상대적으로 사용자가 원하는 가장 적절한 정보의 검색이 어려워짐은 물론이고, 필요한 문서의 저장/관리에 많은 수고가 따르게 되었다.As the exchange of information through the Internet has become common, it has led to a rapid increase in information. However, it has become difficult to search for the most appropriate information desired by the user, and much effort has been required to store and manage necessary documents.

그리고, 소정의 웹 서버를 이용하여 검색되는 문서를 저장하고, 이를 군집화하기 위한 다양한 방법들이 제시되고 있으나, 이들의 군집에 있어서는 소정의 분류수단에 의해 분류된 문서들을 구분하여 저장하는 것으로 그 작업이 완료되는 것이 일반적이다. In addition, various methods for storing and grouping documents retrieved using a predetermined web server have been proposed. However, in the grouping of these groups, the documents classified by predetermined classification means are stored separately. It is common to complete.

따라서, 구조화된 군집들을 사용자가 편집하는 것을 어려운 작업이며, 군집 조건에 해당되는 문서가 새롭게 발생된 경우에는 상기 군집에 이를 업데이트 하는 것이 불가능하였다. Therefore, it is difficult for a user to edit structured clusters, and when a document corresponding to a cluster condition is newly generated, it is impossible to update the clusters.

본 발명은 상기되는 문제점을 해결하기 위하여 제안되는 것으로서, 사용자의 요청에 따라 검색되는 문서들간의 유사도를 기반으로 하여 문서들을 군집화되도록 하는 문서검색 결과를 이용한 군집화 시스템 및 그 방법을 제안하는 것을 목적으로 한다.The present invention has been proposed to solve the above-described problem, and an object of the present invention is to propose a clustering system and a method using a document search result for clustering documents based on the similarity between documents retrieved according to a user's request. do.

또한, 군집화된 문서들을 분류체계화하고, 상기 분류체계를 기준으로 기계학습을 할 수 있도록 함으로써, 외부로부터 제공되는 신규의 문서를 이용하여 군집들을 업데이트하기 위한 문서검색 결과를 이용한 군집화 시스템 및 그 방법을 제안하는 것을 목적으로 한다. In addition, the clustering system and method using a document search results for updating the clusters by using a new document provided from the outside by classifying the clustered documents, and by the machine learning based on the classification system. It is for the purpose of suggestion.

상기되는 목적을 달성하기 위한 본 발명의 실시예에 따른 군집화 시스템은 사용자가 검색어를 입력하기 위한 입력수단; 문서가 저장된 데이터베이스에 접속하여 입력된 검색어가 포함되는 문서를 검색하기 위한 검색수단; 검색결과를 사용자에게 표시하기 위한 출력수단; 출력된 검색결과의 문서를 상기 데이터베이스의 군집저장공간에 저장하고, 상기 문서로부터 특성을 도출하여 이를 벡터화하고, 문서간의 유사도를 도출하기 위한 분석수단; 상기 분석수단에 의해 도출된 문서간의 유사도를 기반으로 문서들간의 가상의 중심을 연산하기 위한 연산수단; 및 상기의 문서들간의 가상의 중심을 기준으로 상기 군집저장공간의 문서들을 군집화하기 위한 군집수단;이 포함된다. Clustering system according to an embodiment of the present invention for achieving the above object is input means for the user inputs a search word; Retrieving means for retrieving a document including a search term inputted by accessing a database in which the document is stored; Output means for displaying a search result to a user; Analysis means for storing the output document of the search result in a cluster storage space of the database, deriving a feature from the document, vectorizing it, and deriving a similarity between documents; Calculating means for calculating a virtual center between the documents based on the similarity between the documents derived by the analyzing means; And clustering means for grouping documents in the cluster storage space based on a virtual center between the documents.

본 발명의 다른 측면에 따른 군집화 방법은 사용자가 입력한 검색어에 따라 상기 검색어가 포함된 문서를 데이터베이스로부터 독출하여 출력하는 단계; 상기 데이터베이스로부터 독출된 문서들을 상기 데이터베이스의 군집저장공간에 저장하는 단계; 상기 군집저장공간에 저장된 문서들로부터 특성을 도출하고, 이를 벡터화하여 문서간의 유사도가 도출되는 단계; 상기 문서간의 유사도를 기반으로 문서들간의 가상의 중심을 계산하는 단계; 및 상기 문서들간의 가상의 중심을 기준으로 상기 문서들을 군집화하는 단계;가 포함된다. According to another aspect of the present invention, a clustering method includes: reading and outputting a document including a search word from a database according to a search word input by a user; Storing the documents read from the database in a cluster storage space of the database; Deriving a characteristic from documents stored in the cluster storage space and vectorizing the same to derive similarity between the documents; Calculating a virtual center between the documents based on the similarity between the documents; And clustering the documents based on a virtual center between the documents.

이하에서는 본 발명의 바람직한 실시예를 첨부되는 도면을 참조하여 상세하게 설명한다. 다만, 본 발명의 사상이 제시되는 실시예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서, 구성 요소의 부가, 변경, 삭제, 추가등에 의해서 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본 발명의 사상의 범위 내에 든다고 할 것이다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in detail. However, the spirit of the present invention is not limited to the embodiments in which the present invention is presented, and those skilled in the art who understand the spirit of the present invention may easily add other embodiments by adding, changing, deleting, and adding components within the scope of the same idea. It may be suggested, but this will also fall within the scope of the spirit of the present invention.

이하에서 사용하는 '문서'는 데이터베이스에 저장된 문서에 대하여 클라이언트로부터 요청된 문서검색의 결과에 해당되는 문서 또는 상기 데이터베이스에 저장된 문서를 의미하는 것으로 사용된다. The document used below is used to mean a document corresponding to a result of a document search requested from a client for a document stored in a database or a document stored in the database.

도 1은 본 발명의 실시예에 따른 군집화 시스템을 설명하기 위한 블록도이고, 도 2는 본 발명의 실시예에 따라 문서의 특성이 벡터화되는 모습을 설명하는 도면이다. FIG. 1 is a block diagram illustrating a clustering system according to an embodiment of the present invention, and FIG. 2 is a view for explaining how a property of a document is vectorized according to an embodiment of the present invention.

도 1 및 도 2를 참조하면, 본 발명에 따른 군집화 시스템에는 문서검색의 결과 및 상기 결과를 이용한 문서의 군집화를 수행하는 군집화 서버(2)와, 상기 군집화 서버(2)와 네트워크 연결되는 다수의 클라이언트(1)로 이루어진다.1 and 2, a clustering system according to the present invention includes a clustering server 2 for performing a document search and a clustering of documents using the results, and a plurality of networked networks connected to the clustering server 2. It consists of the client 1.

상세히, 상기 클라이언트(1)에는 사용자가 상기 군집화 서버(2)로 소정의 검색어를 전송하기 위한 입력수단(110)과, 상기 군집화 서버(2)로부터 전송되는 문서검색의 결과등이 출력되는 출력수단(120)이 구비된다. In detail, the client 1 has input means 110 for transmitting a predetermined search word to the clustering server 2 by the user, and output means for outputting a result of a document search transmitted from the clustering server 2. 120 is provided.

그리고, 상기 입력수단(110)은 사용자가 소정의 검색어를 입력하기 위한 사용자 인터페이스가 될 수 있으며, 검색어는 키워드가 되는 단어이거나 문장 단위로 입력하는 것이 가능하다. In addition, the input unit 110 may be a user interface for a user to input a predetermined search word, and the search word may be a word that is a keyword or a sentence unit.

문장의 단위로 검색어가 입력되는 경우에는, 상기 검색수단(220)에 의해 입력된 문장으로부터 키워드가 추출되고, 추출된 키워드가 포함되는 문서 또는 추출된 키워드가 유사한 단어가 포함되는 문서가 검색된다. When a search word is input in units of sentences, a keyword is extracted from a sentence input by the search means 220, and a document including the extracted keyword or a document including a similar word is searched for.

그리고, 상기 군집화 서버(2)와 연결되는 상기 클라이언트(1)는 IP네트워크를 통하여 다수가 연결될 수 있으며, 도시된 도면에서는 상기 군집화 서버(2)와 클 라이언트(1)가 별도로 구성되어 있으나 상기 입력수단(110) 및 출력수단(120)이 상기 군집화 서버(2)내에 형성되는 것도 가능하다. In addition, the client 1 connected to the clustering server 2 may be connected in plural through an IP network. In the drawing, the clustering server 2 and the client 1 may be separately configured, but the input may be performed. It is also possible for the means 110 and the output means 120 to be formed in the clustering server 2.

또한, 상기 군집화 서버(2)에는 다수의 문서가 저장되는 데이터베이스(210)와, 상기 클라이언트(1)로부터 요청되는 검색어에 대응되는 문서를 상기 데이터베이스(210) 또는 다른 웹 서버로부터 검색하기 위한 검색수단(220)과, 상기 출력수단(120)을 통해 출력된 검색결과의 문서를 상기 데이터베이스(210)의 군집저장공간에 저장하고 문서로부터 특성을 도출하여 이를 벡터화하여 문서간의 유사도를 판단하는 분석수단(230)과, 상기 분석수단(230)에 의해 도출된 문서간의 유사도를 기반으로 문서들간의 가상의 중심을 연산하는 연산수단(240)과, 상기 문서들간의 가상의 중심을 기준으로 상기 군집저장공간에 저장된 문서들을 군집화하기 위한 군집수단(250)과, 상기 군집저장공간 내에 분류된 분류체계를 기준으로 기계학습을 하기 위한 학습수단(260)이 포함된다.In addition, the clustering server 2 includes a database 210 in which a plurality of documents are stored, and search means for searching for a document corresponding to a search word requested from the client 1 from the database 210 or another web server. And analysis means for storing the document of the search result output through the output means 120 in a cluster storage space of the database 210 and deriving a characteristic from the document to vectorize it to determine the similarity between the documents ( 230 and an arithmetic means 240 for calculating a virtual center between the documents based on the similarity between the documents derived by the analyzing means 230, and the cluster storage space based on the virtual center between the documents. Clustering means 250 for clustering the documents stored in the, and learning means 260 for machine learning based on the classification system classified in the cluster storage space.

보다 상세히, 상기 데이터베이스(210)에는 다수의 문서가 저장되고, 상기 문서는 특허문서 또는 논문등의 자료가 될 수 있다. 그리고, 상기 데어터베이스(210)는 소정의 네트워크 인터페이스(Network Interface, 미도시)를 통하여 다수의 문서를 제공할 수 있는 웹 서버(미도시)에 접속될 수 있으며, 접속된 웹 서버로부터 제공되는 문서가 상기 데이터베이스(210)에 저장될 수 있다.In more detail, a plurality of documents are stored in the database 210, and the documents may be data such as patent documents or papers. In addition, the database 210 may be connected to a web server (not shown) capable of providing a plurality of documents through a predetermined network interface (not shown), and may be provided from a connected web server. May be stored in the database 210.

예컨대, 상기 검색수단(220)은 한국 특허청, 미국 특허청 또는 세계지적재산권기구(WIPO)의 데이터베이스에 접속하여 하이퍼텍스트 전송프로토콜 형태의 특허문서들을 다운로드받을 수 있으며, 이들 문서는 상기 데이터베이스(210)에 저장될 수 있다. For example, the search means 220 may access a database of the Korea Intellectual Property Office, the US Patent Office, or the World Intellectual Property Organization (WIPO) to download patent documents in the form of a hypertext transfer protocol, and the documents may be downloaded to the database 210. Can be stored.

그리고, 상기 검색수단(220)은 상기 입력수단(110)을 통해 입력되는 검색어를 이용하여, 상기 데이터베이스(210) 또는 네트워크 접속된 소정의 웹 서버로부터 문서를 검색하기 위한 역할을 수행하며, 상기 입력된 검색어가 포함된 문서를 검색하거나 상기 입력된 검색어와 관련되는 키워드가 포함된 문서를 검색할 수 있다.The search means 220 performs a search for a document from the database 210 or a predetermined web server connected to a network by using a search word input through the input means 110. The document including the searched word may be searched or the document including the keyword related to the input search word may be searched.

그리고, 상기 검색수단(220)에 의해 검색되는 문서는 소정의 서지정보로 상기 출력수단(120)에 제공되고, 이를 통해 사용자가 검색된 문서의 정보를 확인할 수 있다. In addition, the document searched by the search means 220 is provided to the output means 120 with predetermined bibliographic information, through which the user can check the information of the searched document.

상기 분석수단(230)은 상기 출력수단(120)으로 제공되는 문서검색의 결과 즉, 상기 검색수단(220)에 의해 검색된 문서들로부터 문서의 특성을 도출하여 이를 벡터화한다. 그리고, 도출된 특성을 기반으로 하여 문서간의 유사도를 판단한다. The analyzing means 230 derives the characteristics of the document from the documents searched by the search means 220, that is, the result of the document search provided to the output means 120 and vectorizes it. The similarity between documents is determined based on the derived characteristics.

그리고, 상기 분석수단(230)은 문서로부터 추출되는 특성 예컨대, 기술적 특징이 될 수 있는 키워드를 기반으로 형성된 문서간의 유사도를 이용하여, 상기 검색결과의 문서들중에서 대표의 문서를 선정할 수 있다. 이 경우, 상기 분석수단(230)에 의해 선정되는 대표문서는 후술하게 되는 가상의 중심과 관련될 수 있다. The analyzing means 230 may select a representative document from among documents of the search result by using similarity between documents formed based on a feature extracted from a document, for example, a keyword that may be a technical feature. In this case, the representative document selected by the analyzing means 230 may be associated with a virtual center to be described later.

상기 분석수단(230)에 의해 도출된 문서의 특징으로 이루어진 벡터는 문서의 특징을 나타내는 단어와 상기 단어의 가중치를 그룹의 구성요소로 하고, 상기 벡터를 구성하는 요소의 개수는 문서에 따라 다르게 형성될 수 있다.The vector consisting of the features of the document derived by the analysis means 230 is a word representing the feature of the document and the weight of the words as a component of the group, the number of elements constituting the vector is formed differently depending on the document Can be.

예를 들어 설명하기 위한 도 2를 참조하면, 상기 입력수단(110)을 통해 입력 되는 검색어로부터 도출되는 특성들에 대하여, 문서1에서는 첫번째 특성이 19번, 두번째 특성이 35번, 마지막 특성이 15번의 빈도로 포함된다.For example, referring to FIG. 2 for description, with respect to characteristics derived from a search word input through the input means 110, in document 1, the first characteristic is 19 times, the second characteristic is 35 times, and the last characteristic is 15. It is included in frequency of times.

같은 방법으로 분석대상이 되는 문서들에 대해 특성으로 구성되는 벡터가 형성될 수 있다. In the same way, a vector consisting of the characteristics of documents to be analyzed can be formed.

그리고, 상기 분석수단(230)은 도출된 벡터를 바탕으로 문서간의 유사도를 판단할 수 있다. In addition, the analyzing means 230 may determine the similarity between documents based on the derived vector.

상기 분석수단(230)에 의한 문서 특징의 벡터화는 벡터공간 모델에서 수행되어, 텍스트와 카테고리를 색인어의 가중치 벡터로 표현하고, 그 사이의 유사도를 양쪽 벡터의 코사인 등에 의해 계산될 수 있다. 상기 문서로부터 도출된 특성 예컨대, 가중치가 부여된 키워드에 대해서는 상기 키워드를 식별하기 위한 번호가 부여될 수 있으며, 이 경우 상기 키워드에 부여된 번호를 이용하여 상기 연산수단(240)은 상기 문서들간의 가상의 중심을 연산할 수 있다. The vectorization of document features by the analyzing means 230 may be performed in a vector space model to express text and categories as weight vectors of index words, and the similarity therebetween may be calculated by cosines of both vectors. A characteristic for deriving from the document, for example, a weighted keyword, may be assigned a number for identifying the keyword, in which case the calculation means 240 may use the number assigned to the keyword to determine a relationship between the documents. Calculate the virtual center

그리고, 상기 분석수단(230)은 상기 데이터베이스(210)에 저장된 텍스트 형태의 문서는 구조화되어 있지 않은 경우가 일반적이므로, 소정의 텍스트 마이닝 엔진에 의한 구조화된 자료로 변환하기 위한 역할도 수행한다. In addition, since the analysis means 230 is generally not structured text document stored in the database 210, it also plays a role for converting the structured data by a predetermined text mining engine.

또한, 상기 연산수단(240)은 상기 분석수단(230)에 의해 형성되는 문서들 각각의 벡터로부터 가상의 중심을 연산하는 역할을 수행하며, 상기 연산수단(240)에 의해 형성되는 가상의 중심은 상기 입력수단(110)을 통해 사용자가 설정할 수 있다. 이에 대한 설명은 첨부되는 도면을 참조하여 후술하기로 한다. In addition, the calculation means 240 serves to calculate a virtual center from the vector of each document formed by the analysis means 230, the virtual center formed by the calculation means 240 is The user can set through the input means 110. The description thereof will be described later with reference to the accompanying drawings.

그리고, 상기 연산수단(240)에 의해 형성되는 가상의 중심은 각각의 문서로 부터 도출된 특성 및 상기 특성에 부여된 번호를 참조하여 이루어질 수 있다. The virtual center formed by the calculating means 240 may be made by referring to a characteristic derived from each document and a number assigned to the characteristic.

또한, 상기 군집수단(250)은 상기 문서들간의 가상의 중심을 기준으로 상기 데이터베이스(210) 내의 소정의 군집저장공간(미도시)에 저장된 문서들을 군집화하는 역할을 수행한다. 이를 위해, 상기 데이터베이스(210) 내에는 소정의 영역으로 구획분리된 군집저장공간이 형성될 수 있으며, 상기 출력수단(120)에 표시된 문서 검색의 결과에 대해서 군집화 요청이 있는 경우에 문서검색의 결과에 해당되는 문서들은 상기 군집저장공간에 저장된다.In addition, the clustering means 250 serves to cluster documents stored in a predetermined cluster storage space (not shown) in the database 210 based on the virtual center between the documents. To this end, a cluster storage space partitioned into a predetermined area may be formed in the database 210, and the result of the document search when there is a clustering request for the result of the document search displayed on the output means 120. Corresponding documents are stored in the cluster storage space.

그리고, 상기 군집수단(250)은 상기 군집저장공간에 저장된 문서들을 분류체계화하는 역할을 수행할 수 있으며, 이 경우 상기 군집수단(250)은 소정의 분류수단으로서의 역할을 동시에 수행하는 것이 될 수 있다.In addition, the clustering means 250 may serve to classify documents stored in the cluster storage space, and in this case, the clustering means 250 may simultaneously perform a role as a predetermined sorting means. .

상기 군집수단(250)은 상기 군집저장공간에 저장된 문서들을 기 설정된 분류코드 또는 상기 입력수단(110)을 통한 사용자의 지정에 따라 분류체계화할 수 있다. The clustering means 250 may classify documents stored in the cluster storage space according to a predetermined classification code or a user's designation through the input means 110.

이 경우, 상기 분류코드는 각각의 단어(또는 키워드)들이 그들의 유사도에 따라 형성된 소정의 맵이거나 테이블이 될 수 있으며, 다수의 단어들이 노드에 의해 연결되어 단어간의 유사도가 높은 하위 레벨에서부터 단어간의 유사도가 상대적으로 낮은 상위 레벨로 구조화된다.In this case, the classification code may be a predetermined map or table in which each word (or keyword) is formed according to their similarity, and a plurality of words are connected by nodes so that the similarity between words from a lower level of similarity between words is high. Is structured at a relatively low upper level.

또한, 상기 학습수단(260)은 상기 군집저장공간내에서 문서들이 분류된 기준이 되는 분류체계를 기계학습하고, 기계화된 학습결과를 바탕으로 소정의 웹 서버로부터 제공되는 신규의 문서가 분류체계에 따라 상기 데이터베이스(210)에 저장되 도록 한다. In addition, the learning means 260 machine-learns a classification system that is a standard for classifying documents in the cluster storage space, and a new document provided from a predetermined web server is classified into a classification system based on the mechanized learning result. To be stored in the database 210.

그리고, 상기 학습수단(260)에 의한 새로운 문서의 분류는 사용자의 설정에 따라 주기적으로 또는 실시간으로 이루어질 수 있다. In addition, the classification of the new document by the learning means 260 may be performed periodically or in real time according to the user's setting.

도 3은 본 발명의 실시예에 따라 사용자에게 제공되는 입력 인터페이스와 출력 인터페이스를 설명하기 위한 도면이고, 도 4는 본 발명의 실시예에 따라 군집화하기 위한 방법을 설명하는 도면이고, 도 5는 군집된 결과 화면을 보여주는 도면이다.3 is a diagram illustrating an input interface and an output interface provided to a user according to an embodiment of the present invention, FIG. 4 is a diagram illustrating a method for clustering according to an embodiment of the present invention, and FIG. Shows the result screen.

도 3 내지 도 5를 참조하면, 사용자가 검색어를 입력하기 위한 입력수단(110)에 의해 제공되는 입력 인터페이스(111) 및 검색 결과를 표시하기 위한 상기 출력수단(120)에 의해 제공되는 출력 인터페이스(121)가 사용자에게 보여진다.3 to 5, an input interface 111 provided by the input means 110 for inputting a search word by a user and an output interface provided by the output means 120 for displaying a search result ( 121) is shown to the user.

사용자는 상기 입력 인터페이스(111)를 통해 문서 검색의 조건을 입력할 수 있으며, 문서의 번호 또는 키워드를 입력할 수도 있다.A user may input a condition of a document search through the input interface 111, and may input a document number or a keyword.

그리고, 상기 입력 인터페이스(111)를 통해 입력되는 검색어에 대한 검색이 수행되는 경우에 상기 검색수단(220)에 의한 문서 검색의 결과가 상기 출력 인터페이스(121)에 출력된다.In addition, when a search for a search word input through the input interface 111 is performed, the result of the document search by the search means 220 is output to the output interface 121.

그리고, 사용자는 상기 출력 인터페이스(121)에 제공되는 검색 결과의 문서들을 확인하고, 소정의 문서들을 선택하여 군집화하는 것이 가능하다. 이 경우, 사용자는 상기 출력 인터페이스(121)를 통해 표시된 문서의 항목들 중에서 특정의 문서들을 선택한 뒤 군집화 수행요청키(예로서, '마이프로젝트전송'으로 도시)를 입력함으로써 상기 분석수단(230), 연산수단(240) 및 군집수단(250)에 의한 문서의 군집이 수행되도록 할 수 있다. In addition, the user may check the documents of the search result provided to the output interface 121 and select and group predetermined documents. In this case, the user selects a specific document from among the items of the displayed document through the output interface 121 and inputs a clustering execution request key (for example, shown as 'my project transmission') by the analyzing means 230. The clustering of documents by the computing means 240 and the clustering means 250 may be performed.

그리고, 문서검색 결과를 이용한 군집화의 절차로서, 사용자에게 도 4에 도시된 결과화면이 출력된다. 이 경우, 사용자는 군집수와 군집당 문서개수를 지정할 수 있으며, 사용자에 의해 지정된 군집수에 따라 상기 문서들간의 가상의 중심이 연산된다.Then, as a clustering procedure using the document search result, the result screen shown in FIG. 4 is output to the user. In this case, the user can designate the number of clusters and the number of documents per cluster, and the virtual center between the documents is calculated according to the number of clusters specified by the user.

그리고, 군집화된 결과들은 도 5에 도시된 바와 같이, 군집에 사용된 특성값이 별도로 출력될 수 있으며, 군집화된 결과인 군집목록이 대분류와 소분류로 구분되어 표시될 수 있다. 이 경우, 사용자가 상기 군집목록중 특정의 목록 예컨대, '한국1(9)'를 선택하는 경우에는 군집화된 9개의 문서에 대한 서지 정보가 출력된다. And, as shown in FIG. 5, the clustered results may separately output the characteristic values used for clustering, and the clustered list, which is a clustered result, may be divided into large and small categories. In this case, when the user selects a specific list from the cluster list, for example, 'Korea 1 (9)', bibliographic information for nine clustered documents is output.

전술한 바와 같은 본 발명의 실시예에 의해서, 문서검색 결과를 이용하여 문서들의 군집이 수행되도록 하여 군집에 사용되는 문서들을 사용자가 설정할 수 있고, 군집의 업데이트가 수행되는 효과가 있다. According to the embodiment of the present invention as described above, the user can set the documents used in the cluster by performing a grouping of documents using the document search result, and there is an effect of updating the cluster.

제안되는 바와 같은 본 발명의 실시예에 의해서, 사용자의 요청에 따라 검색되는 문서들간의 유사도를 기반으로 하여 문서들을 군집화되도록 하고, 군집화된 결과를 사용자가 용이하게 확인할 수 있고, 구조화된 군집의 편집이 용이하게 되는 장점이 있다. According to an embodiment of the present invention as proposed, the documents are clustered based on the similarity between documents retrieved at the request of the user, and the user can easily check the clustered results and edit the structured cluster. This has the advantage of being easy.

또한, 군집화된 문서들을 분류체계화하고, 상기 분류체계를 기준으로 기계학습을 할 수 있도록 함으로써, 외부로부터 제공되는 신규의 문서를 이용하여 군집들 이 업데이트되는 장점이 있다. In addition, by grouping the clustered documents and by the machine learning based on the classification system, there is an advantage that the clusters are updated by using a new document provided from the outside.

Claims

Input means for a user to input a search word;

Retrieving means for retrieving a document including a search term inputted by accessing a database in which the document is stored;

Analysis means for expressing a keyword extracted from the searched document as a weight vector, and reading the similarity between the searched documents using the vector;

Calculating means for calculating a virtual center of the retrieved documents using the weight vectors of the documents represented by the analyzing means; And

And clustering means for grouping documents of a search result based on a virtual center calculated by the calculating means.

And the number of virtual centers calculated by the calculating means is determined according to the number of clusters designated by the user.

The method of claim 1,

And said calculating means extracts a particular representative document from the documents of said search result based on the similarity between the documents read by said analyzing means.

delete

The method of claim 1,

And said clustering means classifies documents stored in said cluster storage space according to a pre-stored classification code or a user's designation.

The method of claim 4, wherein

And a learning means connected to the grouping means and configured to perform machine learning based on the classification code.

The method of claim 5, wherein

Clustering system, characterized in that the documents provided from a predetermined web server is automatically classified based on the learning result by the learning means.

The method of claim 1,

Clustering of documents by the clustering means is updated by documents provided from a web server connected to the clustering means through a predetermined network interface.

The method of claim 7, wherein

And when the clustering of the documents stored in the cluster storage space is updated by the documents provided from the web server, the clustering means notifying the user of the updated information through a mail.

Retrieving a document including the search word from a database according to a search word input by a user;

Expressing a keyword extracted from the searched documents as a weight vector, and reading the similarity between the searched documents using the vector;

Calculating a virtual center of the retrieved documents using the weight vector of the document; And

And grouping documents of a search result based on the calculated virtual center.

And the number of the computed virtual centers is determined according to the number of clusters designated by the user.

delete

The method of claim 9,

The grouping of the documents may include classifying the virtual centers between the documents according to a predetermined classification scheme code or a user's setting, thereby clustering the documents connected to the virtual centers.

The method of claim 9,

And after the clustering of the documents is performed, grouping the clustered results into the cluster storage space.

The method of claim 9,

Clustering characterized in that the automatic classification of the documents provided by the learning means for the machine learning based on a predetermined classification system is performed when a new document is provided to the database after the grouping of the documents is performed. Way.

The method of claim 13,

And the information on which the document is automatically sorted by the learning means is transmitted to the user using a predetermined mail service.

The method of claim 9,

And when new documents are downloaded to the database, the new document information is periodically provided to the user.