KR102540939B1

KR102540939B1 - System and method for improving the adequacy of natural language searches

Info

Publication number: KR102540939B1
Application number: KR1020220127139A
Authority: KR
Inventors: 고형석; 곽효승; 이홍재
Original assignee: (주)유알피
Priority date: 2022-10-05
Filing date: 2022-10-05
Publication date: 2023-06-08
Anticipated expiration: 2042-10-05

Abstract

본 발명은 검색어와 대조해 보려는 문서에 토큰 스코어뿐만 아니라 프레이즈 스코어를 추가한 점수를 부여하고 부여된 점수에 따라 내림차순으로 정렬하여 사용자에게 순서대로 제공하는 자연어 검색의 적절도 향상 시스템 및 적절도 향상 방법을 제안한다. 상기 자연어 검색의 적절도 향상 시스템은 토큰화 모듈, 문서 필터 모듈, 프레이즈 가중치 부여 모듈 및 문서 스코어 연산 모듈을 포함한다. The present invention provides a relevance improvement system and relevance improvement method for natural language search in which a score obtained by adding a phrase score as well as a token score is given to a document to be matched with a search term, and sorted in descending order according to the assigned score and provided to the user in order. Suggest. The natural language search relevance improvement system includes a tokenization module, a document filter module, a phrase weighting module, and a document score computation module.

Description

System and method for improving adequacy of natural language searches {System and method for improving the adequacy of natural language searches}

본 발명은 자연어 검색 시스템에 관한 것으로, 특히 검색어와 대조해 보려는 문서에 토큰 스코어(token score)와 프레이즈 스코어(phrase score)를 합한 점수를 부여하고 부여된 점수에 따라 내림차순으로 정렬하는 자연어 검색의 적절도 향상 시스템 및 적절도 향상 방법에 대한 것이다. The present invention relates to a natural language search system, and more particularly, the relevance of a natural language search in which a sum of a token score and a phrase score is assigned to a document to be matched with a search term and sorted in descending order according to the assigned score. It is about the improvement system and the adequacy improvement method.

자연어(Natural Language)는 컴퓨터에서 사용하는 프로그램 작성 언어 또는 기계어와 구분하기 위해 인간이 일상생활에서 의사 소통을 위해 사용하는 언어를 가리킨다. 자연어 처리(Natural Language Processing)는 인간의 언어인 자연어를 기계인 컴퓨터가 이해할 수 있도록 해석하고 컴퓨터가 처리할 수 있도록 하는 과정을 의미한다. Natural language refers to the language that humans use for communication in everyday life to distinguish it from programming language or machine language used by computers. Natural language processing (NLP) refers to a process of interpreting natural language, which is a human language, so that a computer, which is a machine, can understand it, and allowing the computer to process it.

자연어를 검색어로 이용하여 검색을 수행하여, 주어진 쿼리(Query) 즉 검색어로 입력된 자연어와 문서와의 연관성을 평가하여 쿼리에 포함된 용어가 각각의 문서에 얼마나 자주 등장하는지를 평가한다. A search is performed using a natural language as a search term, and a correlation between a given query, that is, a natural language entered as a search term, and a document is evaluated, and how often a term included in the query appears in each document is evaluated.

대한민국 등록특허 10-2256007호(2021년 5월 18일)인 '자연어 질의를 통한 문서 검색 및 응답 제공 시스템 및 방법'은 사용자로부터 자연어 질의가 입력되면 자연어 질의를 토큰화하고, 토큰화된 자연어 질의를 사용하여 데이터베이스 내의 모든 문서와 유사도 검사를 수행하여 문서를 선별하는 기술에 대해 기재하고 있다. Republic of Korea Patent Registration No. 10-2256007 (May 18, 2021) 'System and method for searching documents and providing responses through natural language queries' tokenizes natural language queries when a natural language query is input from a user, and tokenized natural language queries It describes a technique for selecting documents by performing a similarity test with all documents in the database using .

선별된 문서의 서비스 제공 순위, 즉 복수의 선별된 문서 중 어떤 문서를 사용자에게 가장 최선으로 제공하고, 어떤 문서를 이어서 제공할 것인가에 대해서는 고려하지 않았다. The order of service provision of the selected documents, that is, which document is best provided to the user among a plurality of selected documents and which document is to be provided next, is not considered.

만일 서비스되는 문서의 순서가 정해져 있지 않다면, 질의에 대한 최선의 대답이 기재된 문서가 아닌 일반적인 문서가 최초로 제공되어 사용자의 검색 시간을 늘리는 단점이 있다. If the order of the documents to be served is not determined, there is a disadvantage in that a general document is provided first, rather than a document in which the best answer to the query is written, increasing the user's search time.

또한, 일정한 원칙을 정한다고 해도, 그 원칙이 최선의 것이 아닐 때는 서비스되는 문서의 순서가 정해져 있지 않은 경우와 동일한 단점을 포함하게 될 것이다. In addition, even if a certain principle is set, when the principle is not the best, the same disadvantages as when the order of serviced documents are not determined will be included.

본 발명이 해결하고자 하는 기술적 과제는, 검색어와 대조해 보려는 문서에 토큰 스코어뿐만 아니라 프레이즈 스코어를 추가한 점수를 부여하고 부여된 점수에 따라 내림차순으로 정렬하여 사용자에게 순서대로 제공하는 자연어 검색의 적절도 향상 시스템을 제안하는 것에 있다. The technical problem to be solved by the present invention is to improve the appropriateness of natural language search by assigning a score obtained by adding not only a token score but also a phrase score to a document to be matched with a search term, sorting in descending order according to the assigned score, and providing the user in order. It is to propose a system.

본 발명이 해결하고자 하는 다른 기술적 과제는 검색어와 대조해 보려는 문서에 토큰 스코어뿐만 아니라 프레이즈 스코어를 추가한 점수를 부여하고 부여된 점수에 따라 내림차순으로 정렬하여 사용자에게 순서대로 제공하는 자연어 검색의 적절도 향상 방법을 제안하는 것에 있다. Another technical problem to be solved by the present invention is to improve the appropriateness of natural language search by assigning a score obtained by adding not only a token score but also a phrase score to a document to be matched with a search word, and sorting in descending order according to the assigned score and providing it to the user in order. It's about suggesting a way.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다. The technical problems to be achieved in the present invention are not limited to the above-mentioned technical problems, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below. You will be able to.

상기 기술적 과제를 달성하기 위한 본 발명에 따른 자연어 검색의 적절도 향상 시스템은 토큰화 모듈, 문서 필터 모듈, 프레이즈 가중치 부여 모듈 및 문서 스코어 연산 모듈을 포함한다. A system for improving relevance of natural language search according to the present invention for achieving the above technical problem includes a tokenization module, a document filter module, a phrase weighting module, and a document score calculation module.

상기 토큰화 모듈은 사용자가 입력한 쿼리에 포함된 자연어 검색어를 수집하여 토큰으로 분리한다. 상기 문서 필터 모듈은 복수의 문서 중 상기 토큰화 모듈에서 분리한 토큰을 포함하는 문서를 필터링한다. 상기 프레이즈 가중치 부여 모듈은 복수의 토큰 중 2개의 토큰을 묶어 프레이즈를 생성하고 각각의 프레이즈에 가중치를 부여한다. 상기 문서 스코어 연산 모듈은 사용자가 입력한 검색어와 필터링 된 문서, 상기 프레이즈 및 상기 가중치를 이용하여 토큰 스코어 및 프레이즈 스코어를 연산한다. The tokenization module collects natural language search words included in a query input by a user and separates them into tokens. The document filter module filters documents including tokens separated by the tokenization module among a plurality of documents. The phrase weighting module generates a phrase by combining two tokens among a plurality of tokens and assigns a weight to each phrase. The document score calculation module calculates a token score and a phrase score using the search word input by the user, the filtered document, the phrase, and the weight.

상기 다른 기술적 과제를 달성하기 위한 본 발명에 따른 자연어 검색의 적절도 향상 방법은, 제1항에 기재된 자연어 검색의 적절도 향상 시스템으로 수행하며, 상기 토큰화 모듈이 사용자가 입력한 쿼리에 포함된 자연어 검색어를 수집하여 토큰으로 분리하는 단계, 상기 문서 필터 모듈이 복수의 문서 중 상기 토큰화 모듈에서 분리한 토큰을 포함하는 문서를 필터링하는 단계, 상기 프레이즈 가중치 부여 모듈이 복수의 토큰 중 2개의 토큰을 묶어 프레이즈를 생성하고 각각의 프레이즈에 가중치를 부여하는 단계 및 상기 문서 스코어 연산 모듈이 사용자가 입력한 검색어와 필터링 된 문서, 상기 프레이즈 및 상기 가중치를 이용하여 토큰 스코어 및 프레이즈 스코어를 연산하는 단계를 포함한다. The method for improving the relevance of natural language search according to the present invention for achieving the other technical problem is performed by the system for improving the relevance of natural language search according to claim 1, and the tokenization module is included in the query input by the user. Collecting natural language search words and separating them into tokens, filtering, by the document filter module, documents including tokens separated by the tokenization module among a plurality of documents, and by the phrase weighting module, two tokens among a plurality of tokens generating phrases and assigning a weight to each phrase, and calculating a token score and a phrase score by the document score calculation module using the search word entered by the user, the filtered document, the phrase, and the weight. include

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다. The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below. You will be able to.

본 발명에 따른 자연어 검색의 적절도 향상 시스템 및 적절도 향상 방법은 토큰 스코어뿐만 아니라 프레이즈 스코어를 병행하여 고려함으로써, 기존의 토큰 스코어 만을 이용하여 검색의 적절도를 향상하던 방식에 비해 더 효과적으로 검색 및 검색된 정보의 서비스 제공이 가능하도록 하는 장점이 있다. The relevance improvement system and relevance improvement method of natural language search according to the present invention considers not only the token score but also the phrase score in parallel, so that search and relevance can be performed more effectively than the existing method of improving the relevance of search using only the token score. There is an advantage of enabling service provision of searched information.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다. The effects obtainable in the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. will be.

도 1은 본 발명에 따른 자연어 검색의 적절도 향상 시스템의 일 실시 예이다.
도 2는 본 발명에 따른 자연어 검색의 적절도 향상 방법의 일 실시 예이다. 1 is an embodiment of a system for improving relevance of natural language search according to the present invention.
2 is an embodiment of a method for improving relevance of natural language search according to the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 예시적인 실시 예를 설명하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention and the advantages in operation of the present invention and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings describing exemplary embodiments of the present invention and the contents described in the accompanying drawings.

이하 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예를 설명함으로써, 본 발명을 상세히 설명한다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. Like reference numerals in each figure indicate like members.

도 1은 본 발명에 따른 자연어 검색의 적절도 향상 시스템의 일 실시 예이다. 1 is an embodiment of a system for improving relevance of natural language search according to the present invention.

도 1을 참조하면, 본 발명에 따른 자연어 검색의 적절도 향상 시스템(100)은, 자연어 검색 모듈(110), 토큰화 모듈(120), 문서 필터 모듈(130), 프레이즈 가중치 부여 모듈(140), 문서 스코어 연산 모듈(150) 및 문서 소팅 모듈(160)을 포함한다. Referring to FIG. 1 , the natural language search relevance improvement system 100 according to the present invention includes a natural language search module 110, a tokenization module 120, a document filter module 130, and a phrase weighting module 140. , a document score computation module 150 and a document sorting module 160.

자연어 검색 모듈(110)은 사용자 쿼리에 포함되는 자연어를 검색한다. The natural language search module 110 searches for natural language included in a user query.

설명의 편의를 위해 이하에서는 검색된 쿼리의 예로 'Q = [자연어 검색에서 적절도를 향상시키는 방법 및 시스템]'을 가정한다. For convenience of explanation, it is hereinafter assumed 'Q = [method and system for improving relevance in natural language search]' as an example of a searched query.

토큰화 모듈(120)은 사용자가 입력한 쿼리에 포함된 자연어 검색어를 토큰으로 분리한다. The tokenization module 120 separates a natural language search term included in a query input by a user into tokens.

예를 들면, 토큰으로 분리된 쿼리 Q = [자연어, 검색에서, 적절도를, 향상시키는, 방법, 및, 시스템]과 같이, 검색어 문장에서 띄어쓰기를 기준으로 토큰화할 수 있을 것이다. For example, it may be possible to tokenize based on spacing in a search term sentence, such as query Q = [natural language, search, relevance, improvement, method, and system] separated into tokens.

문서 필터 모듈(130)은 복수의 문서 중 상기 토큰화 모듈에서 분리한 토큰을 포함하는 문서를 필터링한다. 이 때, 토큰을 포함하는 개수에 상관없이 해당 토큰을 포함하는 모든 문서를 필터링(또는 선택)하는 것도 가능하지만, 미리 설정한 기준 토큰(수) 이상을 포함하는 문서를 필터링하는 것이 바람직할 것이다. The document filter module 130 filters documents including tokens separated by the tokenization module among a plurality of documents. At this time, it is possible to filter (or select) all documents including the token regardless of the number of tokens included, but it is preferable to filter documents containing more than a preset reference token (number).

프레이즈 가중치 부여 모듈(140)은 복수의 토큰 중 2개의 토큰을 묶어 프레이즈를 생성하고 각각의 프레이즈에 가중치를 부여한다. The phrase weighting module 140 generates a phrase by combining two tokens among a plurality of tokens and assigns a weight to each phrase.

예를 들면, 프레이즈 그룹 P = [{(자연어 검색에서), 6.0}, {(검색에서 적절도를), 5.0}, {(적절도를 향상시키는), 4.0}, {(향상시키는 방법), 3.0},{(방법 및), 2.0}, {(및 시스템), 1.0}]와 같이 표시할 수 있다. For example, phrase groups P = [{(in natural language search), 6.0}, {(relevance in search), 5.0}, {(improving relevance), 4.0}, {(how to improve), 3.0}, {(method and), 2.0}, {(and system), 1.0}].

"자연어 검색에서는" 이라는 첫 번째 프레이즈는 가중치를 6.0으로 부여하고, "검색에서 적절도를" 이라는 두 번째 프레이즈는 가중치를 5.0으로 부여하는 방식이다. The first phrase “in natural language search” is given a weight of 6.0, and the second phrase “relevance in search” is given a weight of 5.0.

문서 스코어 연산 모듈(150)은 사용자가 입력한 검색어와 필터링 된 문서, 프레이즈 및 가중치를 이용하여 토큰 스코어 및 프레이즈 스코어를 연산한다. The document score calculation module 150 calculates a token score and a phrase score using a search word input by a user and filtered documents, phrases, and weights.

아래의 수학식 1은 토큰 스코어를 연산하는 방정식이다. Equation 1 below is an equation for calculating a token score.

수학식 1에서, D는 대조해 보려고 하는 문서, Q는 사용자가 입력한 검색어,

는 쿼리에서 i(i는 변수) 번째 토큰,

는 검색된 문서에서 매칭된 토큰 수,

은 1.2를 기본값으로 설정되어 있는 TSP(term saturation parameter), b는 디폴트(dafault)로 취하는 상수인 0.75, dl은 검색된 문서의 길이, avgdl은 전체 문서의 평균 필드의 길이, docCount는 문서의 총 개수,

는 쿼리에서 i번째 토큰(

)를 포함하는 문서 빈도의 역수 그리고

는 해당 토큰(

)을 포함하는 문서의 개수를 각각 의미한다. In Equation 1, D is the document to be compared, Q is the search word entered by the user,

is the i (i is a variable) token in the query,

is the number of matched tokens in the retrieved document,

is the term saturation parameter (TSP), which is set to 1.2 as a default value, b is a constant of 0.75 as default, dl is the length of the searched document, avgdl is the average field length of all documents, and docCount is the total number of documents ,

is the ith token in the query (

) and the reciprocal of the frequency of documents containing

is the corresponding token (

) means the number of documents containing each.

dl은 문서를 토큰화 했을 때의 토큰 수가 되며, avgdl은 인덱스 내의 모든 문서를 토큰화 했을 때 문서당 평균 토큰 수가 될 것이다. dl will be the number of tokens when documents are tokenized, and avgdl will be the average number of tokens per document when all documents in the index are tokenized.

문서 빈도(Document Frequency)는 특정 용어(토큰)이 얼마나 자주 등장 하였는가를 판단할 수 있는 근거라면, 문서빈도의 역수(IOnverse Document Frequency)는 자주 발생하는 단어일수록 중요하지 않은 단어로 인식하여 가중치를 낮추려는 시도를 반영한다. If document frequency is the basis for judging how often a specific term (token) appears, the reciprocal of document frequency (IOnverse document frequency) recognizes frequently occurring words as less important words and tries to lower the weight. reflects the attempt

수학식 1을 참조하면, 토큰 스코어(token score)는 쿼리에 있는 용어(토큰)이 각각의 문서에 얼마나 자주 등장 하는가를 평가하는 지표가 될 것이다. Referring to Equation 1, the token score will be an index for evaluating how often a term (token) in a query appears in each document.

즉, 수학식 1의 토큰 스코어는 자주 등장하는 단어는 가중치를 낮추고, 문서의 길이에 따른 토큰의 수의 의미를 반영하고, 쿼리에 있는 키워드가 문서에 자주 나타나는 가 여부가 점수로 환산되었다. That is, for the token score of Equation 1, the weight of frequently appearing words is lowered, the meaning of the number of tokens according to the length of the document is reflected, and whether the keyword in the query frequently appears in the document is converted into a score.

본 발명에서는, 수학식 1의 토큰 스코어 외에도 아래에서 설명하는 프레이즈 스코어(phrase score)을 추가로 연산하여 자연어 검색의 적절성을 향상시키고자 한다. In the present invention, in addition to the token score of Equation 1, a phrase score described below is additionally calculated to improve the appropriateness of natural language search.

아래의 수학식 2는 프레이즈 스코어를 연산하는 방정식이다. Equation 2 below is an equation for calculating a phrase score.

수학식 2에서,

는 문서에서 j번째(j는 변수) 프레이즈(phrase)의 개수를 확인하는데 토큰과 토큰 사이의 거리가 주어진 거리 이하인 프레이즈의 개수이고, weight는 가중치이다. In Equation 2,

is the number of phrases where the distance between tokens is equal to or less than the given distance, and weight is the weight.

수학식 1 및 2에서 n 및 m는 자연수이다. In Equations 1 and 2, n and m are natural numbers.

문서 소팅 모듈(160)은 토큰 스코어(token score) 및 프레이즈 스코어(phrase score)를 합하여 얻은 문서별 스코어의 내림차순으로 정렬한다. 문서별 스코어는 사용자가 검색하는 검색어가 많이 포함된 문서의 순서로 제공할 수 있도록 할 것이다. The document sorting module 160 sorts in descending order of scores for each document obtained by adding a token score and a phrase score. Scores per document will allow the user to provide the order of documents containing the most search terms.

도 2는 본 발명에 따른 자연어 검색의 적절도 향상 방법의 일 실시 예이다. 2 is an embodiment of a method for improving relevance of natural language search according to the present invention.

도 2를 참조하면, 본 발명에 따른 자연어 검색의 적절도 향상 방법(200)은, 자연어 검색 단계(210), 토큰화 단계(220), 문서 필터 단계(230), 프레이즈 가중치 부여 단계(240), 문서 스코어 연산 단계(250) 및 문서 소팅 단계(260)를 포함한다. Referring to FIG. 2 , the natural language search relevance improvement method 200 according to the present invention includes a natural language search step 210, a tokenization step 220, a document filter step 230, and a phrase weighting step 240. , document score calculation step 250 and document sorting step 260.

자연어 검색 단계(210)에서는 사용자 쿼리에 포함되는 자연어를 검색한다.In the natural language search step 210, a natural language included in a user query is searched.

토큰화 단계(220)는 토큰화 모듈(120)이 사용자가 입력한 쿼리에 포함된 자연어 검색어를 수집하여 토큰으로 분리한다. In the tokenization step 220, the tokenization module 120 collects natural language search terms included in the query input by the user and separates them into tokens.

문서 필터 단계(230)는 문서 필터 모듈(130)이 복수의 문서 중 상기 토큰화 모듈에서 분리한 토큰을 포함하는 문서를 필터링한다. In the document filtering step 230, the document filter module 130 filters documents including tokens separated by the tokenization module among a plurality of documents.

프레이즈 가중치 부여 단계(240)는 프레이즈 가중치 부여 모듈(140)이 복수의 토큰 중 2개의 토큰을 묶어 프레이즈를 생성하고 각각의 프레이즈에 가중치를 부여한다. In the phrase weighting step 240, the phrase weighting module 140 generates a phrase by binding two tokens among a plurality of tokens and assigns a weight to each phrase.

문서 스코어 연산 단계(250)는 문서 스코어 연산 모듈(150)이 사용자가 입력한 검색어와 필터링 된 문서, 프레이즈 및 가중치를 이용하여 토큰 스코어 및 프레이즈 스코어를 연산한다. In the document score calculation step 250, the document score calculation module 150 calculates a token score and a phrase score using the search word input by the user and filtered documents, phrases, and weights.

문서 소팅 단계(260)는 문서 소팅 모듈(160)이 토큰 스코어 및 프레이즈 스코어를 합하여 얻은 문서별 스코어의 내림차순으로 정렬한다. In the document sorting step 260, the document sorting module 160 sorts in descending order of scores for each document obtained by summing the token score and the phrase score.

토큰으로 분리하는 단계(220)는, 검색어 문장에서 띄어쓰기를 기준으로 검색어를 토큰화하고, 문서를 필터링하는 단계(230)는 미리 정한 기준 토큰 이상의 토큰을 포함하는 문서를 필터링하며, 프레이즈에 가중치를 부여하는 단계(240)는, 서로 인접하는 2개의 토큰을 하나의 프레이즈로 생성하는 것이 바람직하다. In the step of separating into tokens (220), the search word is tokenized based on spaces in the search word sentence, and in the step of filtering the document (230), documents containing tokens equal to or greater than a predetermined standard token are filtered, and weight is assigned to the phrase. In the assigning step 240, it is preferable to generate two tokens adjacent to each other as one phrase.

토큰 스코어는, 사용자가 입력한 검색어, 문서 중 토큰과 대조해 보려는 문서, 필터링된 문서에서 매칭된 토큰의 수, 검색된 문서의 길이, 검색된 문서의 평균 길이, 및 쿼리에 포함되는 토큰에 대한 역자료빈도(Inverse Document Frequency)를 이용하여 연산한다. The token score is based on the search term entered by the user, the documents to be matched against the token, the number of matched tokens in the filtered documents, the length of the documents retrieved, the average length of the documents retrieved, and the inverse data frequency for the tokens included in the query. (Inverse Document Frequency).

프레이즈 스코어는, 문서에서 매칭되는 프레이즈의 수 및 문서에서 토큰과 토큰 사이의 거리가 미리 주어진 거리 이하인 프레이즈의 개수를 이용하여 연산하는 것이 바람직하다. The phrase score is preferably calculated using the number of matched phrases in the document and the number of phrases in which the distance between tokens in the document is equal to or less than a predetermined distance.

이상에서는 본 발명에 대한 기술사상을 첨부 도면과 함께 서술하였지만 이는 본 발명의 바람직한 실시 예를 예시적으로 설명한 것이지 본 발명을 한정하는 것은 아니다. 또한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 기술자라면 누구나 본 발명의 기술적 사상의 범주를 이탈하지 않는 범위 내에서 다양한 변형 및 모방 가능함은 명백한 사실이다.In the above, the technical idea of the present invention has been described together with the accompanying drawings, but this is an illustrative example of a preferred embodiment of the present invention, but does not limit the present invention. In addition, it is obvious that anyone skilled in the art can make various modifications and imitations without departing from the scope of the technical idea of the present invention.

110: 자연어 검색 모듈
120: 토큰화 모듈
130: 문서 필터 모듈
140: 프레이즈 가중치 부여 모듈
150: 문서 스코어 연산 모듈
160: 문서 소팅 모듈110: natural language search module
120: tokenization module
130: document filter module
140: phrase weighting module
150: document score calculation module
160: document sorting module

Claims

A tokenization module that collects natural language search words included in a query entered by a user and tokenizes them based on spaces;
a document filter module for filtering documents that include more than a predetermined standard number of tokens from among a plurality of documents separated by the tokenization module;
a phrase weighting module for generating a phrase by binding two tokens adjacent to each other among a plurality of tokens and assigning a weight to each phrase;
a document score calculation module that calculates a token score and a phrase score using a search word input by a user, the filtered document, the phrase, and the weight; and
A document sorting module for arranging scores for each document obtained by summing the token score and the phrase score in descending order;
The token score is based on a search word entered by the user, documents to be matched with the token among the documents, the number of matched tokens in the filtered documents, the length of the retrieved documents, the average length of the retrieved documents, and the tokens included in the query. The ratio of the length of the individual document to the average length of the filtered document is reflected for the number of matched tokens in the individual document included in the filtered document. It is characterized by applying a weight and an inverse document frequency for the number of matched tokens according to the document length,
The phrase score,
Calculation using the number of matching phrases in the document and the number of phrases in which the distance between tokens in the document is less than or equal to a given distance
Relevance improvement system for natural language search.

delete

It is performed by the system for improving the adequacy of natural language search described in claim 1,
The tokenization module collects natural language search words included in a query input by a user, tokenizes them based on spaces, and separates them;
filtering, by the document filter module, documents including a predetermined standard number of tokens or more of tokens separated by the tokenization module among a plurality of documents;
generating, by the phrase weighting module, two tokens adjacent to each other among a plurality of tokens to generate a phrase and assigning a weight to each phrase;
calculating, by the document score calculation module, a token score and a phrase score using the search word input by the user, the filtered document, the phrase, and the weight; and
A document sorting module sorting scores for each document obtained by summing the token score and the phrase score in descending order;
The token score is based on a search word entered by the user, documents to be matched with the token among the documents, the number of matched tokens in the filtered documents, the length of the retrieved documents, the average length of the retrieved documents, and the tokens included in the query. The ratio of the length of the individual document to the average length of the filtered document is reflected for the number of matched tokens in the individual document included in the filtered document. It is characterized by applying a weight and an inverse document frequency for the number of matched tokens according to the document length,
The phrase score,
Calculation using the number of matching phrases in the document and the number of phrases in which the distance between tokens in the document is less than or equal to a given distance
A method for improving the relevance of natural language search.

delete