KR102260646B1

KR102260646B1 - Natural language processing system and method for word representations in natural language processing

Info

Publication number: KR102260646B1
Application number: KR1020190080967A
Authority: KR
Inventors: 이상근; 김예찬
Original assignee: 고려대학교 산학협력단
Priority date: 2018-10-10
Filing date: 2019-07-04
Publication date: 2021-06-07
Anticipated expiration: 2039-07-04
Also published as: KR20200040652A

Abstract

본 발명은 자연어 처리 시스템 및 자연어 처리에서의 단어 표현 방법에 관한 것으로서, 자연어 처리 시스템에 의해 수행되는 자연어 처리에서의 단어 표현 방법에 있어서, a) 적어도 하나 이상의 단어를 포함하는 어휘 및 각 단어에 대해 기학습된 단어 임베딩 정보를 포함하는 어휘 사전 데이터세트를 제공하는 단계; b) 상기 어휘 사전 데이터세트에 기초한 어휘가 입력 데이터로 제공되면, 단어 표현 모델을 이용하여 상기 입력 데이터에 존재하는 단어들에 대한 하위 단어(subword) 정보를 추출하고, 상기 하위 단어 정보를 단어 임베딩 벡터를 산출하는 단계; 및 c) 상기 산출된 단어 임베딩 벡터와 해당 단어의 기학습된 단어 임베딩 정보를 매칭함으로써 상기 기학습된 단어 임베딩 정보를 상기 산출된 단어 임베딩 벡터로 대체하여 해당 단어에 대한 단어 표현을 학습하는 단계를 포함하되, 상기 단어 표현 모델은, 상기 하위 단어 정보를 이용하여 하위 단어 특징 벡터들을 산출하는 합성곱 신경망(convolutional neural network) 기반의 컨볼루션 모듈과, 상기 컨볼루션 모듈에서 산출된 하위 단어 특징 벡터들을 적응적으로 결합하여 해당 단어의 단어 임베딩 벡터를 산출하는 하이웨이 네트워크(highway network) 기반의 하이웨이 모듈을 포함하는 것이다.The present invention relates to a natural language processing system and a word expression method in natural language processing, in a word expression method in natural language processing performed by a natural language processing system, a) a vocabulary including at least one word and each word providing a vocabulary dictionary dataset including previously learned word embedding information; b) When a vocabulary based on the vocabulary dictionary dataset is provided as input data, subword information about words existing in the input data is extracted using a word expression model, and the subword information is embedded in words calculating a vector; and c) matching the calculated word embedding vector with the pre-learned word embedding information of the corresponding word, replacing the pre-learned word embedding information with the calculated word embedding vector to learn a word expression for the word. wherein the word expression model includes a convolutional neural network-based convolutional neural network that calculates lower-order word feature vectors using the lower-order word information, and the lower-order word feature vectors calculated by the convolution module. It includes a highway network-based highway module that adaptively combines to calculate a word embedding vector of a corresponding word.

Description

NATURAL LANGUAGE PROCESSING SYSTEM AND METHOD FOR WORD REPRESENTATIONS IN NATURAL LANGUAGE PROCESSING

본 발명은 기학습된 단어 임베딩의 지도 학습을 기반으로 미등록 단어(Out Of Vocabulary, OOV)를 비롯한 모든 단어에 대한 단어 표현을 생성하는 자연어 처리 시스템 및 자연어 처리에서의 단어 표현 방법에 관한 것이다.The present invention relates to a natural language processing system for generating word expressions for all words including Out Of Vocabulary (OOV) based on supervised learning of previously learned word embeddings, and to a word expression method in natural language processing.

딥러닝 기술이 컴퓨터 비전 시스템의 큰 발전을 가져옴에 따라 딥러닝을 이용한 자연어처리 시스템에 관한 연구도 급속도로 진행되고 있다. 딥러닝 기술을 이용한 자연어처리 시스템은 단어를 수치화하기 위해 단어를 저차원의 벡터로 임베딩하여 사용해야 한다. As deep learning technology brings great development of computer vision systems, research on natural language processing systems using deep learning is also rapidly progressing. A natural language processing system using deep learning technology should embed words as low-dimensional vectors to quantify them.

이때, 자연어 처리(Natural Language Processing)는 컴퓨터가 인간 언어(human or natural language)를 이해할 수 있는 구문적/의미적 표상을 연구하는 것이고, 단어 임베딩 기술은 신경망에 기반한 언어 모델로부터 도출된 기술로 유사한 단어들을 벡터 공간상에 가깝게 배치하여 어휘 의미를 표현할 수 있는 기술이다. At this time, natural language processing is a study of syntactic/semantic representations that a computer can understand human or natural language, and word embedding technology is a technology derived from a neural network-based language model. It is a technology that can express lexical meaning by arranging words closely on a vector space.

언어 모델(Language Model)은 주어진 문장에서 앞선 단어들을 기초로 다음 단어가 나올 확률을 계산해주는 모델이다. 언어 모델은 어떤 문장이 실제로 존재할 확률을 계산해주기 때문에 주어진 문장이 문법적으로 또는 의미적으로 얼마나 적합한지를 결정할 수 있다.The language model is a model that calculates the probability that the next word will appear based on the preceding words in a given sentence. Because language models calculate the probability that a sentence actually exists, it can determine how well a given sentence is grammatically or semantically appropriate.

언어 모델은 음성 인식, QA(Question Answering), 자연어 처리, 예측 텍스트(predictive text), 번역 및 통역 등의 분야에 적용될 수 있는데, 개방 어휘(Open Vocabulary) 환경에서는 학습 데이터의 어휘에 속하지 않는 단어(Out Of Vocabulary, OOV), 즉 미등록 단어의 처리가 필요하다. OOV를 처리하기 위해, 미등록 단어를 의사 단어(Pesudo word)로 사용하거나, 단어가 사용된 문맥을 이용하여 미등록 단어에 대한 임시 표현을 생성하는 방법을 사용하였다. The language model can be applied to fields such as speech recognition, QA (Question Answering), natural language processing, predictive text, translation and interpretation. Out Of Vocabulary (OOV), that is, processing of unregistered words is required. In order to process OOV, a method of generating a temporary expression for an unregistered word was used by using an unregistered word as a pseudo word or using the context in which the word was used.

기존의 미등록 단어의 처리 방법은 미등록 단어가 소량으로 존재하는 경우에 좋은 해결책이 될 수 있지만, 미등록 단어의 수가 많아질 경우에 자연어 처리 시스템이 텍스트를 제대로 분석하지 못하는 결과를 초래하는 문제점이 있다. The existing method of processing unregistered words can be a good solution when there are a small amount of unregistered words, but when the number of unregistered words increases, there is a problem in that the natural language processing system cannot properly analyze the text.

이러한 문제점은 소셜 미디어 환경에서 더욱 명확하게 드러난다. 소셜 미디어에서 사용자들이 소비하는 단어들은 축약어, 합성어, 신조어, 오타 등 일반적인 단어의 형태와 다르게 사용되고 있고, 이러한 단어들은 자연어처리 시스템이 지니고 있는 단어 임베딩에 존재하지 않을 확률이 매우 높다. These problems are more evident in the social media environment. Words consumed by users in social media are used differently from common words such as abbreviations, compound words, neologisms, and typos, and there is a very high probability that these words do not exist in the word embeddings of the natural language processing system.

따라서, 자연어처리 시스템이 처리해야 할 단어의 수가 많거나 신조어 등이 빈번하게 발생하는 개방 어휘(Open Vocabulary) 환경에서 미등록 단어들 각각에 대한 표현을 생성하고, 미등록 단어가 지니고 있는 고유한 의미를 추론할 수 있는 언어 모델의 개발이 요구된다.Therefore, in an open vocabulary environment where the number of words to be processed by the natural language processing system is large or neologisms occur frequently, an expression for each unregistered word is generated, and the unique meaning of the unregistered word is inferred. It is required to develop a language model that can do this.

대한민국등록특허 제10-1799681호(발명의 명칭 : 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법)Republic of Korea Patent No. 10-1799681 (Title of the invention: apparatus and method for homogeneous word discrimination using lexical semantic network and word embedding)

본 발명은 전술한 문제점을 해결하기 위하여, 본 발명의 일 실시예에 따라The present invention according to an embodiment of the present invention in order to solve the above problems

기존의 미등록 단어들에 대한 고유한 정보를 전혀 고려하지 않은 채 주위 단어를 통해 그 의미를 파악하였으나, 미등록 단어뿐만 아니라 모든 단어의 고유한 의미를 추론하기 위해 해당 단어의 하위단어정보를 이용하여 기학습된 단어 임베딩의 지도 학습을 기반으로 고유한 단어 표현을 생성하는 것에 목적이 있다.Although the meaning of the existing unregistered words was grasped through the surrounding words without considering the unique information at all, in order to infer the unique meanings of all words as well as the unregistered words, the subword information of the corresponding word was used to infer the meaning. The purpose is to generate a unique word expression based on supervised learning of learned word embeddings.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical task to be achieved by the present embodiment is not limited to the technical task as described above, and other technical tasks may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법은, 자연어 처리 시스템에 의해 수행되는 자연어 처리에서의 단어 표현 방법에 있어서, a) 적어도 하나 이상의 단어를 포함하는 어휘 및 각 단어에 대해 기학습된 단어 임베딩 정보를 포함하는 어휘 사전 데이터세트를 제공하는 단계; b) 상기 어휘 사전 데이터세트에 기초한 어휘가 입력 데이터로 제공되면, 단어 표현 모델을 이용하여 상기 입력 데이터에 존재하는 단어들에 대한 하위 단어(subword) 정보를 추출하고, 상기 하위 단어 정보를 단어 임베딩 벡터를 산출하는 단계; 및 c) 상기 산출된 단어 임베딩 벡터와 해당 단어의 기학습된 단어 임베딩 정보를 매칭함으로써 상기 기학습된 단어 임베딩 정보를 상기 산출된 단어 임베딩 벡터로 대체하여 해당 단어에 대한 단어 표현을 학습하는 단계를 포함하되, 상기 단어 표현 모델은, 상기 하위 단어 정보를 이용하여 하위 단어 특징 벡터들을 산출하는 합성곱 신경망(convolutional neural network) 기반의 컨볼루션 모듈과, 상기 컨볼루션 모듈에서 산출된 하위 단어 특징 벡터들을 적응적으로 결합하여 해당 단어의 단어 임베딩 벡터를 산출하는 하이웨이 네트워크(highway network) 기반의 하이웨이 모듈을 포함하는 것이다.As a technical means for achieving the above technical problem, a word expression method in natural language processing according to an embodiment of the present invention is a word expression method in natural language processing performed by a natural language processing system, a) at least one or more providing a vocabulary dictionary dataset including vocabulary including words and pre-learned word embedding information for each word; b) When a vocabulary based on the vocabulary dictionary dataset is provided as input data, subword information about words existing in the input data is extracted using a word expression model, and the subword information is embedded in words calculating a vector; and c) matching the calculated word embedding vector with the pre-learned word embedding information of the corresponding word, replacing the pre-learned word embedding information with the calculated word embedding vector to learn a word expression for the word. wherein the word expression model includes a convolutional neural network-based convolutional neural network that calculates lower-order word feature vectors using the lower-order word information, and the lower-order word feature vectors calculated by the convolution module. It includes a highway network-based highway module that adaptively combines to calculate a word embedding vector of a corresponding word.

본 발명의 다른 일 실시예에 따른 자연어 처리 시스템은, 어휘에 포함된 단어의 분산된 표현을 위한 자연어 처리 시스템에 있어서, 자연어 처리에서의 단어 표현 방법을 수행하기 위한 프로그램이 기록된 메모리; 및 상기 프로그램을 실행하기 위한 프로세서를 포함하며, 상기 프로세서는, 상기 프로그램의 실행에 의해, 적어도 하나 이상의 단어를 포함하는 어휘 및 각 단어에 대해 기학습된 단어 임베딩 정보를 포함하는 어휘 사전 데이터세트에 기초한 어휘가 입력 데이터로 제공되고, 단어 표현 모델을 이용하여 상기 입력 데이터에 존재하는 단어들에 대한 하위 단어(subword) 정보를 추출하고, 상기 하위 단어 정보를 단어 임베딩 벡터를 산출하며, 상기 산출된 단어 임베딩 벡터와 해당 단어의 기학습된 단어 임베딩 정보를 매칭함으로써 상기 기학습된 단어 임베딩 정보를 상기 산출된 단어 임베딩 벡터로 대체하여 해당 단어에 대한 단어 표현을 학습하되, 상기 단어 표현 모델은, 상기 하위 단어 정보를 이용하여 하위 단어 특징 벡터들을 산출하는 합성곱 신경망(convolutional neural network) 기반의 컨볼루션 모듈과, 상기 컨볼루션 모듈에서 산출된 하위 단어 특징 벡터들을 적응적으로 결합하여 해당 단어의 단어 임베딩 벡터를 산출하는 하이웨이 네트워크(highway network) 기반의 하이웨이 모듈을 포함하는 것이다.A natural language processing system according to another embodiment of the present invention is a natural language processing system for distributed expression of words included in a vocabulary, comprising: a memory in which a program for performing a word expression method in natural language processing is recorded; and a processor for executing the program, wherein, by the execution of the program, a vocabulary including at least one or more words and a vocabulary dictionary dataset including pre-learned word embedding information for each word. A vocabulary based on a vocabulary is provided as input data, subword information about words existing in the input data is extracted using a word expression model, a word embedding vector is calculated from the subword information, and the calculated By matching the word embedding vector with the previously learned word embedding information of the corresponding word, the previously learned word embedding information is replaced with the calculated word embedding vector to learn a word expression for the word, wherein the word expression model includes: A convolutional neural network-based convolution module that calculates lower-order word feature vectors using lower-order word information, and adaptively combining the lower-order word feature vectors calculated by the convolution module to embed the word in the corresponding word It includes a highway network-based highway module that calculates a vector.

전술한 본 발명의 과제 해결 수단에 의하면, 미등록 단어뿐만 아니라 모든 단어의 하위단어정보를 이용하여 해당 단어가 가지고 있는 고유한 의미를 정확히 추출하고, 미등록 단어가 많은 개방 어휘(Open Vocabulary) 환경에서 효과적으로 동작할 수 있다.According to the above-described problem solving means of the present invention, the unique meaning of a corresponding word is accurately extracted using sub-word information of not only unregistered words but also all words, and effectively in an open vocabulary environment with many unregistered words. can work

또한, 본 발명은 새로운 단어 임베딩을 생성하기 위해 말뭉치, 즉 대형 코퍼스에서 오랜 시간 학습하지 않고, 기존의 자연어 처리 시스템이 가지고 있는 단어 임베딩을 이용하여 미등록 단어를 생성할 수 있기 때문에 단어 임베딩 생성에 있어서의 효율성 및 효과성이 향상될 수 있다.In addition, the present invention does not learn for a long time in a corpus, that is, a large corpus in order to generate new word embeddings, and can generate unregistered words using the word embeddings of the existing natural language processing system. efficiency and effectiveness can be improved.

도 1은 본 발명의 일 실시예에 따른 단어 표현을 생성하기 위한 자연어 처리 시스템의 구성을 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른 단어 표현 모델을 설명하는 도면이다.
도 3은 도 2의 일부 구성요소인 컨볼루션 모듈을 설명하는 도면이다.
도 4는 도 2의 일부 구성요소인 하이웨이 모듈을 설명하는 도면이다.
도 5는 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법 중 단어 표현 모델의 학습 과정을 설명하는 순서도이다.
도 6은 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법 중 단어 표현 모델의 추론 과정을 설명하는 순서도이다.
도 7은 어휘와 미등록 단어를 위한 단어 표현의 최근접 이웃을 설명하는 도면이다.
도 8은 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법과 다른 학습 모델의 비교 평가를 위해 전체 데이터 세트에 대한 단어 유사성 결과를 설명하는 도면이다.
도 9는 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법을 실제 자연어 처리 시스템에 적용한 실험 결과를 나타낸 도면이다. 1 is a diagram showing the configuration of a natural language processing system for generating a word expression according to an embodiment of the present invention.
2 is a diagram illustrating a word expression model according to an embodiment of the present invention.
FIG. 3 is a view for explaining a convolution module that is a part of FIG. 2 .
FIG. 4 is a view for explaining a highway module that is a part of FIG. 2 .
5 is a flowchart illustrating a learning process of a word expression model in a word expression method in natural language processing according to an embodiment of the present invention.
6 is a flowchart illustrating an inference process of a word expression model among a word expression method in natural language processing according to an embodiment of the present invention.
Fig. 7 is a diagram for explaining the nearest neighbor of a word expression for a vocabulary and an unregistered word.
8 is a diagram for explaining the word similarity results for the entire data set for comparative evaluation of a word expression method in natural language processing and other learning models according to an embodiment of the present invention.
9 is a diagram showing experimental results of applying the word expression method in natural language processing according to an embodiment of the present invention to an actual natural language processing system.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . Also, when a part "includes" a component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and one or more other features However, it is to be understood that the existence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.

본 명세서에서 ‘단말’은 휴대성 및 이동성이 보장된 무선 통신 장치일 수 있으며, 예를 들어 스마트폰, 태블릿 PC 또는 노트북 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치일 수 있다. 또한, ‘단말’은 네트워크를 통해 다른 단말 또는 서버 등에 접속할 수 있는 PC 등의 유선 통신 장치인 것도 가능하다. 또한, 네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. In the present specification, a 'terminal' may be a wireless communication device with guaranteed portability and mobility, for example, any type of handheld-based wireless communication device such as a smartphone, tablet PC, or notebook computer. In addition, the 'terminal' may be a wired communication device such as a PC that can connect to another terminal or server through a network. In addition, the network refers to a connection structure capable of exchanging information between each node, such as terminals and servers, and includes a local area network (LAN), a wide area network (WAN), and the Internet (WWW). : World Wide Web), wired and wireless data networks, telephone networks, and wired and wireless television networks.

무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다.Examples of wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), Wi-Fi, Bluetooth communication, infrared communication, ultrasound Communication, Visible Light Communication (VLC), LiFi, etc. are included, but are not limited thereto.

이하의 실시예는 본 발명의 이해를 돕기 위한 상세한 설명이며, 본 발명의 권리 범위를 제한하는 것이 아니다. 따라서 본 발명과 동일한 기능을 수행하는 동일 범위의 발명 역시 본 발명의 권리 범위에 속할 것이다.The following examples are detailed descriptions to help the understanding of the present invention, and do not limit the scope of the present invention. Accordingly, an invention of the same scope performing the same function as the present invention will also fall within the scope of the present invention.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 단어 표현을 생성하기 위한 자연어 처리 시스템의 구성을 나타낸 도면이고, 도 2는 본 발명의 일 실시예에 따른 단어 표현 모델을 설명하는 도면이며, 도 3은 도 2의 일부 구성요소인 컨볼루션 모듈을 설명하는 도면이고, 도 4는 도 2의 일부 구성요소인 하이웨이 모듈을 설명하는 도면이다.1 is a diagram showing the configuration of a natural language processing system for generating a word expression according to an embodiment of the present invention, FIG. 2 is a diagram for explaining a word expression model according to an embodiment of the present invention, FIG. 3 is FIG. 2 is a diagram for explaining a convolution module, which is a part of FIG. 2 , and FIG. 4 is a diagram for explaining a highway module, which is a part of FIG. 2 .

도 1을 참조하면, 자연어 처리 시스템(100)은 통신 모듈(110), 메모리(120), 프로세서(130) 및 데이터베이스(140)를 포함한다.Referring to FIG. 1 , the natural language processing system 100 includes a communication module 110 , a memory 120 , a processor 130 , and a database 140 .

상세히, 통신 모듈(110)은 통신망과 연동하여 자연어 처리 시스템(100)과 사용자 단말 간의 송수신 신호를 패킷 데이터 형태로 제공하는 데 필요한 통신 인터페이스를 제공한다. 나아가, 통신 모듈(110)은 사용자 단말로부터 데이터 요청을 수신하고, 이에 대한 응답으로서 데이터를 송신하는 역할을 수행할 수 있다.In detail, the communication module 110 provides a communication interface necessary to provide a transmission/reception signal between the natural language processing system 100 and the user terminal in the form of packet data in connection with a communication network. Furthermore, the communication module 110 may perform a role of receiving a data request from the user terminal and transmitting data as a response thereto.

여기서, 통신 모듈(110)은 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.Here, the communication module 110 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals through wired/wireless connection with other network devices.

메모리(120)는 자연어 처리에서의 단어 표현 방법을 수행하기 위한 프로그램이 기록된다. 또한, 프로세서(130)가 처리하는 데이터를 일시적 또는 영구적으로 저장하는 기능을 수행한다. 여기서, 메모리(120)는 휘발성 저장 매체(volatile storage media) 또는 비휘발성 저장 매체(non-volatile storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.In the memory 120, a program for performing a word expression method in natural language processing is recorded. In addition, the processor 130 performs a function of temporarily or permanently storing the processed data. Here, the memory 120 may include a volatile storage medium or a non-volatile storage medium, but the scope of the present invention is not limited thereto.

프로세서(130)는 자연어 처리에서의 단어 표현 방법을 제공하는 전체 과정을 제어한다. 즉, 프로세서(130)는 어휘사전 데이터세트에 다양한 어휘와 각 단어의 기학습된 단어 임베딩 벡터를 저장하고, 어휘사전 데이터세트에 저장된 단어 임베딩의 지도 학습을 기반으로 단어 표현을 생성하는 단어 표현 모델을 학습하며, 학습된 단어 표현 모델에 기초하여 등록 단어뿐만 아니라 미등록 단어에 대한 단어표현을 생성한 후 최근접 이웃 탐색을 통해 미등록 단어의 고유한 의미를 추론할 수 있다. 이러한 프로세서(130)가 수행하는 각각의 동작에 대해서는 추후 보다 상세히 살펴보기로 한다. The processor 130 controls the entire process of providing a word expression method in natural language processing. That is, the processor 130 stores various vocabularies and pre-learned word embedding vectors of each word in the lexicon dataset, and creates a word expression model based on supervised learning of the word embeddings stored in the lexicon dataset. After generating word expressions for not only registered words but also unregistered words based on the learned word expression model, it is possible to infer the unique meaning of unregistered words through nearest neighbor search. Each operation performed by the processor 130 will be described in more detail later.

도 2에 도시된 바와 같이, 단어 표현 모델(200)은 기학습된 단어 임베딩의 지도 학습을 기반으로 단어 표현을 생성하는 것을 학습한다. 이러한 단어 표현 모델(200)은 컨볼루션 모듈(210), 하이웨이 모듈(220) 및 최적화 모듈(230)을 포함한다.As shown in FIG. 2 , the word expression model 200 learns to generate a word expression based on supervised learning of previously-learned word embeddings. The word expression model 200 includes a convolution module 210 , a highway module 220 , and an optimization module 230 .

컨볼루션 모듈(210)은 합성곱 신경망(convolutional neural network)을 통해 문자 기반의 하위 단어 특징을 추출한다. 하이웨이 모듈(220)은 하이웨이 신경망(highway network)을 활용하여 컨볼루션 모듈(210)에서 추출된 하위단어 특징들을 적응적으로 결합하여 단어 임베딩 벡터를 산출한다. 또한, 최적화 모듈(230)은 하이웨이 모듈(220)에서 산출된 단어 임베딩 벡터가 기학습된 단어 임베딩과 유사해지도록 최적화를 수행한다. The convolution module 210 extracts character-based lower-order word features through a convolutional neural network. The highway module 220 adaptively combines the subword features extracted from the convolution module 210 using a highway network to calculate a word embedding vector. Also, the optimization module 230 optimizes the word embedding vector calculated by the highway module 220 to be similar to the previously learned word embedding.

컨볼루션 모듈(210)은 자연어 처리에서 로컬 특징들(local features)을 추출할 수 있기 때문에 합성곱 신경망을 문자 시퀀스에 적용하여 하위 단어 정보를 추출한다. 도 3에 도시된 바와 같이, 컨볼루션 모듈(210)은 문자 시퀀스에서 각기 다른 특징을 추출하는 필터들을 포함하고, 각 필터를 통해 산출된 행렬인 특징 맵(Feature maps)을 추출하며, 필터들을 통해 특징 맵이 추출되면 해당 특징의 유무의 비선형 값으로 바꿔주기 위해 비선형 함수(tanh, Hyperbolic tangent)를 적용한다.Since the convolution module 210 can extract local features from natural language processing, the convolutional neural network is applied to the character sequence to extract lower-order word information. As shown in FIG. 3 , the convolution module 210 includes filters for extracting different features from a character sequence, and extracts feature maps, which are matrices calculated through each filter, through the filters. When the feature map is extracted, a non-linear function (tanh, hyperbolic tangent) is applied to change it to a non-linear value of the presence or absence of the corresponding feature.

일반적으로 학습된 단어 표현 모델의 깊이가 증가함에 따라 성능이 향상한다. 하지만, 깊이가 증가할수록 최적화가 어려워지며 훈련에 어려움이 따른다. 하이웨이 신경망은 단어 표현 모델을 깊게 만들면서도 정보의 흐름을 통제하고 학습 가능성을 극대화할 수 있도록 해준다.In general, the performance improves as the depth of the learned word representation model increases. However, as the depth increases, optimization becomes difficult and training becomes difficult. Highway neural networks allow you to control the flow of information and maximize learning possibilities while deepening your word representation models.

도 4에 도시된 바와 같이, 하이웨이 모듈(220)은 컨볼루션 모듈(210)로부터 수신한 하위단어 특징 벡터들에 대해 input(y)의 값을 가지고, 비선형한 변환(T)과 이동(C)을 추가로 적용한다. 이때, Output(z)이 input(y)에 대하여 얼마나 변환되고 옮겨졌느냐를 표현해주기 때문에 T를 변환 게이트(transform gate), C를 이동 게이트(carry gate)라고 한다. As shown in FIG. 4 , the highway module 220 has a value of input(y) for the subword feature vectors received from the convolution module 210, and performs a nonlinear transformation (T) and a movement (C) is additionally applied. At this time, T is called a transform gate and C is called a carry gate because it expresses how much the output(z) is transformed and shifted with respect to the input(y).

한편, 프로세서(130)는 프로세서(processor)와 같이 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 여기서, '프로세서(processor)'는, 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.Meanwhile, the processor 130 may include all kinds of devices capable of processing data, such as a processor. Here, the 'processor' may refer to a data processing device embedded in hardware, for example, having a physically structured circuit to perform a function expressed as a code or an instruction included in a program. As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but the scope of the present invention is not limited thereto.

데이터베이스(140)는 자연어 처리에서의 단어 표현 방법을 수행하면서 누적되는 데이터가 저장된다. 예컨대, 데이터베이스(140)에는 어휘사전 데이터세트, 단어 표현 모델, 등록 단어 및 미등록 단어의 임베딩 벡터 등이 저장될 수 있다.The database 140 stores data accumulated while performing a word expression method in natural language processing. For example, the database 140 may store a lexicon dataset, a word expression model, an embedding vector of a registered word and an unregistered word, and the like.

도 5는 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법 중 단어 표현 모델의 학습 과정을 설명하는 순서도이다.5 is a flowchart illustrating a learning process of a word expression model in a word expression method in natural language processing according to an embodiment of the present invention.

도 5를 참조하면, 자연어 처리에서의 단어 표현 방법은, 자연어 처리 시스템(100)에서 적어도 하나 이상의 단어를 포함하는 어휘 및 기학습된 단어 임베딩 정보를 포함하는 어휘 사전 데이터세트를 제공한다(S110). Referring to FIG. 5 , the word expression method in natural language processing provides a vocabulary dictionary dataset including vocabulary including at least one or more words and pre-learned word embedding information in the natural language processing system 100 ( S110 ) .

단어 표현 모델은 어휘 사전 데이터세트에 기초한 어휘가 입력 데이터로 제공되면(S120), 입력 데이터에 존재하는 단어들에 대한 하위 단어(subword) 정보를 추출하고(S130), 추출된 하위 단어 정보를 이용하여 하위 단어 특징 벡터들을 생성한 후 하위 단어 특징 벡터들을 결합하여 단어 임베딩 벡터를 산출한다(S140).In the word expression model, when a vocabulary based on the vocabulary dictionary dataset is provided as input data (S120), subword information about words existing in the input data is extracted (S130), and the extracted subword information is used. to generate the lower-order word feature vectors, and then combine the lower-order word feature vectors to calculate a word embedding vector (S140).

단어 표현 모델은 산출된 단어 임베딩 벡터와 해당 단어의 기학습된 단어 임베딩 정보를 매칭하여(S150), 기학습된 단어 임베딩을 산출된 단어 임베딩 벡터로 대체하여 해당 단어에 대한 단어 표현을 학습한다(S160). The word expression model matches the calculated word embedding vector with the pre-learned word embedding information of the corresponding word (S150), and replaces the pre-learned word embedding with the calculated word embedding vector to learn the word expression for the word (S150). S160).

하위 단어의 범위는 어근, 문자, N-그램(gram) 등 다양하지만, 단어 표현 모델(200)에서는 하위 단어의 단위를 문자로 사용한다. 따라서, 단어 표현 모델은 어휘에 존재하는 모든 단어를 문자 단위로 구분하고, 각 문자를 나타내기 위한 표현으로 원-핫 인코딩(One-hot encoding)을 사용한다. 여기서, 원-핫 인코딩은 해당 문자의 인덱스에서 1의 값을 가지며, 그렇지 않은 경우에는 0의 값을 가지므로, 이러한 문자 표현을 연결하여 단어를 구성한다. The range of a sub-word is various, such as a root, a character, an N-gram, and the like, but in the word expression model 200, the unit of the sub-word is used as a character. Therefore, the word expression model classifies all words existing in the vocabulary in character units, and uses one-hot encoding as an expression to represent each character. Here, the one-hot encoding has a value of 1 in the index of the corresponding character, otherwise it has a value of 0, so a word is formed by concatenating these character representations.

각 단어를 이루고 있는 문자 표현을 연결함으로써 수학식 1과 같은 문자 시퀀스 표현이 생성된다. By concatenating the character representations constituting each word, a character sequence representation as shown in Equation 1 is generated.

[수학식 1][Equation 1]

수학식 1에서, r은 단어에 대한 시퀀스 표현이고,

이며, Vc는 문자들의 어휘이고, n은 단어의 길이이며,

은 연결 연산자이고, c_i은 해당 단어에서 i번째 문자를 표현하기 위한 문자 표현(즉, 원-핫 인코딩)을 각각 나타낸다.In Equation 1, r is a sequence representation for a word,

, where Vc is the vocabulary of characters, n is the length of the word,

is a concatenation operator, and c _i denotes a character representation (ie, one-hot encoding) for expressing the i-th character in the corresponding word, respectively.

이때, 모든 문자들에 동일한 수의 합성곱을 수행하기 위해 r에 제로 패딩(zero-padding)을 삽입하고, 제로 패딩의 수는 h-1/2이 된다.At this time, zero-padding is inserted in r to perform the same number of convolutions on all characters, and the number of zero-padding becomes h-1/2.

단어 표현 모델(200)의 컨볼루션 모듈(210)은 수학식 1과 같은 패딩된 문자 시퀀스 표현 r에 하기 수학식 2를 사용하여 합성곱을 수행함으로써 하위 단어 정보를 추출한다. The convolution module 210 of the word expression model 200 extracts lower-order word information by performing convolution on the padded character sequence expression r as in Equation 1 using Equation 2 below.

[수학식 2][Equation 2]

수학식 2에서, F는 합성곱 신경망에서 사용하는 필터로서,

이고, h는 필터(F)의 폭이며, r_i:i+h-1는 문자 c_i에서 c_i+h-1사이의 시퀀스 표현이고, b는 바이어스이며, tanh은 합성곱 결과 값들에 대한 비선형 함수를 각각 나타낸다.In Equation 2, F is a filter used in a convolutional neural network,

where h is the width of the filter F, r _i:i+h-1 is the sequence representation between the letters c _i to c _i+h-1 , b is the bias, and tanh is the Each nonlinear function is represented.

컨볼루션 모듈(210)은 단어에 존재하는 고유의 하위 단어 정보를 추출한 후, 추출된 하위 단어 정보에 맥스 풀링(Max pooling) 연산을 적용함으로써, 하기 수학식 3과 같이 유의미한 하위 단어 특징들만을 추출한다. The convolution module 210 extracts unique lower-order word information existing in a word, and then applies a Max pooling operation to the extracted lower-order word information, thereby extracting only significant lower-order word features as shown in Equation 3 below. do.

[수학식 3][Equation 3]

수학식 3에서, S는 하위 단어 특징 벡터이고, k는 스트라이드의 길이이며, s_ki:k(i+1)-1 는 s에 하나의 스트라이드를 갖는 시퀀스 표현을 각각 나타낸다. 이때, 스트라이드가 적용된 맥스 풀링은 단어의 길이에 따라 길이가 다른 특징을 생성하므로 벡터e의 요소를 단순히 합하여 고정 크기 특징을 생성한다.In Equation 3, S is a low-order word feature vector, k is the length of the stride, and s _ki:k(i+1)-1 represents a sequence representation with one stride in s, respectively. At this time, since max pooling with stride applied generates features with different lengths depending on the length of the word, the elements of vector e are simply summed to create a fixed-size feature.

컨볼루션 모듈(210)은 여러 개의 필터들이 입력 데이터를 지정한 간격으로 순회하면서 합성곱을 계산하는데, 지정된 간격으로 필터를 순회하는 간격을 스트라이드(Stride)라고 한다. 입력 데이터가 여러 채널을 가질 경우, 필터는 각 채널을 순회하며 합성곱을 계산한 후 채널별 특징 맵을 생성하고, 각 채널의 특징 맵을 모두 합산(concatenate)하여 최종 특징 맵으로 반환한다. The convolution module 210 calculates a convolution while traversing input data at a specified interval for several filters, and the interval at which the filters are traversed at a specified interval is called a stride. When the input data has multiple channels, the filter traverses each channel, calculates the convolution, generates a feature map for each channel, concatenates all the feature maps of each channel, and returns the final feature map.

결과 벡터 s는 제로 패딩으로 인해 입력 단어와 동일한 길이를 갖고, 맥스 폴링을 스트라이드와 함께 사용하여 하위 단어 정보에 접미사, 어근 및 접두어를 포함하도록 한다. 결과적으로, 스트라이드가 적용된 맥스 풀링에서 파생된 각 특징에는 문자 시퀀스에서 하위단어 정보의 요약을 가진다. The resulting vector s has the same length as the input word due to zero padding, and max polling is used with stride to include suffixes, roots and prefixes in the subword information. Consequently, each feature derived from strided max pooling has a summary of subword information in the character sequence.

이와 같이, 단어 표현 모델은 기존에 대형 코퍼스에서 비지도 학습 방식으로 단어 표현을 학습하던 방식과 다르게, 기학습된 단어 임베딩 정보를 포함하는 어휘사전 데이터세트를 이용하여 어휘에 포함된 모든 단어에 대한 단어 표현을 학습한다. As such, the word expression model uses a vocabulary dictionary dataset including previously learned word embedding information, unlike the existing method of learning word expressions in an unsupervised learning method in a large corpus. learn word expressions.

단어 표현 모델(200)의 하이웨이 모듈(220)은 하기 수학식 4에 의한 게이트 메커니즘을 사용하여 하위단어 특징 벡터를 라우트하는 것으로서, 개별 필터의 로컬 특징을 적응적으로 결합하여 합성곱 신경망에 유용하다. 즉, 하이웨이 모듈(220)은 컨볼루션 모듈(210)에서 파생된 하위 단어 특징 벡터를 적응적으로 결합하고, 이 하위 단어 특징 벡터가 기학습된 단어 임베딩 정보에 연관되도록 한다.The highway module 220 of the word expression model 200 routes the subword feature vector using the gate mechanism according to Equation 4 below, and is useful for convolutional neural networks by adaptively combining the local features of individual filters. . That is, the highway module 220 adaptively combines the lower-order word feature vectors derived from the convolution module 210, and allows the lower-order word feature vectors to be associated with the pre-learned word embedding information.

[수학식 4][Equation 4]

수학식 4에서, H는 입력(e)에서의 비선형 변환이고, T는 변환 게이트이며, C는 이동 게이트를 각각 나타낸다. In Equation 4, H is the non-linear transformation at the input e, T is the transformation gate, and C denotes the moving gate, respectively.

상기한 수학식 4에서 이동 게이트(C)를 (1-T)로 단순화하여 하기 수학식 5로 나타낼 수 있다. In Equation 4, the moving gate C can be simplified to (1-T) and expressed as Equation 5 below.

[수학식 5][Equation 5]

수학식 5에서, W_t와 W_h는 정방 행렬(square matrices)이고, b_t와 b_h는 바이어스들이며, tanh는 결과 값들에 대한 비선형 함수이다. In Equation 5, W _t and W _h are square matrices, b _t and b _h are biases, and tanh is a nonlinear function of the resulting values.

하이웨이 모듈(220)은 기학습된 단어 임베딩 정보와 동일한 크기의 단어 임베딩 벡터를 생성하기 위해 산출된 단어 임베딩 벡터에 하기 수학식 6을 이용해 선형 변환을 수행한다.The highway module 220 performs linear transformation on the calculated word embedding vector to generate a word embedding vector having the same size as the previously learned word embedding information using Equation 6 below.

[수학식 6][Equation 6]

수학식 6에서, w는 단어 표현 모델로부터 파생된 최종 단어 표현이고, W와 b는 선형 변환의 파라미터들이며, y는 결과 벡터이다. 최적화를 위해 수학식 6을 통한 선형 변형을 이용하여 기학습된 단어 임베딩 결과 벡터 y의 크기가 동일해지도록 설정한다. In Equation 6, w is the final word representation derived from the word representation model, W and b are parameters of the linear transformation, and y is the result vector. For optimization, the size of the pre-learned word embedding result vector y is set to be the same by using the linear transformation through Equation (6).

이와 같이, 단어 표현 모델(200)은 컨볼루션 모듈(210)과 하이웨이 모듈(220)을 이용하여 하위 단어 정보를 고려하면서 단어 표현을 생성한다. 또한, 단어 표현 모델(200)의 최적화 모듈(230)은 산출된 단어 임베딩 벡터가 기학습된 단어 임베딩 벡터와 유사해지도록 학습하기 위해 목적 함수로 하기 수학식 7과 같은 제곱 유클리드 거리를 사용한다. 이때, 유클리드 거리는 L2 거리(L2 Distance)라고도 한다. As such, the word expression model 200 uses the convolution module 210 and the highway module 220 to generate a word expression while considering lower word information. In addition, the optimization module 230 of the word expression model 200 uses the squared Euclidean distance as in Equation 7 as an objective function to learn the calculated word embedding vector to be similar to the previously learned word embedding vector. In this case, the Euclidean distance is also referred to as the L2 distance.

[수학식 7] [Equation 7]

수학식 7에서, V_w는 어휘 사전 데이터세트 내의 어휘이고, w_v는 산출된 단어 임베딩 벡터이며,

는 기학습된 단어 임베딩 벡터이다. In Equation 7, V _w is a vocabulary in the vocabulary dictionary dataset, w _v is the calculated word embedding vector,

is a pre-learned word embedding vector.

최적화 모듈(230)은 상기한 수학식 7에 의한 제곱 유클리드 거리 또는 L2 손실(loss) 함수를 이용하여 산출된 단어 임베딩 벡터와 기학습된 단어 임베딩 정보간의 유사도를 산출하는데, 코사인 유사도, L 1 거리(L1 Distance or Manhattan Distance) 등을 이용하여 두 벡터간의 유사도를 계산할 수 있다. The optimization module 230 calculates the similarity between the word embedding vector calculated using the squared Euclidean distance or the L2 loss function according to Equation 7 and the pre-learned word embedding information, cosine similarity, L 1 distance (L1 Distance or Manhattan Distance) can be used to calculate the similarity between two vectors.

최적화 모듈(230)은 단어에 대해 산출된 단어 임베딩 벡터와 기학습된 단어 임베딩 정보를 매칭함으로써 기학습된 단어 임베딩을 산출된 단어 임베딩 벡터로 대체하여 해당 단어에 대한 단어 표현을 학습한다. The optimization module 230 replaces the previously learned word embedding with the calculated word embedding vector by matching the word embedding vector calculated for the word with the previously learned word embedding information to learn a word expression for the word.

다시 도 1을 참조하면, 어휘사전 데이터세트에서 'uncovered'라는 단어가 입력 데이터로 제공될 경우, 컨볼루션 모듈(210)은 h=5, 2개의 제로 패딩이 추가된 3개의 필터들과 다른 폭의 필터를 사용하고, 풀링을 위해 스트라이드 폭을 3으로 설정하는 것으로 가정하며, 하이웨이 모듈(220)에서는 단일 하이웨이 네트워크를 사용한다.Referring back to FIG. 1 , when the word 'uncovered' is provided as input data in the lexicon dataset, the convolution module 210 has a different width than the three filters with h=5 and two zero padding added. It is assumed that a filter is used, the stride width is set to 3 for pooling, and a single highway network is used in the highway module 220 .

단어 표현 모델(200)은 'uncovered'의 기학습된 단어 임베딩 정보와 유사하게 단어 표현을 학습하고, 입력 데이터와 어휘적으로 관련되어 있지만 기학습된 단어 임베딩 정보에 포함되지 않은 단어(예를 들어, uncovering)를 나타낼 수 있다. 즉, 'uncovered'와 'uncovering'은 유사한 문자 시퀀스 표현을 공유하기 때문에 단어 표현 모델(200)은 기학습된 단어 임베딩을 문자로 재구성하고, 학습한 단어 표현의 최근접한 이웃 단어들을 제시할 수 있다. The word expression model 200 learns a word expression similarly to the pre-learned word embedding information of 'uncovered', and is lexically related to the input data but is not included in the pre-learned word embedding information (for example, , uncovering) can be shown. That is, since 'uncovered' and 'uncovering' share a similar character sequence expression, the word expression model 200 can reconstruct the previously learned word embeddings into characters and present the nearest neighboring words of the learned word expression. .

이와 같은 방식으로, 어휘 사전 데이터세트 내의 모든 단어에 대한 기학습된 단어 임베딩 정보를 단어 표현 모델에 의해 새롭게 생성된 단어 표현으로 일반화하게 된다. 이때, 단어 표현 모델은 각 단어의 하위 단어를 문자 단위로 구분하기 때문에 미등록 단어를 비롯한 모든 단어에 대한 단어 표현을 생성할 수 있다.In this way, the pre-learned word embedding information for all words in the vocabulary dictionary dataset is generalized to the word expression newly created by the word expression model. In this case, since the word expression model classifies sub-words of each word in character units, it is possible to generate word expressions for all words including unregistered words.

도 6은 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법 중 단어 표현 모델의 추론 과정을 설명하는 순서도이고, 도 7은 어휘와 미등록 단어를 위한 단어 표현의 최근접 이웃을 설명하는 도면이다. 6 is a flowchart illustrating an inference process of a word expression model in a word expression method in natural language processing according to an embodiment of the present invention, and FIG. 7 is a diagram illustrating the nearest neighbor of a word expression for a vocabulary and an unregistered word to be.

도 6을 참조하면, 어휘 사전 데이터세트 내의 등록 단어와 미등록단어에 대한 단어 표현을 학습한 단어 표현 모델(200)은 미등록 단어가 입력 데이터로 제공되면(S210), 미등록 단어를 문자 단위로 구분하고, 문자 시퀀스 표현을 생성한 후 컨볼루션 모듈(210)과 하이웨이 모듈(220)을 통해 미등록 단어 임베딩 벡터를 생성한다(S220). Referring to FIG. 6 , the word expression model 200, which has learned word expressions for registered words and unregistered words in the vocabulary dictionary dataset, separates the unregistered words into character units when the unregistered words are provided as input data (S210). , after generating a character sequence representation, an unregistered word embedding vector is generated through the convolution module 210 and the highway module 220 ( S220 ).

단어 표현 모델(200)은 미등록 단어 임베딩 벡터에 대한 최근접 이웃 탐색(nearest-neighbor search)을 통해 미등록 단어의 고유 의미를 추론할 수 있다(S230).The word expression model 200 may infer the intrinsic meaning of the unregistered word through a nearest-neighbor search for the unregistered word embedding vector (S230).

표 1은 단어 표현 모델의 추론 과정을 통해 미등록 단어의 단어 표현에 대한 벡터 공간 상 이웃 단어들을 나타내고 있다. Table 1 shows the neighboring words in the vector space for the word expression of the unregistered word through the reasoning process of the word expression model.

[표 1][Table 1]

표 1에 나타나 있듯이, 단어 표현 모델에 의한 추론 과정은 미등록 단어에 대해 고유 의미를 잘 표현하고 있는 것을 확인할 수 있다. 즉, 단어 표현 모델(200)을 통해 생성된 단어 표현은 단어 변형과 복합어(computerization, bluejacket)를 잘 나타내고, 다른 단어 스타일과 관련된 단어(global-globally-globalization)를 포착하며, 맞춤법 오류 단어(vehicals)에 대한 견고함을 보여주고, 대부분의 미등록 단어(bluejacket-pants, vehicals-bicycles)에서 의미론적으로 관련된 단어를 포착할 수 있다. As shown in Table 1, it can be confirmed that the inference process by the word expression model expresses the intrinsic meaning of the unregistered word well. That is, the word expression generated through the word expression model 200 well represents word transformations and compound words (computerization, bluejacket), captures words related to other word styles (global-globally-globalization), and misspelled words (vehicals). ), and can capture semantically related words in most unregistered words (bluejacket-pants, vehicals-bicycles).

도 7에 도시된 바와 같이, 단어 표현 모델(GWR)의 성능을 평가하기 위해 최근접 이웃 단어를 탐색하여 정성 분석을 수행하면, word2vec 및 GWR에서 영어로 학습한 단어 표현의 최근접 이웃 단어를 추출하고, 최근접 이웃 단어는 코사인 유사성에 의해 계산된 유사성의 내림차순으로 정렬된다.As shown in FIG. 7 , when qualitative analysis is performed by searching for nearest neighbor words to evaluate the performance of the word expression model (GWR), the nearest neighbor words of word expressions learned in English by word2vec and GWR are extracted. , and the nearest neighbor words are sorted in descending order of similarity calculated by cosine similarity.

또한, 어휘 단어의 최근접 이웃 단어는 의미론적 또는 구문론적으로 관련된 단어가 최근접 이웃단어(taxicab-minibus, teachteaching)에 위치하고 있음을 알 수 있다. 더욱이, GWR은 어휘적으로 관련된 단어 표현을 보다 유사하게 만든다. 예를 들어, "connect"의 이웃 단어의 변형어(connects-connected-connecting)가 벡터 공간에서 기학습된 단어 임베딩보다 더 밀접하게 위치함을 나타낸다. 이 렌더링은 다른 단어들(teach-teaching, computes-compute-computable)도 비슷한 추세를 보여주고 있고, 어휘 관련성에 관한 단어의 특성을 만족시킨다. 이는 어휘적으로 관련된 단어가 문자 시퀀스의 많은 부분을 공유하기 때문이다.In addition, it can be seen that the nearest neighbor word of the vocabulary word is semantically or syntactically related word located in the nearest neighbor word (taxicab-minibus, teachteaching). Moreover, GWR makes lexically related word representations more similar. For example, it indicates that the variants (connects-connected-connecting) of the neighboring word of “connect” are more closely located than the pre-learned word embeddings in the vector space. This rendering shows similar trends in other words (teach-teaching, computes-compute-computable), and satisfies the characteristics of words regarding lexical relevance. This is because lexically related words share a large portion of a character sequence.

도 8은 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법과 다른 학습 모델의 비교 평가를 위해 전체 데이터 세트에 대한 단어 유사성 결과를 설명하는 도면이다. 8 is a diagram for explaining the word similarity results for the entire data set for comparative evaluation of a word expression method in natural language processing and other learning models according to an embodiment of the present invention.

단어 유사성은 단어들 간의 코사인 유사도 사이의 상관 계수를 계산하여 측정할 수 있고, 단어 유사성을 통해 각 모델의 능력을 평가할 수 있다. Word similarity can be measured by calculating a correlation coefficient between cosine similarities between words, and the ability of each model can be evaluated through word similarity.

먼저, 아랍어(Ar), 독일어(De), 영어(En), 스페인어(Es), 프랑스어(Fr) 및 러시아어(Ru)의 6개 언어에 대한 단어의 유사성 데이터세트를 수집하고, 6개 언어에 대해 기학습된 word2vec을 사용한다. First, we collected word similarity datasets for 6 languages: Arabic (Ar), German (De), English (En), Spanish (Es), French (Fr), and Russian (Ru), and We use pre-trained word2vec.

여기서, word2vec은 단어를 개별 최소 단위(Atomic unit)로 간주하고 윈도우 기반 학습 기술을 채택하는 모델이다. word2vec은 단어 기반 접근 방식이므로 OOV 단어를 표현할 수 없고, OOV 단어의 기본값으로 단어 유사 작업에 영 벡터(null vector)를 사용하고, 언어 모델링 작업을 위해 무작위로 초기화하여 미세 조정한다. Here, word2vec is a model that considers a word as an individual atomic unit and adopts a window-based learning technique. Since word2vec is a word-based approach, it cannot represent OOV words, uses a null vector for word-like operations as the default value of OOV words, and initializes randomly for language modeling tasks and fine-tunes them.

한편, FastText 방법은 word2vec의 확장이며, 문자 n-gram을 최소 단위로 간주하며, word2vec의 기술과 유사하게 학습한다. Mimick 방법은 단어를 표현하기 위해 문자에서 단어 임베딩까지의 기능을 학습하는 문자 기반 모델로서, 기학습된 단어 임베딩을 위해 스킵 그램 버전의 word2vec를 사용한다. On the other hand, the FastText method is an extension of word2vec, considers a letter n-gram as the minimum unit, and learns similarly to word2vec's technique. The Mimick method is a character-based model that learns functions from letter to word embeddings to represent words, and uses a skip gram version of word2vec for pre-learned word embeddings.

단어 표현 모델은 생성된 단어 표현(GWR)으로 표시되고, 기학습된 단어 임베딩에 word2vec를 사용한다. The word representation model is represented by a generated word representation (GWR) and uses word2vec for pre-learned word embeddings.

도 8에 도시된 바와 같이, 대부분의 언어에 대한 단어의 유사성 데이터 세트에서 단어 기반 방법(word2vec)에 비해 하위 단어 기반 학습 방법(FastText, Mimick, GWR)이 우수한 성능을 보임을 알 수 있다. 이는 하위 단어 정보를 고려하면 단어를 표현할 때 효과적이며 많은 언어에서 유용하다는 것을 나타낸다. 하위 단어 기반 방법 중 GWR은 영어를 제외한 모든 언어에서 Mimick보다 우수한 성능을 나타낸다.As shown in FIG. 8 , it can be seen that the lower-order word-based learning methods (FastText, Mimick, GWR) perform better than the word-based method (word2vec) in the word similarity data set for most languages. This indicates that considering the sub-word information, it is effective when expressing words and is useful in many languages. Among the subword-based methods, GWR outperforms Mimick in all languages except English.

도 8에 도시된 단어 유사성 결과를 통해 단어 표현 모델의 컨볼루션 모델이 로컬 기능, 즉 하위 단어 정보를 추출하는데 유용하며, 대규모 코퍼스에서 단어 표현을 학습하는 FastText보다 기학습된 단어 임베딩 정보의 지도 학습을 기반으로 단어표현을 생성하는GWR이 더 우수한 학습 성능을 보여줌을 알 수 있다. Through the word similarity result shown in FIG. 8, the convolution model of the word expression model is useful for extracting local functions, that is, sub-word information, and supervised learning of word embedding information that has been learned more than FastText, which learns word expressions in a large corpus. It can be seen that GWR, which generates word expressions based on , shows better learning performance.

GWR에서 파생된 미등록 단어(OOV)의 단어 표현이 실제로 성능을 향상시킨다는 것을 확인하기 위해 미등록 단어를 생성하지 않는 모델(GWR^-)보다 상당히 우수한 성능을 보여준다. 이로 인해, GWR이 어휘 범위를 효과적으로 확장하고, 미등록 단어의 표현을 매우 잘 생성함을 나타낸다. 또한 GWR^-은 word2vec보다 약간 우수한 성능을 나타내는데, 이는 단어 표현에 하위 단어 정보를 고려할 때의 효과를 보여주는 것이다. To confirm that the word representation of GWR-derived unregistered words (OOV) actually improves the performance, it shows significantly better performance than the ^{model that does not generate unregistered words (GWR - ).} This indicates that GWR effectively expands the lexical range and generates expressions of unregistered words very well. Also, GWR ^- shows slightly better performance than word2vec, which shows the effect of considering sub-word information in word representation.

도 9는 본 발명의 일 실시예에 따른 자연어 처리에서의 단어 표현 방법을 실제 자연어 처리 시스템에 적용한 실험 결과를 나타낸 도면이다. 9 is a diagram showing experimental results of applying the word expression method in natural language processing according to an embodiment of the present invention to an actual natural language processing system.

도 9를 참조하면, 자연어 처리 시스템은 GWR의 유용성을 확인하기 위해 외부 모델로 언어 모델링을 수행한다. 자연어처리 시스템의 성능평가 척도는 퍼플렉시티(perplexity)로서 퍼플렉시티가 낮을수록 강한 모델을 의미한다. 여기서, 퍼플렉시티(Perplexity)는 언어 모델을 평가하기 위한 내부 평가 지표로서 보통 PPL이라고 표현하는데, PPL 측정 방법은 외재적 태스크(Extrinsic task)에 대한 성능 평가 방법의 하나이다.Referring to FIG. 9 , the natural language processing system performs language modeling using an external model to confirm the usefulness of GWR. The performance evaluation scale of the natural language processing system is perplexity, and the lower the perplexity, the stronger the model. Here, perplexity is an internal evaluation index for evaluating a language model and is usually expressed as PPL. The PPL measurement method is one of performance evaluation methods for an extrinsic task.

체코어(Cs), 독일어(De), 영어(En), 스페인어(Es), 프랑스어(Fr), 러시아어(Ru)의 6개 언어에 대한 데이터 세트를 사용하여 언어 모델링 작업을 수행한다. GWR을 학습하는 데 사용되는 word2vec의 성능을 기반으로 형태적으로 풍부한 언어에서 성능이 현저하게 향상됨을 알 수 있다. We perform language modeling tasks using data sets for six languages: Czech (Cs), German (De), English (En), Spanish (Es), French (Fr), and Russian (Ru). Based on the performance of word2vec used to learn GWR, it can be seen that the performance is significantly improved in the morphologically rich language.

예를 들어, 형태학적으로 풍부한 슬라브 언어(Cs, Ru)의 성능은 형태면에서 다른 언어보다 더 복잡한 혼돈 감소(체코 어 및 러시아어의 경우 각각 15 및 22%)를 나타낸다. 이는 GWR이 형태학적으로 풍부한 언어에서 더 유용하고 효과적임을 나타내는 것이다.For example, the performance of morphologically rich Slavic languages (Cs, Ru) exhibits more complex chaotic reduction (15 and 22% for Czech and Russian, respectively) than other languages in terms of morphology. This indicates that GWR is more useful and effective in morphologically rich languages.

이와 같이, 단어 표현 모델은 기존에 미등록 단어를 의사 단어로 표현하여 사용하는 미등록 단어 처리 방법에 비해 미등록 단어의 처리 면에서 있어서 효과성이 우수함을 보여준다.As such, the word expression model shows that it is more effective in terms of processing unregistered words compared to the unregistered word processing method that uses the existing unregistered words to be expressed as pseudo-words.

한편, 단어 표현 모델은 CNN에서 널리 사용되는 MLP(multi-layer perceptron) 보다는 하이웨이 네트워크에 기반한 하이웨이 모듈(220)을 사용한다. 하이웨이 네트워크가 기본적으로 학습 손실과 관련하여 MLP보다 빠른 수렴을 보여주어 학습을 더 빠르게 수행할 수 있고, 2-레이어 Highway는 1-레이어 Highway보다 더 빠른 수렴을 보여주므로 레이어 스택(stack)에 따른 하이웨이 네트워크의 특성을 이용하여 단어의 의미상 유사성을 향상시키면서 더 깊은 단어 표현 모델을 학습할 수 있다.On the other hand, the word expression model uses the highway module 220 based on the highway network rather than the multi-layer perceptron (MLP) widely used in CNN. Highway networks basically show faster convergence than MLP with respect to learning loss, so learning can be done faster, and 2-layer Highways show faster convergence than 1-layer Highways, so highways according to the layer stack A deeper word expression model can be learned while improving the semantic similarity of words by using the network characteristics.

이와 같이, 본 발명에서는 GWR로 표시된 문자 기반의 단어 표현 방법을 제공하고 있는데, CNN과 하이웨이 네트워크를 사용하는 단어 표현 모델은 하위 단어 정보를 고려하여 미등록 단어의 고품질 표현을 생성할 수 있다. 이러한 단어 표현 모델은 텍스트 분류와 명명된 개체 인식과 같은 다른 영역에도 적용될 수 있다.As described above, the present invention provides a character-based word expression method marked with GWR. A word expression model using CNN and highway networks can generate high-quality expressions of unregistered words in consideration of sub-word information. This word representation model can also be applied to other areas such as text classification and named entity recognition.

이상에서 설명한 본 발명의 실시예에 따른 자연어 처리에서의 단어 표현 방법은, 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 이러한 기록 매체는 컴퓨터 판독 가능 매체를 포함하며, 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함하며, 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.The word expression method in natural language processing according to the embodiment of the present invention described above may be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Such recording media includes computer-readable media, and computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Computer readable media also includes computer storage media, which include volatile and nonvolatile embodied in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. , both removable and non-removable media.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100: 자연어 처리 시스템
110: 통신 모듈 120: 메모리
130: 프로세서 140: 데이터베이스100: natural language processing system
110: communication module 120: memory
130: processor 140: database

Claims

A method for expressing words in natural language processing performed by a natural language processing system, the method comprising:
a) providing a vocabulary dictionary dataset including a vocabulary including at least one word and pre-learned word embedding information for each word;
b) When a vocabulary based on the vocabulary dictionary dataset is provided as input data, subword information about words existing in the input data is extracted using a word expression model, and the subword information is embedded in words calculating a vector;
c) matching the calculated word embedding vector with pre-learned word embedding information of the corresponding word, thereby replacing the pre-learned word embedding information with the calculated word embedding vector to learn a word expression for the corresponding word;
d) When an unregistered word (Out of Vocabulary) is provided as input data to the learned word expression model, sub-word information is extracted for the unregistered word, and then the word embedding vector of the unregistered word is extracted using the extracted lower-order word information. calculating; and
e) calculating the similarity between word embedding vectors through a vector operation based on the calculated word embedding vector of the unregistered word, extracting neighboring words of the unregistered word, and inferring the unique meaning of the unregistered word,
The word expression model is
A convolutional neural network-based convolution module that calculates lower-order word feature vectors using the lower-order word information, and a word of the corresponding word by adaptively combining the lower-order word feature vectors calculated by the convolutional module Which includes a highway network-based highway module that calculates an embedding vector,
How to represent words in natural language processing.

The method of claim 1,
Step b) is,
b-1) extracting lower-order words from all the words included in the input data, assigning individual codes to the extracted lower-order words, and concatenating codes constituting the corresponding words to generate a sequence expression;
b-2) extracting lower-order word information existing in the corresponding word through synthesis with the sequence expression in the convolution module;
b-3) extracting meaningful lower-order word features by applying a pooling operation to the extracted lower-order word information; and
b-4) concatenating all of the extracted sub-word features to calculate a sub-word feature vector through convolution.

3. The method of claim 2,
Step b) is,
When the sub-word is set as a character, a character-based sequence expression is generated for all words including an unregistered word (Out of Vocabulary),
A word expression method in natural language processing, wherein a word embedding vector of a registered word and a word embedding vector of an unregistered word are calculated by using the character-based sequence representation.

4. The method of claim 3,
Step c) is,
By matching the word embedding vector of the registered word with the previously learned word embedding information of the corresponding registered word, the previously learned word embedding is replaced with the word embedding vector of the registered word to learn the word expression for the registered word,
The word expression method in natural language processing, wherein the word expression of the non-registered word is learned by using the word embedding vector of the non-registered word.

3. The method of claim 2,
In b-1), one-hot encoding is applied to represent the sequence expression by Equation 1 below.
[Equation 1]

r: word expression,

Vc: vocabulary

: concatenation operator
c _i : One-hot encoding to represent the i-th character in the word

3. The method of claim 2,
b-2) is a word expression method in natural language processing, wherein the lower-order word information is extracted by synthesizing with the sequence expression through Equation 2 below.
[Equation 2]

F: convolution filter,

Vc: vocabulary
h: width of F
r _i:i+h-1 : represents the sequence between the letters c _i through c _i+h-1
b: bias
tanh: non-linear function of convolution result values

3. The method of claim 2,
b-3) extracts the lower-order word feature vectors by applying a stride pooling operation according to Equation 3 below, and the lower-order word feature vectors include a suffix, a root, and a prefix. How to express words in
[Equation 3]

S: subword feature vector
k: length of stride
s _ki:k(i+1)-1 : Represents a sequence with one stride in s

The method of claim 1,
wherein the highway module routes the subword feature vector using a gate mechanism according to Equation 4 below.
[Equation 4]

H: non-linear transformation at input (e)
T: conversion gate
C: moving gate

9. The method of claim 8,
In Equation 4, the moving gate C is simplified to (1-T) and expressed as Equation 5 below, a word expression method in natural language processing.
[Equation 5]

W _t, W _h : square matrices
b _t , b _h : bias
tanh: non-linear function of the resulting values

The method of claim 1,
The word expression model performs a linear transformation on the calculated word embedding vector through Equation 6 below to generate a word embedding vector having the same size as the previously learned word embedding information. Way.
[Equation 6]

w: final word expression
W, b: parameters of linear transformation
y: result vector

The method of claim 1,
Step c) is,
Cosine similarity between the calculated word embedding vector and pre-learned word embedding information, L 1 distance (L1 Distance or Manhattan Distance), or L2 distance (L2 Distance or Euclidean Distance) Using a similarity calculation method of any one Matching Phosphorus, a word expression method in natural language processing.

12. The method of claim 11,
In step c), the similarity between the calculated word embedding vector and the pre-learned word embedding information is calculated using an L2 loss function according to Equation 7 below.
[Equation 7]

V _w : vocabulary in the lexical dictionary dataset
w _v : calculated word embedding vector

: Pre-Learned Word Embedding Vector

delete

The method of claim 1,
Step d) is,
When the sub-word is set as a character, a character-based sequence representation is generated for the Out of Vocabulary, and a word embedding vector of the non-registered word is calculated using the character-based sequence representation A method of representing words in natural language processing.

15. The method of claim 14,
The word expression model is
The convolution module extracts sub-word information existing in the non-registered word through synthesis with the character-based sequence expression, and applies a pooling operation to the extracted sub-word information to feature at least one sub-word extracts and connects the sub-word features to calculate a sub-word feature vector,
and associating the subword feature vector of the unregistered word calculated by the highway module with the pre-learned word embedding information using a gate mechanism.

The method of claim 1,
The step e) extracts the neighboring words using a nearest-neighbor search algorithm, and sorts word expressions in the neighboring words in descending order of similarity between the word embedding vectors. How to express words in

In the natural language processing system for distributed expression of words included in vocabulary,
a memory in which a program for performing a word expression method in natural language processing is recorded; and
a processor for executing the program;
The processor, by executing the program,
A vocabulary including at least one word and a vocabulary based on a vocabulary dictionary dataset including pre-learned word embedding information for each word are provided as input data, and words existing in the input data using a word expression model The pre-learned word embedding information by extracting subword information for , calculating a word embedding vector from the sub-word information, and matching the calculated word embedding vector with the pre-learned word embedding information of the corresponding word is replaced with the calculated word embedding vector to learn a word expression for the word, and when an unregistered word (Out of Vocabulary) is provided as input data to the learned word expression model, lower word information for the unregistered word is obtained After extraction, the word embedding vector of the non-registered word is calculated using the extracted sub-word information, and the similarity between the word embedding vectors is calculated through a vector operation based on the calculated word embedding vector of the non-registered word. Inferring the unique meaning of the unregistered word by extracting
The word expression model is
A convolutional neural network-based convolution module that calculates lower-order word feature vectors using the lower-order word information, and a word of the corresponding word by adaptively combining the lower-order word feature vectors calculated by the convolutional module A natural language processing system comprising a highway module based on a highway network for calculating an embedding vector.

18. The method of claim 17,
The word expression model is
Cosine similarity between the calculated word embedding vector and pre-learned word embedding information, L1 distance (L1 Distance or Manhattan Distance), or L2 distance (L2 Distance or Euclidean Distance) Any one of the similarity calculation method by matching using the method The natural language processing system of claim 1, further comprising an optimization module for reconstructing the previously learned word embedding information into the calculated word embedding vector.

delete