KR102815217B1

KR102815217B1 - Learning method of neural network model for language generation and apparatus for the learning method

Info

Publication number: KR102815217B1
Application number: KR1020200110295A
Authority: KR
Inventors: 정의석; 김현우; 송화전; 오유리; 유병현; 한란
Original assignee: 한국전자통신연구원
Priority date: 2019-09-20
Filing date: 2020-08-31
Publication date: 2025-06-02
Anticipated expiration: 2040-08-31
Also published as: KR20210034486A

Abstract

본 발명은 적대적 학습 방법을 이용하여 기존 모델의 정규화를 강화한 새로운 학습 방법을 제시한다. 또한 기존 기술은 워드 임베딩 의존성이 큰 접근 방법으로 특히 단일 의미만을 지닌 워드 임베딩의 문제점을 가지고 있지만, 본 발명은 자가-주의집중 모델을 적용하여 종래 문제점을 해결한다.The present invention proposes a new learning method that strengthens the regularization of existing models by using an adversarial learning method. In addition, existing technologies have problems with word embeddings, especially word embeddings with only a single meaning, due to their high word embedding dependency, but the present invention solves the existing problems by applying a self-attention model.

Description

{Learning method of neural network model for language generation and apparatus for the learning method}

본 발명은 신경망 기반의 언어 생성 기술에 관한 것이다. The present invention relates to a neural network-based language generation technology.

최근 신경망을 이용하여 언어(또는 자연어)를 생성하는 기술(이하, '신경망 기반 언어 생성' 또는 '신경 언어 생성(neural language generation)에 대한 연구가 활발히 진행되고 있다.Recently, research on technology for generating language (or natural language) using neural networks (hereinafter referred to as 'neural network-based language generation' or 'neural language generation') is being actively conducted.

신경망 기반 언어 생성을 위한 신경망 모델에서는, 신경망의 출력값들에 대한 클래스 분류를 위해, 상기 출력값들의 정규화, 즉, 언어 생성 확률값을 계산하는 소프트맥스(softmax) 함수가 이용된다. In neural network models for neural network-based language generation, the softmax function is used to normalize the output values of the neural network, that is, to calculate the probability of language generation, in order to classify the output values.

그런데, 소프트맥스 함수는 언어 생성 확률값을 계산하는데, 많은 연산량을 필요로 하는 문제가 있으며, 이러한 소프트맥스 연산 문제는 신경망 기반 언어 생성을 위한 신경망 모델의 학습 속도 및 성능을 저하시키는 주요 요인이다.However, the softmax function has a problem in that it requires a large amount of computation to calculate the probability of language generation, and this softmax computation problem is a major factor in reducing the learning speed and performance of neural network models for neural network-based language generation.

본 발명은, 신경망 기반 언어 생성을 위한 신경망 모델의 학습 속도 및 성능을 개선하는데 목적이 있다.The present invention aims to improve the learning speed and performance of a neural network model for neural network-based language generation.

세부적으로, 본 발명은, 학습 속도 개선을 위해, 언어 생성 확률값을 도출하는 소프트맥스(softmax)의 문제점을 해결하는 데 있다. 또한 본 발명은, 성능 개선을 위해, 언어 생성 시점에서 문장의 컨텍스트를 고려하는 주의집중(attention) 모델을 제공하는데 있다.In detail, the present invention aims to solve the problem of softmax, which derives the probability value of language generation, in order to improve the learning speed. In addition, the present invention aims to provide an attention model that considers the context of a sentence at the time of language generation, in order to improve performance.

본 발명의 전술한 목적 및 그 이외의 목적과 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부된 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다.The above-mentioned objects and other objects, advantages and features of the present invention, and methods for achieving them will become apparent with reference to the embodiments described in detail below together with the attached drawings.

상술한 목적을 달성하기 위한 본 발명의 일면에 따른 언어 생성을 위한 신경망 모델의 학습 방법은, 입력 워드를 벡터로 표현한 입력 워드 임베딩값을 순환 신경망을 통해 변환한 값과 상기 입력 워드의 다음에 등장할 정답 워드를 벡터로 표현한 타겟 워드 임베딩값 사이의 거리값을 일정 수준으로 설정하기 위한 적대적 교란값을 추정하는 단계; 합산기에서, 상기 입력 워드 임베딩값과 상기 타겟 워드 임베딩값에 상기 추정된 적대적 교란값을 각각 합산하여, 변환된 입력 워드 임베딩값과 변경된 타겟 워드 임베딩값을 각각 출력하는 단계; 순환 신경망 셀에서, 상기 변경된 입력 워드 임베딩값에 대한 은닉값을 생성하는 단계; 주의집중 모델에서, 상기 생성된 은닉값에 대해 자가-주의집중(self-attention) 연산을 수행하여, 상기 은닉값에 컨텍스트에 따른 의미 변화를 나타내는 컨텍스트 정보를 투영하는 단계; 및 거리 최소화 연산기에서, 상기 컨텍스트 정보가 투영된 은닉값과 상기 변경된 타겟 워드 임베딩값 간의 거리값을 최소화하는 연산을 수행하여, 상기 신경망 모델에 대한 적대적 학습을 진행하는 단계를 포함한다.According to one aspect of the present invention for achieving the above-described purpose, a learning method of a neural network model for language generation comprises: a step of estimating an adversarial perturbation value for setting a distance value between an input word embedding value, which expresses an input word as a vector, converted through a recurrent neural network, and a target word embedding value, which expresses a correct word to appear next to the input word as a vector, to a predetermined level; a step of adding the estimated adversarial perturbation value to the input word embedding value and the target word embedding value, respectively, in an adder, and outputting the converted input word embedding value and the modified target word embedding value, respectively; a step of generating a hidden value for the modified input word embedding value in a recurrent neural network cell; a step of performing a self-attention operation on the generated hidden value in an attention model, and projecting context information indicating a change in meaning according to the context onto the hidden value; And, in the distance minimization operator, a step of performing an operation to minimize the distance value between the hidden value onto which the context information is projected and the changed target word embedding value, thereby performing adversarial learning for the neural network model is included.

본 발명의 다른 일면에 따른 언어 생성을 위한 신경망 모델의 학습을 위한 컴퓨팅 장치는 입력 워드를 벡터로 표현한 입력 워드 임베딩값과 상기 입력 워드의 다음에 등장할 정답 워드를 벡터로 표현한 타겟 워드 임베딩값에 적대적 교란값을 각각 합산하는 제1 연산 로직; 상기 적대적 교란값이 합산된 입력 워드 임베딩값에 대한 순환 신경망 연산을 수행하여 은닉값을 계산하는 제2 연산 로직; 상기 계산된 은닉값에 대해 자가-주의집중(self-attention) 연산을 수행하여, 상기 계산된 은닉값에 상기 입력 워드의 주변 워드에 대한 컨텍스트 정보를 투영하기 위한 연산을 수행하는 제3 연산 로직; 및 상기 컨텍스트 정보가 투영된 은닉값과 상기 적대적 교란값이 합산된 타겟 워드 임베딩값 간의 거리값을 최소화하는 연산을 통해, 상기 신경망 모델에 대한 적대적 학습을 수행하는 제4 연산 로직을 포함한다.According to another aspect of the present invention, a computing device for learning a neural network model for language generation includes: a first operation logic for adding an adversarial perturbation value to an input word embedding value that represents an input word as a vector and a target word embedding value that represents a correct word to appear next to the input word as a vector; a second operation logic for performing a recurrent neural network operation on the input word embedding value to which the adversarial perturbation value has been added to calculate a hidden value; a third operation logic for performing a self-attention operation on the calculated hidden value to perform an operation for projecting context information on words surrounding the input word onto the calculated hidden value; and a fourth operation logic for performing an operation for minimizing a distance value between the hidden value to which the context information has been projected and the target word embedding value to which the adversarial perturbation value has been added to perform adversarial learning on the neural network model.

본 발명에 따르면, 신경망 기반의 언어 생성을 위한 신경망 모델의 학습 속도를 개선하는데 제약 요소인 소프트맥스 연산을 회피한다. According to the present invention, the softmax operation, which is a limiting factor in improving the learning speed of a neural network model for neural network-based language generation, is avoided.

또한, 본 발명의 신경망 기반의 언어 생성을 위한 신경망 모델은 대상 워드 벡터와 출력 벡터를 비교하는 시점에서 컨텍스트를 반영하는 기법을 제공함으로써, 다중 의미 어휘 생성을 가능하게 한다.In addition, the neural network model for language generation based on the neural network of the present invention provides a technique for reflecting context at the time of comparing the target word vector and the output vector, thereby enabling the generation of multi-meaning vocabulary.

또한, 본 발명에 따르면, 적대적 훈련(Adversarial Training) 기법을 이용하여 신경망 기반의 언어 생성을 위한 신경망 모델의 견고성(Robustness)을 향상시킴으로써, 신경망 모델의 표현력이 개선될 수 있다.In addition, according to the present invention, the robustness of a neural network model for neural network-based language generation can be improved by using an adversarial training technique, thereby improving the expressive power of the neural network model.

도 1은 본 발명의 실시 예에 따른 신경망 기반의 언어 생성을 위한 신경망 모델의 내부 구성을 나타내는 블록도이다.
도 2는 본 발명의 실시 예에 따른 신경망 기반의 언어 생성을 위한 학습 과정을 도식적으로 나타낸 도면이다.
도 3은 본 발명의 실시 예에 따른 언어 생성을 위한 신경망 모델의 학습 방법을 설명하기 위한 흐름도이다.FIG. 1 is a block diagram showing the internal configuration of a neural network model for language generation based on a neural network according to an embodiment of the present invention.
FIG. 2 is a diagram schematically illustrating a learning process for neural network-based language generation according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating a method for learning a neural network model for language generation according to an embodiment of the present invention.

본 발명의 다양한 실시 예는 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들이 도면에 예시되고 관련된 상세한 설명이 기재되어 있다. 그러나, 이는 본 발명의 다양한 실시 예를 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 다양한 실시예의 사상 및 기술 범위에 포함되는 모든 변경 및/또는 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용되었다.The various embodiments of the present invention may have various modifications and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail in connection therewith. However, this is not intended to limit the various embodiments of the present invention to specific embodiments, but it should be understood that all modifications and/or equivalents or substitutes included in the spirit and technical scope of the various embodiments of the present invention are included. In connection with the description of the drawings, similar reference numerals have been used for similar components.

본 발명의 다양한 실시 예에서 사용될 수 있는 "포함한다" 또는 "포함할 수 있다" 등의 표현은 개시(disclosure)된 해당 기능, 동작 또는 구성요소 등의 존재를 가리키며, 추가적인 하나 이상의 기능, 동작 또는 구성요소 등을 제한하지 않는다. 또한, 본 발명의 다양한 실시 예에서, "포함하다" 또는 "가지다" 등의 용어는 명세서에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Expressions such as “includes” or “may include”, which may be used in various embodiments of the present invention, indicate the presence of the disclosed corresponding function, operation or component, etc., and do not limit one or more additional functions, operations or components, etc. In addition, in various embodiments of the present invention, it should be understood that terms such as “includes” or “has” are intended to specify the presence of a feature, number, step, operation, component, part or a combination thereof described in the specification, but do not exclude in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts or combinations thereof.

이하, 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명하기로 한다. 그에 앞서, 먼저, 본 발명의 이해를 돕기 위해, 신경망 기반의 언어 생성과 관련하여 몇 가지 공개된 연구들에 대해 소개한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Before that, to help understand the present invention, several published studies related to neural network-based language generation will be introduced.

ICLR 2019에서 공개되고, 'VON MISES-FISHER LOSS FOR TRAINING SEQUENCE TO SEQUENCE MODELS WITH CONTINUOUS OUTPUTS''을 제목으로 하고, Sachin Kumar & Yulia Tsvetkov를 저자로 하는 논문(이하, Kumar의 논문)은 Von Mises-Fisher(vMF) loss를 이용한 연속 출력(continuous output)에 대한 기술을 다루고 있다.A paper presented at ICLR 2019, titled ' V ON MISES-FISHER LOSS FOR TRAINING SEQUENCE TO SEQUENCE MODELS WITH CONTINUOUS OUTPUTS', by Sachin Kumar & Yulia Tsvetkov (hereafter referred to as Kumar's paper), deals with a technique for continuous output using the von Mises-Fisher (vMF) loss.

Kumar의 논문은 시퀀스2시퀀스(sequence to sequence) 모델의 출력 단계(output step)에서 어휘 생성에 대한 확률 분포(probability distribution)를 생성하는 대신 직접 워드 임베딩값(word embeddings value)을 생성하는 접근 방법을 제안하고 있다. Kumar's paper proposes an approach that generates word embeddings values directly instead of generating a probability distribution for vocabulary generation at the output step of a sequence-to-sequence model.

구체적으로, Kumar의 논문에서는, 타겟 워드, 즉 정답 워드의 사전 훈련된(pre-trained) 워드 임베딩 벡터값(word embeddings vector)과 신경망의 출력 벡터값(output vector)의 거리를 최소화하도록 진행되는 학습 과정을 제안하고 있다.Specifically, Kumar's paper proposes a learning process that minimizes the distance between the pre-trained word embeddings vector of the target word, i.e. the correct word, and the output vector of the neural network.

Kumar의 논문에서는, 테스트 시점에서 타겟 임베딩 공간(target embedding space)에서 근접 이웃(Nearest neighbor)을 탐색하는데, 모델의 생성 벡터, 즉 출력 벡터값(output vector value)을 키로 사용한다.In Kumar's paper, the model's generated vector, i.e. the output vector value, is used as a key to search for the nearest neighbor in the target embedding space at test time.

본 발명은 vMF loss의 새로운 정규화(regularization)을 위해 적대적 훈련(adversarial training) 기법 또는 적대적 학습 기법을 도입하여 새로운 vMF loss를 제안한다.The present invention proposes a new vMF loss by introducing an adversarial training technique or an adversarial learning technique for a new regularization of the vMF loss.

또한, 본 발명은 사전 훈련된(학습된) 워드 임베딩값(pre-trained word embeddings value)과 출력 벡터값(output vector value) 간의 거리 연산 과정에서 자가-주의집중(self-attention) 모델을 이용하여 문장의 문맥(context)을 고려한 언어 생성 접근 방법을 제안한다.In addition, the present invention proposes a language generation approach that considers the context of a sentence by using a self-attention model in the distance calculation process between pre-trained word embeddings values and output vector values.

신경망 기반의 언어 생성과 관련하여, ICLR 2015에서 공개되고, 'EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES'을 제목으로 하고, Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy를 저자로 하는 논문(이하, Goodfellow의 논문)은 학습 모델의 입력 데이터에 교란값(worst-case perturbation)을 도입하여 모델의 견고성(robustness)을 개선하는 FGSM(fast gradient sign method)를 소개하고 있다.In the context of neural network-based language generation, the paper presented at ICLR 2015, titled 'EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES', by Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy (hereafter referred to as Goodfellow's paper), introduces the fast gradient sign method (FGSM) that improves the robustness of a model by introducing worst-case perturbations to the input data of the learning model.

본 발명은 Goodfellow의 논문의 접근 방법을 기반으로, 학습 모델의 입력 데이터에 교란값을 도입하는 것에 더하여, 출력 벡터값(output vector value)에 적대적 교란값(adversarial perturbation value)을 추정하여 vMF loss에 적대적 훈련(adversarial training) 기법 또는 적대적 학습(adversarial learning) 기법을 접목한 방법을 제안한다.The present invention proposes a method of grafting an adversarial training technique or an adversarial learning technique onto vMF loss by estimating an adversarial perturbation value to an output vector value in addition to introducing a perturbation value to the input data of a learning model based on the approach of Goodfellow's paper.

신경망 기반의 언어 생성과 관련하여, 신경 정보 처리 시스템에 관한 31차 컨퍼런스(NIPS 2017)에서 공개되고, 'Attention Is All You Need'을 제목으로 하며, Ashish Vaswani외 다수를 저자로 하는 논문(이하, Vaswani의 논문)은 self-attention과 시퀀스-투-시퀀스로 구성된 멀티-헤드 주의집중(multi-head attention) 모델을 소개하고 있다.In the context of neural network-based language generation, a paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) titled 'Attention Is All You Need', and authored by Ashish Vaswani et al. (hereafter, Vaswani's paper) introduces a multi-head attention model consisting of self-attention and sequence-to-sequence.

Vaswani의 논문은 주의집중(attention) 모델의 파라메터(parameter) 개수를 증폭하여 모델의 주의집중(attention) 능력을 강화한 접근 방법을 소개하고 있다.Vaswani's paper introduces an approach that enhances the attention ability of an attention model by increasing the number of parameters of the attention model.

본 발명은, 기계 학습과 관련된 국제 회의(Proceedings of the 36^th International Conference on Machine Learning)에서, Nikolaos Pappas를 저자로 하며, 'Deep Residual Output Layers for Neural Language Generation'을 제목으로 하는 논문에서 소개된 접근 방법을 기반으로, 멀티-헤드 주의집중(multi-head attention) 모델을 대체할 수 있는 접근 방법을 제안한다. The present invention proposes an alternative approach to multi-head attention models, based on the approach presented in the paper entitled 'Deep Residual Output Layers for Neural Language Generation', authored by Nikolaos Pappas, in the Proceedings of the ^36th International Conference on Machine Learning.

기존 주의집중(attention) 모델의 경우, 주의집중(attention) 항목 간에 공유된 파라메터(shared parameter)가 존재하지 않아 서로 독립적인 주의집중(attention)을 생성한다.In the case of existing attention models, there are no shared parameters between attention items, so independent attention is created.

이러한 서로 독립적인 주의집중(attention) 생성은 본 발명에서 제공하는 새로운 심층 레지듀얼 주의집중(Deep residual attention) 모델로 해결한다.This independent attention generation is solved by a novel deep residual attention model provided in the present invention.

이상 설명한 바와 같이, 신경망 언어 생성(neural language generation) 기술에서, 디코더의 출력층에서 확률 분포를 생성하는 것이 아니라 워드 임베딩값을 출력하는 접근 방법(Kumar의 논문)이 제시된 바 있다. As explained above, in neural language generation technology, an approach has been proposed (Kumar's paper) to output word embedding values instead of generating a probability distribution in the output layer of the decoder.

Kumar의 논문의 접근 방법은 워드 임베딩에 대한 의존성이 큰 접근 방법이다. 즉, Kumar의 논문의 접근 방법은 단일 의미만을 갖는 워드 임베딩의 문제점을 갖고 있다. 이러한 문제점을 해결하기 위해, 본 발명은 문맥(context)을 고려한 주의집중 모델을 상기 접근 방법(Kumar의 논문)에 통합하는 방법을 제안한다. The approach of Kumar's paper is an approach that is highly dependent on word embedding. That is, the approach of Kumar's paper has the problem of word embedding having only a single meaning. To solve this problem, the present invention proposes a method of integrating an attention model that considers context into the above approach (Kumar's paper).

또한 본 발명은 적대적 학습 방법을 이용하여 기존 모델의 정규화를 강화한 새로운 학습 방법을 제공한다.In addition, the present invention provides a new learning method that strengthens the regularization of an existing model by using an adversarial learning method.

또한, 본 발명은 다중 헤드 주의집중(multi-head attention) 모델의 문제점이 과도하게 많은 파라미터의 개수와 주의집중 대상들이 공유하는 파라미터가 없다는 한계를 극복하기 위해, 새로운 심층 레지듀얼 주의집중(Deep residual attention) 모델을 제공한다. 이는 언어 생성 분야뿐만 아니라 주의집중 모델이 활용되는 다양한 접근 방법에서 활용될 수 있다.In addition, the present invention provides a new deep residual attention model to overcome the limitations of the multi-head attention model, which is an excessive number of parameters and the absence of parameters shared by the attention targets. This can be utilized not only in the field of language generation but also in various approaches in which attention models are utilized.

이하, 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시 예에 따른 신경망 기반의 언어 생성을 위한 신경망 모델의 내부 구성을 나타내는 블록도이다.FIG. 1 is a block diagram showing the internal configuration of a neural network model for language generation based on a neural network according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 신경망 기반의 언어 생성을 위한 신경망 모델은, 예를 들면, 시퀀스-투-시퀀스(Sequence-to-Sequence) 모델(300)일 수 있다.Referring to FIG. 1, a neural network model for language generation based on a neural network according to an embodiment of the present invention may be, for example, a sequence-to-sequence model (300).

시퀀스-투-시퀀스 모델(300)은 컴퓨팅 장치에 의해 실행되는 소프트웨어 모듈, 하드웨어 모듈 또는 이들의 조합으로 구현될 수 있다.The sequence-to-sequence model (300) can be implemented as a software module, a hardware module, or a combination thereof executed by a computing device.

시퀀스-투-시퀀스 모델(300)이 소프트웨어 모듈로 구현된 경우, 시퀀스-투-시퀀스 모델(300)은 컴퓨팅 장치 내의 적어도 하나의 프로세서에 의해 실행되고, 실행을 위해 상기 컴퓨팅 장치 내의 메모리에 적재되는 알고리즘 형태로 구현될 수 있다. 여기서, 프로세서는 적어도 하나의 CPU, 적어도 하나의 GPU 또는 이들의 조합일 수 있다.When the sequence-to-sequence model (300) is implemented as a software module, the sequence-to-sequence model (300) may be implemented in the form of an algorithm that is executed by at least one processor within a computing device and loaded into a memory within the computing device for execution. Here, the processor may be at least one CPU, at least one GPU, or a combination thereof.

시퀀스-투-시퀀스 모델(300)이 하드웨어 모듈로 구현된 경우, 시퀀스-투-시퀀스 모델(300)은 컴퓨팅 장치 내의 적어도 하나의 프로세서 내의 회로 로직으로 구현될 수 있다.When the sequence-to-sequence model (300) is implemented as a hardware module, the sequence-to-sequence model (300) can be implemented as circuit logic within at least one processor within a computing device.

시퀀스-투-시퀀스 모델(300)은 입력 시퀀스로부터 다른 도메인의 출력 시퀀스를 출력하는 모델로서, 챗봇(Chatbot), 기계 번역(Machine Translation), 내용 요약(Text Summarization), STT(Speech to Text) 등 다양한 분야에서 적용될 수 있다.The sequence-to-sequence model (300) is a model that outputs an output sequence of a different domain from an input sequence, and can be applied in various fields such as chatbot, machine translation, text summarization, and STT (Speech to Text).

시퀀스-투-시퀀스 모델(300)은, 도 1에 도시된 바와 같이, 크게 인코더(50)와 디코더(100)를 포함하도록 구성될 수 있다.The sequence-to-sequence model (300) can be largely configured to include an encoder (50) and a decoder (100), as illustrated in FIG. 1.

인코더(50)는 입력 문장의 모든 단어들을 순차적으로 수신한 후, 모든 단어들을 하나의 벡터로 인코딩한다. 인코더(50)에 의해 인코딩된 벡터는 컨텍스트 벡터(context vector)라 불릴 수 있다.The encoder (50) sequentially receives all words of the input sentence and then encodes all words into one vector. The vector encoded by the encoder (50) may be called a context vector.

입력 문장의 모든 단어들이 하나의 컨텍스트 벡터로 인코딩 되면, 인코더(50)는 그 컨텍스트 벡터를 디코더(100)로 입력한다. When all words in the input sentence are encoded into one context vector, the encoder (50) inputs the context vector into the decoder (100).

디코더(100)는 인코더(50)로부터 입력된 컨텍스트 벡터를 기반으로 번역된 단어들을 하나씩 순차적으로 출력한다. The decoder (100) sequentially outputs translated words one by one based on the context vector input from the encoder (50).

특별히 한정하는 것은 아니지만, 본 발명에 따른 언어 생성 과정은 시퀀스-투-시퀀스 모델(300)의 디코더(100)에 적용되는 것으로 가정한다.Although not particularly limited, it is assumed that the language generation process according to the present invention is applied to the decoder (100) of the sequence-to-sequence model (300).

디코더(100)에 적용되는 본 발명에 따른 언어 생성 과정은 vMF loss 접근 방법에 적대적 훈련(adversarial training) 기법과 자가 주의집중(self-attention) 기술을 접목한 새로운 접근 방법을 제공한다. The language generation process according to the present invention applied to the decoder (100) provides a new approach that combines an adversarial training technique and a self-attention technique with the vMF loss approach.

디코더(100)에 적용되는 본 발명에 따른 언어 생성 과정은 기존의 주의집중(attention) 모델의 한계점인 주의집중(attention) 항목 간의 독립적인 접근 방법을 해결한다.The language generation process according to the present invention applied to the decoder (100) solves the limitation of the existing attention model, which is the independent access method between attention items.

도 2는 도 1에 도시한 디코더의 언어 생성을 위한 학습 과정을 도식적으로 나타낸 도면이다.Figure 2 is a diagram schematically illustrating the learning process for language generation of the decoder illustrated in Figure 1.

도 2를 참조하면, 디코더(100)의 학습을 위해, 디코더(100)는 2개의 합산기 블록들(110, 120), 순환 신경망(RNN) 블록(130), 자가-주의집중 모델(140) 및 거리 최소화 연산기(150)를 포함한다.Referring to FIG. 2, for learning of the decoder (100), the decoder (100) includes two adder blocks (110, 120), a recurrent neural network (RNN) block (130), a self-attention model (140), and a distance minimization operator (150).

각 구성들(110, 120, 130, 140, 150)은 컴퓨팅 장치 내의 프로세서에 의해 실행되는 소프트웨어 모듈로 구현되거나, 상기 프로세서에 임베딩된 회로 로직(하드웨어 모듈)으로 구현될 수 있다. 또는 각 구성들(110, 120, 130, 140 및 150)은 소프트웨어 모듈 및 하드웨어 모듈의 조합으로 구현될 수 있다.Each of the components (110, 120, 130, 140, 150) may be implemented as a software module executed by a processor within a computing device, or as circuit logic (hardware module) embedded in the processor. Alternatively, each of the components (110, 120, 130, 140, and 150) may be implemented as a combination of software modules and hardware modules.

합산기 블록(110)은 다수의 합산기들(11, 12, 13, 14)를 포함한다. 각 합산기는, 예를 들면, 입력 워드를 벡터로 표현한 입력 워드 임베딩값(101의 w _i-1 )과 RNN 셀(33)에 의해 추정된 적대적 교란값()을 합산하여, 입력 워드 임베딩값(101의 w _i-1 )을 적대적 교란값(adversarial perturbation value) ()이 반영된 입력 워드 임베딩값(115)으로 변경한다.The adder block (110) includes a number of adders (11, 12, 13, 14). Each adder includes, for example, an input word embedding value ( wi _-1 of 101) that represents an input word as a vector and an adversarial disturbance value (estimated by the RNN cell (33) ) and the input word embedding value ( wi _-1 of 101) is converted into an adversarial perturbation value ( ) is changed to the input word embedding value (115).

합산기 블록(120)은 다수의 합산기들(21, 22, 23, 24)을 포함한다. 각 합산기는, 예를 들면, 상기 입력 워드의 다음에 등장할 정답 워드를 벡터로 표현한 타겟 워드 임베딩값(102의 w _i )과 RNN 셀(33)에 의해 추정된 적대적 교란값()을 합산하여, 타겟 워드 임베딩값(102의 w _i )을 적대적 교란값()이 반영된 타겟 워드 임베딩값()으로 변경한다. The adder block (120) includes a plurality of adders (21, 22, 23, 24). Each adder includes, for example, a target word embedding value ( wi of 102) representing the correct word to appear next to _the input word as a vector and an adversarial disturbance value ( ) and the target word embedding value ( wi of 102 ₎ is added to the adversarial perturbation value ( ) reflects the target word embedding value ( ) is changed to .

RNN 블록(130)은 다수의 순환 신경망(RNN) 셀들(31, 32, 33, 34)을 포함하며, 적대적 교란값()이 반영된 은닉값(132의 )을 출력한다. 예를 들면, RNN 셀(33)은, 적대적 교란값()이 반영된 입력 워드 임베딩값(115)에 대해 RNN 연산(또는 은닉층 연산)을 수행하여 적대적 교란값()이 반영된 은닉값(132의 )을 생성한다. The RNN block (130) includes a number of recurrent neural network (RNN) cells (31, 32, 33, 34) and an adversarial perturbation ( ) is reflected in the hidden value (132) ) is output. For example, the RNN cell (33) outputs an adversarial disturbance value ( ) is applied to the input word embedding value (115) by performing RNN operation (or hidden layer operation) to generate adversarial disturbance value ( ) is reflected in the hidden value (132) ) is created.

또한 RNN 블록(130)은 적대적 교란값()이 반영된 은닉값(132의 )을 출력하기 전에, 적대적 교란값()을 추정한다. Additionally, the RNN block (130) is an adversarial perturbation ( ) is reflected in the hidden value (132) ) before outputting the adversarial disturbance value ( ) is estimated.

적대적 교란값의 추정을 위해, 예를 들면, RNN 셀(33)은, 합산기(13)를 그대로 통과한(bypassing) 입력 워드 임베딩값(101의 w _i-1 )에 대한 디코딩 추론(decoding inference)(또는 RNN 연산)을 수행하여 초기 은닉값을 생성하고, 그 초기 은닉값을 적대적 교란값(, )으로 출력(또는 추정)한다.For estimation of the adversarial perturbation value, for example, the RNN cell (33) performs decoding inference (or RNN operation) on the input word embedding value ( wi _-1 of 101) that has passed through the adder (13) as it is to generate an initial hidden value, and then uses the initial hidden value as an adversarial perturbation value ( , ) is output (or estimated).

이후, 합산기(13)는 RNN 셀(33)에 의해 추정된 적대적 교란값()을 워드 임베딩값들(w _i-1 )에 합산하고, 합산기(23)은 RNN 셀(33)에 의해 추정된 적대적 교란값()을 워드 임베딩값들(w _i )에 합산한다.Afterwards, the adder (13) calculates the adversarial disturbance value estimated by the RNN cell (33). ) is added to the word embedding values ( w _i-1 ), and the adder (23) adds the adversarial disturbance value ( estimated by the RNN cell (33) ) is added to the word embedding values ( w _i ).

자가-주의집중 모델(140)은 적대적 교란값()이 반영된 은닉값(132의 )에 대해 자가-주의집중(self-attention) 연산을 수행하여, 상기 은닉값(132의 )에 컨텍스트에 따른 의미 변화를 나타내는 컨텍스트 정보를 투영한다.The self-attention model (140) is an adversarial disturbance ( ) is reflected in the hidden value (132) ) to perform a self-attention operation on the hidden value (132). ) projects contextual information that indicates changes in meaning according to context.

컨텍스트 정보는 자가-주의집중 연산 대상에 해당하는 현재의 워드(word)의 이전 워드들과 이후 워드들을 의미한다. 현재의 워드를 RNN 셀(33)로부터 추정된(또는 출력된) 은닉값(132의 )라고 가정하면, 이전 워드들은 132의 , … 이고, 이후의 워드들은 132의 … 이다.Context information refers to the previous and subsequent words of the current word corresponding to the target of self-attention operation. The current word is estimated (or output) from the RNN cell (33) and the hidden value (132) ), the previous words are 132 , … and the words that follow are 132. … am.

본 발명은, 아래에서 설명될 거리 최소화 연산기(150)에서 RNN 블록(130)의 출력값, 예를 들면, 적대적 교란값()이 적용된 은닉값(132의 )과 적대적 교란값()이 반영된 타겟 워드 임베딩값() 간의 비교 연산을 수행하기 이전에, 자가-주의집중 모델(140)을 이용하여, 은닉값(132의 )에 컨텍스트에 따른 의미 변화(컨텍스트 정보)를 투영하는 점에서 선행 문헌들과의 차이점이 있다.The present invention is a method for minimizing the distance of an RNN block (130) in a distance minimization operator (150) described below, for example, an adversarial disturbance value ( ) is applied to the hidden value (132) ) and adversarial disturbances ( ) reflects the target word embedding value ( ) before performing a comparison operation between the hidden values (132) using the self-attention model (140). ) differs from previous literature in that it projects context-dependent semantic changes (contextual information).

거리 최소화 연산기(150)는 상기 컨텍스트 정보가 투영된 은닉값(142)과 상기 적대적 교란값()이 반영된 타겟 워드 임베딩값() 간의 거리값을 최소화하는 연산을 수행하여, 상기 신경망 모델(디코더)에 대한 적대적 학습을 진행한다.The distance minimization operator (150) calculates the hidden value (142) onto which the context information is projected and the adversarial disturbance value ( ) reflects the target word embedding value ( ) to perform an operation that minimizes the distance between the neural network model (decoder) and perform adversarial learning for the neural network model (decoder).

이하, 각 구성들(110, 120, 130, 140 및 150)에 대해 좀 더 상세하게 살펴보기로 한다.Below, we will look at each configuration (110, 120, 130, 140, and 150) in more detail.

디코더는 하나의 문장을 생성하는 과정에서 이전 컨텍스트(context)로 이 주어진 경우, 다음에 등장할 수 있는 워드 의 확률을 예측할 수 있는 모델 로 모델링된다.The decoder uses the previous context to generate a sentence. Given this, the next word that can appear is A model that can predict the probability of is modeled as .

디코더(100)의 입력 은 입력 워드 임베딩값(input word embeddings value)으로서, 사전-훈련된(pre-trained) 또는 사전 학습된(pre-learned) 값으로 가정한다. 여기서, 입력 워드 임베딩값은 입력 워드 임베딩 벡터(input word embeddings vector)로 지칭될 수 있다. 유사하게, 아래에서 설명할 타겟 워드 임베딩값(target word embeddings vector)은 타겟 워드 입베딩 벡터로 지칭될 수 있다. Input of decoder (100) is an input word embedding value, which is assumed to be a pre-trained or pre-learned value. Here, the input word embedding value may be referred to as an input word embedding vector. Similarly, the target word embedding vector described below may be referred to as a target word embedding vector.

RNN(Recurrent Neural Network) 블록(130)은 다수의 RNN셀들(31~34)을 포함하도록 구성되며, 각 RNN 셀은 신경망 기반 언어 모델에서 주로 사용되는 순환 신경망(Recurrent Neural Network) 구조를 갖는다. RNN 블록(130)은 트랜스포머(transformer)와 같은 다른 신경망 모델로 대체할 수 있다.The RNN (Recurrent Neural Network) block (130) is configured to include a plurality of RNN cells (31 to 34), and each RNN cell has a recurrent neural network structure mainly used in neural network-based language models. The RNN block (130) can be replaced with another neural network model, such as a transformer.

각 RNN셀의 출력값은 일반적인 신경망 기반의 언어 모델에서 소프트맥스(softmax)의 입력 값으로 사용되는 은닉값(hidden value)을 의미한다.The output value of each RNN cell represents a hidden value used as the input value of softmax in a general neural network-based language model.

본 발명에서 적대적 교란값이 반영된 은닉값()을 이용하여 적대적 훈련(adversarial training) 또는 적대적 학습(adversarial learning)이 수행된다.In the present invention, a hidden value (reflecting an adversarial disturbance value) ) is used to perform adversarial training or adversarial learning.

적대적 학습을 수행하기 위해, 본 발명은 적대적 교란값()를 추정하여 계산하고, 입력 워드 임베딩값()과 타겟 워드 임베딩값()에 적대적 교란값(adversarial perturbation value)을 합산한다.To perform adversarial learning, the present invention uses adversarial perturbation ( ) is estimated and calculated, and the input word embedding value ( ) and target word embedding value ( ) and add the adversarial perturbation value.

적대적 학습은 크게 적대적 교란값을 추정하는 디코딩 과정과 추정된 적대적 교란값을 이용하여 다시 디코딩 과정을 수행하는 학습 과정으로 나눌 수 있다.Adversarial learning can be broadly divided into a decoding process that estimates adversarial perturbations and a learning process that performs the decoding process again using the estimated adversarial perturbations.

적대적 교란값을 추정하는 디코딩 과정은 적대적 교란값 없이, 디코딩 추론(Decoding inference) 또는 RNN 연산을 수행하여, 초기 은닉값을 생성하고, 그 초기 은닉값을 적대적 교란값(adversarial perturbation value)으로 추정하는 과정이다. The decoding process for estimating adversarial perturbation values is the process of generating an initial hidden value by performing decoding inference or RNN operation without adversarial perturbation values, and estimating the initial hidden value as an adversarial perturbation value.

상기 추정된 적대적 교란값 기반의 학습 과정은, 상기 추정된 적대적 교란값(adversarial perturbation value)을 입력단의 입력 워드 임베딩값()과 타겟 워드 임베딩값()에 각각 합산하는 과정을 포함한다.The learning process based on the above estimated adversarial perturbation value is to input the input word embedding value of the input terminal using the above estimated adversarial perturbation value. ) and target word embedding value ( ) includes a process of adding them up respectively.

디코더(100)는, 은닉값 과 타겟 워드 임베딩값 사이의 거리가 최소화되도록 학습된다. 이러한 학습은 거리 최소화 연산기(150)에 의해 수행된다. The decoder (100) is a hidden value and target word embedding value It is learned so that the distance between them is minimized. This learning is performed by a distance minimization operator (150).

이처럼 두 벡터값들 과 간의 거리값이 최소화되도록 수행되는 학습 방법은 전통적인 소프트맥스(softmax)를 이용한 학습 방법이 아니라 Kumar의 논문에서 소개된 연속 출력값 기반의 학습 방법에 기반한 것이다. Two vector values like this class The learning method performed to minimize the distance value between the nodes is not a learning method using the traditional softmax, but is based on the learning method based on continuous output values introduced in Kumar's paper.

Kumar의 논문에서는 두 벡터값들 과 사이의 거리값에 기반한 다양한 손실 함수(loss function)를 제안한다. Kumar의 논문의 문제점은 타겟 워드 임베딩값을 사전 학습된(pre-learned) 워드 임베딩값으로 가정하고 있기 때문에, 컨텍스트(context)에 따른 의미 변화를 고려하고 있지 않다는 점이다.In Kumar's paper, two vector values class proposes various loss functions based on the distance between words. The problem with Kumar's paper is that it assumes that the target word embeddings are pre-learned word embeddings, so it does not consider the change in meaning according to the context.

이러한 문제점을 해결하기 위해, 본 발명은, 상기 컨텍스트(context)에 따른 의미 변화가 반영된 워드 임베딩값을 활용하기 위해, 주의집중(attention) 모델(140)을 이용하여 벡터값 에 상기 컨텍스트(context)에 따른 의미 변화를 나타내는 컨텍스트(context) 정보를 투영(projection)한 후, 상기 컨텍스트(context) 정보가 투영된(projected)된 벡터값 과 적대적 교란값이 반영된 타겟 워드 임베딩값 사이의 거리값을 최소화하는 접근 방법(학습 방법)을 제안한다. To solve these problems, the present invention utilizes a word embedding value that reflects a change in meaning according to the context, and uses an attention model (140) to create a vector value. After projecting the context information that indicates the change in meaning according to the context, the vector value onto which the context information is projected Target word embedding values with adversarial perturbation reflected We propose an approach (learning method) that minimizes the distance between

이러한 본 발명의 접근 방법(학습 방법)은 벡터 거리를 고려한 기계학습의 손실 함수(loss function)로 기술될 수 있으며, 세부적인 접근 방법은 두 부분으로 설명된다. 첫번째는 적대적(Adversarial) vMF Loss를 통한 두 벡터들의 거리를 이용한 학습 함수이고, 두번째는 딥 레지듀얼 주의집중 모델(Deep residual attention model)이다.This approach (learning method) of the present invention can be described as a loss function of machine learning considering vector distance, and the detailed approach is explained in two parts. The first is a learning function using the distance between two vectors through the adversarial vMF Loss, and the second is a deep residual attention model.

적대적(Adversarial) vMF LossAdversarial vMF Loss

Kumar의 논문에서는, 두 워드 임베딩 벡터의 유사도 계산을 위해 von Mises-Fisher (vMF) distribution을 이용하여 아래의 수학식1과 같은 손실 함수(Loss function)를 소개하고 있다.In Kumar's paper, a loss function (loss function as shown in mathematical expression 1 below) is introduced using the von Mises-Fisher (vMF) distribution to calculate the similarity between two word embedding vectors.

위 수학식 1은 vMF 분포(vMF distribution)의 네거티브 로그우도(negative log-likelihood)를 이용하여 타겟 워드 임베딩(target word embedding) e(w)와 RNN 출력값 가 유사할수록 손실(loss)이 작아지게 하는 역할을 한다.The above mathematical expression 1 uses the negative log-likelihood of the vMF distribution to compare the target word embedding e(w) and the RNN output value. The more similar they are, the smaller the loss becomes.

여기서 는 집중도(concentration) 상수로서, 가 0에 가까우면 균일 분포(uniform distribution)를 나타내고, 이면, 포인트 분포(point distribution)를 나타낸다.Here is the concentration constant, If it is close to 0, it indicates a uniform distribution, If so, it represents the point distribution.

Kumar의 논문에서는, 두 가지의 휴리스틱 정규화(regularization) 접근 방법을 제안하고 있으나, 본 발명은 위의 손실 함수 NLLvMF()에 적대적 학습(adversarial learning) 기법을 적용한 것이다.In Kumar's paper, two heuristic regularization approaches are proposed, whereas the present invention applies an adversarial learning technique to the above loss function NLLvMF().

Kumar의 논문에서는 소프트맥스 레이어(layer)가 존재하지 않기 때문에, 손실 함수 NLLvMF()에 적대적 학습(adversarial learning) 기법을 직접적으로 적용할 수 없다.Since there is no softmax layer in Kumar's paper, adversarial learning technique cannot be directly applied to the loss function NLLvMF().

이에, 본 발명은 Goodfellow의 논문에서 소개하는 FGSM(fast gradient sign method)을 기반으로 손실 함수 NLLvMF()를 수정한다. 이는 입력 데이터를 손실 함수 NLLvMF()의 그레디언트(gradient) 방향으로 선형 이동시켜 적대적 데이터를 생성하여 모델의 견고성을 강화하는 것이다.Accordingly, the present invention modifies the loss function NLLvMF() based on the fast gradient sign method (FGSM) introduced in Goodfellow's paper. This is to linearly move the input data in the gradient direction of the loss function NLLvMF() to generate adversarial data, thereby enhancing the robustness of the model.

언어생성 학습은 아래의 수학식 2를 따른다. Language generation learning follows the mathematical formula 2 below.

수학식 2에 따른 언어생성 학습에 따르면, 컨텍스트 x_1:t-1을 가정할 때, x_t의 네거티브 로그우도(negative log-likelihood)가 최소화되도록 학습이 진행된다. 여기서 w_j는 x_t의 워드 임베딩 값이고, r_j는 해당 임베딩값의 교란값이 된다. 여기서 적대적 노이즈가 생성되도록 r_j를 최대화한다. 그리고, l은 문장 인덱스를 의미한다.According to the generative language learning according to Equation 2, learning is performed so that the negative log-likelihood of x _t is minimized when the context x _1:t-1 is assumed. Here, w _j is the word embedding value of x _t , and r _j is the perturbation value of the embedding value. Here, r _j is maximized so that adversarial noise is generated. And, l means the sentence index.

위의 수학식 2에서 적대적 교란값의 최대화 값은 아래의 수학식 3과 같다. 수학식 3은 해당 언어생성모델을 NLLvMF로 기술하는 내용을 포함하고 있다.In the mathematical expression 2 above, the maximum value of the adversarial perturbation value is as shown in the mathematical expression 3 below. The mathematical expression 3 includes the content describing the language generation model as NLLvMF.

아래의 수학식 4는 추정된 교란값을 보여준다. Mathematical expression 4 below shows the estimated disturbance value.

이는 디코더(100)의 출력값으로 구성되고, 정규화 형식을 갖는다. NLLvMF는 디코더(100)의 출력값과 타겟 임베딩값의 거리를 나타내고, 해당 거리값을 일정 수준으로 크게 만드는 r_j값을 찾는 것이 수학식 4의 목적이다.It consists of the output value of the decoder (100) and has a normalized format. NLLvMF represents the distance between the output value of the decoder (100) and the target embedding value, and the purpose of mathematical expression 4 is to find the r _j value that increases the distance value to a certain level.

아래의 수학식 5는 본 발명에 따른 새로운 손실함수 NLLvMF()이다.Mathematical expression 5 below is a new loss function NLLvMF() according to the present invention.

Kumar의 논문은 은닉값 의 크기를 통해 학습과정을 통제하는 휴리스틱 접근 방법을 소개한 반면, 본 발명은 추정된 적대적 교란값(adversarial perturbation value)을 이용하여 손실함수 NLLvMF()에 적대적 학습을 접목시켜, 정규화를 진행한다.Kumar's paper is about hidden values While a heuristic approach to control the learning process through the size of the loss function was introduced, the present invention performs regularization by grafting adversarial learning onto the loss function NLLvMF() using the estimated adversarial perturbation value.

자가 주의집중 모델(self-attention model)Self-attention model

본 발명과 Kumar의 논문의 큰 차이점 중에 하나는, RNN의 출력값 과 타겟 워드 벡터 간의 비교(거리 최소화 과정) 이전에, 자가-주의집중 모델(도 2의 140)을 이용하여 RNN의 출력값 에 컨텍스트(context) 정보를 반영(reflection) 또는 투영(projection)하는 데 있다.One of the major differences between the present invention and Kumar's paper is the output value of the RNN. Before comparison between the target word vector and the target word vector (distance minimization process), the output value of the RNN is calculated using the self-attention model (140 in Figure 2). It is about reflecting or projecting context information.

본 발명은 Vaswani의 논문에서 소개된 멀티-헤드 주의집중(Multi-Head Attention) 메커니즘을 이용하여 RNN의 출력 문장(RMM 블록(132)의 출력값들 )에 대한 자가-주의집중 연산을 수행한다.The present invention utilizes the multi-head attention mechanism introduced in Vaswani's paper to extract the output sentences of the RNN (the output values of the RMM block (132)). ) performs self-attention operations.

자가-주의집중 연산을 수행하기 위해, 도 2에 도시된 자가-주의집중 모델(140)(Self-Attention model)은 우선 현재의 컨텍스트(입력 워드 임베딩 시퀀스: )를 프로젝션하여 Q(Query), K(Key), V(Value) 매트릭스로 변환한다. To perform self-attention operation, the self-attention model (140) illustrated in Fig. 2 first calculates the current context (input word embedding sequence: ) is converted into Q(Query), K(Key), V(Value) matrix.

프로젝션은 RNN 블록(130)의 출력값들(132, 워드 임베딩 시퀀스(열))을 파라미터 Q(Query)로 구성된 Q 매트릭스, 파라미터 K(Key)로 구성된 K 매트릭스, 파라미터 V(Value)로 구성된 V 매트릭스로 변환하는 과정이다.Projection is a process of converting the output values (132, word embedding sequence (column)) of the RNN block (130) into a Q matrix composed of parameter Q (Query), a K matrix composed of parameter K (Key), and a V matrix composed of parameter V (Value).

이후, 자가-주의집중 모델(140)은 Q와 K와의 내적 연산(dot product)를 통해 현재 워드와 컨텍스트 워드들 간의 유사도를 나타내는 확률값을 소프트맥스(softmax)를 이용하여 계산한다.Afterwards, the self-attention model (140) calculates a probability value representing the similarity between the current word and context words using softmax through a dot product operation with Q and K.

컨텍스트 워드는 현재 워드의 주변 워드들로서, 현재 워드의 이전 워드들과 현재 워드의 이후 워드들을 의미한다. 따라서, 현재 워드와 컨텍스트 워드들 간의 유사도는 현재 워드와 이전 워드의 유사도, 현재 워드와 이후 워드의 유사도를 포함한다.Context words are surrounding words of the current word, meaning words before the current word and words after the current word. Therefore, the similarity between the current word and the context words includes the similarity between the current word and the previous word and the similarity between the current word and the following word.

예를 들면, RNN 블록(130)의 출력값들(워드 인베딩 시퀀스)을 , , 라 가정하고, 현재 워드(현재의 자가-주의집중 연산 대상에 해당하는 워드)는 이고, 이전 워드는 이고, 이후 워드는 라 할 때, 현재 워드와 컨텍스트 워드들 간의 유사도는 와 의 유사도 및 및 의 유사도를 포함한다.For example, the output values (word embedding sequence) of the RNN block (130) , , Assuming that the current word (the word corresponding to the current self-attention operation target) is and the previous word is And after that, the word is When , the similarity between the current word and the context words is and Similarity of and and Includes similarities of .

이후, 자가-주의집중 모델(140)은, 상기 계산된 확률값(유사도)을 가중치로 이용하여, 컨텍스트 워드들 각각의 워드 임베딩값 V를 통합(합산)하고, 이를 주의집중값(attention value)을 계산한다. 예를 들면, 이 와 유사한 정도를 나타내는 확률값이 a이고, 이 과 유사한 정도를 나타내는 확률값이 c이고, 가 및 와 유사한 정도를 나타내는 확률값이 b(= a+c)일때, 현재의 워드에 대응하는 의 주의집중값은 + + 로 계산될 수 있다.Afterwards, the self-attention model (140) integrates (sums up) the word embedding values V of each context word using the calculated probability value (similarity) as a weight and calculates the attention value. For example, this The probability value indicating the degree of similarity is a, this The probability value that represents the degree of similarity is c, go and When the probability value indicating the degree of similarity is b(= a+c), the corresponding word is The attention value of + + can be calculated as

이후, 자가-주의집중 모델(140)은 상기 계산된 주의집중값(attention value)을 기존 벡터값과 합산하고, 상기 기존 벡터값과 합산된 주의집중값(attention value)에 대해 정규화 과정을 수행한다. 여기서, 기존의 백터값은, RNN 블록(130)으로부터 출력된 적대적 교란값이 반영된 은닉값들(132, 워드 임베딩 시퀀스)을 의미한다.Thereafter, the self-attention model (140) adds the calculated attention value to the existing vector value, and performs a normalization process on the attention value added to the existing vector value. Here, the existing vector value means the hidden values (132, word embedding sequence) to which the adversarial disturbance value output from the RNN block (130) is reflected.

자가-주의집중 모델(140)은, 상기 정규화 과정을 통해, 자가-주의집중(self-attention)이 반영된 새로운 벡터값, 즉, 컨텍스트 정보가 반영된(투영된) 새로운 은닉값(도 2의 142)을 계산하고, 이 새로운 은닉값(142)을 거리 최소화 연산기(150)로 전달한다.The self-attention model (140) calculates a new vector value reflecting self-attention, i.e., a new hidden value (142 in FIG. 2) reflecting (projecting) context information, through the above normalization process, and transfers this new hidden value (142) to the distance minimization operator (150).

멀티-헤드 주의집중(Multi-Head Attention)은 다수의 이질적인 학습이 필요한 프로젝션들을 통해, 주의집중(attention) 능력을 향상시키는 역할을 한다. 여기서 업데이트된 벡터값, 즉, 새로운 은닉값(142)은 타겟 임베딩 벡터와의 거리를 최소화하기 위한 연산에 이용된다.Multi-Head Attention improves attention ability through projections that require multiple heterogeneous learning. Here, the updated vector value, i.e., the new hidden value (142), is used in an operation to minimize the distance from the target embedding vector.

본 발명의 실험 결과Experimental results of the present invention

본 발명의 실시 예에 대한 실험을 위해, 실험은 프랑스어/영어 기계번역을 대상으로 실시하였다. 평가셋은 International Workshop on Spoken Language Translation(IWSLT16)의 평가셋을 이용하였고, IWSLT16의 평가셋은 4만 단어, 2,369문장쌍으로 구성된다.For the experiment on the embodiment of the present invention, the experiment was conducted on French/English machine translation. The evaluation set used the evaluation set of the International Workshop on Spoken Language Translation (IWSLT16), and the evaluation set of IWSLT16 consists of 40,000 words and 2,369 sentence pairs.

학습셋은 영어의 경우 383만단어, 22만 문장, 프랑스어의 경우 392만단어, 22만 문장의 병렬 텍스트를 이용했다. 워드 임베딩의 경우, fastText로 학습된 결과를 이용하였다. The training set used parallel texts of 3.83 million words and 220,000 sentences in English and 3.92 million words and 220,000 sentences in French. For word embedding, the results learned with fastText were used.

해당 리소스들은 Kumar의 논문에서 제공한 결과를 이용한 것이다. These resources utilize results provided in Kumar's paper.

아래의 표 1은 모두 6가지 실험 결과들을 보여준다. IN-adv은 입력층에 교란값을 적용한 실험이고, OUT-adv는 출력층에 교란값을 적용한 실험이다. Table 1 below shows the results of six experiments. IN-adv is an experiment that applies a perturbation value to the input layer, and OUT-adv is an experiment that applies a perturbation value to the output layer.

ATT는 주의 집중 모델을 적용한 실험이다. 실험결과, 출력층에 교란값을 적용한 실험 OUT-adv가 가장 좋은 결과를 보였다.ATT is an experiment that applied the attention model. As a result of the experiment, the experiment OUT-adv that applied the disturbance value to the output layer showed the best result.

실험1Experiment 1 실험2Experiment 2 실험3Experiment 3 평균average Baseline [1]Baseline [1] 30.5930.59 30.0830.08 30.2530.25 30.3130.31 IN-advIN-adv 30.2630.26 30.7430.74 30.4830.48 30.4930.49 OUT-advOUT-adv 30.530.5 30.4130.41 30.7830.78 30.5630.56 IN-OUTIN-OUT 30.4730.47 30.2930.29 30.1530.15 30.3030.30 ATTATT 30.3130.31 30.4430.44 30.3630.36 30.3730.37 IN-adv+ATTIN-adv+ATT 30.1530.15 30.0230.02 30.1330.13 30.130.1

이상 설명한 바와 같이, 본 발명의 실시 예에 따른 평가 결과는 적대적 학습을 디코더의 출력단에 적용한 실험에서 가장 좋았지만, 입력단과 출력단에 적대적 학습 방법을 모두 적용한 경우, 동일한 결과 또는 더 좋은 결과를 얻을 수 있을 것으로 예상한다. As described above, the evaluation results according to the embodiments of the present invention were best in the experiments where adversarial learning was applied to the output end of the decoder, but it is expected that the same or better results can be obtained when the adversarial learning method is applied to both the input end and the output end.

도 3은 본 발명의 실시 예에 따른 언어 생성을 위한 신경망 모델의 학습 방법을 설명하기 위한 흐름도이다.FIG. 3 is a flowchart illustrating a method for learning a neural network model for language generation according to an embodiment of the present invention.

본 발명의 실시 예에 따른 언어 생성을 위한 신경망 모델의 학습 방법은 컴퓨팅 장치 또는 상기 컴퓨팅 장치 내의 적어도 하나의 프로세서(CPU, GPU)에 의해 수행된다.A method for learning a neural network model for language generation according to an embodiment of the present invention is performed by a computing device or at least one processor (CPU, GPU) within the computing device.

도 3을 참조하면, 단계 320에서, 합산기 블록(110)에 의해, 입력 워드(input word)를 벡터로 표현한 입력 워드 임베딩값(예, 도 2의 w _i-1 )과 상기 입력 워드의 다음에 등장할 정답 워드를 벡터로 표현한 타겟 워드 임베딩값(도 2의 w _i )에 적대적 교란값(예, 도 2에서 및 )을 각각 합산하는 과정이 수행된다.Referring to FIG. 3, in step 320, an adversarial disturbance value (e.g., w i-1 in FIG. 2) is added to an input word embedding value (e.g., w _i-1 in FIG. 2) that represents an input word as a vector and a target word embedding value ( w _i in FIG. 2) that represents the correct word to appear next to the input word as a vector by a summation block (110). and ) is performed to add them up respectively.

워드 임베딩값(예, 도 2의 w _i-1 , w _i )과 적대적 교란값(예, 도 2에서 및 )의 합산을 위해, 상기 단계 320에 앞서, 단계 310에서, 적대적 교란값을 계산(추정)하는 과정이 선행된다.Word embedding values (e.g., w _i-1 , w _i in Fig. 2) and adversarial perturbation values (e.g., w i in Fig. 2) and ) for summation, prior to step 320, a process of calculating (estimating) an adversarial disturbance value is performed in step 310.

여기서, 적대적 교란값(예, 도 2에서 및 )은 상기 입력 워드 임베딩값(예, 도 2의 w _i-1 )을 순환 신경망을 통해 변환한 값과 상기 타겟 워드 임베딩값(도 2의 w _i )과의 거리값을 일정 수준 이상의 큰 값으로 만드는 역할을 한다. 즉, 적대적 교란값(예, 도 2에서 및 )은 상기 입력 워드 임베딩값(예, 도 2의 w _i-1 )과 상기 타겟 워드 임베딩값(도 2의 w _i )의 유사도를 의도적으로 떨어트리기 위한 정보로 활용된다.Here, the adversarial disturbance (e.g., in Fig. 2) and ) plays a role in making the distance value between the input word embedding value (e.g., w _i-1 in Fig. 2) converted through a recurrent neural network and the target word embedding value ( w _i in Fig. 2) a value greater than a certain level. In other words, the adversarial disturbance value (e.g., w i-1 in Fig. 2) and ) is used as information to intentionally reduce the similarity between the input word embedding value (e.g., w _i-1 in FIG. 2) and the target word embedding value ( w _i in FIG. 2).

적대적 교란값(예, 도 2에서 및 )의 계산(추정)은 순환 신경망 블록(130)에서 수행될 수 있다. 적대적 교란값의 계산(추정)을 위해, 먼저, 상기 합산기 블록(110)에서는 상기 입력 워드 임베딩값(예, 도 2의 w _i-1 )에 대한 어떠한 합산 연산 없이, 상기 입력 워드 임베딩값(예, 도 2의 w _i-1 )을 그대로 상기 순환 신경망 블록(130)으로 입력하는 과정이 수행된다.Adversarial disturbances (e.g., in Fig. 2) and ) can be calculated (estimated) in the recurrent neural network block (130). In order to calculate (estimate) the adversarial disturbance value, first, in the adder block (110), a process is performed in which the input word embedding value (e.g., w _i-1 of FIG. 2) is input as is into the recurrent neural network block (130) without any sum operation on the input word embedding value (e.g., w _i-1 of FIG. 2).

이후, 상기 순환 신경망 블록(130)에서, 상기 합산기 블록(110)으로부터 입력된 상기 입력 워드 임베딩값(예, 도 2의 w _i-1 )에 대한 순환 신경망 연산을 수행하여 초기 은닉값을 계산하고, 계산된 상기 초기 은닉값을 상기 적대적 교란값(예, 도 2에서 및 )으로서 계산하는 과정이 수행된다.Thereafter, in the recurrent neural network block (130), a recurrent neural network operation is performed on the input word embedding value (e.g., w _i-1 in FIG. 2) input from the adder block (110) to calculate an initial hidden value, and the calculated initial hidden value is used as the adversarial disturbance value (e.g., w i-1 in FIG. 2). and ) is calculated as follows.

단계 310에서 적대적 교란값(예, 도 2에서 및 )의 계산이 완료되면, 계산된 적대적 교란값은 다시 합산기 블록(110, 120)로 피드백되고, 합산기 블록(110, 120)에서는 입력 워드 임베딩값(예, 도 2의 w _i-1 )과 타겟 워드 임베딩값(도 2의 w _i )에 순환 신경망 블록(130)으로부터 피드백된 적대적 교란값(예, 도 2에서 및 )을 각각 합산하는 과정이 수행된다.At step 310, the adversarial disturbance value (e.g., in Fig. 2) and ) is completed, the calculated adversarial perturbation value is fed back to the adder block (110, 120), and the adder block (110, 120) adds the adversarial perturbation value (e.g., w _i-1 in FIG. 2) fed back from the recurrent neural network block (130) to the input word embedding value (e.g., w _i in FIG. 2) and the target word embedding value (w i in FIG. 2). and ) is performed to add them up respectively.

이어, 단계 330에서, 순환 신경망 블록(130)에 의해, 상기 적대적 교란값이 합산된 입력 워드 임베딩값에 대한 순환 신경망 연산을 수행하여 은닉값(예, 도 2의 )을 계산하는 과정이 수행된다.Next, in step 330, a recurrent neural network operation is performed on the input word embedding value to which the adversarial disturbance value is added by the recurrent neural network block (130) to obtain a hidden value (e.g., FIG. 2). ) is calculated.

이어, 단계 340에서, 자가-주의집중 모델(140)에 의해, 전 단계 330에서 계산된 은닉값(예, 도 2의 )에 대해 자가-주의집중(self-attention) 연산을 수행하여, 상기 계산된 은닉값에 상기 입력 워드의 주변 워드에 대한 컨텍스트 정보를 투영(적용)하는 과정이 수행된다. 여기서, 상기 자가-주의집중(self-attention) 연산은 예를 들면, 멀티-헤드 주의집중(multi-head attention) 연산일 수 있다.Then, at step 340, the hidden value (e.g., in Fig. 2) calculated at the previous step 330 is used by the self-attention model (140). ) is performed to project (apply) context information about surrounding words of the input word to the calculated hidden value. Here, the self-attention operation may be, for example, a multi-head attention operation.

상기 계산된 은닉값(예, 도 2의 )에 상기 주변 워드에 대한 컨텍스트 정보를 투영(적용)하는 자가-주의집중 연산을 수행하기 위해, 상기 합산기 블록(110)에서, 상기 입력 워드의 주변 워드에 대응하는 주변 워드 임베딩값(예, 도 2의 w ₀ , w ₁ , w _n-1 )과 상기 주변 워드 임베딩값(도 2의 w ₀ , w ₁ , w _n-1 )에 대응하는 적대적 교란값(, ,)을 합산하는 과정과 상기 순환 신경망 블록(130)에서 상기 대응하는 적대적 교란값이 합산된 주변 워드 임베딩값에 대한 순환 신경망 연산을 수행하여, 주변 은닉값(도 2의 , , )을 계산하는 과정이 선행된다.The above calculated hidden value (e.g., in Fig. 2) ) to perform a self-attention operation that projects (applies) context information about the surrounding words, in the adder block (110), the surrounding word embedding values (e.g., w ₀ , w ₁ , w _n-1 in FIG. 2) corresponding to the surrounding words of the input word and the adversarial perturbation values ( w ₀ , w ₁ , w _n-1 in FIG. 2) corresponding to the surrounding word embedding values , , ) and the recurrent neural network block (130) performs a recurrent neural network operation on the surrounding word embedding values to which the corresponding adversarial disturbance values are added, thereby obtaining the surrounding hidden values (Fig. 2). , , ) is preceded by a process of calculating the

주변 은닉값(도 2의 , , )의 계산이 완료되면, 상기 계산된 주변 은닉값(도 2의 , , )을 상기 컨텍스트 정보로 이용하여, 상기 계산된 은닉값(도 2의 )에 상기 계산된 주변 은닉값(도 2의 , , )을 적용(투영)하는 것이 단계 340의 자가-주의집중 연산 과정이다.Surrounding hidden values (Fig. 2) , , ) is completed, the calculated surrounding hidden values (Fig. 2) are , , ) as the context information, the calculated hidden value (Fig. 2) is used. ) and the calculated surrounding hidden value (Fig. 2) , , ) is the self-attention operation process of step 340.

여기서, 상기 주변 워드는 자가-주의 집중 연산 대상에 해당하는 상기 입력 워드의 이전 워드와 이후 워드를 포함하며, 상기 주변 워드 임베딩값은 상기 이전 워드에 대응하는 이전 워드 임베딩값(도 2의 w ₀ , w ₁ )과 상기 이후 워드에 대응하는 이후 워드 임베딩값(도 2의 w _n-1 )을 포함한다. Here, the surrounding words include the previous word and the next word of the input word corresponding to the target of the self-attention concentration operation, and the surrounding word embedding value includes the previous word embedding value ( w ₀ and w ₁ in FIG. 2) corresponding to the previous word and the next word embedding value ( w _n-1 in FIG. 2) corresponding to the next word.

상기 주변 은닉값은 이전 워드 임베딩값(도 2의 w ₀ , w ₁ )에 대한 이전 은닉값(예, 도 2의 , )과 이후 워드 임베딩값(도 2의 w _n-1 )에 대한 이후 은닉값(예, 도 2의 )을 포함한다. 이때, 이전 은닉값(도 2의 , )과 이후 은닉값(도 2의 w _n-1 )은 합산기 블록(110)에 의해 각각 적대적 교란값이 적용된 것이다.The above surrounding hidden values are the previous hidden values (e.g., w ₀ and w ₁ in Fig. 2) for the previous word embedding values. , ) and the subsequent hidden value (e.g., w _n-1 in Fig. 2) for the subsequent word embedding value (Fig. 2). ) is included. At this time, the previous hidden value (Fig. 2) is included. , ) and the hidden value ( w _n-1 in Fig. 2) are each subjected to an adversarial disturbance by the adder block (110).

상기 계산된 은닉값(도 2의 )에 상기 계산된 주변 은닉값(도 2의 , , )을 투영하는 과정을 더 상세히 설명하면 다음과 같다.The above calculated hidden value (Fig. 2) ) and the calculated surrounding hidden value (Fig. 2) , , ) The process of projecting is explained in more detail as follows.

먼저, 상기 계산된 은닉값(도 2의 )과 상기 계산된 주변 은닉값(도 2의 , , ) 간의 유사한 정도를 나타내는 확률값을 계산하는 과정이 수행된다. 예를 들면, 확률값은와 의 유사도, 와 의 유사도 및 와 의 유사도를 포함한다.First, the calculated hidden value (Fig. 2) ) and the calculated surrounding hidden value (Fig. 2). , , ) is performed to calculate a probability value representing the degree of similarity between the two. For example, the probability value is and Similarity of, and Similarity of and and Includes similarities of .

유사도를 나타내는 확률값 계산은, 전I한 바와 같이, RNN 블록(130)의 출력값들(워드 임베딩 시퀀스: , , )을 파라미터 Q(Query)로 구성된 Q 매트릭스, 파라미터 K(Key)로 구성된 K 매트릭스 및 파라미터 V(Value)로 구성된 V 매트릭스로 변환한 후, Q 매트릭스와 K 매트릭스의 내적 연산(dot product)를 통해 현재 워드(현재의 은닉값도: 도 2의 )와 컨텍스트 워드들(주변 은닉값들: 도 2의 , , )의 유사도를 나타내는 확률값을 소프트맥스(softmax)를 이용하여 계산함은 전술한 바와 같다.The probability value representing the similarity is calculated as described above using the output values of the RNN block (130) (word embedding sequence: , , ) is converted into a Q matrix composed of parameter Q (Query), a K matrix composed of parameter K (Key), and a V matrix composed of parameter V (Value), and then the current word (current hidden value: Fig. 2) is obtained through an inner product operation (dot product) of the Q matrix and the K matrix. ) and context words (surrounding hidden values: Fig. 2). , , ) is calculated using softmax as described above.

이후, 상기 확률값을 가중치로 이용하여 상기 계산된 은닉값(도 2의 )과 상기 계산된 주변 은닉값(도 2의 , , )을 합산하여 획득한 합산 결과를 정규화하는 과정을 통해, 상기 계산된 은닉값(도 2의 )에 상기 계산된 주변 은닉값(도 2의 , , ), 즉, 컨텍스트 정보가 투영(적용)된다.Afterwards, the calculated hidden value (Fig. 2) is used as a weight using the above probability value. ) and the calculated surrounding hidden value (Fig. 2). , , ) is normalized through the process of normalizing the result obtained by adding the hidden values (Fig. 2). ) and the calculated surrounding hidden value (Fig. 2) , , ), that is, context information is projected (applied).

이어, 단계 350에서, 거리 최소화 연산기(150)에서, 상기 신경망 모델에 대한 적대적 학습을 수행하기 위해, 상기 컨텍스트 정보가 투영된 은닉값과 상기 적대적 교란값이 합산된 타겟 워드 임베딩값 간의 거리값을 최소화하는 연산이 수행된다.Next, in step 350, in the distance minimization operator (150), an operation is performed to minimize the distance value between the hidden value onto which the context information is projected and the target word embedding value to which the adversarial disturbance value is added, in order to perform adversarial learning for the neural network model.

상기 컨텍스트 정보가 투영된 은닉값과 상기 타겟 워드 임베딩값 간의 거리값의 최소화 과정은 예를 들면, 손실 함수의 네거티브 로그우도(negative log-likelihood)를 이용하여 수행될 수 있다. 여기서, 상기 손실함수는 von Mises-Fisher (vMF) 분포와 관련된(나타내는) 함수일 수 있다.The process of minimizing the distance value between the hidden value onto which the context information is projected and the target word embedding value can be performed, for example, using the negative log-likelihood of the loss function. Here, the loss function can be a function related to (indicating) the von Mises-Fisher (vMF) distribution.

이상 설명된 학습 방법에 포함된 각 단계는 프로세서에 의해 실행되는 하드웨어 모듈, 소프트웨어 모듈, 또는 그 2 개의 결합으로 구현될 수 있다. 또한, 각 단계의 수행 주체, 합산기 블록, 순환 신경망 블록, 자가-주의집중 모델 및 거리 최소화 연산기는 프로세서 내부의 제1 내지 제4 연산 로직으로 각각 구현될 수 있다.Each step included in the learning method described above can be implemented as a hardware module, a software module, or a combination of the two executed by the processor. In addition, the execution subject of each step, the adder block, the recurrent neural network block, the self-attention model, and the distance minimization operator can be implemented as the first to fourth operation logics inside the processor, respectively.

소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM과 같은 저장 매체(즉, 메모리 및/또는 스토리지)에 상주할 수도 있다. A software module may reside in a storage medium (i.e., memory and/or storage) such as RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, or a CD-ROM.

저장 매체는, 예를 들면, 프로세서에 연결되며, 그 프로세서는 저장 매체로부터의 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. The storage medium is, for example, coupled to the processor, such that the processor can read information from the storage medium, and write information to the storage medium. Alternatively, the storage medium may be integral to the processor.

프로세서 및 저장 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within a user terminal. Alternatively, the processor and the storage medium may reside as discrete components within the user terminal.

본 개시의 예시적인 방법들은 설명의 명확성을 위해서 동작의 시리즈로 표현되어 있지만, 이는 단계가 수행되는 순서를 제한하기 위한 것은 아니며, 필요한 경우에는 각각의 단계가 동시에 또는 상이한 순서로 수행될 수도 있다. Although the exemplary methods of the present disclosure are presented as a series of operations for clarity of description, this is not intended to limit the order in which the steps are performed, and individual steps may be performed simultaneously or in a different order, if desired.

본 개시에 따른 방법을 구현하기 위해서, 예시하는 단계에 추가적으로 다른 단계를 포함하거나, 일부의 단계를 제외하고 나머지 단계를 포함하거나, 또는 일부의 단계를 제외하고 추가적인 다른 단계를 포함할 수도 있다.To implement a method according to the present disclosure, additional steps may be included in addition to the steps exemplified, some steps may be excluded and the remaining steps may be included, or some steps may be excluded and additional steps may be included.

본 개시의 다양한 실시 예는 모든 가능한 조합을 나열한 것이 아니고 본 개시의 대표적인 양상을 설명하기 위한 것이며, 다양한 실시 예에서 설명하는 사항들은 독립적으로 적용되거나 또는 둘 이상의 조합으로 적용될 수도 있다The various embodiments of the present disclosure are not intended to list all possible combinations but rather to illustrate representative aspects of the present disclosure, and the matters described in the various embodiments may be applied independently or in combination of two or more.

또한, 본 개시의 다양한 실시 예는 하드웨어, 펌웨어(firmware), 소프트웨어, 또는 그들의 결합 등에 의해 구현될 수 있다. 하드웨어에 의한 구현의 경우, 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 범용프로세서(general processor), 컨트롤러, 마이크로 컨트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.Additionally, various embodiments of the present disclosure may be implemented by hardware, firmware, software, or a combination thereof. In the case of hardware implementation, the embodiments may be implemented by one or more ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable Logic Devices), FPGAs (Field Programmable Gate Arrays), general processors, controllers, microcontrollers, microprocessors, and the like.

본 개시의 범위는 다양한 실시 예의 방법에 따른 동작이 장치 또는 컴퓨터 상에서 실행되도록 하는 소프트웨어 또는 머신-실행 가능한 명령들(예를 들어, 운영체제, 애플리케이션, 펌웨어(firmware), 프로그램 등), 및 이러한 소프트웨어 또는 명령 등이 저장된 장치 또는 컴퓨터 상에서 실행 가능한 비-일시적 컴퓨터-판독가능 매체(non-transitory computer-readable medium)를 포함한다.The scope of the present disclosure includes software or machine-executable instructions (e.g., an operating system, an application, firmware, a program, etc.) that cause operations according to the methods of various embodiments to be executed on a device or a computer, and a non-transitory computer-readable medium having such software or instructions stored thereon that can be executed on a device or a computer.

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면, 본 발명의 본질적 특성을 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능하다. The above description is merely an example of the technical idea of the present invention, and those skilled in the art can make various modifications and variations without departing from the essential characteristics of the present invention.

따라서, 본 발명에 표현된 실시예들은 본 발명의 기술적 사상을 한정하는 것이 아니라, 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 권리범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하고, 그와 동등하거나, 균등한 범위 내에 있는 모든 기술적 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Therefore, the embodiments described in the present invention are not intended to limit the technical idea of the present invention, but are intended to explain it, and the scope of the rights of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas equivalent thereto or within a scope equivalent thereto should be interpreted as being included in the scope of the rights of the present invention.

Claims

A method for learning a neural network model performed by at least one processor in a computing device, wherein the method for learning the neural network model for language generation comprises:
A step of inputting the input word embedding values as they are into a recurrent neural network block without performing any summation operation on the input word embedding values in the adder block, and then performing a recurrent neural network operation on the input word embedding values input from the adder block in the recurrent neural network block to calculate an initial hidden value, and calculating the calculated initial hidden value as an adversarial perturbation value;
In the above adder block, a step of adding the adversarial perturbation value to the input word embedding value that represents the input word as a vector and the target word embedding value that represents the correct word to appear next to the input word as a vector;
In the above recurrent neural network block, a step of calculating a hidden value by performing a recurrent neural network operation on the input word embedding value to which the adversarial perturbation value is added;
In the self-attention model, a step of performing a self-attention operation on the calculated hidden value to project context information about surrounding words of the input word onto the calculated hidden value; and
In a distance minimization operator, a step of performing adversarial learning for the neural network model through an operation of minimizing the distance value between the hidden value to which the context information is projected and the target word embedding value to which the adversarial disturbance value is added.
A method for training a neural network model for language generation including .

In paragraph 1,
A method for learning a neural network model for language generation, wherein prior to the summing step, the recurrent neural network block further includes a step of estimating the adversarial perturbation value, which serves to make the distance value between the input word embedding value converted through the recurrent neural network and the target word embedding value a value greater than a certain level.

In paragraph 2,
The step of estimating the above adversarial disturbance value is,
In the above adder block, a step of outputting the input word embedding value to the recurrent neural network block without adding with the adversarial disturbance value; and
In the above recurrent neural network block, a step of calculating an initial hidden value by performing a recurrent neural network operation on the input word embedding value, and estimating the calculated initial hidden value as the adversarial disturbance value.
A method for training a neural network model for language generation, which comprises:

In paragraph 1,
In the above adder block, a step of adding a surrounding word embedding value corresponding to a surrounding word of the input word and an adversarial disturbance value corresponding to the surrounding word embedding value; and
In the above recurrent neural network block, a step of calculating a surrounding hidden value by performing a recurrent neural network operation on the surrounding word embedding value to which the corresponding adversarial perturbation value is added is further included.
The step of projecting the context information of the surrounding words of the input word onto the hidden value is,
A step of applying the calculated surrounding hidden value to the calculated hidden value by using the calculated surrounding hidden value as the context information;
A method for training a neural network model for language generation, which comprises:

In Article 4,
The step of projecting the calculated surrounding hidden values onto the calculated hidden values is:
A step of calculating a probability value representing the degree of similarity between the calculated hidden value and the calculated surrounding hidden values;
A step of adding the calculated hidden value and the calculated surrounding hidden value using the above probability value as a weight; and
A step of normalizing the result obtained by adding the calculated hidden value and the calculated surrounding hidden value, and projecting the calculated surrounding hidden value onto the calculated hidden value.
A method for training a neural network model for language generation, which comprises:

In paragraph 1,
The steps of performing the above adversarial learning are:
A learning method of a neural network model for language generation, comprising a step of performing adversarial learning on the neural network model by minimizing the distance value between the hidden value to which the context information is projected and the target word embedding value to which the adversarial disturbance value is added using the negative log-likelihood of the loss function.

In Article 6,
The above loss function is,
A method for training a neural network model for language generation, which is a function related to the von Mises-Fisher (vMF) distribution.

In paragraph 1,
The above self-attention operation is,
A method for training a neural network model for language generation that is a multi-head attention operation.

In paragraph 1,
The above neural network model is,
It is a sequence to sequence model that includes an encoder and a decoder,
The steps of performing the above adversarial learning are:
A learning method for a neural network model, which is a step of performing the adversarial learning for the above decoder.

A computing device that performs learning of a neural network model, wherein the computing device includes a storage medium storing the neural network model and a processor connected to the storage medium and executing the neural network model stored in the storage medium.
The above processor,
The first operation logic adds the adversarial disturbance value to the input word embedding value, which expresses the input word as a vector, and the target word embedding value, which expresses the correct word that will appear next to the input word as a vector.
A second operation logic for calculating a hidden value by performing a recurrent neural network operation on the input word embedding value to which the above adversarial perturbation value is added;
A third operation logic for performing a self-attention operation on the calculated hidden value to perform an operation to project context information about surrounding words of the input word onto the calculated hidden value; and
Includes a fourth operation logic that performs adversarial learning for the neural network model by minimizing the distance value between the hidden value to which the context information is projected and the target word embedding value to which the adversarial disturbance value is added.
A computing device characterized in that the input word embedding value is input as is into a recurrent neural network block without any sum operation on the input word embedding value in the summator block executed by the processor, and then the recurrent neural network block performs a recurrent neural network operation on the input word embedding value input from the summator block to calculate an initial hidden value, and calculates the calculated initial hidden value as the adversarial perturbation value.

In Article 10,
The above second operation logic is,
A computing device further calculating the adversarial perturbation value by setting the distance value between the input word embedding value converted through a recurrent neural network and the target word embedding value to a certain level.

In Article 10,
The above second operation logic is,
A computing device that performs a recurrent neural network operation on the input word embedding values to which the adversarial disturbance values are not added, calculates an initial hidden value, and generates the calculated initial hidden value as the adversarial disturbance value.

In Article 10,
The above first operation logic is,
Add the surrounding word embedding values corresponding to the surrounding words of the input word and the adversarial perturbation values corresponding to the surrounding word embedding values,
The above second operation logic is,
By performing a recurrent neural network operation on the surrounding word embedding values to which the corresponding adversarial disturbance values are added, the surrounding hidden values are calculated.
The above third operation logic is,
A computing device that performs an operation to project the calculated surrounding hidden value onto the calculated hidden value by using the calculated surrounding hidden value as the context information.

In Article 13,
The above third operation logic is,
A computing device that calculates a probability value representing the degree of similarity between the calculated hidden value and the calculated surrounding hidden value, adds the calculated hidden value and the calculated surrounding hidden value using the probability value as a weight, and performs an operation to normalize the result of the sum obtained by adding the calculated hidden value and the calculated surrounding hidden value in order to project the calculated surrounding hidden value onto the calculated hidden value.

In Article 10,
The above fourth operation logic is,
A computing device that performs an operation to minimize the distance value between the hidden value onto which the context information is projected and the target word embedding value to which the adversarial perturbation value is added, using the negative log-likelihood of the von Mises-Fisher (vMF) distribution, in order to perform adversarial learning on the neural network model.