KR102532658B1

KR102532658B1 - Neural architecture search

Info

Publication number: KR102532658B1
Application number: KR1020227011808A
Authority: KR
Inventors: 바렛 조프; 꾸옥 브이. 리
Original assignee: 구글 엘엘씨
Priority date: 2016-10-28
Filing date: 2017-10-27
Publication date: 2023-05-15
Anticipated expiration: 2037-10-27
Also published as: JP7210531B2; JP2021064390A; JP2023024993A; KR20220047688A; US11829874B2; JP6817431B2; KR102386806B1; DE102017125256A1; US20210295163A1; US20230368024A1; US20190251439A1; CN108021983A; DE202017106532U1; KR20190052143A; US11030523B2; WO2018081563A9; JP2019533257A; WO2018081563A1; JP7516482B2

Abstract

신경망 아키텍처를 결정하기 위한, 컴퓨터 저장 매체상에 인코딩된 컴퓨터 프로그램을 포함하는 방법, 시스템 및 장치가 개시된다. 방법들 중 하나는, 제어기 신경망을 사용하여 출력 시퀀스들의 배치를 생성하는 단계 -상기 배치 내의 각각의 출력 시퀀스는 특정 신경망 태스크를 수행하도록 구성된 차일드 신경망의 각 아키텍처를 정의함 -; 상기 배치의 각 출력 시퀀스에 대해: 상기 출력 시퀀스에 의해 정의된 상기 아키텍처를 갖는 상기 차일드 신경망의 각 인스턴스를 훈련시키는 단계; 상기 특정 신경망 태스크에 대한 차일드 신경망의 상기 훈련된 인스턴스의 성능을 평가하여 상기 특정 신경망 태스크에 대한 상기 차일드 신경망의 상기 훈련된 인스턴스에 대한 성능 메트릭을 결정하는 단계; 그리고 상기 차일드 신경망의 상기 훈련된 인스턴스에 대한 상기 성능 메트릭을 사용하여 상기 제어기 신경망의 제어기 파라미터들의 현재 값들을 조정하는 단계를 포함한다.A method, system and apparatus comprising a computer program encoded on a computer storage medium for determining a neural network architecture is disclosed. One of the methods includes generating a batch of output sequences using a controller neural network, each output sequence in the batch defining a respective architecture of a child neural network configured to perform a particular neural network task; For each output sequence in the batch: training each instance of the child neural network having the architecture defined by the output sequence; evaluating performance of the trained instance of the child neural network for the specific neural network task to determine a performance metric for the trained instance of the child neural network for the specific neural network task; and adjusting current values of controller parameters of the controller neural network using the performance metric for the trained instance of the child neural network.

Description

Neural Architecture Search {NEURAL ARCHITECTURE SEARCH}

본 명세서는 신경망 아키텍처를 수정하는 것에 관한 것이다.This specification relates to modifying neural network architecture.

신경망은 수신된 입력에 대한 출력을 예측하기 위해 하나 이상의 비선형 유닛 계층을 사용하는 기계 학습 모델이다. 일부 신경망은 출력층 외에도 하나 이상의 은닉층을 포함한다. 각각의 은닉층의 출력은 네트워크의 다음 계층, 즉 다음 은닉층 또는 출력층에 대한 입력으로 사용된다. 네트워크의 각 계층은 각각의 파라미터 세트의 현재 값에 따라 수신된 입력으로부터 출력을 생성한다.A neural network is a machine learning model that uses one or more layers of nonlinear units to predict an output given a received input. Some neural networks include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as an input to the next layer of the network, i.e. the next hidden layer or the output layer. Each layer of the network generates outputs from received inputs according to the current values of the respective parameter sets.

일부 신경망은 순환 신경망이다. 순환 신경망은 입력 시퀀스를 수신하고 입력 시퀀스로부터 출력 시퀀스를 생성하는 신경망이다. 특히, 순환 신경망은 현재 시간 단계에서 출력을 계산할 때 이전 시간 단계에서 네트워크의 내부 상태 중 일부 또는 전부를 사용할 수 있다. 순환 신경망의 예는 하나 이상의 LSTM(long short term) 메모리 블록을 포함하는 LSTM 신경망이다. 각각의 LSTM 메모리 블록은, 입력 게이트, 삭제 게이트(forget gate), 셀이 예를 들어 현재 활성화를 생성하거나 LSTM 신경망의 다른 컴포넌트에 제공되는데 사용하기 위해 셀에 대한 이전 상태를 저장할 수 있게 하는 출력 게이트를 각각 포함하는 하나 이상의 셀을 포함할 수 있다.Some neural networks are recurrent neural networks. A recurrent neural network is a neural network that receives an input sequence and generates an output sequence from the input sequence. In particular, a recurrent neural network may use some or all of the network's internal state from a previous time step when computing an output at the current time step. An example of a recurrent neural network is an LSTM neural network that includes one or more long short term (LSTM) memory blocks. Each LSTM memory block has an input gate, a forget gate, and an output gate that allows the cell to store a previous state for the cell for use in generating, for example, the current activation or provided to other components of the LSTM neural network. It may include one or more cells each including.

본 명세서는 하나 이상의 위치에 있는 하나 이상의 컴퓨터상의 컴퓨터 프로그램으로 구현된 시스템이 제어기 신경망을 사용하여 특정 신경망 태스크를 수행하도록 구성된 차일드(child) 신경망을 위한 아키텍처(구조)를 결정하는 방법을 설명한다.This specification describes how a system implemented in computer programs on one or more computers at one or more locations uses a controller neural network to determine the architecture (structure) for a child neural network configured to perform a particular neural network task.

본 명세서에서 설명된 요지의 특정 실시 예는 다음의 장점 중 하나 이상을 실현하도록 구현될 수 있다. 시스템은 효율적이고 자동적으로, 즉 사용자 개입없이, 특정 태스크에 대해 고성능 신경망을 초래할 신경망 아키텍처를 선택할 수 있다. 이 시스템은 특정 태스크에 적합한 새로운 신경망 아키텍처를 효과적으로 결정할 수 있으므로 결과적인 차일드 신경망이 태스크에 대해 개선된 성능을 가질 수 있게 한다. 상기 시스템은 보강 학습을 통해 제어기 신경망을 학습하여 아키텍처(구조)를 결정하기 때문에, 시스템은 특정 태스크에 적합한 차일드 신경망을 위한 아키텍처를 식별하기 위해 가능한 아키텍처의 넓은 공간을 효과적으로 탐구(explore)할 수 있다.Certain embodiments of the subject matter described herein may be implemented to realize one or more of the following advantages. The system can efficiently and automatically select a neural network architecture that will result in a high performance neural network for a particular task, ie without user intervention. The system can effectively determine a new neural network architecture suitable for a particular task, thus allowing the resulting child neural network to have improved performance for the task. Since the system determines the architecture (structure) by learning the controller neural network through reinforcement learning, the system can effectively explore a wide space of possible architectures to identify an architecture for a child neural network suitable for a particular task. .

본 명세서에 기술된 요지의 하나 이상의 실시 예의 세부 사항은 첨부된 도면 및 이하의 설명에서 설명된다. 요지의 다른 특징, 양태 및 장점은 상세한 설명, 도면 및 청구 범위로부터 명백해질 것이다.The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects and advantages of the subject matter will be apparent from the detailed description, drawings and claims.

도 1은 예시적인 신경 아키텍처 검색 시스템을 도시한다.
도 2a는 출력 시퀀스를 생성하는 제어기 신경망의 예를 나타낸 도면이다.
도 2b는 스킵 연결을 포함하는 아키텍처를 정의하는 출력 시퀀스를 생성하는 제어기 신경망의 예를 도시한 도면이다.
도 2c는 순환 셀에 대한 아키텍처를 정의하는 출력 시퀀스를 생성하는 제어기 신경망의 일례를 나타낸 도이다.
도 3은 제어기 파라미터의 현재 값을 업데이트하기 위한 예시적인 프로세스의 흐름도이다.
다양한 도면에서 동일한 참조 번호 및 명칭은 동일한 구성요소를 나타낸다.1 shows an exemplary neural architecture retrieval system.
2A is a diagram illustrating an example of a controller neural network generating an output sequence.
FIG. 2B is an example of a controller neural network that generates an output sequence defining an architecture that includes skip connections.
2C is a diagram illustrating an example of a controller neural network that generates an output sequence defining an architecture for a circulating cell.
3 is a flow diagram of an exemplary process for updating a current value of a controller parameter.
Like reference numbers and designations in the various drawings indicate like elements.

본 명세서는 제어기 신경망을 사용하여 특정 신경망 태스크를 수행하도록 구성된 차일드 신경망용 아키텍처를 결정하는 하나 이상의 위치에 있는 하나 이상의 컴퓨터 상의 컴퓨터 프로그램으로 구현되는 시스템을 설명한다.This specification describes a system implemented as a computer program on one or more computers in one or more locations that determines an architecture for a child neural network configured to perform specific neural network tasks using a controller neural network.

차일드 신경망은 모든 유형의 디지털 데이터 입력을 수신하고, 입력에 따라 스코어(점수), 분류 또는 회귀 출력의 모든 유형을 생성하도록 구성될 수 있다. A child's neural network can be configured to receive any type of digital data input and, depending on the input, generate any type of score, classification or regression output.

예를 들어, 차일드 신경망에 대한 입력이 이미지로부터 추출된 특징 또는 이미지인 경우, 소정의 이미지에 대해 차일드 신경망에 의해 생성된 출력은 객체 카테고리들의 세트의 각각에 대한 스코어일 수 있으며, 각각의 스코어는 이미지가 카테고리에 속하는 객체의 이미지를 포함하는 추정된 우도를 나타낸다.For example, if the input to a Child's Neural Network is an image or features extracted from an image, the output produced by the Child's Neural Network for a given image may be a score for each of a set of object categories, each score being Indicates the estimated likelihood that the image contains images of objects belonging to the category.

또 다른 예로서, 차일드 신경망에 대한 입력이 인터넷 자원(예: 웹 페이지), 문서 또는 인터넷 자원, 문서 또는 그 문서의 일부에서 추출된 특징 또는 문서의 일부인 경우, 소정의 인터넷 자원, 문서, 또는 그 문서의 일부분에 대한 차일드 신경망에 의해 생성된 출력은 토픽들의 세트 각각에 대한 스코어일 수 있고, 각 스코어는 인터넷 자원, 문서 또는 문서 부분이 토픽에 관한 것으로 추정되는 우도를 나타낸다.As another example, an input to a child's neural network is an Internet resource (e.g., a web page), a document, or a feature extracted from an Internet resource, a document, or a portion of a document, or a portion of a document. The output produced by the child's neural network for a portion of a document may be a score for each of the set of topics, each score representing an estimated likelihood that the Internet resource, document, or document portion relates to a topic.

또 다른 예로서, 차일드 신경망에 대한 입력이 특정 광고에 대한 노출 컨텍스트의 특징인 경우, 차일드 신경망에 의해 생성된 출력은 특정 광고가 클릭될 것으로 추정되는 우도를 나타내는 스코어일 수 있다. As another example, if the input to the child's neural network is a characteristic of the exposure context for a particular advertisement, the output generated by the child's neural network may be a score representing the estimated likelihood that the particular advertisement will be clicked.

또 다른 예로서, 차일드 신경망에 대한 입력이 사용자에 대한 개인화된 추천의 특징, 예를 들어, 추천에 대한 컨텍스트를 특징짓는 특징, 예컨대 사용자에 의해 취해진 이전 액션을 특징짓는 특징인 경우, 차일드 신경망에 의해 생성된 출력은 콘텐트 아이템들의 세트 각각에 대한 스코어일 수 있고, 각 스코어는 사용자가 콘텐트 아이템을 추천하는 것에 호의적으로 응답할 것으로 예상되는 우도를 나타낸다.As another example, if the input to a child's neural network is a feature of a personalized recommendation for a user, e.g., a feature that characterizes the context for the recommendation, such as a previous action taken by the user, the child's neural network The output generated by may be a score for each of the set of content items, each score representing a likelihood that the user is expected to respond favorably to recommending the content item.

또 다른 예로서, 차일드 신경망에 대한 입력이 하나의 언어로된 텍스트의 시퀀스라면, 차일드 신경망에 의해 생성된 출력은 다른 언어로된 텍스트의 부분들(pieces) 세트의 각각에 대한 스코어일 수 있으며, 각각의 스코어는 다른 언어의 텍스트의 부분들이 다른 언어로의 입력 텍스트의 적절한 번역인 것으로 추정된 우도를 나타낸다.As another example, if the input to a child's neural network is a sequence of text in one language, the output produced by the child's neural network may be a score for each of a set of pieces of text in another language; Each score represents the estimated likelihood that portions of text in the other language are proper translations of the input text in the other language.

또 다른 예로서, 차일드 신경망에 대한 입력이 발언(발화)을 나타내는 시퀀스라면, 차일드 신경망에 의해 생성된 출력은 텍스트의 부분들의 세트 각각에 대한 스코어 일 수 있고, 각각의 스코어는 텍스트 부분이 발화에 대한 정확한 전사(transcript)일 것으로 추정되는 우도를 나타낸다.As another example, if the input to the Child's Neural Network is a sequence representing an utterance (utterance), the output produced by the Child's Neural Network may be a score for each set of parts of text, each score representing a text part corresponding to an utterance. Indicates the likelihood estimated to be an accurate transcript for .

도 1은 예시적인 신경 아키텍처 검색 시스템(100)을 도시한다. 신경 아키텍처 검색 시스템(100)은 하나 이상의 위치에서 하나 이상의 컴퓨터상의 컴퓨터 프로그램으로 구현되는 시스템의 예이며, 여기서 설명되는 시스템, 컴포넌트 및 기술이 구현될 수 있다.1 shows an exemplary neural architecture retrieval system 100 . Neural architecture search system 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, and the systems, components, and techniques described herein may be implemented.

신경 아키텍처 검색 시스템(100)은 특정 태스크를 수행하기 위해 신경망을 훈련시키기 위한 훈련 데이터(102) 및 특정 태스크에 대해 신경망의 성능을 평가하기 위한 유효성(validation) 세트(104)를 획득하고, 훈련 데이터(102) 및 유효성 세트(104)를 사용하여 특정 태스크를 수행하도록 구성된 차일드 신경망에 대한 아키텍처를 결정하는 시스템이다. 상기 아키텍처는 차일드 신경망에서의 계층(layer)의 수, 각 계층에 의해 수행되는 동작, 및 차일드 신경망에서의 계층들 사이의 연결성, 즉 어느 계층이 차일드 신경망에서 다른 계층으로부터 입력을 수신하는지를 정의한다.The neural architecture retrieval system 100 obtains training data 102 for training a neural network to perform a particular task and a validation set 104 for evaluating the performance of the neural network for a particular task, and the training data (102) and effectiveness set (104) to determine the architecture for a child neural network configured to perform a particular task. The architecture defines the number of layers in the child neural network, the operations performed by each layer, and the connectivity between the layers in the child neural network, i.e. which layers receive input from other layers in the child neural network.

일반적으로, 훈련 데이터(102) 및 유효성 세트(104) 모두는 신경망 입력들의 세트 및 각각의 네트워크 입력에 대해 특정 태스크를 수행하기 위해 차일드 신경망에 의해 생성되어야 하는 각각의 목표 출력을 포함한다. 예를 들어, 훈련 데이터(102) 및 유효성 세트(104)를 생성하기 위해 더 큰 세트의 훈련 데이터가 랜덤하게 분할될 수 있다.In general, both training data 102 and validation set 104 include a set of neural network inputs and each target output that must be produced by the child neural network to perform a particular task for each network input. For example, a larger set of training data can be randomly partitioned to generate training data 102 and validation set 104 .

시스템(100)은 다양한 방식들 중 임의의 방식으로 훈련 데이터(102) 및 유효성 세트(104)를 수신할 수 있다. 예를 들어, 시스템(100)은 시스템(100)에 의해 이용 가능한 애플리케이션 프로그래밍 인터페이스(API)를 사용하여 데이터 통신 네트워크를 통해 시스템의 원격 사용자로부터 업로드로서 훈련 데이터를 수신할 수 있고, 업로드된 데이터를 무작위로 훈련 데이터(102)와 유효성 세트(104)로 나눈다(분할). 또 다른 예로서, 시스템(100)은 시스템(100)에 의해 이미 유지된 데이터가 신경망을 훈련시키기 위해 사용되어야 하는지를 지정하는 사용자로부터의 입력을 수신할 수 있고, 이어서 지정된 데이터를 훈련 데이터(102) 및 유효성 세트(104)로 분할한다.System 100 may receive training data 102 and validation set 104 in any of a variety of ways. For example, system 100 can receive training data as an upload from a remote user of the system over a data communications network using an application programming interface (API) usable by system 100, and send the uploaded data to Randomly divides into training data 102 and validation set 104 (split). As another example, system 100 can receive input from a user specifying whether data already maintained by system 100 should be used to train a neural network, and then convert the specified data to training data 102 . and validity set (104).

신경 아키텍처 검색 시스템(100)은 제어기 신경망(110), 훈련 엔진(120) 및 제어기 파라미터 업데이트 엔진(130)을 포함한다. The neural architecture search system (100) includes a controller neural network (110), a training engine (120) and a controller parameter update engine (130).

제어기 신경망(110)은 본 명세서에서 "제어기 파라미터"로 언급되고 제어기 파라미터에 따라 출력 시퀀스를 생성하도록 구성된 파라미터를 갖는 신경망이다. 제어기 신경망(110)에 의해 생성된 각각의 출력 시퀀스는 차일드 신경망에 대한 각각의 가능한 아키텍처를 정의한다.Controller neural network 110 is a neural network having parameters referred to herein as "controller parameters" and configured to generate an output sequence according to the controller parameters. Each output sequence produced by controller neural network 110 defines each possible architecture for the child neural network.

특히, 각각의 출력 시퀀스는 복수의 시간 단계 각각에서 각각의 출력을 포함하고, 출력 시퀀스의 각각의 시간 단계는 차일드 신경망의 아키텍처의 상이한 하이퍼 파라미터에 대응한다. 따라서, 각각의 출력 시퀀스는 각 시간 단계에서 해당 하이퍼파라미터의 각 값을 포함한다. 집합적으로, 소정의 출력 시퀀스에서 하이퍼파라미터의 값은 차일드 신경망에 대한 아키텍처를 정의한다. 일반적으로, 하이퍼파라미터는 차일드 신경망의 훈련이 시작되기 전에 설정되고 차일드 신경망에 의해 수행되는 동작에 영향을 미치는 값이다. 출력 시퀀스 및 가능한 하이퍼파라미터는도 2a-2c를 참조하여 아래에서보다 상세히 설명된다.In particular, each output sequence includes a respective output at each of a plurality of time steps, and each time step of the output sequence corresponds to a different hyperparameter of the architecture of the child's neural network. Thus, each output sequence contains each value of the corresponding hyperparameter at each time step. Collectively, the values of the hyperparameters in a given output sequence define the architecture for the child's neural network. In general, hyperparameters are values that are set before training of the child neural network starts and affect operations performed by the child neural network. The output sequence and possible hyperparameters are described in more detail below with reference to Figs. 2a-2c.

일반적으로, 시스템(100)은 제어기 신경망(110)을 훈련시켜 제어기 파라미터의 값을 조정함으로써 차일드 신경망에 대한 아키텍처를 결정한다. In general, system 100 determines the architecture for a child neural network by training controller neural network 110 to adjust the values of controller parameters.

특히, 훈련 과정의 반복 동안, 시스템(100)은 제어기 파라미터의 현재 값에 따라 제어기 신경망(110)을 사용하여 시퀀스의 배치(batch)(112)를 생성한다. 배치(112)의 각 출력 시퀀스에 대해, 훈련 엔진(120)은 훈련 데이터(102)에 대한 출력 시퀀스에 의해 정의된 아키텍처를 갖는 차일드 신경망의 인스턴스를 훈련시키고, 유효성 세트(104)에서 훈련된 인스턴스의 성능을 평가한다. 그 다음, 제어기 파라미터 업데이트 엔진(130)은 배치(112) 내의 출력 시퀀스에 대한 평가 결과를 사용하여 제어기 파라미터의 현재 값을 업데이트하여 태스크에 대한 제어기 신경망(110)에 의해 생성된 출력 시퀀스에 의해 정의된 아키텍처의 예상된 성능을 향상시킨다. 훈련된 인스턴스의 성능을 평가하고 제어기 파라미터의 현재 값을 업데이트하는 것은 도 3을 참조하여 이하에서보다 상세하게 설명된다. In particular, during iterations of the training process, the system 100 generates batches 112 of sequences using the controller neural network 110 according to the current values of the controller parameters. For each output sequence in batch 112, training engine 120 trains an instance of a child neural network with the architecture defined by the output sequence to training data 102, and the trained instance in validation set 104. evaluate the performance of The controller parameter update engine 130 then updates the current values of the controller parameters using the evaluation results for the output sequences in the batch 112 defined by the output sequences generated by the controller neural network 110 for the task. improve the expected performance of the proposed architecture. Evaluating the performance of the trained instance and updating the current values of the controller parameters are described in more detail below with reference to FIG. 3 .

이러한 방식으로 제어기 파라미터의 값을 반복적으로 업데이트함으로써, 시스템(100)은 제어기 신경망(110)을 훈련시켜, 특정 태스크에 대해 증가된 성능을 갖는 차일드 신경망, 즉 제어기 신경망(110)에 의해 제안된 아키텍처의 유효성 세트(104)상의 예상된 정확성을 최대화하는 출력 시퀀스를 생성할 수 있다.By iteratively updating the values of the controller parameters in this way, the system 100 trains the controller neural network 110 so that it has increased performance for specific tasks, i.e. the architecture proposed by the controller neural network 110. can generate an output sequence that maximizes the expected accuracy on the validity set 104 of .

일단 제어기 신경망(110)이 훈련되면, 시스템(100)은 차일드 신경망의 최종 아키텍처로서 유효성 세트(104)에서 최상으로 수행된 아키텍처를 선택할 수 있거나, 제어기 파라미터의 훈련된 값에 따라 새로운 출력 시퀀스를 생성할 수 있고 새로운 출력 시퀀스에 의해 정의된 아키텍처를 차일드 신경망의 최종 아키텍처로 사용할 수 있다.Once the controller neural network 110 is trained, the system 100 may select the best performing architecture in the validation set 104 as the final architecture of the child neural network, or generate a new output sequence according to the trained values of the controller parameters. and use the architecture defined by the new output sequence as the final architecture of the child neural network.

그 후, 신경망 검색 시스템(100)은 차일드 신경망의 아키텍처, 즉 차일드 신경망의 일부인 계층을 지정하는 데이터, 계층들 간 연결성 및 계층들에 의해 수행된 동작을 지정하는 아키텍처 데이터(150)를 출력할 수 있다. 예를 들어, 신경망 검색 시스템(100)은 훈련 데이터를 제출한 사용자에게 아키텍처 데이터(150)를 출력할 수 있다. 일부 경우에, 데이터(150)는 또한 아키텍처를 갖는 차일드 신경망의 훈련된 인스턴스의 훈련으로부터 차일드 신경망의 파라미터의 훈련된 값을 포함한다.Then, the neural network retrieval system 100 may output architecture data 150 specifying the architecture of the child neural network, that is, data specifying layers that are part of the child neural network, connectivity between layers, and operations performed by the layers. there is. For example, the neural network search system 100 may output architecture data 150 to a user who has submitted training data. In some cases, data 150 also includes trained values of parameters of the child neural network from training of a trained instance of the child neural network with the architecture.

일부 구현 예에서, 아키텍처 데이터(150)를 출력하는 대신에 또는 그에 추가하여, 시스템(100)은, 예를 들어, 아키텍처를 갖는 차일드 신경망의 인스턴스를 훈련한 결과 생성된 파라미터 값을 스크래치(scratch)로부터 또는 미세조정하기 위해 상기 결정된 아키텍처를 갖는 신경망의 인스턴스를 훈련시키고, 그 다음 훈련된 신경망을 사용하여, 예를 들어 시스템에 의해 제공된 API를 통해 사용자에 의해 수신된 요청을 처리한다. 즉, 시스템(100)은 처리될 입력을 수신하고, 훈련된 차일드 신경망을 사용하여 입력을 처리할 수 있고, 훈련된 신경망에 의해 생성된 출력 또는 수신된 입력에 응답하여 생성된 출력으로부터 유도된 데이터를 제공한다.In some implementations, instead of or in addition to outputting architecture data 150, system 100 scratches parameter values generated as a result of, for example, training an instance of a child neural network having an architecture. Train an instance of a neural network with the architecture determined above to fine-tune or from, and then use the trained neural network to process requests received by users, for example via an API provided by the system. That is, system 100 may receive input to be processed, process the input using a trained child's neural network, and data derived from output generated by the trained neural network or output generated in response to the received input. provides

일부 구현 예에서, 시스템(100)은 분산 방식으로 제어기 신경망을 훈련시킨다. 즉, 시스템(100)은 제어기 신경망의 복수의 레플리카(replicas)를 포함한다. 훈련이 배포되는 이러한 구현예 중 일부에서, 각 레플리카는 레플리카가 출력한 출력 시퀀스의 배치들에 대한 성능 메트릭과 그 성능 메트릭을 사용하여 제어기 파라미터에 대한 업데이트를 결정하는 전용 제어기 파라미터 업데이트 엔진을 생성하는 전용 훈련 엔진을 갖는다. 제어기 파라미터 업데이트 엔진이 업데이트를 결정하면, 제어기 파라미터 업데이트 엔진은 업데이트를 모든 제어기 파라미터 업데이트 엔진에 액세스할 수 있는 중앙 파라미터 업데이트 서버로 전송할 수 있다. 중앙 파라미터 업데이트 서버는 서버에 의해 유지되는 제어기 파라미터의 값을 업데이트하고, 그 업데이트된 값을 제어기 파라미터 업데이트 엔진에 송신할 수 있다. 일부 경우에 따라, 복수의 레플리카들 및 이들의 해당 훈련 엔진들 및 파라미터 업데이트 엔진들 각각은 훈련 엔진들 및 파라미터 업데이트 엔진들의 서로 다른 세트로부터 비동기적으로 동작할 수있다.In some implementations, system 100 trains the controller neural network in a distributed manner. That is, system 100 includes multiple replicas of the controller neural network. In some of these implementations where training is distributed, each replica creates a performance metric for batches of output sequences output by the replica and a dedicated controller parameter update engine that uses that performance metric to determine updates to the controller parameters. It has a dedicated training engine. If the controller parameter update engine determines to update, the controller parameter update engine can send the update to a central parameter update server accessible to all controller parameter update engines. The central parameter update server may update the value of the controller parameter maintained by the server and send the updated value to the controller parameter update engine. In some cases, a plurality of replicas and their corresponding training engines and parameter update engines, respectively, may operate asynchronously from different sets of training engines and parameter update engines.

도 2a는 출력 신경망(110)이 출력 시퀀스를 생성하는 일례의 다이어그램(200)이다. 2A is a diagram 200 of an example in which an output neural network 110 generates an output sequence.

특히, 다이어그램(200)은 출력 시퀀스의 생성 동안 7개의 예시적인 시간 단계(202-214)에 대해 제어기 신경망(110)에 의해 수행되는 처리(프로세싱)를 도시한다. 아래에서보다 상세히 설명되는 바와 같이, 7개의 시간 단계들(202-214) 각각은 신경망 아키텍처의 상이한 하이퍼파라미터에 대응한다.In particular, diagram 200 illustrates the processing performed by controller neural network 110 for seven exemplary time steps 202-214 during generation of the output sequence. As described in more detail below, each of the seven time steps 202-214 corresponds to a different hyperparameter of the neural network architecture.

제어기 신경망(110)은 하나 이상의 순환 신경망 계층들, 예를 들어 계층들(220 및 230)을 포함하는 순환 신경망이며, 이는 각각의 시간 단계에 대해 상기 소정의 출력 시퀀스에서 선행 시간 단계에 대응하는 하이퍼파라미터의 값을 입력으로서 수신하고, 상기 입력을 처리하여 상기 순환 신경망의 현재 은닉 상태를 업데이트하도록 구성된다. 예를 들어, 제어기 신경망(110)의 순환 계층들은 LSTM(long-short term memory) 계층 또는 GRU(gated recurrent unit) 계층들일 수 있다. 도 2a의 예에서는, 시간 단계(208)에서, 계층들(220 및 230)은 이전(선행) 시간 단계(206)로부터의 하이퍼파라미터의 값을 입력으로서 수신하고, 업데이트된 은닉 상태(232)를 출력으로서 생성하기 위해 시간 단계(206)로부터 계층들(220 및 230)의 운닉 상태들을 업데이트한다.Controller neural network 110 is a recurrent neural network that includes one or more recurrent neural network layers, e.g., layers 220 and 230, which for each time step correspond to the preceding time step in the given output sequence. and receive values of the parameters as inputs, and process the inputs to update the current hidden state of the recurrent neural network. For example, the recurrent layers of controller neural network 110 may be long-short term memory (LSTM) layers or gated recurrent unit (GRU) layers. In the example of FIG. 2A , at time step 208 , layers 220 and 230 receive as input the value of the hyperparameter from the previous (previous) time step 206 , and obtain an updated hidden state 232 . Updates the unnic states of layers 220 and 230 from time step 206 to produce as output.

제어기 신경망(110)은 또한 출력 시퀀스의 각 시간 단계에 대한 각각의 출력 계층, 예를 들어, 시간 단계들(202-214)에 대한 출력 계층들(242-254)을 각각 포함한다. 각각의 출력 계층은 시간 단계에서 업데이트된 은닉 상태를 포함하는 출력 계층 입력을 수신하고, 시간 단계에서 하이퍼파리미터의 가능한 값에 대한 스코어 분포를 정의하는 시간 단계에 대한 출력을 생성하도록 구성된다. 예를 들어, 각 출력 계층은 먼저 해당 하이퍼파라미터의 복수의 가능한 값들에 대한 적절한 차원으로 출력 계층 입력을 프로젝트(project)하고, 그 후 시간 단계에서 하이퍼파라미터에 대한 복수의 가능한 값들 각각에 대한 각각의 스코어를 생성하도록 상기 프로젝트된 출력 계층 입력에 소프트맥스(softmax)를 적용할 수 있다. 예를 들어, 시간 단계(208)에 대한 출력 계층(248)은 은닉 상태(232)를 포함하는 입력을 수신하고, 스트라이드(stride) 높이의 하이퍼파라미터에 대한 복수의 가능한 값들 각각에 대한 각각의 스코어를 생성하도록 구성된다.Controller neural network 110 also includes a respective output layer for each time step of the output sequence, e.g., output layers 242-254 for time steps 202-214, respectively. Each output layer is configured to receive an output layer input comprising an updated hidden state at the time step, and generate an output for the time step defining a distribution of scores for possible values of the hyperparameter at the time step. For example, each output layer first projects the output layer input into the appropriate dimension for multiple possible values of that hyperparameter, and then each output layer input for each of the multiple possible values for that hyperparameter at a time step. A softmax can be applied to the projected output layer input to generate a score. For example, the output layer 248 for the time step 208 receives an input comprising a hidden state 232 and generates a respective score for each of a plurality of possible values for the hyperparameter of the stride height. is configured to generate

따라서, 출력 시퀀스에서 소정의 시간 단계에 대한 하이퍼파라미터 값을 생성하기 위해, 시스템(100)은 출력 시퀀스의 이전 시간 단계에서의 하이퍼파라미터의 값을 제어기 신경망에 입력으로서 제공하고, 제어기 신경망은 시간 단계에서 하이퍼파라미터의 가능한 값에 대한 스코어 분포를 정의하는 시간 단계에 대한 출력을 생성한다. 이전 시간 단계가 없기 때문에, 출력 시퀀스의 맨 처음 시간 단계에서, 시스템(100)은 그 대신에 미리 결정된 플레이스홀더(placeholder) 입력을 제공할 수 있다. 그 다음, 시스템(100)은 출력 시퀀스의 시간 단계에서 하이퍼파라미터의 값을 결정하기 위해 스코어 분포에 따라 가능한 값들로부터 샘플링한다. 소정의 하이퍼파라미터가 취할 수 있는 가능한 값은 훈련 전에 고정되어 있으며 복수의 가능한 값들은 다른 하이퍼파라미터들마다 다를 수 있다.Thus, to generate hyperparameter values for a given time step in an output sequence, system 100 provides the values of the hyperparameters at a previous time step in the output sequence as inputs to a controller neural network, which controls the time step produces an output for time steps defining the distribution of scores over the possible values of the hyperparameters in . Since there is no previous time step, at the very first time step in the output sequence, system 100 may instead provide a predetermined placeholder input. System 100 then samples from the possible values according to the score distribution to determine the value of the hyperparameter at the time step of the output sequence. Possible values of a given hyperparameter are fixed before training, and a plurality of possible values may be different for different hyperparameters.

일반적으로 소정의 출력 시퀀스에 의해 정의된 아키텍처에 포함될 계층들의 수는 시퀀스를 생성하기 전에 고정된다. 일부 구현 예에서, 제어기 신경망의 훈련 동안 생성된 출력 시퀀스에 의해 정의된 각각의 아키텍처는 동일한 수의 계층을 갖는다. 다른 구현예에서, 시스템은 훈련이 진행됨에 따라 차일드 신경망에서 계층의 수를 증가시키는 스케줄을 사용한다. 일예로서, 시스템은 훈련 동안 6개의 계층에서 시작하여 1,600개의 샘플마다 하나 이상의 계층으로 깊이(depth)를 증가시킬 수 있다.In general, the number of layers to be included in the architecture defined by a given output sequence is fixed prior to generating the sequence. In some implementations, each architecture defined by the output sequences generated during training of the controller neural network has the same number of layers. In another implementation, the system uses a schedule that increases the number of layers in the child neural network as training progresses. As an example, the system may start with 6 layers during training and increase the depth to one or more layers every 1,600 samples.

도 2a의 예에서, 차일드 신경망은 콘볼루션 신경망이고, 하이퍼파라미터들은 차일드 신경망에서 각 콘볼루션 신경망 계층에 대한 하이퍼파라미터들을 포함한다. 특히, 도 2a에서, 시간 단계(202)는 차일드 신경망의 콘볼루션 계층 N-1의 하이퍼 파라미터에 대응하고, 시간 단계들(204-212)은 콘볼루션 계층 N의 하이퍼파라미터들에 대응하고, 시간 단계(214)는 콘볼루션 계층 N + 1의 하이퍼파라미터에 대응한다. 예를 들어, 콘볼루션 계층들은 스택으로 배열될 수 있으며, 계층 N은 계층 N-1에 의해 생성된 출력을 입력으로서 수신하고 계층 N + 1에 입력으로서 제공되는 출력을 생성한다.In the example of FIG. 2A , the child neural network is a convolutional neural network, and the hyperparameters include hyperparameters for each convolutional neural network layer in the child neural network. In particular, in Fig. 2A, time step 202 corresponds to a hyperparameter of convolutional layer N-1 of the Child's Neural Network, time steps 204-212 correspond to hyperparameters of convolutional layer N, and Step 214 corresponds to the hyperparameters of the convolutional layer N+1. For example, convolutional layers can be arranged in a stack, where layer N receives as input the output produced by layer N−1 and produces an output that is provided as input to layer N+1.

도 2a의 예에서, 콘볼루션 계층의 경우, 계층에 의해 수행하는 동작들을 정의하는 하이퍼파라미터들은 계층의 필터 수, 각 필터의 필터 높이, 각 필터의 필터 폭, 각 필터를 적용하는 스트라이드(stride) 높이 및 각 필터의 스트라이드 폭이다. 다른 예에서, 이들 중 일부는 제거될 수 있으며, 예를 들어 이러한 하이퍼파라미터들의 특정적인 것들은 고정된 것으로 가정될 수 있고, 다른 하이퍼파라미터들, 예를 들어 활성화 함수의 유형, 콘볼루션이 딜레이션되거나(dilated) 마스킹되는지 여부 등이 추가될 수 있거나, 또는 둘 모두가 추가될 수 있다.In the example of FIG. 2A, in the case of a convolutional layer, the hyperparameters defining operations performed by the layer are the number of filters in the layer, the filter height of each filter, the filter width of each filter, and the stride to which each filter is applied. height and stride width of each filter. In another example, some of these may be eliminated, e.g. certain of these hyperparameters may be assumed to be fixed, and other hyperparameters, e.g. the type of activation function, the convolution may be dilated or (dilated) masking or not may be added, or both may be added.

일례의 구현예에서, 필터 높이에 대한 가능한 값은 [1, 3, 5, 7]이고, 필터 폭에 대한 가능한 값은 [1, 3, 5, 7]이고, 필터의 수에 대한 가능한 값은 [24, 36, 48, 6 64]이고, 스트라이드 높이와 폭에 대한 가능한 값은 [1, 2, 3]이다.In an example implementation, possible values for filter height are [1, 3, 5, 7], possible values for filter width are [1, 3, 5, 7], and possible values for number of filters are [24, 36, 48, 6 64], and possible values for stride height and width are [1, 2, 3].

도 2a의 예에서, 차일드 신경망에서의 계층들의 구성, 즉 어떤 계층들이 다른 계층들로부터의 계층들을 수신하는지가 고정된다. 그러나, 다른 예들에서, 상기 하이퍼파라미터들은 차일드 신경망의 계층들 사이의 연결성을 정의하는 하이퍼파라미터들을 포함한다.In the example of Fig. 2a, the composition of the layers in the child's neural network, ie which layers receive layers from other layers, is fixed. However, in other examples, the hyperparameters include hyperparameters that define connectivity between layers of a child neural network.

도 2b는 스킵 연결(skip connections)을 포함하는 아키텍처를 정의하는 출력 시퀀스를 생성하는 제어기 신경망(110)의 일례의 다이어그램(250)이다. FIG. 2B is a diagram 250 of an example of a controller neural network 110 that generates output sequences that define an architecture that includes skip connections.

특히, 도 2b의 예에서, 차일드 신경망의 하나 이상의 계층에 대해, 하이퍼파라미터는 스킵 연결 하이퍼파라미터를 포함하는데, 이 스킵 연결 하이퍼파라미터는 이전의 어느 계층들이 그 계층에 대한 스킵 연결을 갖는지를 정의한다. 더 구체적으로, 출력 시퀀스의 시간 단계들은 하이퍼미터(hypermeter)가 스킵 연결 하이퍼파라미터, 예를 들어, 계층 N-1에 대한 시간 단계(252) 및 계층 N에 대한 시간 단계(254)인 하나 이상의 계층들 각각에 대한 각 앵커(anchor) 포인트 시간 단계를 포함한다.In particular, in the example of FIG. 2B , for one or more layers of the child neural network, the hyperparameters include a skip connection hyperparameter, which defines which previous layers have skip connections to that layer. . More specifically, the time steps of the output sequence are one or more layers whose hypermeter is a skip connection hyperparameter, e.g., time step 252 for layer N-1 and time step 254 for layer N. Each anchor point time step for each of

소정의 계층의 소정의 앵커 포인트 시간 단계에 대한 출력 계층은 차일드 신경망에서 현재 계층보다 이전의 각 계층에 대응하는 개별 노드를 포함한다. 각각의 노드는 (i) 앵커 포인트 단계에 대한 업데이트된 은닉 상태, 및 (ii) 대응하는 이전 계층, 즉 노드에 대응하는 이전 계층의 앵커 포인트 시간 단계에 대한 업데이트된 은닉 상태를, 상기 이전 계층이 차일드 신경망의 현재 계층에 연결될 가능성을 나타내는 스코어를 생성하도록 하는 파라미터들의 세트에 따라 처리하도록 구성된다. 예를 들어, 이전 계층 j에 대응하는 계층 i의 출력 계층에 있는 노드는 수학식 1을 만족하는 상기 대응하는 이전 계층에 대한 스코어를 생성할 수 있다.The output layer for a given anchor point time step of a given layer contains individual nodes corresponding to each layer prior to the current layer in the child neural network. Each node has (i) an updated hidden state for the anchor point step, and (ii) an updated hidden state for the anchor point time step of the corresponding previous layer, i.e., the previous layer corresponding to the node. It is configured to process according to a set of parameters to generate a score representing a likelihood of being connected to the current layer of the child neural network. For example, a node in the output layer of layer i corresponding to previous layer j may generate a score for the corresponding previous layer that satisfies Equation 1.

여기서,

는 상기 노드에 대한 파라미터들이며,

는 상기 대응하는 이전 계층 j의 앵커 포인트 시간 단계에 대한 업데이트된 은닉 상태이고,

는 계층 i의 앵커 포인트 시간 단계에 대한 업데이트된 은닉 상태이다.here,

are the parameters for the node,

is the updated concealment state for the anchor point time step of the corresponding previous layer j,

is the updated concealment state for the anchor point time step of layer i.

그 후, 시스템(100)은 각각의 이전 계층에 대해, 상기 계층이 이전 계층에 대응하는 노드에 의해 생성된 스코어에 따라 "yes" 또는 "no"중 어느 하나를 샘플링함으로써 스킵 연결로 소정의 계층에 연결되어 있는지를 결정한다. 시스템이 다중 계층이 상기 소정의 계층에 연결되어야 한다고 결정하면, 다중 계층들 모두에 의해 생성된 출력들은 소정의 계층에 대한 입력을 생성하기 위해 깊이 차원에서 연결된다. 하나의 계층이 다른 계층과 호환되지 않고 입력 또는 출력이 없는 계층이 포함되지 않은 네트워크에서 스킵 연결이 "컴파일 실패(compilation failures)"를 발생시키지 않도록 하려면, (i) 계층이 입력 계층에 연결되어 있지 않으면 네트워크 입력이 계층의 입력으로 사용되며, (ii) 최종 계층에서, 시스템은 연결되지 않은 모든 계층 출력을 가져 와서 이를 최종 연결 출력을 네트워크의 출력 계층으로 전송하기 전에 연결하며, (iii) 연결될 입력이 다른 크기를 갖는다면, 시스템은 연결될 입력이 동일한 크기를 갖도록 스몰 계층을 제로로 패드(pad)한다. Then, for each previous layer, the system 100 selects a given layer as a skip connection by sampling either "yes" or "no" according to the score generated by the node corresponding to the previous layer. determine if it is connected to If the system determines that multiple layers should be connected to the given layer, then the outputs produced by all of the multiple layers are concatenated in the depth dimension to create the input to the given layer. To avoid skip connections from causing "compilation failures" in networks where one layer is incompatible with another and does not contain a layer with no inputs or outputs, (i) the layer is not connected to the input layer. Otherwise, the network input is used as the input of the layer, (ii) at the final layer, the system takes all unconnected layer outputs and concatenates them before sending the final connected output to the output layer of the network, and (iii) the input to be connected. have different sizes, the system pads the small layer to zero so that the inputs to be connected have the same size.

일부 예에서, 차일드 신경망은 복수의 상이한 층 유형을 포함한다. 예를 들어, 차일드 신경망은 다른 유형의 신경망 계층들, 예를 들어 완전히 연결된 계층, 풀링 계층, 깊이 연결 계층, 로컬 콘트라스트(contrast) 정규화, 배치 정규화, 출력 계층(예: softmax 계층 또는 다른 분류기 계층) 등을 포함하는 콘볼루션 신경망이 될 수 있다.In some examples, a child's neural network includes a plurality of different layer types. For example, a child neural network is composed of other types of neural network layers, such as fully connected layers, pooling layers, deeply connected layers, local contrast normalization, batch normalization, output layers (e.g. softmax layers or other classifier layers). It can be a convolutional neural network including

이들 예들 중 일부에서, 다른 계층들의 하이퍼파라미터들 및 위치들은 고정되어 있고, 출력 시퀀스는 차일드 신경망의 콘볼루션 신경망 계층들에 대한 하이퍼파라미터 값들만을 포함한다. 예를 들어, 출력 계층의 위치는 차일드 신경망의 마지막 계층으로 고정될 수 있으며, 일부 또는 전부의 콘볼루션 계층들은 배치 정규화 계층들을 뒤따르거나 선행될 수 있다.In some of these examples, the hyperparameters and positions of the other layers are fixed, and the output sequence contains only hyperparameter values for the convolutional neural network layers of the child's neural network. For example, the position of the output layer may be fixed as the last layer of the child neural network, and some or all convolutional layers may follow or precede the batch normalization layers.

이 예들 중 다른 예들에서, 출력 시퀀스에 의해 정의된 하이퍼파라미터는 각 계층에 대해, 계층의 유형에 대응하는 값을 포함한다. 서로 다른 유형의 계층에는 서로 다른 하이퍼파라미터가 있으므로 이 예제에서 시스템은 소정의 계층에 대해 어떤 유형의 신경망 계층이 선택되는지에 기초하여 출력 시퀀스의 생성 중에 어떤 하이퍼파라미터가 동적으로 어떤 시간 단계에 해당하는지를 결정한다. 즉, 시스템이 소정의 시간 단계 동안 출력 계층으로 사용하는 출력 계층은 가장 최근에 샘플링된 계층 유형 하이퍼파라미터의 값에 따라 달라진다.In other of these examples, the hyperparameter defined by the output sequence includes, for each layer, a value corresponding to the type of layer. Different types of layers have different hyperparameters, so in this example the system dynamically determines which hyperparameters correspond to which time steps during generation of the output sequence based on which type of neural network layer is selected for a given layer. Decide. That is, the output layer that the system uses as an output layer during a given time step depends on the most recently sampled value of the layer type hyperparameter.

일부 예에서, 차일드 신경망은 순환 신경망이다. 이러한 경우, 출력 시퀀스는 순환(recurrent) 셀의 아키텍처를 정의할 수 있으며, 순한 셀은 차일드 신경망 내에서 여러 번 반복되어 신경망을 위한 아키텍처를 생성할 수 있다. 전술한 바와 같이, 일부 경우에는 반복 횟수가 훈련 동안 고정되고, 다른 경우에는 시스템이 훈련이 진행됨에 따라 반복 횟수를 증가시킨다.In some examples, the child's neural network is a recurrent neural network. In this case, the output sequence can define the architecture of the recurrent cells, and the meek cells can be repeated many times within the child neural network to create the architecture for the neural network. As mentioned above, in some cases the number of repetitions is fixed during training, in other cases the system increases the number of repetitions as the training progresses.

도 2c는 순환 셀에 대한 아키텍처를 정의하는 출력 시퀀스를 생성하는 제어기 신경망(110)의 일례의 다이어그램(270)이다. FIG. 2C is a diagram 270 of an example of a controller neural network 110 that generates an output sequence defining an architecture for a circulating cell.

특히, 도 2c에 도시된 바와 같이, 출력 시퀀스는 순환 셀에 의해 수행된 계산을 나타내는 계산 단계의 트리(tree)의 각각의 노드에 대한 각각의 계산 단계를 포함한다. 순한 셀은, 현재 시간 단계에 대한 입력(즉 시간 단계에서의 차일드 네트워크로의 입력 또는 차일드 네트워크의 다른 컴포넌트에 의해 생성된 출력), 및 이전 시간 단계로부터의 셀의 출력과 같은 2개의 입력을 수신한다. 순환 셀은 이 두 입력을 처리하여 셀 출력을 생성한다. 후술되는 바와 같이, 일부 경우에, 순환 셀은 또한 제3 입력, 즉 메모리 상태를 수신한다.In particular, as shown in Fig. 2c, the output sequence includes each computational step for each node of a tree of computational steps representing the computations performed by the circulating cells. A benign cell receives two inputs: an input for the current time step (i.e. input to the child network at the time step or output generated by another component of the child network), and the cell's output from the previous time step. do. A circulating cell processes these two inputs to produce a cell output. As described below, in some cases, the circulating cell also receives a third input, namely the memory state.

보다 구체적으로, 도 2c에서, 출력 시퀀스는 트리의 3개의 노드, 즉 트리 인덱스(index) 0에서의 하나의 리프 노드, 트리 인덱스 1에서의 다른 하나의 리프 노드, 트리 인덱스 2에서의 내부 노드에 대한 세팅을 정의한다.More specifically, in Fig. 2c, the output sequence is distributed over three nodes of the tree: one leaf node at tree index 0, another leaf node at tree index 1, and an internal node at tree index 2. Define settings for

트리의 각 노드는 두 개의 입력을 병합하여 출력을 생성하고, 각 노드에 대해 출력 시퀀스는 (i) 두 입력을 결합하는 결합(combination) 방법을 식별하는 데이터 및 (ii) 출력을 생성하기 위해 두 입력의 조합에 적용될 활성화 함수를 포함한다. 일반적으로, 셀의 리프 노드는 먼저 셀에 대한 두 개의 입력 각각에 각각의 파라미터 행렬을 적용하지만 내부 노드는 어떠한 파라미터들을 가지지 않는다. 상기 설명한 바와 같이, 결합 방법은 가능한 결합 방법들(예를 들어, 합산(add); 요소별 곱셈(element wise multiply))의 한 세트로부터 선택되고, 상기 활성화 함수는 가능한 활성화 함수들(예를 들어, 아이덴티티(identity); tanh; sigmoid; relu)의 한 세트로부터 선택된다.Each node in the tree produces an output by merging two inputs, and for each node the output sequence is (i) data identifying the combination method that combines the two inputs and (ii) two inputs to produce the output. Contains the activation function to be applied to combinations of inputs. In general, a cell's leaf node first applies its respective parameter matrix to each of the two inputs to the cell, but the inner nodes do not have any parameters. As described above, the combination method is selected from a set of possible combination methods (eg add; element wise multiply), and the activation function is selected from a set of possible activation functions (eg add; element wise multiply). , identity; tanh; sigmoid; relu).

예를 들어, 트리 인덱스 0의 리프 노드의 경우, 시스템이 결합 함수로 "add"를 선택하고, 활성화 함수로 "tanh"를 선택했기 때문에, 셀의 트리 인덱스 0에 있는 리프 노드는 출력

을 생성하기 위해 수학식 2의 연산을 수행할 수 있다. For example, for a leaf node at tree index 0, the leaf node at tree index 0 in a cell is an output because the system chose "add" as the join function and "tanh" as the activation function.

The operation of Equation 2 can be performed to generate

여기서,

및

는 노드의 파라미터 행렬이고,

는 시간 단계에서 셀에 대한 입력이고,

는 이전 시간 단계에서의 셀의 출력이다.here,

and

is the parameter matrix of the node,

is the input to the cell at the time step,

is the cell's output at the previous time step.

트리 인덱스 2에서의 노드에 대해, 셀이 메모리 상태를 갖지 않을 때, 도 2c의 예에서는 노드에 대한 두 개의 입력, 즉 두 리프 노드의 출력이 요소별로 곱해지고, 요소별 시그모이드 함수가 내부 노드의 출력, 즉 셀의 출력을 생성하기 위해 요소별 곱셈의 출력에 적용되도록 특정한다.For a node at tree index 2, when the cell has no memory state, in the example of Fig. 2c, the two inputs to the node, i.e., the outputs of the two leaf nodes, are multiplied elementwise, and the elementwise sigmoid function is internally Specifies to be applied to the output of a node, i.e., the output of an element-by-element multiplication to produce the output of a cell.

선택적으로, 셀의 아키텍처는 입력으로서 선행 메모리 상태를 수신하는 것을 포함할 수 있다. 이 경우, 출력 시퀀스는 셀의 메모리 상태가 셀에 주입(injected)되는 방법, 즉 선행 메모리 상태가 어떻게 업데이트되는지, 트리의 다음 노드로 전달되기 전에 선행 메모리 상태를 사용하여 수정된 출력을 갖는 노드를 정의하는 값들을 포함한다.Optionally, the cell's architecture may include receiving a preceding memory state as an input. In this case, the output sequence is how the cell's memory state is injected into the cell, i.e. how the preceding memory state is updated, and the node whose output is modified using the preceding memory state before being passed to the next node in the tree. contains the values defined.

특히, 출력 시퀀스는 노드에 대한 업데이트된 출력을 생성하기 위해 선행 메모리 상태가 트리에서의 노드들 중 하나의 출력과 결합되는 방법(즉 결합 방법 및 결합에 대한 활성화 함수)을 특정하는 두 개의 셀 주입 값들과, (i) 출력이 메모리 상태를 사용하여 업데이트되는 노드 및 (ii) 출력이 업데이트된 메모리 상태로 설정되는 노드(노드에 대한 활성화 함수의 적용 전에)를 특정하는 두 개의 셀 인덱스 값을 포함한다.Specifically, the output sequence is two cell injections that specify how the preceding memory state is combined with the output of one of the nodes in the tree (i.e., the combination method and the activation function for the combination) to produce an updated output for the node. values, and two cell index values that specify (i) the node whose output is updated using the memory state and (ii) the node whose output is set to the updated memory state (before application of the activation function to the node). do.

도 2c의 예에서, 두 번째 셀 인덱스에 대해 생성된 값이 0이고, 주입(injection)에 대한 결합 방법이 "add"이고, 활성화 함수가 ReLU이므로, 셀은 선행 셀 상태와 트리 인덱스 0(

라고 칭함)에서의 노드의 출력을 합산할 수 있고, 그 다음 트리 인덱스 0에서 노드의 업데이트된 출력을 생성하기 위해 그 합산에 ReLU를 적용할 수 있다. 그 다음, 상기 셀은 업데이트된 출력을 트리 인덱스 2의 노드에 입력으로서 제공할 수 있다.In the example of FIG. 2C , since the value generated for the second cell index is 0, the coupling method for injection is "add", and the activation function is ReLU, the cell has a preceding cell state and a tree index of 0 (

), and then apply ReLU to the sum to create an updated output of the node at tree index 0. The cell may then provide the updated output as an input to the node at tree index 2.

첫 번째 셀 인덱스에 대해 생성된 값이 1이기 때문에, 셀은 활성화 함수가 적용되기 전에 상기 업데이트된 메모리 상태를 인덱스 1의 트리 출력으로 설정한다.Since the value generated for the first cell index is 1, the cell sets the updated memory state to the tree output at index 1 before the activation function is applied.

한편, 도 2c는 트리가 설명을 용이하게 하기 위해 2개의 리프 노드를 포함하는 예를 도시하고, 실제로 리프 노드의 수는 4, 8 또는 16과 같이 더 클 수 있다.On the other hand, FIG. 2C shows an example in which the tree includes two leaf nodes for ease of explanation, and in practice, the number of leaf nodes may be larger, such as 4, 8 or 16.

도 3은 제어기 파라미터의 현재 값을 업데이트하기 위한 예시적인 프로세스(300)의 흐름도이다. 편의상, 프로세스(300)는 하나 이상의 위치에 위치한 하나 이상의 컴퓨터의 시스템에 의해 수행되는 것으로 설명될 것이다. 예를 들어, 적절하게 프로그램된 도 1의 신경 아키텍처 검색 시스템(100)과 같은 신경 아키텍처 검색 시스템은 프로세스(300)를 수행할 수 있다.3 is a flow diagram of an exemplary process 300 for updating a current value of a controller parameter. For convenience, process 300 will be described as being performed by a system of one or more computers located at one or more locations. For example, a properly programmed neural architecture retrieval system such as neural architecture retrieval system 100 of FIG. 1 may perform process 300 .

상기 시스템은 제어기 신경망을 훈련시키기 위해, 즉 제어기 파라미터의 초기 값으로부터 제어기 파라미터의 훈련된 값을 결정하기 위해 프로세스(300)를 반복적으로 수행할 수 있다. The system can iteratively perform process 300 to train the controller neural network, ie, to determine trained values of the controller parameters from initial values of the controller parameters.

상기 시스템은 제어기 신경망을 사용하고 반복시 제어기 파라미터의 현재 값에 따라 출력 시퀀스의 배치를 생성한다(단계 302). 배치의 각 출력 시퀀스는 차일드 신경망의 개별 아키텍처를 정의한다. 특히, 전술한 바와 같이, 상기 시스템은 출력 시퀀스에서 각각의 하이퍼파라미터 값을 생성할 때 스코어 분포로부터 샘플링하기 때문에, 배치의 시퀀스는 일반적으로 동일한 제어기 파라미터 값에 따라 생성되더라도 서로 다를 수 있다. 배치는 일반적으로 미리 결정된 수의 출력 시퀀스, 예를 들어 8, 16, 32 또는 64개의 시퀀스를 포함한다. The system uses a controller neural network and upon iteration generates a batch of output sequences according to the current values of the controller parameters (step 302). Each output sequence in a batch defines an individual architecture of the child's neural network. In particular, as noted above, because the system samples from the score distribution when generating each hyperparameter value in the output sequence, sequences of batches may differ from each other even though they are generally generated according to the same controller parameter values. A batch typically includes a predetermined number of output sequences, for example 8, 16, 32 or 64 sequences.

배치 내의 각 출력 시퀀스에 대해, 시스템은 특정 신경망 태스크를 수행하기 위해 출력 시퀀스에 의해 정의된 아키텍처를 갖는 차일드 신경망의 인스턴스를 훈련시킨다(단계 304). 즉, 배치의 각 출력 시퀀스에 대해, 상기 시스템은 출력 시퀀스에 의해 정의된 아키텍처를 갖는 신경망을 인스턴스화하고, 태스크에 적합한 종래의 기계 학습 훈련 기술(예를 들어, 역 전파(또는 backpropagation-through-time)를 통한 확률적인 그래디언트 디센트(gradient descent))을 사용하여 특정 신경망 태스크를 수행하기 위해 상기 수신된 훈련 데이터에 대한 인스턴스를 훈련시킨다. 일부 구현 예에서, 상기 시스템은 제어기 신경망에 대한 전체 훈련 시간을 줄이기 위해 차일드 신경망의 훈련을 병렬화한다. 상기 시스템은 지정된 시간 동안 또는 지정된 수의 훈련 반복 동안 각 차일드 신경망을 훈련시킬 수 있다.For each output sequence in the batch, the system trains an instance of the child neural network with the architecture defined by the output sequence to perform the particular neural network task (step 304). That is, for each output sequence in the batch, the system instantiates a neural network with an architecture defined by the output sequence, and a conventional machine learning training technique suitable for the task (e.g., backpropagation or backpropagation-through-time ) to train an instance on the received training data to perform a specific neural network task using stochastic gradient descent through . In some implementations, the system parallelizes training of the child neural network to reduce the overall training time for the controller neural network. The system can train each child neural network for a specified amount of time or for a specified number of training iterations.

배치의 각 출력 시퀀스에 대해, 상기 시스템은 특정 신경망 태스크에 대한 훈련된 인스턴스에 대한 성능 메트릭을 결정하기 위해 특정 신경망 태스크에 대한 차일드 신경망의 훈련된 인스턴스의 성능(performance)을 평가한다(단계 306). 예를 들어, 성능 메트릭은 적절한 정확도 측정에 의해 측정된 유효성 세트에서의 훈련된 인스턴스의 정확도일 수 있다. 예를 들어, 상기 정확도는 출력이 시퀀스일 때 난남도(perplexity measure)이거나 태스크가 분류 태스크일 때 분류 오류 비율일 수 있다. 또 다른 예로서, 성능 메트릭은 인스턴스의 훈련의 마지막 2, 5 또는 10 에포크(epoch)마다 인스턴스의 정확도의 평균 또는 최대값이 될 수 있다.For each output sequence in the batch, the system evaluates the performance of the trained instance of the child neural network for a particular neural network task to determine a performance metric for the trained instance for that particular neural network task (step 306). . For example, the performance metric may be the accuracy of a trained instance in a validation set as measured by an appropriate accuracy measure. For example, the accuracy may be a perplexity measure when the output is a sequence or a classification error rate when the task is a classification task. As another example, the performance metric can be the average or maximum of an instance's accuracy over the last 2, 5 or 10 epochs of the instance's training.

상기 시스템은 훈련된 인스턴스에 대한 성능 메트릭을 사용하여 제어기 파라미터의 현재 값을 조정한다(단계 308).The system adjusts the current value of the controller parameter using the performance metric for the trained instance (step 308).

특히, 상기 시스템은 보강 학습 기술을 사용하여 증가된 성능 메트릭을 갖는 차일드 신경망을 초래하는 출력 시퀀스를 생성하도록 제어기 신경망을 훈련시킴으로써 현재 값을 조정한다. 보다 구체적으로, 상기 시스템은 훈련된 인스턴스의 성능 메트릭에 기초하여 결정되는 수신된 보상(reward)을 최대화하는 출력 시퀀스를 생성하기 위해 제어기 신경망을 훈련시킨다. 특히, 소정의 출력 시퀀스에 대한 보상은 훈련된 인스턴스의 성능 메트릭의 함수이다. 예를 들어, 상기 보상은 성능 메트릭, 성능 메트릭의 제곱, 성능 메트릭의 세제곱, 성능 메트릭의 제곱근 중 하나 일 수 있다.In particular, the system uses reinforcement learning techniques to adjust the current value by training the controller neural network to produce an output sequence that results in a child neural network with an increased performance metric. More specifically, the system trains a controller neural network to generate an output sequence that maximizes a received reward determined based on the performance metrics of the trained instances. In particular, the reward for a given output sequence is a function of the performance metric of the trained instance. For example, the reward may be one of a performance metric, a square of a performance metric, a cube of a performance metric, and a square root of a performance metric.

일부 경우에는, 상기 시스템은 PG(policy gradient) 기술을 사용하여 예상되는 보상을 최대화하기 위해 제어기 신경망을 훈련시킨다. 예를 들어, PG 기술은 "REINFORCE 기술" 또는 PPO(Proximal Policy Optimization) 기술일 수 있다. 예를 들어, 상기 시스템은 수학식 3을 만족하는 그래디언트의 추정기를 사용하여 제어기 파라미터에 대해 예상되는 보상의 그래디언트를 추정할 수 있다.In some cases, the system uses a policy gradient (PG) technique to train the controller neural network to maximize the expected reward. For example, the PG technology may be a "REINFORCE technology" or a Proximal Policy Optimization (PPO) technology. For example, the system can estimate the gradient of the expected reward for the controller parameter using an estimator of the gradient that satisfies equation (3).

여기서, m은 배치의 시퀀스 수이며, T는 배치에서 각 시퀀스의 시간 단계의 수이고,

는 소정의 출력 시퀀스의 시간 단계(t)에서의 출력이며,

는 출력 시퀀스 k에 대한 보상이며,

는 제어기 파라미터이며, b는 베이스라인 함수, 예를 들어 이전 아키텍처 정확도의 지수(exponential) 이동 평균이다.where m is the number of sequences in the batch, T is the number of time steps of each sequence in the batch,

is the output at time step t of a given output sequence,

is the compensation for the output sequence k,

is the controller parameter, and b is the baseline function, e.g. the exponential moving average of the previous architecture accuracy.

일부 구현 예에서, 시스템은 분산 방식으로 제어기 신경망을 훈련시킨다. 즉, 상기 시스템은 제어기 신경망의 여러 레플리카(replica)를 유지하고, 훈련 도중 비동기적으로 레플리카의 파라미터 값을 업데이트한다. 즉, 상기 시스템은 각 레플리카에 대해 단계(302-306)를 비동기적으로 수행할 수 있고, 각 레플리카에 대해 결정된 그래디언트를 사용하여 제어기 파라미터를 업데이트할 수 있다.In some implementations, the system trains the controller neural network in a distributed manner. That is, the system maintains several replicas of the controller neural network and asynchronously updates parameter values of the replicas during training. That is, the system can asynchronously perform steps 302-306 for each replica and update the controller parameters using the gradient determined for each replica.

본 명세서는 시스템 및 컴퓨터 프로그램 구성 요소와 관련하여 "구성된"이라는 용어를 사용한다. 특정 동작이나 액션을 수행하도록 구성된 하나 이상의 컴퓨터 시스템은 시스템이 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합으로 인해 시스템이 동작 또는 액션을 수행하게 하는 것을 의미한다. 특정 동작 또는 액션을 수행하도록 구성된 하나 이상의 컴퓨터 프로그램은 하나 이상의 컴퓨터 프로그램이 데이터 처리 장치에 의해 실행될 때 장치로 하여금 동작 또는 액션을 수행하게 하는 명령어들을 포함함을 의미한다.This specification uses the term "configured" in relation to systems and computer program components. One or more computer systems configured to perform a particular operation or action means that the system causes the system to perform the operation or action due to software, firmware, hardware or a combination thereof. One or more computer programs configured to perform a particular operation or action means that the one or more computer programs, when executed by a data processing device, include instructions that cause the device to perform the operation or action.

본 명세서에서 설명된 요지 및 기능적 동작의 실시 예는 본 명세서 및 그의 구조적 균등물에 개시된 구조들 또는 그들 중 하나 이상의 조합을 포함하여, 디지털 전자 회로, 유형적으로- 구현된 컴퓨터 소프트웨어 또는 펌웨어, 컴퓨터 하드웨어에서 구현될 수 있다. 본 명세서에서 설명된 요지의 실시 예는 하나 이상의 컴퓨터 프로그램, 즉 데이터 처리 장치에 의해 실행되거나 또는 데이터 처리 장치의 동작을 제어하기 위해 유형의 일시적 저장 매체상에 인코딩된 컴퓨터 프로그램 명령어들의 하나 이상의 모듈로서 구현될 수 있다. 상기 컴퓨터 저장 매체는 기계 판독가능 저장 장치, 기계 판독가능 저장 기판, 랜덤 또는 직렬 액세스 메모리 장치, 또는 이들 중 하나 이상의 조합일 수 있다. 대안으로 또는 부가적으로, 상기 프로그램 명령어들은 데이터 처리 장치에 의한 실행을 위해 적절한 수신기 장치로의 송신을 위해 정보를 인코딩하기 위해 생성되는 인위적으로 생성된 전파된 신호, 예를 들어, 기계-발생 전기, 광학 또는 전자기 신호상에 인코딩될 수 있다. Embodiments of subject matter and functional operation described herein may include digital electronic circuitry, tangibly-implemented computer software or firmware, computer hardware, including structures disclosed in this specification and their structural equivalents, or combinations of one or more of them. can be implemented in Embodiments of the subject matter described herein may be implemented as one or more computer programs, i.e., as one or more modules of computer program instructions, executed by a data processing device or encoded on a tangible transitory storage medium for controlling the operation of a data processing device. can be implemented The computer storage medium may be a machine readable storage device, a machine readable storage substrate, a random or serial access memory device, or a combination of one or more of these. Alternatively or additionally, the program instructions may be an artificially generated propagated signal, e.g., a machine-generated electrical signal, generated to encode information for transmission to a suitable receiver device for execution by a data processing device. , can be encoded onto an optical or electromagnetic signal.

"데이터 처리 장치"라는 용어는 데이터 처리 하드웨어를 의미하며, 예를 들어 프로그램 가능 프로세서, 컴퓨터, 또는 복수의 프로세서 또는 컴퓨터를 포함하여 데이터를 처리하기 위한 모든 종류의 장치, 디바이스 및 기계를 포함한다. 이 장치는 또한 특수 목적 논리 회로, 예를 들어 FPGA (field programmable gate array) 또는 ASIC(application specific integrated circuit)일 수 있다. 상기 장치는 하드웨어 이외에, 컴퓨터 프로그램들의 실행 환경을 생성하는 코드, 예를 들어 프로세서 펌웨어, 프로토콜 스택, 데이터베이스 관리 시스템, 운영 체제 또는 이들 중 하나 이상의 조합을 구성하는 코드를 선택적으로 포함할 수 있다.The term “data processing apparatus” means data processing hardware and includes all kinds of apparatus, devices and machines for processing data including, for example, a programmable processor, a computer, or a plurality of processors or computers. The device may also be a special purpose logic circuit, such as a field programmable gate array (FPGA) or application specific integrated circuit (ASIC). In addition to hardware, the device may optionally include code that creates an execution environment for computer programs, such as processor firmware, a protocol stack, a database management system, an operating system, or code that constitutes a combination of one or more of these.

컴퓨터 프로그램(프로그램, 소프트웨어, 소프트웨어 애플리케이션, 모듈, 소프트웨어 모듈, 스크립트 또는 코드로 지칭되거나 설명될 수 있음)은 컴파일된 또는 해석된 언어, 또는 선언적 또는 절차적 언어를 포함한 모든 형태의 프로그래밍 언어로 작성될 수 있으며, 독립 실행형 프로그램이나 모듈, 컴포넌트, 서브루틴 또는 컴퓨팅 환경에서 사용하기에 적합한 다른 장치를 포함하여 어떤 형태로든 배포될 수 있다. 컴퓨터 프로그램은 파일 시스템의 파일에 해당할 수 있지만 반드시 그런 것은 아니다. 프로그램은 프로그램 전용 단일 파일, 여러 개의 조정된 파일(예를 들어, 하나 이상의 모듈, 하위 프로그램 또는 코드의 부분들을 저장하는 파일들), 또는 마크업 언어 문서에 저장된 하나 이상의 스크립트와 같은 다른 프로그램들 또는 데이터를 보유하고 있는 파일의 부분에 저장될 수 있다. 컴퓨터 프로그램은 한 사이트에 있거나 여러 사이트에 분산되어 있으며 통신 네트워크로 상호 연결된 여러 대의 컴퓨터 또는 하나의 컴퓨터에서 실행되도록 배포될 수 있다.A computer program (which may be referred to as or described as a program, software, software application, module, software module, script, or code) may be written in a compiled or interpreted language, or any form of programming language, including a declarative or procedural language. may be distributed in any form, including stand-alone programs, modules, components, subroutines, or other devices suitable for use in a computing environment. A computer program can, but does not necessarily, correspond to a file in a file system. A program is a single file dedicated to a program, several coordinated files (for example, files that store one or more modules, subprograms, or parts of code), or other programs such as one or more scripts stored in a markup language document, or It can be stored in the part of the file that holds the data. A computer program may be distributed to be executed on a single computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

본 명세서에서, "데이터베이스"라는 용어는 모든 데이터 수집을 나타내기 위해 광범위하게 사용되며, 그 데이터는 특정 방식으로 구조화되거나 전혀 구조화될 필요가 없으며 하나 이상의 위치에 있는 저장 장치에 저장할 수 있다. 따라서, 예를 들어, 인덱스 데이터베이스는 복수의 데이터 집합을 포함할 수 있으며, 각각의 집합은 다르게 구성 및 액세스될 수 있다.In this specification, the term "database" is used broadly to refer to any collection of data, which data may or may not need to be structured in any particular way and may be stored in storage devices in one or more locations. Thus, for example, an index database may contain multiple data sets, each set may be structured and accessed differently.

유사하게, 본 명세서에서, 용어 "엔진"은 하나 이상의 특정 기능을 수행하도록 프로그램된 소프트웨어 기반 시스템, 서브 시스템 또는 프로세스를 지칭하기 위해 광범위하게 사용된다. 일반적으로 엔진은 하나 이상의 소프트웨어 모듈 또는 구성 요소로 구현되며 하나 이상의 위치에 있는 하나 이상의 컴퓨터에 설치된다. 일부 경우에 따라 하나 이상의 컴퓨터가 특정 엔진 전용으로 사용되며, 다른 경우에는 여러 대의 엔진을 동일한 컴퓨터나 컴퓨터들에 설치하여 실행할 수 있다.Similarly, herein, the term "engine" is used broadly to refer to a software-based system, subsystem, or process programmed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components and installed on one or more computers in one or more locations. In some cases, more than one computer is dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or computers.

본 명세서에서 설명되는 프로세스들 및 로직 흐름은 입력 데이터를 조작하고 출력을 생성함으로써 기능을 수행하도록 하나 이상의 컴퓨터 프로그램을 실행하는 하나 이상의 프로그램 가능 컴퓨터에 의해 수행될 수 있다. 상기 프로세스들 및 로직 흐름은 또한 FPGA(field programmable gate array) 또는 ASIC(application specific integrated circuit)과 같은 특수 목적 논리 회로에 의해 수행될 수 있고, 장치는 또한 상기 특수 목적 논리 회로로 구현될 수 있다.The processes and logic flows described herein can be performed by one or more programmable computers executing one or more computer programs to perform functions by manipulating input data and generating output. The processes and logic flow may also be performed by a special purpose logic circuit such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC), and an apparatus may also be implemented with the special purpose logic circuit.

컴퓨터 프로그램의 실행에 적합한 컴퓨터는 예를 들어 범용 또는 특수 목적 마이크로프로세서 또는 둘 모두, 또는 임의의 다른 종류의 중앙 처리 장치를 포함하고, 이들에 기반할 수 있다. 일반적으로, 중앙 처리 장치는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 모두로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 필수 구성요소들은 명령어들을 수행하거나 실행하기 위한 중앙 처리 장치 및 명령어들 및 데이터를 저장하기 위한 하나 이상의 메모리 장치이다. 일반적으로, 컴퓨터는 데이터(예를 들어, 자기, 광 자기 디스크 또는 광 디스크)를 저장하기 위한 하나 이상의 대용량 저장 장치를 포함하거나, 그 하나 이상의 대용량 저장 장치로부터 데이터를 수신하거나 전송하기 위해 동작 가능하게 결합될 것이다. 그러나, 컴퓨터는 그러한 장치들을 가질 필요는 없다. 또한, 컴퓨터는 다른 장치, 예를 들어, 이동 전화기, 개인 휴대 정보 단말기(PDA), 이동 오디오 또는 비디오 플레이어, 게임 콘솔, GPS 수신기 또는 휴대용 저장 장치(예를 들어, 범용 직렬 버스(USB) 플래시 드라이브)에 내장될 수 있다.A computer suitable for the execution of a computer program may include, for example, be based on a general purpose or special purpose microprocessor or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from read only memory or random access memory or both. The essential components of a computer are a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer includes one or more mass storage devices for storing data (eg, magnetic, magneto-optical disks, or optical disks) or is operable to receive or transmit data from the one or more mass storage devices. will be combined However, a computer need not have such devices. A computer may also be used as another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, GPS receiver, or portable storage device (such as a Universal Serial Bus (USB) flash drive). ) can be embedded.

컴퓨터 프로그램 명령어들 및 데이터를 저장하기에 적합한 컴퓨터 판독 가능 매체는 예를 들어 EPROM, EEPROM 및 플래시 메모리 장치와 같은 반도체 메모리 장치, 내부 하드 디스크 또는 이동식 디스크와 같은 자기 디스크, 광 자기 디스크, 및 CD ROM 및 DVD-ROM 디스크를 포함하는 모든 형태의 비휘발성 메모리, 매체 및 메모리 장치를 포함한다. Computer readable media suitable for storing computer program instructions and data include, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, magnetic disks such as internal hard disks or removable disks, magneto-optical disks, and CD ROMs. and all forms of non-volatile memory, media and memory devices including DVD-ROM disks.

사용자와의 상호 작용을 제공하기 위해, 본 명세서에서 설명된 요지의 실시예들은 사용자에게 정보를 제공하기 위한 CRT(cathode ray tube) 또는 LCD(liquid crystal display) 모니터와 같은 디스플레이 장치, 사용자가 입력을 컴퓨터에 제공할 수 있는 마우스 또는 트랙볼과 같은 키보드 및 포인팅 장치를 갖는 컴퓨터에서 구현될 수 있다. 다른 종류의 장치들은 사용자와의 상호 작용을 제공하는 데 사용될 수 있으며, 예를 들어, 사용자에게 제공되는 피드백은 시각 피드백, 청각 피드백 또는 촉각 피드백과 같은 임의의 형태의 감각 피드백일 수 있고, 사용자로부터의 입력은 음향, 음성 또는 촉각 입력을 포함하는 임의의 형태로 수신될 수 있다. 또한, 컴퓨터는 사용자가 사용하는 장치로 문서를 보내고 문서를 수신하여 사용자와 상호 작용할 수 있으며, 예를 들어, 웹 브라우저로부터 수신된 요청에 응답하여 사용자의 클라이언트 장치상의 웹 브라우저에 웹 페이지를 전송함으로써 수행될 수 있다. 또한, 컴퓨터는 문자 메시지 또는 다른 형태의 메시지를 개인용 장치(예를 들어, 메시징 애플리케이션을 실행중인 스마트폰)에 송신하고 사용자로부터 응답 메시지를 수신함으로써 사용자와 상호 작용할 수 있다.To provide interaction with a user, embodiments of the subject matter described herein may include a display device such as a cathode ray tube (CRT) or liquid crystal display (LCD) monitor for providing information to a user, and input by the user. It can be implemented in a computer having a keyboard and pointing device such as a mouse or trackball that can be provided to the computer. Different types of devices may be used to provide interaction with a user, for example, the feedback provided to the user may be any form of sensory feedback, such as visual, auditory, or tactile feedback, from the user. The input of may be received in any form including acoustic, voice or tactile input. The computer can also interact with the user by sending documents to and receiving documents from the device used by the user, such as by sending a web page to a web browser on the user's client device in response to a request received from the web browser. can be performed The computer may also interact with the user by sending a text message or other form of message to the personal device (eg, a smartphone running a messaging application) and receiving a response message from the user.

기계 학습 모델을 구현하기 위한 데이터 처리 장치는 또한, 예를 들어, 기계 학습 훈련 또는 생산, 즉 추론, 작업부하의 공통 및 연산 중심 부분을 처리하기 위한 특수 목적 하드웨어 가속기 유닛을 포함할 수 있다.A data processing device for implementing a machine learning model may also include a special purpose hardware accelerator unit for handling common and computationally centric parts of the workload, ie machine learning training or production, ie inference, for example.

기계 학습 모델은" TensorFlow" 프레임워크, "Microsoft Cognitive Toolkit" 프레임워크, "Apache Singa" 프레임워크 또는 "Apache MXNet" 프레임워크와 같은 기계 학습 프레임워크를 사용하여 구현 및 배치할 수 있다.Machine learning models can be implemented and deployed using machine learning frameworks such as the "TensorFlow" framework, the "Microsoft Cognitive Toolkit" framework, the "Apache Singa" framework, or the "Apache MXNet" framework.

본 명세서에서 설명된 요지의 실시예들은 데이터 서버와 같은 백 엔드 컴포넌트; 애플리케이션 서버와 같은 미들웨어 컴포넌트; 예를 들어 관계 그래픽 사용자 인터페이스 또는 사용자가 본 명세서에 설명된 요지의 구현예와 상호 작용할 수 있는 웹 브라우저를 갖는 클라이언트 컴퓨터와 같은 프론트 엔트 컴포넌트; 또는 하나 이상의 백 엔드, 미들웨어, 프론트 엔트 컴포넌트들의 임의의 조합을 포함하는 컴퓨팅 시스템에서 구현될 수 있다. 상기 시스템의 컴포넌트들은 디지털 데이터 통신의 임의의 형태 또는 매체, 예를 들어 통신 네트워크에 의해 상호 접속될 수 있다. 예시적인 통신 네트워크는 근거리 통신망("LAN") 및 광역 통신망("WAN"), 예를 들어 인터넷을 포함한다.Embodiments of the subject matter described herein may include a back end component such as a data server; middleware components such as application servers; front end components such as, for example, client computers having relational graphical user interfaces or web browsers through which users can interact with implementations of the subject matter described herein; or any combination of one or more back end, middleware, front end components. The components of the system may be interconnected by any form or medium of digital data communication, for example a communication network. Exemplary communication networks include local area networks ("LAN") and wide area networks ("WAN"), such as the Internet.

상기 컴퓨팅 시스템은 클라이언트들과 서버들을 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 멀리 떨어져 있으며, 일반적으로 통신 네트워크를 통해 상호 작용한다. 클라이언트와 서버의 관계는 각각의 컴퓨터에서 실행되고 서로 클라이언트-서버 관계를 갖는 컴퓨터 프로그램들로 인해 발생한다. 일부 실시 예에서, 서버는 데이터, 예를 들어, 데이터를 디스플레이하고, 클라이언트로서 동작하는 장치와 상호 작용하는 사용자로부터 사용자 입력을 수신하기 위해 HTML 페이지를 사용자 장치로 송신한다. 사용자 장치에서 생성된 데이터, 예를 들어 사용자 상호 작용의 결과는 상기 장치로부터 서버에서 수신될 수 있다.The computing system may include clients and servers. Clients and servers are usually remote from each other and usually interact through a communication network. The relationship of client and server arises due to computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server sends HTML pages to the user device to display data, eg, data, and to receive user input from a user interacting with the device acting as a client. Data generated by the user device, for example the result of user interaction, may be received at the server from the device.

본 명세서는 다수의 특정 구현 세부 사항을 포함하지만, 이들은 임의의 발명 또는 청구될 수 있는 범위에 대한 제한으로서 해석되어서는 안되며, 오히려 특정 발명의 특정 실시예에 특정될 수 있는 특징에 대한 설명으로 해석되어야 한다. 별도의 실시예들과 관련하여 본 명세서에서 설명되는 특정 특징들은 또한 단일 실시예에서 조합하여 구현될 수 있다. 반대로, 단일 실시예의 콘텍스트에서 설명된 다양한 특징들은 또한 다수의 실시예에서 개별적으로 또는 임의의 적합한 서브조합으로 구현될 수 있다. 더욱이, 특징들은 소정의 조합으로 작용하고 상술한 바와 같이 초기에 청구된 것으로서 설명될 수 있지만, 청구된 조합의 하나 이상의 특징이 어떤 경우 그 조합으로부터 제거될 수 있고, 그 청구된 조합은 서브조합 또는 그 서브조합의 변형을 지향할 수 있다.Although this specification contains many specific implementation details, they should not be construed as limitations on any invention or on the scope that may be claimed, but rather as a description of features that may be specific to particular embodiments of a particular invention. do. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may act in any combination and be described as initially claimed as described above, one or more features of a claimed combination may in some cases be eliminated from that combination, and the claimed combination may be a subcombination or Variations of that subcombination may be directed.

유사하게, 동작들이 특정 순서로 도면들에 도시되어 있지만, 이는 바람직한 동작들을 달성하기 위해, 그러한 동작들이 도시된 순서 또는 순차적인 순서로 수행되거나, 도시된 모든 동작들이 수행될 것을 요구하는 것으로 이해되어서는 안된다. 특정 상황에서 멀티 태스킹 및 병렬 처리가 유리할 수 있다. 또한, 상술한 실시 예에서 다양한 시스템 모듈 및 컴포넌트의 분리는 모든 실시예에서 그러한 분리를 필요로 하는 것으로 이해되어서는 안되며, 서술된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품에 함께 통합되거나 다중 소프트웨어 제품들로 패키징될 수 있다는 것을 이해해야 한다.Similarly, while actions are depicted in the figures in a particular order, it should be understood that such actions are performed in the order shown or sequential order, or require that all actions shown be performed in order to achieve the desired actions. should not be Multitasking and parallel processing can be advantageous in certain circumstances. Further, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems are generally integrated together in a single software product or multiple software products. It should be understood that it can be packaged into products.

본 발명의 특정 실시예들이 설명되었다. 다른 실시예들은 다음의 청구항들의 범위 내에 있다. 예를 들어, 청구 범위에 열거된 동작들은 상이한 순서로 수행될 수 있으며 여전히 바람직한 결과를 달성한다. 하나의 예로서, 첨부된 도면에 도시된 프로세스는 바람직한 결과를 얻기 위해 도시된 특정 순서 또는 순차적 순서를 반드시 필요로 하지는 않는다. 특정 구현예들에서, 멀티태스킹 및 병렬 처리가 유리할 수 있다.Specific embodiments of the invention have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As an example, the processes depicted in the accompanying figures do not necessarily require the specific order shown or sequential order to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims

A method performed by one or more computers, comprising:
determining, using the controller, a plurality of architectures of a child neural network configured to perform a particular neural network task;
By the training engine, for each architecture of the plurality of architectures:
training an instance of the child neural network having the respective architecture on a plurality of training inputs each associated with a respective target training output to perform the specific neural network task; and
After the training, based on evaluating the performance of the trained instance of the child neural network for the specific neural network task using an appropriate performance measure, the performance of the trained instance of the child neural network for the specific neural network task determining a performance metric; and
and adjusting, by a controller update engine, the controller using the performance metric for the trained instance of the child neural network.

According to claim 1,
the controller is implemented using a controller neural network with a plurality of controller parameters;
determining the plurality of architectures of the child neural network comprises generating a batch of output sequences using the controller neural network having the plurality of controller parameters and according to current values of the plurality of controller parameters; each output sequence of the batch defines a respective architecture of the child neural network; and
and adjusting the controller using the performance metric for the trained instance of the child neural network comprises adjusting current values of the plurality of controller parameters.

3. The method of claim 2, wherein adjusting current values of controller parameters of the controller neural network using a performance metric for a trained instance of the child neural network comprises:
and training the controller neural network to generate output sequences such that the child neural network has an improved performance metric using a reinforcement learning technique.

4. The method of claim 3, wherein the reinforcement learning technique is a policy gradient (PG) technique.

5. The method of claim 4, wherein the reinforcement learning technique is a REINFORCE technique.

3. The method of claim 2, wherein determining each architecture of the child's neural network comprises determining a value for each hyperparameter of the child's neural network.

The method of claim 6, wherein the controller neural network is a recurrent neural network, and the recurrent neural network,
one configured to, for a given output sequence and at each time step, receive as input a value of a hyperparameter at a preceding time step of the given output sequence, and process the input to update a current hidden state of the recurrent neural network; the above recurrent neural network layers;
including each output layer for each time step;
Each output layer, for the predetermined output sequence,
receive an output layer input comprising the updated hidden state at a time step, and generate an output for the time step defining a score distribution for possible values of the hyperparameter at the time step. A method performed by a computer.

According to claim 7,
Generating, using a controller neural network having a plurality of controller parameters, a batch of output sequences, according to the current values of the controller parameters, comprises:
For each output sequence in the batch and each of the plurality of time steps,
providing the value of the hyperparameter at a preceding time step of the output sequence as an input to the controller neural network to generate an output for the time step that defines a score distribution for the possible values of the hyperparameter at the time step; and
and determining a value of a hyperparameter at a time step of the output sequence by sampling from the possible values according to the score distribution.

8. The method of claim 7, wherein the child neural network is a convolutional neural network and the hyperparameters include hyperparameters for each convolutional neural network layer of the child neural network.

The method of claim 9, wherein the hyperparameters for each of the convolutional neural network layers,
by one or more computers comprising at least one of a plurality of filters, a filter height for each filter, a filter width for each filter, a stride height for each filter, and a stride width for each filter. how it is done.

8. The method of claim 7 , wherein the child neural network includes a plurality of layer types, and the hyperparameters include, for each layer, a value corresponding to the layer type. method.

12. The method of claim 11, wherein for each of the one or more layers, the hyperparameters include a skip connection hyperparameter, the skip connection hyperparameter defining which previous layers have a skip connection to the layer. A method performed by one or more computers that

According to claim 12,
wherein the plurality of time steps comprises a respective anchor point time step for each of the one or more layers for which a hypermeter is a skip connection hyperparameter;
For an anchor point time step for a current layer, the output layer includes each node corresponding to each layer before the current layer of the child neural network;
Each node processes the updated hidden state for the anchor point time step for the previous layer and the updated hidden state for the anchor point time step for the previous layer according to the current values of the set of parameters, and the previous layer is the child A method performed by one or more computers, characterized in that it is configured to generate a score representing a likelihood to be connected to a current layer of a neural network.

The method of claim 6, wherein the child's neural network is a recurrent neural network,
The method of claim 1 , wherein determining the architecture of each of the child neural networks comprises determining an architecture for a recurrent cell in the recurrent neural network.

The method of claim 14, wherein the method,
and determining each computational step for each node in the tree of computational steps representing computations performed by said recurrent cell.

According to claim 15,
Each node in the tree produces an output by merging two inputs;
For each node, determining each architecture of the child neural network includes determining a combination method for combining the two inputs and an activation function applied to the combination of the two inputs to generate the output. A method performed by one or more computers, characterized in that.

The method of claim 15, wherein the method,
and determining values defining how the memory state of the circulating cell is injected into the circulating cell.

The method of claim 1, wherein the method,
and determining, in accordance with the adjusted controller, a final architecture of the child's neural network.

The method of claim 18, wherein the method,
and performing a specific neural network task on the received network inputs by processing the received network inputs using a child neural network having the final architecture.

one or more computers; and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the method of any one of claims 1 to 19.

One or more computer readable storage media having stored thereon instructions which, when executed by one or more computers, cause the one or more computers to perform the method of any one of claims 1-19.