TWI900601B

TWI900601B - Computer-implemented method, system, and non-transitory computer readable storage medium for neural architecture scaling for hardware accelerators

Info

Publication number: TWI900601B
Application number: TW110124428A
Authority: TW
Inventors: 予寧李; 盛李; 譚明星; 若鳴龐; 立群程; 國Ｖ樂; 諾曼保羅約皮
Original assignee: 美商谷歌有限責任公司
Priority date: 2021-01-15
Filing date: 2021-07-02
Publication date: 2025-10-11
Also published as: EP4217928A1; TW202230221A; CN116261734A; WO2022154829A1; JP7579972B2; JP2023552048A

Abstract

Methods, systems, and apparatus, including computer-readable media, for scaling neural network architectures on hardware accelerators. A method includes receiving training data and information specifying target computing resources, and performing using the training data, a neural architecture search over a search space to identify an architecture for a base neural network. A plurality of scaling parameter values for scaling the base neural network can be identified, which can include repeatedly selecting a plurality of candidate scaling parameter values, and determining a measure of performance for the base neural network scaled according to the plurality of candidate scaling parameter values, in accordance with a plurality of second objectives including a latency objective. An architecture for a scaled neural network can be determined using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

Description

Computer-implemented methods, systems, and non-transitory computer-readable storage media for neural architecture scaling for hardware accelerators

神經網路係包含一或多個非線性操作層以針對一所接收輸入預測一輸出之機器學習模型。除了一輸入層及一輸出層以外，一些神經網路亦包含一或多個隱藏層。各隱藏層之輸出可被輸入至神經網路之另一隱藏層或輸出層。神經網路之各層可根據該層之一或多個模型參數之值從一所接收輸入產生一各自輸出。模型參數可為透過一訓練演算法判定以導致神經網路產生精確輸出之權重或偏差。 A neural network is a machine learning model that includes one or more layers of nonlinear operations to predict an output for a given input. In addition to an input layer and an output layer, some neural networks also include one or more hidden layers. The output of each hidden layer can be input to another hidden layer or an output layer of the neural network. Each layer of the neural network can generate an output from a given input based on the values of one or more model parameters in that layer. Model parameters can be weights or biases determined by a training algorithm to cause the neural network to produce accurate outputs.

根據本發明之態樣實施之一系統可藉由根據各候選者之運算要求(例如，FLOPS)、操作強度及執行效率一起搜尋候選神經網路架構來減少一神經網路架構之延時。發現運算要求、操作強度及執行效率被歸為一神經網路在目標運算資源上之延時之一根本原因，而非運算要求單獨影響延時(包含推論延時)，如本文中描述。本發明之態樣提供執行神經架構搜尋及縮放之技術，諸如藉由延時感知複合縮放及藉由基於延時及運算、操作強度及執行效率之間的此所觀察關係來擴增從其中搜尋候選神經網路之空間。 A system implemented according to aspects of the present invention can reduce the latency of a neural network architecture by searching for candidate neural network architectures based on each candidate's computational requirement (e.g., FLOPS), operational intensity, and execution efficiency. Computational requirement, operational intensity, and execution efficiency are found to be a fundamental cause of a neural network's latency on target computational resources, rather than computational requirement alone affecting latency (including inference latency), as described herein. Aspects of the present invention provide techniques for performing neural architecture search and scaling, such as through latency-aware compound scaling and by expanding the space of candidate neural network candidates searched based on this observed relationship between latency and computational intensity and execution efficiency.

此外，該系統可執行複合縮放以一致地且根據多個目標縮放神經網路之多個參數，此可導致優於其中考量一單一目標之方法或其中分開搜尋一神經網路之縮放參數之方法之經縮放神經網路之改良效能。延時感知複合縮放可用於快速建立一系列神經網路架構，根據來自一初始縮放神經網路架構之不同值進行縮放，且可適用於不同使用情況。 Furthermore, the system can perform composite scaling to consistently scale multiple parameters of a neural network according to multiple objectives, which can result in improved performance of the scaled neural network compared to methods that consider a single objective or that search for scaling parameters of a neural network separately. Delay-aware composite scaling can be used to quickly build a series of neural network architectures that are scaled according to different values from an initial scaled neural network architecture and that can be adapted to different use cases.

根據本發明之態樣，一種電腦實施方法包含一種用於判定一神經網路之一架構之方法，其包含：藉由一或多個處理器接收對應於一神經網路任務之訓練資料及指定目標運算資源之資訊；藉由該一或多個處理器使用該訓練資料且根據複數個第一目標在一搜尋空間上執行一神經架構搜尋以識別一基本神經網路之一架構；藉由該一或多個處理器根據指定該等目標運算資源之該資訊及該基本神經網路之複數個縮放參數來識別用於縮放該基本神經網路之複數個縮放參數值。該識別可包含重複執行以下步驟：選擇複數個候選縮放參數值；及判定根據該複數個候選縮放參數值縮放之該基本神經網路之一效能量度，其中根據包含一延時目標之複數個第二目標來判定該效能量度。該方法可包含藉由該一或多個處理器使用根據該複數個縮放參數值縮放之該基本神經網路之該架構產生一經縮放神經網路之一架構。 According to an aspect of the present invention, a computer-implemented method includes a method for determining an architecture of a neural network, comprising: receiving, by one or more processors, training data corresponding to a neural network task and information specifying target computing resources; using, by the one or more processors, a neural architecture search on a search space based on a plurality of first objectives to identify an architecture of a basic neural network; and identifying, by the one or more processors, a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network. The identifying may include repeatedly performing the steps of: selecting a plurality of candidate scaling parameter values; and determining a performance metric of the base neural network scaled according to the plurality of candidate scaling parameter values, wherein the performance metric is determined based on a plurality of second objectives including a latency target. The method may include generating, by the one or more processors, an architecture of a scaled neural network using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

前述及其他實施方案可各視情況單獨或組合地包含以下特徵之一或多者。 The foregoing and other embodiments may include one or more of the following features, individually or in combination, as appropriate.

用於執行該神經架構搜尋之該複數個第一目標可相同於用於識別該複數個縮放參數值之該複數個第二目標。 The plurality of first objectives for performing the neural architecture search may be the same as the plurality of second objectives for identifying the plurality of scaling parameter values.

該複數個第一目標及該複數個第二目標可包含對應於該基本神經網路之輸出之精度之一精度目標。 The plurality of first objectives and the plurality of second objectives may include an accuracy target corresponding to the accuracy of the output of the basic neural network.

當該基本神經網路根據該複數個候選縮放參數值進行縮放且部署於該等目標運算資源上時，該效能量度可至少部分與該基本神經網路接收一輸入與產生一輸出之間的一延時量度對應。 When the basic neural network is scaled according to the plurality of candidate scaling parameter values and deployed on the target computing resources, the performance metric may correspond at least in part to a latency measure between the basic neural network receiving an input and generating an output.

當該基本神經網路部署於該等目標運算資源上時，該延時目標可對應於該基本神經網路接收一輸入與產生一輸出之間的一最小延時。 When the basic neural network is deployed on the target computing resources, the latency target may correspond to a minimum delay between the basic neural network receiving an input and generating an output.

該搜尋空間可包含候選神經網路層，各候選神經網路層經組態以執行一或多個各自操作。該搜尋空間可包含候選神經網路層，該等候選神經網路層包含不同各自激發函數。 The search space may include candidate neural network layers, each candidate neural network layer being configured to perform one or more respective operations. The search space may include candidate neural network layers, each candidate neural network layer including different respective activation functions.

該基本神經網路之該架構可包含複數個組件，各組件具有各自複數個神經網路層。該搜尋空間可包含候選神經網路層之複數個候選組件，包含：候選網路層之一第一組件，其包含一第一激發函數；及候選網路層之一第二組件，其包含不同於該第一激發函數之一第二激發函數。 The architecture of the basic neural network may include a plurality of components, each component having a plurality of neural network layers. The search space may include a plurality of candidate components of the candidate neural network layer, including: a first component of the candidate network layer, which includes a first activation function; and a second component of the candidate network layer, which includes a second activation function different from the first activation function.

指定該等目標運算資源之該資訊可指定一或多個硬體加速器；且其中該方法進一步包含在該一或多個硬體加速器上執行該經縮放神經網路以執行該神經網路任務。 The information specifying the target computing resources may specify one or more hardware accelerators; and wherein the method further includes executing the scaled neural network on the one or more hardware accelerators to perform the neural network task.

該等目標運算資源可包含第一目標運算資源，該複數個縮放參數值係複數個第一縮放參數值，且該方法可進一步包含：藉由該一或多個處理器接收指定不同於該等第一目標運算資源之第二目標運算資源之資訊；及根據指定該等第二目標運算資源之該資訊來識別用於縮放該基本神經網路之複數個第二縮放參數值，其中該複數個第二縮放參數值不同於該複數個第一縮放參數值。 The target computing resources may include first target computing resources, the plurality of scaling parameter values may be a plurality of first scaling parameter values, and the method may further include: receiving, by the one or more processors, information specifying a second target computing resource different from the first target computing resources; and identifying a plurality of second scaling parameter values for scaling the basic neural network based on the information specifying the second target computing resources, wherein the plurality of second scaling parameter values are different from the plurality of first scaling parameter values.

該複數個縮放參數值係複數個第一縮放參數值，且其中該方法進一步包含：從使用複數個第二縮放參數值縮放之該基本神經網路架構產生一經縮放神經網路架構，其中依據該複數個第一縮放參數值及一致地修改該等第一縮放參數值之各者之值之一或多個複合係數產生該等第二縮放參數值。 The plurality of scaling parameter values are a plurality of first scaling parameter values, and wherein the method further comprises: generating a scaled neural network architecture from the base neural network architecture scaled using a plurality of second scaling parameter values, wherein the second scaling parameter values are generated based on the plurality of first scaling parameter values and one or more complex coefficients that uniformly modify the values of each of the first scaling parameter values.

該基本神經網路可為一迴旋神經網路，且其中該複數個縮放參數可包含該基本神經網路之一深度、該基本神經網路之一寬度及該基本神經網路之一輸入解析度之一或多者。 The basic neural network may be a convolutional neural network, and the plurality of scaling parameters may include one or more of a depth of the basic neural network, a width of the basic neural network, and an input resolution of the basic neural network.

根據另一態樣，一種用於判定一神經網路之一架構之方法包含：藉由一或多個處理器接收指定目標運算資源之資訊；藉由該一或多個處理器接收指定一基本神經網路之一架構之資料；藉由該一或多個處理器根據指定該等目標運算資源之該資訊及該基本神經網路之複數個縮放參數來識別用於縮放該基本神經網路之複數個縮放參數值。該識別可包含重複執行以下步驟：選擇複數個候選縮放參數值；及判定根據該複數個候選縮放參數值縮放之該基本神經網路之一效能量度，其中根據包括一延時目標之複數個目標來判定該效能量度；及藉由該一或多個處理器使用根據該複數個縮放參數值縮放之該基本神經網路之該架構產生一經縮放神經網路之一架構。 According to another aspect, a method for determining an architecture of a neural network includes: receiving, by one or more processors, information specifying target computing resources; receiving, by the one or more processors, data specifying an architecture of a basic neural network; and identifying, by the one or more processors, a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network. The identifying may include repeatedly performing the following steps: selecting a plurality of candidate scaling parameter values; and determining a performance metric of the basic neural network scaled according to the plurality of candidate scaling parameter values, wherein the performance metric is determined according to a plurality of objectives including a latency target; and generating, by the one or more processors, an architecture of a scaled neural network using the architecture of the basic neural network scaled according to the plurality of scaling parameter values.

該複數個目標可為複數個第二目標；且接收指定該基本神經網路之該架構之該資料可包含：藉由一或多個處理器接收對應於一神經網路任務之訓練資料；及藉由該一或多個處理器使用該訓練資料且根據複數個第一目標在一搜尋空間上執行一神經架構搜尋以識別該基本神經網路之該架構。 The plurality of objectives may be a plurality of second objectives; and receiving the data specifying the architecture of the basic neural network may include: receiving, by one or more processors, training data corresponding to a neural network task; and performing, by the one or more processors, a neural architecture search on a search space using the training data and based on the plurality of first objectives to identify the architecture of the basic neural network.

其他實施方案包含電腦系統、設備及記錄於一或多個電腦儲存裝置上之電腦程式，各經組態以執行該等方法之動作。 Other embodiments include computer systems, apparatus, and computer programs stored on one or more computer storage devices, each configured to perform the actions of the methods.

101:基本神經網路架構 101: Basic Neural Network Architecture

103:系列 103: Series

104A至N:經縮放神經網路架構 104A to N: Scaled Neural Network Architecture

107A至N:候選網路 107A to N: Candidate Networks

108:係數搜尋空間 108: Coefficient Search Space

109:經縮放神經網路架構 109: Scaled Neural Network Architecture

115:資料中心 115: Data Center

116:硬體加速器 116: Hardware Accelerator

200:程序 200:Procedure

210:方塊 210: Block

220:方塊 220: Block

230:方塊 230: Block

240:方塊 240: Block

250:方塊 250: Block

300:程序 300: Program

310:方塊 310: Block

320:方塊 320: Block

330:方塊 330: Block

340:方塊 340: Block

400:神經架構搜尋延時感知複合縮放(NAS-LACS)系統 400: Neural Architecture Search Latency-Aware Composite Scaling (NAS-LACS) System

401:訓練資料 401: Training data

402:目標運算資源資料 402: Target computing resource data

405:神經架構搜尋(NAS)引擎 405: Neural Architecture Search (NAS) Engine

407:基本神經網路架構/資料 407: Basic Neural Network Architecture/Data

409:經縮放神經網路架構 409: Scaled Neural Network Architecture

410:效能量測引擎 410: Performance Measurement Engine

415:延時感知複合縮放(LACS)引擎 415: Latency-Aware Composite Scaling (LACS) Engine

500:環境 500: Environment

512:用戶端運算裝置 512: Client computing device

513:處理器 513: Processor

514:記憶體 514: Memory

515:伺服器運算裝置 515: Server computing device

516:處理器 516: Processor

517:記憶體 517: Memory

518:指令 518: Instructions

519:資料 519: Data

521:指令 521: Instructions

523:資料 523: Data

524:使用者輸入 524: User input

526:使用者輸出 526: User output

530:儲存裝置 530: Storage device

550:資料中心 550: Data Center

551A至N:硬體加速器 551A to N: Hardware Accelerator

560:網路 560: Network

圖1係繪示用於部署於容置所部署神經網路將其上執行之硬體加速器中之一資料中心中之一系列經縮放神經網路架構之一方塊圖。 Figure 1 is a block diagram illustrating a series of scaled neural network architectures for deployment in a data center housing hardware accelerators on which the deployed neural network will execute.

圖2係用於產生用於在目標運算資源上執行之經縮放神經網路架構之一例示性程序之一流程圖。 FIG2 is a flow chart of an exemplary process for generating a scaled neural network architecture for execution on target computing resources.

圖3係用於一基本神經網路架構之延時感知複合縮放之一例示性程序。 Figure 3 shows an exemplary process for delay-aware complex scaling using a basic neural network architecture.

圖4係根據本發明之態樣之一神經架構搜尋延時感知複合縮放(NAS-LACS)系統之一方塊圖。 FIG4 is a block diagram of a Neural Architecture Search Latency-Aware Composite Scaling (NAS-LACS) system according to an aspect of the present invention.

圖5係用於實施NAS-LACS系統之一例示性環境之一方塊圖。 FIG5 is a block diagram of an exemplary environment for implementing a NAS-LACS system.

相關申請案之交叉參考Cross-reference to related applications

本申請案根據35 U.S.C.§119(e)規定主張2021年1月15日申請之美國專利申請案第63/137,926號之權利，該案之揭示內容以引用的方式併入本文中。 This application claims the benefit of U.S. Patent Application No. 63/137,926, filed January 15, 2021, under 35 U.S.C. §119(e), the disclosure of which is incorporated herein by reference.

Overview:

本說明書中描述之技術大體上係關於在不同目標運算資源(諸如不同硬體加速器)上執行之縮放神經網路。神經網路可根據多個不同效能目標進行縮放，該多個不同效能目標可包含用於當在目標運算資源上進行縮放以執行時最小化處理時間(本文中被稱為延時)且最大化神經網路之精度之單獨目標。 The techniques described in this specification generally relate to scaling neural networks for execution on different target compute resources, such as different hardware accelerators. Neural networks can be scaled according to multiple different performance targets, which can include separate targets for minimizing processing time (referred to herein as latency) and maximizing the accuracy of the neural network when scaled for execution on the target compute resource.

一般言之，神經架構搜尋(NAS)系統可經部署用於根據一或多個目標從候選架構之一給定搜尋空間選擇一神經網路架構。一個常見目標係神經網路之精度，通常，實施一NAS技術之一系統將有利於在訓練之後導致較高精度之網路，而非具有較低精度之網路。在緊接著一NAS選擇一基本神經網路之後，可根據一或多個縮放參數來縮放基本神經網路。縮放可包含例如藉由在數字之一係數搜尋空間中搜尋縮放參數來搜尋用於縮放基本神經網路之一或多個縮放參數值。縮放可包含增加或減少一神經網路具有之層數或各層之大小以高效地利用可用於部署神經網路之運算及/或記憶體資源。 Generally speaking, a neural architecture search (NAS) system can be deployed to select a neural network architecture from a given search space of candidate architectures based on one or more objectives. A common objective is the accuracy of the neural network. Generally, a system implementing a NAS technique will favor networks that result in higher accuracy after training, rather than networks with lower accuracy. Following NAS selection of a base neural network, the base neural network can be scaled based on one or more scaling parameters. Scaling can include searching for one or more scaling parameter values for scaling the base neural network, for example, by searching for the scaling parameters in a coefficient search space of numbers. Scaling can involve increasing or decreasing the number of layers a neural network has or the size of each layer to efficiently utilize the computational and/or memory resources available for deploying the neural network.

與神經架構搜尋及縮放相關之一普遍觀念係，一網路透過一神經網路處理輸入所需之運算要求(例如，以每秒之浮點運算(FLOPS)來量測)與將一輸入發送至網路與接收一輸出之間的延時成比例。換言之，具有一低運算(低FLOPS)要求之一神經網路據信比網路具有一較高運算(高FLOPS)要求的情況更快地產生輸出，此係因為總體上執行更少操作。因此，許多NAS系統選擇具有一低運算要求之神經網路。然而，由於神經網路之其他特性(諸如操作強度、並行性及執行效率)可影響一神經網路之總延時，所以已判定運算要求與延時關係係不成比例的。 A common concept related to neural architecture search and scaling is that the computational requirements required to process input through a neural network (e.g., measured in floating-point operations per second (FLOPS)) are proportional to the latency between sending an input to the network and receiving an output. In other words, a neural network with a low computational (low FLOPS) requirement is believed to produce output faster than a network with a higher computational (high FLOPS) requirement because fewer operations are performed overall. Consequently, many NAS systems choose neural networks with low computational requirements. However, because other characteristics of a neural network (such as operational intensity, parallelism, and execution efficiency) can affect a neural network's overall latency, the computational requirement and latency relationship have been determined to be disproportionate.

本文中描述之技術提供延時感知複合縮放(LACS)及從其中選擇一神經網路之候選神經網路搜尋空間之擴增。在擴增一搜尋空間之內容脈絡中，可在搜尋空間中包含「硬體加速器友好」之操作及架構，其中此等添加可導致較高操作強度、執行效率及適於部署在各種類型之硬體加速器上之並行性。此等操作可包含空間至深度操作、空間至批量操作、融合迴旋結構及逐組件搜尋激發函數。 The techniques described herein provide latency-aware composite scaling (LACS) and an expansion of the search space for candidate neural networks from which a neural network is selected. In the context of expanding a search space, hardware accelerator-friendly operations and architectures can be included in the search space, where such additions can result in higher operational intensity, execution efficiency, and parallelism suitable for deployment on various types of hardware accelerators. These operations can include spatial-to-depth operations, spatial-to-batch operations, fused convolutional structures, and component-by-component search activation functions.

神經網路之延時感知複合縮放可相對於未根據延時進行最佳化之習知方法改良神經網路之縮放。代替地，LACS可用於識別精確且在目標運算資源上以低延時操作之經縮放神經網路之經縮放參數值。 Latency-aware composite scaling of neural networks can improve neural network scaling compared to learning methods that do not optimize based on latency. Alternatively, LACS can be used to identify precise scaling parameter values for the scaled neural network that operate with low latency on target computational resources.

該技術進一步提供共用用於使用NAS或類似技術搜尋一神經架構之目標之多目標縮放。可根據搜尋基本神經網路時使用之相同目標來識別經縮放神經網路。因此，經縮放神經網路可在兩個階段(基本架構搜尋及縮放)最佳化效能，而非將各階段視為具有單獨目標之任務。 This technology further provides multi-objective scaling that is shared across targets used in searching for a neural architecture using NAS or similar techniques. Scaled neural networks can be identified based on the same targets used in searching for the underlying neural network. Therefore, the performance of scaled neural networks can be optimized in two phases (architecture search and scaling), rather than treating each phase as a task with a separate objective.

LACS可與現有NAS系統整合，此係至少因為用於搜尋及縮放兩者之相同目標可用於產生用於判定經縮放神經網路架構之一端至端系統。此外，可比無縮放搜尋方法更快地識別一系列經縮放神經網路架構，其中神經網路架構被搜尋但未縮放以部署於目標運算資源上。 LACS can be integrated with existing NAS systems, at least because the same objectives for both search and scaling can be used to produce an end-to-end system for determining scaled neural network architectures. Furthermore, a range of scaled neural network architectures can be identified more quickly than with unscaled search methods, in which neural network architectures are searched but not scaled for deployment on target computing resources.

與在不使用LACS之習知搜尋及縮放方法中識別之神經網路相比，本文中描述之技術可提供經改良神經網路。另外，可快速地產生在如模型精度及推論延時之目標之間具有不同權衡之一系列神經網路以應用於各種使用情況。再者，該技術可提供用於執行特定任務之神經網路之較快識別，但所識別神經網路可以優於使用其他方法識別之神經網路之改良精度來執行。此係至少因為由於如本文中描述之搜尋及縮放而識別之神經網路可考量可影響延時之特性(如操作強度及執行效率)，而非僅考量一網路之運算要求。以此方式，所識別之一神經網路可在不犧牲網路精度的情況下更快地執行推論。 The techniques described herein can provide improved neural networks compared to those identified using learning search and scaling methods that do not use LACS. Furthermore, a range of neural networks with different trade-offs between objectives such as model accuracy and inference latency can be quickly generated for application in various use cases. Furthermore, the techniques can provide faster identification of neural networks for performing specific tasks, but the identified neural networks can perform better than the improved accuracy of neural networks identified using other methods. This is at least because the neural networks identified using search and scaling as described herein can consider characteristics that can affect latency (such as operational intensity and execution efficiency), rather than just the computational requirements of a network. In this way, an identified neural network can perform inferences faster without sacrificing network accuracy.

該技術可進一步提供用於將現有神經網路快速遷移至經改良運算資源環境之一大體上適用框架。例如，當為具有特定硬體之一資料中心選擇之一現有神經網路之執行被遷移至使用不同硬體之一資料中心時，可應用如本文中描述之LACS及NAS。在此方面，可快速識別一系列神經網路以執行現有神經網路之任務，且將其等部署於新資料中心之硬體上。此應用在需要最新技術硬體來高效執行之快速發展領域中可為尤其有用的，諸如執行電腦視覺中之任務或其他影像處理任務之網路。 This technology further provides a generally applicable framework for rapidly migrating existing neural networks to improved computing resource environments. For example, LACS and NAS, as described herein, can be applied when migrating the execution of an existing neural network selected for a data center with specific hardware to a data center using different hardware. In this regard, a set of neural networks capable of performing the tasks of the existing neural network can be quickly identified and deployed on the hardware in the new data center. This application may be particularly useful in rapidly developing fields that require state-of-the-art hardware for efficient execution, such as networks performing tasks in computer vision or other image processing tasks.

圖1係繪示部署於容置所部署神經網路將在其上執行之硬體加速器116之一資料中心115中之一系列103經縮放神經網路架構104A至N之一方塊圖。硬體加速器116可為任何類型之處理器，諸如CPU、GPU、FGPA或ASIC，諸如一TPU。根據本發明之態樣，可從一基本神經網路架構101產生該系列103經縮放神經網路架構。 Figure 1 is a block diagram illustrating a series 103 of scaled neural network architectures 104A-N deployed in a data center 115 housing a hardware accelerator 116 on which the deployed neural network will execute. Hardware accelerator 116 can be any type of processor, such as a CPU, GPU, FGPA, or ASIC, such as a TPU. According to aspects of the present invention, the series 103 of scaled neural network architectures can be generated from a base neural network architecture 101.

一神經網路之一架構指代定義神經網路之特性。例如，架構可包含網路之複數個不同神經網路層之特性、該等層如何處理輸入、該等層如何彼此互動等。例如，一迴旋神經網路(ConvNet)之一架構可定義接收輸入影像資料之一離散迴旋層，其後接著一匯集層，其後接著根據一神經網路任務產生一輸出(例如，對輸入影像資料之內容進行分類)之一完全連接層。一神經網路之架構亦可定義在各層內執行之操作類型。例如，一ConvNet之架構可定義在網路之完全連接層中使用ReLU激發函數。 The architecture of a neural network refers to the properties that define the neural network. For example, the architecture can include the properties of the network's various neural network layers, how these layers process input, and how these layers interact with each other. For example, the architecture of a convolutional neural network (ConvNet) may define a discrete convolutional layer that receives input image data, followed by a pooling layer, followed by a fully connected layer that produces an output based on a neural network task (e.g., classifying the content of the input image data). The architecture of a neural network can also define the type of operations performed within each layer. For example, the architecture of a ConvNet may define the use of Reluctant Unit (ReLU) activation functions in the network's fully connected layers.

可根據一組目標且使用NAS來識別基本神經網路架構101。可根據該組目標且使用NAS從候選神經網路架構之一搜尋空間識別基本神經網路架構101。如本文中更詳細描述，候選神經網路架構之搜尋空間可經擴增以包含不同網路組件、操作及層，可從其等識別滿足目標之一基本網路。 A base neural network architecture 101 can be identified based on a set of goals using NAS. Based on the set of goals, the base neural network architecture 101 can be identified from a search space of candidate neural network architectures using NAS. As described in more detail herein, the search space of candidate neural network architectures can be expanded to include different network components, operations, and layers, from which a base network that satisfies the goals can be identified.

用於識別基本神經網路架構101之該組目標亦可應用於識別系列103中之各神經網路104A至N之縮放參數值。基本神經網路架構101及經縮放神經網路架構104A至N可藉由數個參數來特性化，其中此等參數在經縮放神經網路架構104A至N中縮放至不同程度。在圖1中，使用三個縮放參數展示神經網路101、104A：D指示神經網路中之層數；W指示一神經網路層內之神經元之寬度或數目；且R指示由神經網路在一給定層處處理之輸入之大小。 The same set of objectives used to identify the base neural network architecture 101 can also be applied to identifying scaling parameter values for each neural network 104A-N in the series 103. The base neural network architecture 101 and the scaled neural network architectures 104A-N can be characterized by several parameters, where these parameters are scaled to varying degrees in the scaled neural network architectures 104A-N. In FIG1 , the neural networks 101 and 104A are illustrated using three scaling parameters: D indicates the number of layers in the neural network; W indicates the width or number of neurons within a neural network layer; and R indicates the size of the input processed by the neural network at a given layer.

如本文中更詳細描述，經組態以執行LACS之一系統可搜尋一係數搜尋空間108以識別多組縮放參數值。各縮放參數值係係數搜尋空間中之一係數，其例如可為一組正實數。根據作為係數搜尋空間108中之一搜尋之部分而識別之候選係數值從基本神經網路101縮放各候選網路107A至N。系統可應用各種不同搜尋技術之任何者來識別候選係數，諸如柏拉圖(Pareto)前緣搜尋或網格式搜尋。針對各候選網路107A至N，系統可評估候選網路在執行一神經網路任務時之一效能量度。效能量度可基於多個目標，包含用於量測一候選網路在接收一輸入與作為執行一神經網路任務之部分而產生一對應輸出之間的延時之一延時目標。 As described in more detail herein, a system configured to perform LACS can search a coefficient search space 108 to identify multiple sets of scaling parameter values. Each scaling parameter value is a coefficient in the coefficient search space, which can be, for example, a set of positive real numbers. Candidate networks 107A through N are scaled from the base neural network 101 based on the candidate coefficient values identified as part of a search in the coefficient search space 108. The system can apply any of a variety of different search techniques to identify candidate coefficients, such as a Pareto front search or a grid search. For each candidate network 107A through N, the system can evaluate a measure of the candidate network's performance in performing a neural network task. Performance metrics can be based on a variety of objectives, including a latency objective that measures the delay between a candidate network receiving an input and producing a corresponding output as part of executing a neural network task.

當在係數搜尋空間108上執行縮放參數值搜尋之後，系統可接收一經縮放神經網路架構109。經縮放神經網路架構109從基本神經網路101進行縮放，其中縮放參數值導致在係數搜尋空間中之搜尋期間識別之候選網路107A至N之最高效能度量。 After performing a search for scaling parameter values on the coefficient search space 108, the system may receive a scaled neural network architecture 109. The scaled neural network architecture 109 is scaled from the base neural network 101, where the scaling parameter values result in the highest performance metric for the candidate networks 107A-N identified during the search in the coefficient search space.

從經縮放神經網路架構109，系統可產生該系列103經縮放神經網路架構104A至N。可藉由根據不同值縮放經縮放神經網路架構109來產生該系列103。可一致地縮放經縮放神經網路架構109之各縮放參數值以產生該系列103中之不同經縮放神經網路架構。例如，可藉由將各縮放參數值提高達2倍來縮放經縮放神經網路架構109之各縮放參數值。可針對一致地應用於經縮放神經網路架構109之各縮放參數值之不同值或「複合係數」縮放經縮放神經網路架構109。在一些實施方案中，以其他方式縮放經縮放神經網路架構109，例如，藉由分開縮放各縮放參數值以產生該系列103中之一經縮放神經網路架構。 From the scaled neural network architecture 109 , the system can generate the series 103 of scaled neural network architectures 104A through N. The series 103 can be generated by scaling the scaled neural network architecture 109 by different values. The scaling parameter values of the scaled neural network architecture 109 can be scaled consistently to generate the different scaled neural network architectures in the series 103 . For example, the scaling parameter values of the scaled neural network architecture 109 can be scaled by increasing each scaling parameter value by a factor of 2. The scaled neural network architecture 109 can be scaled by different values or "compounding factors" that are consistently applied to each scaling parameter value of the scaled neural network architecture 109 . In some embodiments, the scaled neural network architecture 109 is scaled in other ways, for example, by separately scaling each scaling parameter value to generate one of the scaled neural network architectures in the series 103.

藉由根據不同值縮放經縮放神經網路架構109，可快速產生不同神經網路架構以根據各種不同使用情況執行一任務。可將不同使用情況指定為用於識別經縮放神經網路架構109之多個目標之間的不同權衡。例如，一個經縮放神經網路架構可被識別為滿足一較高精度臨限值，而代價係執行期間之較高延時。另一經縮放神經網路架構可被識別為滿足一較低精度臨限值，但可在硬體加速器116上以較低延時來執行。可識別在硬體加速器116上平衡精度與延時之間的權衡之另一經縮放神經網路架構。 By scaling the scaled neural network architecture 109 by different values, different neural network architectures can be quickly generated to perform a task according to various use cases. Different use cases can be specified to identify different trade-offs between multiple objectives for the scaled neural network architecture 109. For example, one scaled neural network architecture can be identified as meeting a higher accuracy threshold at the expense of higher latency during execution. Another scaled neural network architecture can be identified as meeting a lower accuracy threshold but can be executed with lower latency on the hardware accelerator 116. Another scaled neural network architecture that balances the trade-off between accuracy and latency on the hardware accelerator 116 can be identified.

作為一實例，為了執行一電腦視覺任務(諸如物件辨識)，一神經網路架構可需要即時或近即時地產生輸出，作為連續接收視訊或影像資料之一應用及用以識別所接收資料中之一特定類別之物件之任務之部分。在此例示性任務中，對精度之容限可較低，因此可部署以對較低延時及較低精度之適當權衡縮放之一經縮放神經網路架構來執行任務。 As an example, to perform a computer vision task such as object recognition, a neural network architecture may need to generate output in real time or near real time as part of an application that continuously receives video or image data and is used to identify a specific class of objects in the received data. In this exemplary task, the tolerance for accuracy may be lower, so a scaled neural network architecture that appropriately trades off lower latency and lower accuracy may be deployed to perform the task.

作為另一實例，一神經網路架構可被賦予對來自影像或視訊資料之一所接收場景中之各物件進行分類的任務。在此實例中，若執行此例示性任務時之延時未被視為與精確地執行任務同樣重要，則可部署以延時為代價之具有較高精度之一經縮放神經網路。在其他實例中，在當執行神經網路任務時未識別或期望一特定權衡的情況下，可部署平衡精度、延時及其他目標之間的權衡之經縮放神經網路架構。 As another example, a neural network architecture may be tasked with classifying objects in a received scene from image or video data. In this example, if latency in performing this exemplary task is not considered as important as performing the task accurately, a scaled neural network with higher accuracy at the expense of latency may be deployed. In other examples, where no specific trade-off is identified or desired when performing the neural network task, a scaled neural network architecture that balances accuracy, latency, and other objectives may be deployed.

經縮放神經網路架構104N使用與用於縮放經縮放神經網路架構109之縮放參數值不同之縮放參數值進行縮放以獲得經縮放神經網路104A，且可表示一不同使用情況，例如，其中期望精度而非推論延時之使用情況。 The scaled neural network architecture 104N is scaled using different scaling parameter values than the scaling parameter values used to scale the scaled neural network architecture 109 to obtain the scaled neural network 104A and may represent a different use case, such as one in which accuracy is desired rather than inference latency.

本文中描述之LACS及NAS技術可產生用於硬體加速器116之系列103，且接收額外訓練資料及指定多個不同運算資源(諸如不同類型之硬體加速器)之資訊。除了產生用於硬體加速器116之系列103之外，系統亦可搜尋一基本神經網路架構，且產生用於不同硬體加速器之一系列經縮放神經網路架構。例如，鑑於一GPU及一TPU，系統可分別為GPU及TPU產生針對精度-延時權衡進行最佳化之單獨模型系列。在一些實施方案中，系統可從同一基本神經網路架構產生多個經縮放系列。 The LACS and NAS techniques described herein can generate families 103 for hardware accelerators 116 and receive additional training data and information specifying multiple different computational resources (e.g., different types of hardware accelerators). In addition to generating families 103 for hardware accelerators 116, the system can also search for a base neural network architecture and generate a family of scaled neural network architectures for different hardware accelerators. For example, given a GPU and a TPU, the system can generate separate families of models optimized for the GPU and TPU, respectively, with respect to the accuracy-latency tradeoff. In some embodiments, the system can generate multiple scaled families from the same base neural network architecture.

Illustrative methods

圖2係用於產生在目標運算資源上執行之經縮放神經網路架構之一例示性程序200之一流程圖。可在一或多個位置中之一或多個處理器之一系統上執行例示性程序200。例如，如本文中描述之一神經架構搜尋-延時感知複合縮放(NAS-LACS)系統可執行程序200。 FIG2 is a flow chart of an exemplary process 200 for generating a scaled neural network architecture for execution on a target computing resource. The exemplary process 200 can be executed on a system of one or more processors in one or more locations. For example, a Neural Architecture Search-Latency-Aware Composite Scaling (NAS-LACS) system as described herein can execute process 200.

如方塊210中展示，系統接收對應於一神經網路任務之訓練資料。一神經網路任務係可由一神經網路執行之一機器學習任務。經縮放神經網路可經組態以接收任何類型之資料輸入以產生用於執行一神經網路任務之輸出。作為實例，輸出可為基於輸入之任何類型之得分、分類或遞歸輸出。相應地，神經網路任務可為用於鑑於一些輸入預測一些輸出之一評分、分類及/或遞歸任務。此等任務可對應於處理影像、視訊、文字、語音或其他類型之資料之各種不同應用。 As shown in block 210, the system receives training data corresponding to a neural network task. A neural network task is a machine learning task that can be performed by a neural network. The scaled neural network can be configured to receive any type of data input to generate an output for performing a neural network task. For example, the output can be any type of score, classification, or regression output based on the input. Accordingly, the neural network task can be a scoring, classification, and/or regression task for predicting some output given some input. These tasks can correspond to a variety of applications that process image, video, text, speech, or other types of data.

根據各種不同學習技術之一者，所接收訓練資料可為適合於訓練一神經網路之任何形式。用於訓練一神經網路之學習技術可包含監督式學習、無監督式學習及半監督式學習技術。例如，訓練資料可包含可由一神經網路作為輸入接收之多個訓練實例。可使用對應於旨在由經適當訓練以執行一特定神經網路任務之一神經網路產生之輸出之一已知輸出來標記訓練實例。例如，若神經網路任務係一分類任務，則訓練實例可為使用一或多個類別標記之影像，該一或多個類別對影像中描繪之物件進行分類。 The received training data can be in any form suitable for training a neural network according to one of a variety of different learning techniques. Learning techniques used to train a neural network can include supervised learning, unsupervised learning, and semi-supervised learning techniques. For example, the training data can include a plurality of training examples that can be received as input by a neural network. The training examples can be labeled with a known output corresponding to the output intended to be produced by a neural network trained to perform a specific neural network task. For example, if the neural network task is a classification task, the training examples can be images labeled with one or more classes that classify the objects depicted in the images.

如方塊220中展示，系統接收指定目標運算資源之資訊。目標運算資源資料可指定可在其上至少部分部署一神經網路之運算資源之特性。運算資源可容置於一或多個資料中心或擁有各種不同類型之硬體裝置之任何者之其他實體位置中。硬體之例示性類型包含中央處理單元(CPU)、圖形處理單元(GPU)、邊緣或行動運算裝置、場可程式化閘陣列(FGPA)及各種類型之特定應用電路(ASIC)。 As shown in block 220, the system receives information specifying a target computing resource. The target computing resource data may specify characteristics of a computing resource on which a neural network may be at least partially deployed. The computing resource may be housed in one or more data centers or other physical locations having any of a variety of different types of hardware devices. Exemplary types of hardware include central processing units (CPUs), graphics processing units (GPUs), edge or mobile computing devices, field programmable gate arrays (FPGAs), and various types of application specific integrated circuits (ASICs).

一些裝置可經組態用於硬體加速，其等可包含經組態用於高效地執行特定類型之操作之裝置。此等硬體加速器(其等可例如包含GPU及張量處理單元(TPU))可實施用於硬體加速之特殊特徵。用於硬體加速之例示性特徵可包含用以執行通常與機器學習模型執行相關聯之操作(諸如矩陣乘法)之組態。作為實例，此等特殊特徵亦可包含不同類型之GPU中可用之矩陣乘法及累加單元，以及TPU中可用之矩陣乘法單元。 Some devices can be configured for hardware acceleration, including devices configured to efficiently perform specific types of operations. These hardware accelerators, which may include, for example, GPUs and tensor processing units (TPUs), may implement special features for hardware acceleration. Exemplary features for hardware acceleration may include configurations for performing operations commonly associated with executing machine learning models, such as matrix multiplication. For example, these special features may include matrix multiplication and accumulation units available in different types of GPUs and matrix multiplication units available in TPUs.

目標運算資源資料可包含運算資源之一或多個目標集之資料。運算資源之一目標集可指代期望在其上部署一神經網路之一運算裝置集合。指定運算資源之目標集之資訊可指代目標集中之硬體加速器或其他運算裝置之類型及數量。目標集可包含相同或不同類型之裝置。例如，運算資源之一目標集可定義一特定類型之硬體加速器之硬體特性及數量，包含其處理能力、處理量及記憶體容量。如本文中描述，一系統可針對在運算資源之目標集中指定之各裝置產生系列經縮放神經網路架構。 Target computing resource data may include data for one or more target sets of computing resources. A target set of computing resources may refer to a collection of computing devices on which a neural network is desired to be deployed. Information specifying a target set of computing resources may specify the type and quantity of hardware accelerators or other computing devices in the target set. Target sets may include devices of the same or different types. For example, a target set of computing resources may define the hardware characteristics and quantity of a particular type of hardware accelerator, including its processing power, throughput, and memory capacity. As described herein, a system may generate a series of scaled neural network architectures for each device specified in a target set of computing resources.

另外，目標運算資源資料可指定運算資源之不同目標集，例如反映容置於一資料中心中之運算資源之不同潛在組態。從此訓練及目標運算資源資料，系統可產生一系列神經網路架構。可從由系統識別之一基本神經網路產生各架構。 Additionally, the target computational resource data can specify different target sets of computational resources, for example, reflecting different potential configurations of computational resources housed in a data center. From this training and target computational resource data, the system can generate a series of neural network architectures. Each architecture can be generated from a base neural network identified by the system.

如方塊230中展示，系統可使用訓練資料在一搜尋空間上執行一神經架構搜尋以識別一基本神經網路之一架構。系統可使用各種NAS技術之任何者，諸如基於強化學習、演進搜尋或可微搜尋之技術。在一些實施方案中，系統可直接接收指定一基本神經網路之一架構之資料，例如，而無需接收訓練資料且執行如本文中描述之NAS。 As shown in block 230, the system can use the training data to perform a neural architecture search on a search space to identify an architecture of a base neural network. The system can use any of a variety of NAS techniques, such as those based on reinforcement learning, evolutionary search, or differentiable search. In some embodiments, the system can directly receive data specifying an architecture of a base neural network, for example, without receiving training data and perform NAS as described herein.

一搜尋空間指代可潛在地被選擇為一基本神經網路架構之部分之候選神經網路或候選神經網路之部分。一候選神經網路架構之一部分可指代神經網路之一組件。一神經網路之架構可根據神經網路之複數個組件來定義，其中各組件包含一或多個神經網路層。神經網路層之特性可在組件級之一架構中定義，此意謂架構可定義在組件中執行之特定操作，使得組件中之各神經網路實施針對組件定義之相同操作。組件亦可在架構中由組件中之層數來定義。 A search space refers to candidate neural networks or portions of candidate neural networks that can potentially be selected as part of a basic neural network architecture. A portion of a candidate neural network architecture can refer to a component of a neural network. The architecture of a neural network can be defined in terms of multiple components of the neural network, where each component comprises one or more neural network layers. The properties of neural network layers can be defined in an architecture at the component level, meaning that the architecture can define specific operations performed in the component so that each neural network in the component implements the same operations defined for the component. Components can also be defined in the architecture by the number of layers in the component.

作為執行NAS之部分，系統可重複識別候選神經網路，獲得對應於多個目標之效能度量，且根據候選神經網路之各自效能度量對其等進行評估。作為獲得效能度量(諸如候選神經網路之精度及延時之度量)之部分，系統可使用所接收訓練資料來訓練候選神經網路。一旦經訓練，系統便可評估候選神經網路架構以判定其效能度量，且根據一當前最佳候選者來比較效能度量。 As part of performing NAS, the system can repeatedly identify candidate neural networks, obtain performance metrics corresponding to multiple objectives, and evaluate the candidate neural networks based on their respective performance metrics. As part of obtaining performance metrics (such as accuracy and latency of the candidate neural network), the system can train the candidate neural network using the received training data. Once trained, the system can evaluate the candidate neural network architecture to determine its performance metric and compare the performance metric against a current best candidate.

系統可藉由選擇候選神經網路，訓練網路，且比較其效能度量來重複執行此搜尋程序，直至達到停止準則。停止準則可為由一當前候選網路所滿足之一最小預定效能臨限值。另外或替代地，停止準則可為一最大搜尋反覆次數，或經分配用於執行搜尋之一最大時間量。停止準則可為其中神經網路之效能收斂之一條件，例如，一後續反覆之效能小於與先前反覆之效能不同之一臨限值。 The system can repeatedly perform this search process by selecting candidate neural networks, training the networks, and comparing their performance metrics until a stopping criterion is reached. The stopping criterion can be a minimum predetermined performance threshold satisfied by a current candidate network. Additionally or alternatively, the stopping criterion can be a maximum number of search iterations, or a maximum amount of time allotted to perform the search. The stopping criterion can be a condition in which the performance of the neural network converges, for example, the performance of a subsequent iteration is less than a threshold value different from the performance of a previous iteration.

在最佳化神經網路之不同效能度量(例如，精度及延時)之內容脈絡中，停止準則可指定預先判定為「最佳」之臨限值範圍。例如，最佳延時之一臨限值範圍可為來自由目標運算資源達成之一理論或經量測最小延時之一臨限值範圍。理論或經量測最小延時可基於運算資源之實體特性，諸如運算資源之組件能夠實體地讀取及處理傳入資料所需之最小時間量。在一些實施方案中，延時被保持為一絕對最小值，例如，實體上儘可能接近於零延遲，且不基於從目標運算資源量測或計算之一目標延時。 In the context of optimizing various performance metrics of a neural network (e.g., accuracy and latency), stopping criteria can specify a threshold range that is predetermined to be "optimal." For example, a threshold range for optimal latency can be derived from a theoretical or measured minimum latency achieved by a target computational resource. The theoretical or measured minimum latency can be based on a physical characteristic of the computational resource, such as the minimum amount of time required for a component of the computational resource to physically read and process incoming data. In some implementations, latency is maintained at an absolute minimum, e.g., as close to zero latency as practically possible, and is not based on a target latency measured or calculated from the target computational resource.

系統可經組態以使用一機器學習模型或其他技術來選擇下一候選神經網路架構，其中該選擇可至少部分基於更有可能在一特定神經網路任務之目標下良好地執行之不同候選神經網路之經學習特性。 The system can be configured to use a machine learning model or other technique to select the next candidate neural network architecture, where the selection can be based at least in part on learned properties of different candidate neural networks that are more likely to perform well at the goal of a particular neural network task.

在一些實例中，系統可使用一多目標獎勵機制用於識別基本神經網路架構，如下： ACCURACY(m)係一候選神經網路m之精度之效能度量，且LATENCY(m)係神經網路在於目標運算資源上產生一輸出時之延時。如本文中更詳細描述，精度及延時亦可為由系統作為根據目標運算資源之特性縮放基本神經網路架構之部分所使用之目標。Target係一目標延時值，例如以毫秒為單位量測，且可經預先判定。值w係用於對網路延時對候選神經網路之整體效能之影響進行加權之一可調諧參數。可根據運算資源之不同目標集來學習或調諧不同值w。作為一實例，w可經設定為-0.09，從而反映適用於如TPU及GPU之運算資源之一整體較大因數，該等運算資源對延時變動之敏感度小於例如行動平台(其中w可經設定為一較小值，諸如-0.07)。 In some examples, the system can use a multi-objective reward mechanism for identifying basic neural network architectures as follows: Accuracy(m) is a performance metric for the accuracy of a candidate neural network m, and latency(m) is the latency of the neural network in producing an output on the target computational resource. As described in more detail herein, accuracy and latency can also be targets used by the system as part of scaling the basic neural network architecture based on the characteristics of the target computational resource. Target is a target latency value, measured in milliseconds, for example, and can be predetermined. The value w is a tunable parameter used to weight the impact of network latency on the overall performance of the candidate neural network. Different values of w can be learned or tuned based on different sets of targets for the computational resources. As an example, w may be set to -0.09, reflecting an overall larger factor for computational resources such as TPUs and GPUs, which are less sensitive to latency variations than, for example, mobile platforms (where w may be set to a smaller value, such as -0.07).

為了量測一候選神經網路之精度，系統可使用訓練集來訓練候選神經網路以執行一神經網路任務。系統可例如根據80/20劃分將訓練資料劃分為一訓練集及一驗證集。例如，系統可應用一監督式學習技術來計算由候選神經網路產生之輸出與由網路處理之一訓練實例之一真實數據標籤之間的一誤差。系統可使用適合於神經網路經訓練用於之任務類型之各種損失或誤差函數之任何者，諸如分類任務之交叉熵損失，或遞歸任務之均方誤差。例如，可使用倒傳遞演算法來計算相對於候選神經網路之不同權重之誤差梯度，且可更新神經網路之權重。系統可經組態以訓練候選神經網路直至滿足停止準則，諸如用於訓練之一反覆次數、一最大時間週期、收斂或當滿足一最小精度臨限值時。 To measure the accuracy of a candidate neural network, the system can train the candidate neural network to perform a neural network task using a training set. The system can, for example, split the training data into a training set and a validation set based on an 80/20 split. For example, the system can apply a supervised learning technique to calculate the error between the output generated by the candidate neural network and the true data label of a training example processed by the network. The system can use any of a variety of loss or error functions that are appropriate for the type of task the neural network is trained for, such as cross-entropy loss for classification tasks or mean squared error for regression tasks. For example, a back propagation algorithm can be used to calculate the error gradient with respect to different weights of a candidate neural network, and the neural network weights can be updated. The system can be configured to train the candidate neural network until a stopping criterion is met, such as a number of iterations for training, a maximum time period, convergence, or when a minimum accuracy threshold is met.

除了其他效能度量之外，系統亦可產生候選神經網路架構在目標運算資源上之精度及延時之效能度量，包含(i)當部署於目標運算資源上時候選基本神經網路之操作強度，及/或候選基本神經網路在目標運算資源上之一執行效率。在一些實施方案中，除了精度及延時之外，候選基本神經網路之效能度量亦至少部分基於其操作強度及/或執行效率。 In addition to other performance metrics, the system can also generate performance metrics for the accuracy and latency of the candidate neural network architecture on the target computing resource, including (i) the operational intensity of the candidate basic neural network when deployed on the target computing resource, and/or the execution efficiency of the candidate basic neural network on the target computing resource. In some embodiments, in addition to accuracy and latency, the performance metric of the candidate basic neural network is also based at least in part on its operational intensity and/or execution efficiency.

延時、操作強度及執行效率可定義為如下： Latency, operational intensity, and execution efficiency can be defined as follows:

在(3)中，LATENCY被定義為，其中W係用於執行所量測之神經網路架構之運算量(例如，以FLOPS為單位)，且C係藉由目標運算資源在神經網路架構上處理輸入而達成之運算速率(例如，以FLOPS/sec為單位)。E係候選神經網路在目標運算資源上執行之執行效率，且E可等於C/C _ideal(且因此，藉由代數，C x C _ideal=E)。C _ideal係在用於執行所量測之神經網路架構之目標運算資源上達成之理想運算速率。C _ideal根據操作強度I、目標運算資源之記憶體頻寬b及C _max來定義，C _max表示目標運算資源之峰值運算速率(例如，一GPU之峰值運算速率)。 In (3), LATENCY is defined as , where W is the computational effort used to execute the measured neural network architecture (e.g., in FLOPS), and C is the computational rate achieved by the target compute resource processing inputs on the neural network architecture (e.g., in FLOPS/sec). E is the execution efficiency of the candidate neural network executed on the target compute resource, and E can be equal to C/C _ideal (and therefore, by algebra, C x C _ideal = E ). C _ideal is the ideal computational rate achieved on the target compute resource used to execute the measured neural network architecture. C _ideal is defined based on the operational intensity I , the memory bandwidth b of the target compute resource, and C _max _, which represents the peak computational rate of the target compute resource (e.g., the peak computational rate of a GPU).

C _max及b係常數且分別對應於目標運算資源之硬體特性，例如，其峰值運算速率及記憶體頻寬。操作強度I指代由部署一神經網路之運算資源處理之資料量之一量測，鑑於神經網路執行所需之運算量W，除以在神經網路之執行期間由運算資源引起之記憶體訊務Q(，如(3)中展示)。 Cmax and b are constants corresponding to the hardware characteristics of the _target computational resource, such as its peak computational speed and memory bandwidth, respectively. Operational intensity I refers to a measure of the amount of data processed by the computational resource on which a neural network is deployed, given the amount of computation W required to execute the neural network, divided by the memory traffic Q caused by the computational resource during the execution of the neural network ( , as shown in (3)).

用於在目標運算資源上執行神經網路之理想運算速率在I<R時係I x b，且否則係C _max。R係脊點，或神經網路架構在目標運算資源上達成峰值運算速率所需之最小操作強度。 The ideal computational rate for executing a neural network on a target compute resource is I xb when I < R , and C _max otherwise. R is the spine, or the minimum operational intensity required for the neural network architecture to achieve peak computational rate on the target compute resource.

共同地，(3)及對應描述展示，一經量測神經網路之推論時延係依據操作強度I、運算W及執行效率E，而非單獨僅依據運算W。在硬體加速器(特定言之，如通常部署於資料中心中之TPU及GPU之硬體加速器)之內容脈絡中，此關係可應用於在NAS期間使用「加速器友好操作」來擴增一候選神經網路搜尋空間，以及改良如何由系統選擇及隨後縮放一基本神經網路。 Together, (3) and the corresponding description show that the inference latency of a measured neural network depends on the operation intensity I , the operations W , and the execution efficiency E , rather than on the operations W alone. In the context of hardware accelerators (specifically, hardware accelerators such as TPUs and GPUs commonly deployed in data centers), this relationship can be applied to expand the search space of candidate neural networks during NAS using "accelerator-friendly operations" and to improve how a base neural network is selected and subsequently scaled by the system.

系統可藉由同時搜尋具有經改良操作強度、執行效率及運算要求之候選神經網路架構來搜尋基本神經網路架構以改良一最終神經網路之延時，而非單獨搜尋以找到減少運算網路。系統可經組態以依此方式操作以減少最終基本神經網路架構之總延時。 The system can search for a base neural network architecture to improve the latency of a final neural network by simultaneously searching for candidate neural network architectures with improved operational strength, execution efficiency, and computational requirements, rather than searching solely for a computationally efficient network. The system can be configured to operate in this manner to reduce the overall latency of the final base neural network architecture.

另外，系統從其選擇基本神經網路架構之候選架構搜尋空間可經擴增以擴大更有可能精確地且以目標運算資源上之減少推論延時(特別是在其中目標運算資源係資料中心硬體加速器之情況中)執行之可用候選神經網路之範圍。 Additionally, the search space of candidate architectures from which the system selects a base neural network architecture can be expanded to broaden the range of available candidate neural networks that are more likely to execute accurately and with reduced inference latency on the target compute resource (particularly in cases where the target compute resource is a data center hardware accelerator).

如描述般擴增搜尋空間可增加更適合於資料中心加速器部署之候選神經網路架構之數目，此可導致可能尚非未成為未根據本發明之態樣擴增之一搜尋空間中之一候選者之一所識別基本神經網路架構。在其中目標運算資源指定如GPU及TPU之硬體加速器之實例中，搜尋空間可使用候選架構或架構之部分來擴增，諸如促進操作強度、並行性及/或執行效率之組件或操作。 Expanding the search space as described can increase the number of candidate neural network architectures that are more suitable for data center accelerator deployment, which can result in identifying a base neural network architecture that may not yet be a candidate in a search space that has not been expanded according to aspects of the present invention. In instances where the target computational resources specify hardware accelerators such as GPUs and TPUs, the search space can be expanded using candidate architectures or portions of architectures, such as components or operations that improve operational strength, parallelism, and/or execution efficiency.

在一個例示性擴增方式中，搜尋空間可經擴增以包含具有實施各種不同類型之激發函數之一者之層之神經網路架構組件。在TPU及GPU之情況中，已發現激發函數(諸如ReLU或swish)通常具有低操作強度，且代替地在此等類型之硬體加速器上通常受記憶體限制。由於一神經網路中之激發函數之執行通常受到目標運算資源上可用之記憶體總量之限制，所以此等函數之執行可對端至端網路推論速度具有大的負面效能影響。 In one exemplary augmentation approach, the search space can be expanded to include neural network architecture components with layers implementing one of various types of activation functions. In the case of TPUs and GPUs, activation functions (such as ReLU or swish) have been found to typically have low operational intensity and are instead often memory-bound on these types of hardware accelerators. Because the execution of activation functions in a neural network is often limited by the amount of memory available on the target compute resource, the execution of these functions can have a significant negative performance impact on the end-to-end network inference speed.

關於激發函數對搜尋空間之一個例示性擴增係將激發函數與其等之相關聯離散迴旋融合引入至搜尋空間。由於激發函數通常係逐元素操作且在經組態用於向量操作之硬體加速器單元上運行，所以此等激發函數之執行可與離散迴旋並行地執行，該等離散迴旋通常係在一硬體加速器之矩陣單元上操作之基於矩陣之操作。此等經融合激發函數-迴旋操作可由系統選擇為一候選神經網路組件，作為搜尋如本文中描述之一基本神經網路架構之部分。可使用各種不同激發函數之任何者，包含Swish、ReLU(修正線性單元)、Sigmoid、Tanh及Softmax。 One exemplary extension of the search space for excitation functions involves fusing excitation functions with their associated discrete convolutions. Because excitation functions typically operate element-wise and run on hardware accelerator units configured for vector operations, these excitation functions can be executed in parallel with the discrete convolutions, which are matrix-based operations typically performed on the matrix units of a hardware accelerator. These fused excitation function-convolution operations can be selected by the system as a candidate neural network component as part of the search for a basic neural network architecture as described herein. Any of a variety of different excitation functions can be used, including Swish, ReLU (rectified linear unit), Sigmoid, Tanh, and Softmax.

包含融合激發函數-迴旋操作層之不同組件可經添加至搜尋空間且可依據所採用之激發函數之類型而變化。例如，激發函數-迴旋層之一個組件可包含ReLU激發函數，而另一組件可包含Swish激發函數。已發現，不同硬體加速器可使用不同激發函數更高效地執行，因此擴增搜尋空間以包含多種類型之激發函數之經融合激發函數-迴旋可進一步改良識別最適合於執行所討論之神經網路任務之一基本神經網路架構。 Different components comprising fused activation function-convolution layers can be added to the search space and can vary depending on the type of activation function employed. For example, one component of a activation function-convolution layer can include a Relu activation function, while another component can include a Swish activation function. It has been found that different hardware accelerators can perform more efficiently using different activation functions, so expanding the search space to include fused activation function-convolution with multiple types of activation functions can further improve the identification of a basic neural network architecture best suited for performing the neural network task in question.

除了本文中描述之具有各種不同激發函數之組件之外，搜尋空間亦可使用其他經融合迴旋結構來擴增，以進一步使用不同形狀、類型及大小之迴旋來豐富搜尋空間。不同迴旋結構可為作為一候選神經網路架構之部分添加之組件，且可包含1x1迴旋之擴展層、深度迴旋、1x1迴旋之投影層以及其他操作，諸如激發函數、批量正規化函數及/或跳躍連接。 In addition to the components with various activation functions described in this paper, the search space can be extended using other fused convolution structures to further enrich the search space with convolutions of different shapes, types, and sizes. These different convolution structures can be added as part of a candidate neural network architecture and can include 1x1 convolution dilation layers, depthwise convolutions, 1x1 convolution projection layers, and other operations such as activation functions, batch normalization functions, and/or skip connections.

如本文中描述，識別運算要求與延時之間的非比例關係之根本原因亦證實並行性對硬體加速器之影響。並行性對於在如GPU及TPU 之硬體加速器上實施之神經網路而言可為關鍵的，此係因為此等硬體加速器可需要大的並行性以達成高效能。例如，一神經網路層之一迴旋操作需要具有足夠大小之深度、批量及空間維度以提供足夠並行性來達成硬體加速器之矩陣單元上之高執行效率E。如參考(3)描述之執行效率形成影響推論時之網路延時之因素之全貌之部分。 As described in this paper, the root cause of the non-proportional relationship between computational requirements and latency is also demonstrated by the impact of parallelism on hardware accelerators. Parallelism can be critical for neural networks implemented on hardware accelerators such as GPUs and TPUs because such hardware accelerators may require large parallelism to achieve high performance. For example, a convolution operation in a neural network layer needs to have sufficient depth, batch size, and spatial dimensions to provide sufficient parallelism to achieve high execution efficiency E on the matrix units of the hardware accelerator. As described in reference (3), execution efficiency forms part of the overall picture of factors affecting network latency at inference.

因此，可以其他方式擴增NAS之搜尋空間以包含可利用一硬體加速器上可用之並行性之操作。搜尋空間可包含用以將深度迴旋與相鄰1x1迴旋融合之一或多個操作及用於重塑一神經網路之輸入之操作。例如，至一候選神經網路之輸入可為一張量。一張量係可根據不同階表示值之一資料結構。例如，一階張量可為一向量，二階張量可為一矩陣，且三階矩陣可為三維矩陣，且以此類推。融合深度迴旋可為有益的，此係因為深度操作通常具有一較低操作強度，且將操作與相鄰迴旋融合可將操作強度增加至較接近於一硬體加速器之最大能力。 Therefore, the NAS search space can be expanded in other ways to include operations that can exploit the parallelism available on a hardware accelerator. The search space can include one or more operations for fusing deep convolutions with neighboring 1x1 convolutions and operations for reshaping the input to a neural network. For example, the input to a candidate neural network can be a tensor. A tensor is a data structure that can represent values according to different ranks. For example, a rank-1 tensor can be a vector, a rank-2 tensor can be a matrix, a rank-3 matrix can be a three-dimensional matrix, and so on. Fusing deep convolutions can be beneficial because deep operations typically have a lower operational power, and fusing operations with neighboring convolutions can increase the operational power closer to the maximum capability of a hardware accelerator.

搜尋空間亦可包含藉由改變一輸入張量之不同維度來重塑該張量之操作。例如，若一張量具有一深度、寬度及高度維度，則搜尋空間中可形成一候選神經網路架構之部分之一或多個操作可經組態以改變張量之深度、寬度及解析度維度之一或多者。在一個實例中，搜尋空間可使用一或多個空間至深度迴旋(諸如2x2迴旋)來擴增，該一或多個空間至深度迴旋藉由增加輸入張量之深度，同時減少輸入張量之其他維度來重塑輸入張量。在一些實施方案中，包含使用步幅-n nxn迴旋之一或多個操作，其中n表示系統可用於重塑一張量輸入之一正整數。例如，若輸入至一候選神經網路之一張量係H x W C，則一或多個所添加操作可將張量重塑為之一維度。 The search space may also include operations that reshape an input tensor by changing different dimensions of the tensor. For example, if a tensor has a depth, width, and height dimension, one or more operations in the search space that may form part of a candidate neural network architecture may be configured to change one or more of the depth, width, and resolution dimensions of the tensor. In one example, the search space may be expanded using one or more space-to-depth convolutions (such as 2x2 convolutions) that reshape the input tensor by increasing the depth of the input tensor while reducing other dimensions of the input tensor. In some embodiments, one or more operations using stride- n nxn convolutions are included, where n represents a positive integer that the system can use to reshape a tensor input. For example, if a tensor input to a candidate neural network is H x WC , then one or more added operations may reshape the tensor into One dimension.

實施如上文描述之空間至深度迴旋可具有至少兩個優點。首先，迴旋與相對高操作強度及執行效率相關聯，且因此搜尋空間中之更多迴旋選項可豐富搜尋一候選神經網路時之整體搜尋空間。高操作強度及執行效率亦有利於在諸如TPU及GPU之硬體加速器上實施。接著，步幅nxn迴旋亦可作為候選神經網路之部分進行訓練以貢獻於神經網路之容量。 Implementing spatial-to-depth convolutions as described above can have at least two advantages. First, convolutions are associated with relatively high computational intensity and execution efficiency, and therefore, more convolution options in the search space can enrich the overall search space when searching for a candidate neural network. High computational intensity and execution efficiency also facilitate implementation on hardware accelerators such as TPUs and GPUs. Furthermore, stride nxn convolutions can also be trained as part of a candidate neural network to contribute to the neural network's capacity.

搜尋空間亦可包含藉由將輸入張量之元素移動至目標運算資源上之記憶體中之不同位置來重塑該張量之操作。另外或替代地，該等操作可將元素複製至記憶體中之不同位置。 The search space may also include operations that reshape the input tensor by moving elements of the input tensor to different locations in memory on the target compute resource. Additionally or alternatively, these operations may copy elements to different locations in memory.

在一些實施方案中，系統經組態以直接接收用於縮放之一基本神經網路，且未執行NAS或用於識別基本神經網路之任何其他搜尋。在一些實施方案中，多個裝置例如藉由在一個裝置上識別基本神經網路，且在另一裝置上如本文中描述般縮放基本神經網路來個別地執行程序200之至少部分。 In some embodiments, the system is configured to directly receive a base neural network for scaling without performing NAS or any other search for identifying the base neural network. In some embodiments, multiple devices individually perform at least a portion of process 200, for example by identifying the base neural network on one device and scaling the base neural network as described herein on another device.

系統可根據指定目標運算資源之資訊及複數個縮放參數來識別用於縮放基本神經網路之複數個縮放參數值，如圖2之方塊240中展示。系統可使用延時感知複合縮放(如目前所描述)用於使用一候選縮放神經網路之精度及延時作為用於搜尋基本神經網路之縮放參數值之目標。 The system can identify a plurality of scaling parameter values for scaling the base neural network based on information specifying a target computational resource and a plurality of scaling parameters, as shown in block 240 of FIG. 2 . The system can use latency-aware composite scaling (as currently described) to use the accuracy and latency of a candidate scaled neural network as targets for searching for scaling parameter values for the base neural network.

一般言之，縮放技術與NAS結合應用以識別經縮放以部署於目標運算資源上之一神經網路。模型縮放可與NAS結合使用以更高效地搜尋支援各種不同使用情況之一系列神經網路。在一縮放方法下，可使用各種不同技術來搜尋縮放參數之不同值，諸如一神經網路之深度、寬度及解析度。可藉由分開搜尋各縮放參數之值或藉由搜尋用於一起調整多個縮放參數之一致值集來進行縮放。前者有時被稱為簡單縮放，且後者有時被稱為複合縮放。 Generally speaking, scaling techniques are applied in conjunction with NAS to identify a neural network that is scaled for deployment on target computing resources. Model scaling can be combined with NAS to more efficiently search for a set of neural networks that support a variety of use cases. Within a scaling approach, various techniques can be used to search for different values of scaling parameters, such as the depth, width, and resolution of a neural network. Scaling can be performed by searching for values of each scaling parameter separately or by searching for a consistent set of values that adjust multiple scaling parameters together. The former is sometimes referred to as simple scaling, and the latter is sometimes referred to as compound scaling.

使用精度作為唯一目標之縮放技術可導致神經網路在部署於專用硬體(諸如資料中心加速器)上時未經縮放以適當考量此等經縮放網路之效能/速度影響。如圖3中更詳細描述，LACS可使用精度及延時目標兩者，其等可被共用為用於識別基本神經網路架構之相同目標。 Scaling techniques that use accuracy as the sole objective can result in neural networks being deployed on dedicated hardware (such as data center accelerators) without properly accounting for the performance/speed impact of these scaled networks. As described in more detail in Figure 3, LACS can use both accuracy and latency objectives, which can be shared as the same objective for identifying underlying neural network architectures.

圖3係用於一基本神經網路架構之延時感知複合縮放之一例示性程序300。可在一或多個位置中之一或多個處理器之一系統或裝置上執行例示性程序300。例如，可在如本文中描述之一NAS-LACS系統上執行程序300。 FIG3 illustrates an exemplary process 300 for delay-aware complex scaling of a basic neural network architecture. The exemplary process 300 may be executed on a system or device of one or more processors in one or more locations. For example, the process 300 may be executed on a NAS-LACS system as described herein.

如方塊310中展示，系統選擇複數個候選縮放參數值。縮放參數值係一神經網路之不同縮放參數之值。如本文中描述，深度、寬度及輸入解析度可為縮放參數，此係至少因為一神經網路之該等參數可變大或變小。作為選擇縮放參數值之部分，系統可從一係數搜尋空間選擇一縮放係數元組。係數搜尋空間包含從其等可判定一神經網路之縮放參數值之候選縮放係數元組。各元組之縮放係數之數目可取決於縮放參數之數目。例如，若縮放參數係深度、寬度及解析度，則係數搜尋空間將包含形式(α,β,γ)之候選元組，各具有三個縮放係數。各元組中之係數可為數值，例如，整數或實值。 As shown in block 310, the system selects a plurality of candidate scaling parameter values. Scaling parameter values are values of different scaling parameters of a neural network. As described herein, depth, width, and input resolution can be scaling parameters, at least because these parameters of a neural network can be scaled up or down. As part of selecting the scaling parameter values, the system can select a scaling coefficient tuple from a coefficient search space. The coefficient search space includes candidate scaling coefficient tuples from which scaling parameter values for a neural network can be determined. The number of scaling coefficients in each tuple can depend on the number of scaling parameters. For example, if the scaling parameters are depth, width, and resolution, the coefficient search space will contain candidate tuples of the form (α, β, γ), each with three scaling coefficients. The coefficients in each tuple can be numeric, e.g., integer or real values.

在一複合縮放方法中，一起搜尋各縮放參數之縮放係數。系統可應用各種不同搜尋技術之任何者用於識別一係數元組，諸如柏拉圖前緣搜尋或網格式搜尋。然而，系統搜尋係數元組，系統可根據用於識別本文中參考圖2描述之基本神經網路架構之相同目標來搜尋該元組。另外，多個目標可包含精度及延時兩者，其等可被表達為：雖然經執行以識別基本神經架構之NAS之間的目標(本文中在(1)處展示)相同於在(2)中展示之目標，但在(2)下運算效能量度之一關鍵差異係評估候選經縮放神經網路m _scaled而非基本神經網路m。系統之總體目標可為識別多個目標之各者之柏拉圖平衡中之縮放參數值，例如，精度及延時。 In a composite scaling method, the scaling coefficients for each scaling parameter are searched together. The system may apply any of a variety of different search techniques to identify a coefficient tuple, such as a Pareto frontier search or a grid search. However, when the system searches for a coefficient tuple, it may search for the tuple based on the same objectives used to identify the basic neural network architecture described herein with reference to FIG. 2 . In addition, the multiple objectives may include both accuracy and latency, which may be expressed as: Although the goal between the NAS implemented to identify the underlying neural architecture (shown in this paper at (1)) is the same as the goal shown in (2), a key difference in the computational performance metric under (2) is that the candidate scaled neural network m _scaled is evaluated rather than the underlying neural network m . The overall goal of the system can be to identify the scaling parameter value in a Pareto equilibrium for each of multiple objectives, such as accuracy and latency.

換言之，系統判定根據複數個候選神經縮放參數值縮放之基本神經網路之一效能量度，如方塊320中展示。系統根據反映從候選神經網路縮放參數值縮放之候選經縮放神經網路之效能之不同效能度量來判定基本神經網路之一效能量度。 In other words, the system determines a performance metric of the base neural network scaled according to a plurality of candidate neural scaling parameter values, as shown in block 320. The system determines a performance metric of the base neural network based on different performance metrics that reflect the performance of the candidate scaled neural network scaled according to the candidate neural network scaling parameter values.

基本神經網路架構根據在係數搜尋空間中搜尋之縮放係數元組由其縮放參數進行縮放。ACCURACY(m _scaled )可為候選經縮放神經網路在從所接收訓練資料獲得之一驗證實例集上之精度之一量度。LATENCY(m _scaled )可為當部署於目標運算資源上時在候選經縮放神經網路上接收輸入與產生一對應輸出之間的時間之一量度。一般言之，系統致力於最大化一候選神經網路架構之精度，同時最小化延時。 The base neural network architecture is scaled by its scaling parameters according to a scaling coefficient tuple searched in a coefficient search space. Accuracy(m _scaled ) can be a measure of the accuracy of the candidate scaled neural network on a validation set of examples obtained from the received training data. Latency(m _scaled ) can be a measure of the time between receiving an input and generating a corresponding output on the candidate scaled neural network when deployed on the target computing resource. In general, the system strives to maximize the accuracy of a candidate neural network architecture while minimizing latency.

根據參考(3)之描述，系統可直接或間接獲得候選經縮放神經網路之操作強度、運算要求及執行效率之效能度量。此係因為LATENCY係依據此三個其他潛在效能度量。因此，除了經縮放神經網路架構在部署於目標運算資源上時之精度及延時之外，系統亦可經組態以搜尋最佳化操作強度、運算要求及執行效率之一或多者之候選縮放參數值。 As described in reference (3), the system can directly or indirectly obtain the performance metrics of the candidate scaled neural network's operation intensity, computational requirements, and execution efficiency. This is because LATENCY is based on these three other potential performance metrics. Therefore, in addition to the accuracy and latency of the scaled neural network architecture when deployed on the target computing resources, the system can also be configured to search for candidate scaling parameter values that optimize one or more of operation intensity, computational requirements, and execution efficiency.

作為判定效能量度之部分，系統可使用接收訓練資料進一步訓練或調諧候選經縮放神經網路架構。根據方塊330，系統可判定經訓練及經縮放神經網路之效能量度，且判定效能量度是否滿足一效能臨限值。效能量度及效能臨限值可分別為多個效能度量及效能臨限值之一複合物。例如，系統可從經縮放神經網路之精度及推論延時兩者之度量判定一單一效能量度，或系統可判定不同目標之單獨效能度量，且比較各度量與一對應效能臨限值。 As part of determining the performance metric, the system may use the received training data to further train or tune the candidate scaled neural network architecture. According to block 330, the system may determine the performance metric of the trained and scaled neural network and determine whether the performance metric meets a performance threshold. The performance metric and the performance threshold may be a composite of multiple performance metrics and performance thresholds, respectively. For example, the system may determine a single performance metric from both the accuracy and inference latency metrics of the scaled neural network, or the system may determine separate performance metrics for different objectives and compare each metric to a corresponding performance threshold.

若效能量度滿足效能臨限值，則程序300結束。否則，程序繼續，且根據方塊310，系統選擇新的複數個候選縮放參數值。例如，系統可至少部分基於先前選擇之候選元組及其對應效能量度從係數搜尋空間選擇新的縮放係數元組。在一些實施方案中，系統搜尋多個係數元組，且根據多個目標在候選元組之各者附近執行一更精細粒度搜尋。 If the performance metric meets the performance threshold, process 300 ends. Otherwise, the process continues, and according to block 310, the system selects a new plurality of candidate scaling parameter values. For example, the system may select a new scaling coefficient tuple from the coefficient search space based at least in part on previously selected candidate tuples and their corresponding performance metrics. In some embodiments, the system searches multiple coefficient tuples and performs a finer-grained search around each of the candidate tuples according to multiple objectives.

系統可實施用於反覆地搜尋係數候選空間之各種不同技術之任何者，例如使用網格式搜尋、強化學習、演進搜尋及類似物。系統可繼續搜尋縮放參數值，直至達到停止準則，諸如收斂或反覆次數，如先前描述。 The system may implement any of a variety of different techniques for iteratively searching the space of coefficient candidates, such as using grid search, reinforcement learning, evolutionary search, and the like. The system may continue searching for scaling parameter values until a stopping criterion is reached, such as convergence or a number of iterations, as previously described.

在一些實施方案中，可根據一或多個控制器參數值來調諧系統，該一或多個控制器參數值可經手動調諧、由一機器學習技術學習或兩者之一組合。控制器參數可影響各目標對一候選元組之整體效能量度之相對影響。在一些實例中，基於在控制器參數值中至少部分反映之理想縮放係數之所學習特性，一候選元組中之特定值或值之間的關係可為有利或不利的。 In some embodiments, the system can be tuned based on one or more controller parameter values, which can be manually tuned, learned using a machine learning technique, or a combination of both. The controller parameters can affect the relative impact of various objectives on the overall performance metric of a candidate tuple. In some examples, specific values or relationships between values in a candidate tuple can be favorable or unfavorable based on learned characteristics of ideal scaling factors at least partially reflected in the controller parameter values.

根據方塊340，系統根據一或多個目標權衡從選定候選縮放參數值產生一或多個縮放參數值群組。目標權衡表示各目標之不同臨限值，例如精度及延時，且可由不同經縮放神經網路來滿足。例如，一個目標權衡可具有一較高網路精度臨限值，但具有一較低推斷延時臨限值(即，具有較高延時之更精確網路)。作為另一實例，一目標權衡可具有一較低網路精度臨限值，但具有一較高推斷延時臨限值(即，具有較低延時之較不精確網路)。作為另一實例，一目標權衡可平衡精度及延時效能。 At block 340, the system generates one or more sets of scaling parameter values from the selected candidate scaling parameter values based on one or more target tradeoffs. The target tradeoffs represent different thresholds for various objectives, such as accuracy and latency, and can be satisfied by different scaled neural networks. For example, one target tradeoff may have a higher network accuracy threshold but a lower inference latency threshold (i.e., a more accurate network with higher latency). As another example, one target tradeoff may have a lower network accuracy threshold but a higher inference latency threshold (i.e., a less accurate network with lower latency). As another example, a target tradeoff may balance accuracy and latency performance.

針對各目標權衡，系統可識別一各自縮放參數值群組，系統可使用該各自縮放參數值群組來縮放基本神經網路架構以滿足目標權衡。換言之，系統可重複如方塊310中展示之選擇、如方塊320中展示之判定效能量度及如方塊330中展示之判定效能量度是否滿足效能臨限值，其中效能臨限值之間的差異由目標權衡來定義。在一些實施方案中，根據方塊330，系統可搜尋根據最初滿足多個目標之效能量度之縮放參數值之選定候選者縮放之基本神經網路架構之候選縮放係數元組，而非搜尋基本神經網路架構之元組。 For each target tradeoff, the system can identify a respective set of scaling parameter values that the system can use to scale the base neural network architecture to meet the target tradeoff. In other words, the system can repeat the selection as shown in block 310, the determination of the performance metric as shown in block 320, and the determination of whether the performance metric meets a performance threshold as shown in block 330, where the difference between the performance thresholds is defined by the target tradeoff. In some embodiments, according to block 330, the system can search for a candidate tuple of scaling coefficients for the base neural network architecture scaled according to the selected candidate scaling parameter values that initially meet the multiple target performance metrics, rather than searching for a tuple of the base neural network architecture.

在一些實施方案中，在選擇複數個縮放參數值且滿足效能量度之後，系統可經組態以藉由再次縮放最初經縮放之神經網路來產生一系列神經網路。換言之，從一縮放係數元組(α,β,γ)，系統可例如藉由一公因數或「複合係數」來縮放元組以獲得用於藉由其他因數來縮放基本神經網路之元組(α',β',γ')及(α",β",γ")。在此方法中，可快速產生一模型系列，且其可包含適用於各種使用情況之不同神經網路。 In some embodiments, after selecting a plurality of scaling parameter values and satisfying performance metrics, the system can be configured to generate a family of neural networks by rescaling the initially scaled neural network. In other words, from a tuple of scaling coefficients (α, β, γ), the system can scale the tuple by, for example, a common factor or "complex coefficient" to obtain tuples (α ' , β ' , γ ' ) and (α " , β " , γ " ) that are used to scale the base neural network by other factors. In this way, a family of models can be quickly generated, and it can include different neural networks suitable for various use cases.

例如，一系列中之不同神經網路可根據一或多個複合係數進行縮放。鑑於一複合係數及一縮放係數元組(α,β,γ)，一系列中之一經縮放神經網路之縮放參數可藉由以下定義：其中d,w,r分別為神經網路之深度、寬度及解析度之縮放參數值。在第一經縮放神經網路架構之情況中，可為1。一般言之，複合係數可表示用於網路縮放之延時預算，而α,β,γ控制如何將延時預算分別分配給不同縮放參數值。 For example, different neural networks in a series can be scaled according to one or more complex coefficients. and a scaling coefficient tuple (α, β, γ), a set of scaling parameters of a scaled neural network can be defined as follows: Where d , w , r are the scaling parameters of the neural network depth, width, and resolution respectively. In the case of the first scaled neural network architecture, Can be 1. In general, the compounding coefficient represents the delay budget used for network scaling, while α, β, and γ control how the delay budget is distributed to different scaling parameter values.

返回至圖2，NAS-LACS系統可使用根據複數個縮放參數值縮放之基本神經網路之架構來產生經縮放神經網路之一或多個架構，如方塊250中展示。經縮放神經網路架構可為從基本神經網路架構及不同縮放參數值產生之一系列神經網路之部分。 Returning to FIG. 2 , the NAS-LACS system can use the architecture of a base neural network scaled according to a plurality of scaling parameter values to generate one or more architectures of scaled neural networks, as shown in block 250 . The scaled neural network architecture can be part of a series of neural networks generated from the base neural network architecture and different scaling parameter values.

若指定目標運算資源之資訊包含一或多組多個目標運算資源，例如，多個不同類型之硬體加速器，則系統可針對各硬體加速器重複程序200及程序300，以產生各對應於一目標集之一各自經縮放神經網路之一架構。針對各硬體加速器，系統可根據延時與精度之間的不同目標權衡或延時、精度及其他目標(特定言之，包含操作強度及執行效率，在本文中參考(3)描述)產生一系列經縮放神經網路架構。 If the information specifying the target computing resources includes one or more sets of target computing resources, for example, multiple hardware accelerators of different types, the system may repeat process 200 and process 300 for each hardware accelerator to generate a scaled neural network architecture corresponding to each target set. For each hardware accelerator, the system may generate a series of scaled neural network architectures based on different target tradeoffs between latency and accuracy or latency, accuracy, and other targets (specifically, including operational intensity and execution efficiency, as described in reference (3) herein).

在一些實施方案中，系統可從同一基本神經網路架構產生多個系列之經縮放神經網路架構。此方法可有助於其中不同目標裝置共用類似硬體特性之狀況，且至少由於對一基本神經網路架構之搜尋僅執行一次而可導致更快地識別各裝置之一對應經縮放系列。 In some implementations, the system can generate multiple families of scaled neural network architectures from the same base neural network architecture. This approach can be helpful in situations where different target devices share similar hardware characteristics and can result in faster identification of a corresponding scaled family for each device, at least because the search for a base neural network architecture is performed only once.

Example system:

圖4係根據本發明之態樣之一神經架構搜尋延時感知複合縮放(NAS-LACS)系統400之一方塊圖。系統400經組態以接收用於執行一神經網路之訓練資料401及指定目標運算資源之目標運算資源資料402。系統400可經組態以實施用於產生一系列經縮放神經網路架構之技術，如本文中參考圖1至圖3描述。 FIG4 is a block diagram of a Neural Architecture Search with Latency-Aware Composite Scaling (NAS-LACS) system 400 according to aspects of the present invention. System 400 is configured to receive training data 401 for executing a neural network and target computational resource data 402 specifying target computational resources. System 400 can be configured to implement the techniques for generating a series of scaled neural network architectures, as described herein with reference to FIG1 through FIG3.

系統400可經組態以根據一使用者介面接收輸入資料。例如，系統400可接收資料作為對曝露系統400之一應用程式介面(API)之一呼叫之部分。系統400可在一或多個運算裝置上實施，如本文中參考圖5描述。例如，至系統400之輸入可透過一儲存媒體(包含透過一網路連接至一或多個運算裝置之遠端儲存器)或作為透過耦合至系統400之一用戶端運算裝置上之一使用者介面之輸入來提供。 System 400 can be configured to receive input data according to a user interface. For example, system 400 can receive data as part of a call to an application programming interface (API) that exposes system 400. System 400 can be implemented on one or more computing devices, as described herein with reference to FIG. 5 . For example, input to system 400 can be provided via a storage medium (including remote storage connected to one or more computing devices via a network) or as input through a user interface on a client computing device coupled to system 400.

系統400可經組態以輸出經縮放神經網路架構409，諸如一系列經縮放神經網路架構。可將經縮放神經網路架構409作為輸出發送，例如用於顯示在一使用者顯示器上，且視情況根據如架構中定義之各神經網路層之形狀及大小視覺化。在一些實施方案中，系統400可經組態以將經縮放神經網路架構409提供為一組電腦可讀指令，諸如一或多個電腦程式，其或其等可由目標運算資源執行以實施經縮放神經網路架構409。 System 400 can be configured to output a scaled neural network architecture 409, such as a series of scaled neural network architectures. The scaled neural network architecture 409 can be sent as an output, for example, for display on a user display, and optionally visualized according to the shape and size of each neural network layer as defined in the architecture. In some embodiments, system 400 can be configured to provide the scaled neural network architecture 409 as a set of computer-readable instructions, such as one or more computer programs, which can be executed by a target computing resource to implement the scaled neural network architecture 409.

一電腦程式可以任何類型之程式設計語言且根據任何程式設計範例進行編寫，例如，宣告、程序、組合、物件導向、資料導向、函數或命令式。一電腦程式可經編寫以執行一或多個不同功能且在一運算環境內(例如，在一實體裝置、虛擬機上或跨多個裝置)操作。一電腦程式亦可實施本說明書中描述之例如由一系統、引擎、模組或模型執行之功能性。 A computer program can be written in any type of programming language and according to any programming paradigm, such as declarative, procedural, combinational, object-oriented, data-oriented, functional, or imperative. A computer program can be written to perform one or more different functions and operate within a computing environment (e.g., on a physical device, a virtual machine, or across multiple devices). A computer program can also implement functionality described in this specification, such as that performed by a system, engine, module, or model.

在一些實施方案中，系統400經組態以將經縮放神經網路架構409之資料轉發至一或多個其他裝置，該一或多個其他裝置經組態用於將架構轉換為以一電腦程式設計語言編寫之一可執行程式，且視情況作為用於產生機器學習模型之一框架之部分。系統400亦可經組態以將對應於經縮放神經網路架構409之資料發送至一儲存裝置以進行儲存及隨後擷取。 In some embodiments, system 400 is configured to forward data corresponding to scaled neural network architecture 409 to one or more other devices configured to convert the architecture into an executable program written in a computer programming language, optionally as part of a framework for generating machine learning models. System 400 can also be configured to send data corresponding to scaled neural network architecture 409 to a storage device for storage and subsequent retrieval.

系統400可包含一NAS引擎405。NAS引擎405及系統400之其他組件可被實施為一或多個電腦程式、特別組態之電子電路或前述之任何組合。NAS引擎405可經組態以接收訓練資料401及目標運算資源資料402，且產生一基本神經網路架構407，該基本神經網路架構407可被發送至一LACS引擎415。NAS引擎405可實施用於本文中參考圖1至圖3描述之神經架構搜尋之各種不同技術之任何者。系統可根據本發明之態樣進行組態以使用多個目標執行NAS，包含一候選神經網路在於目標運算資源上執行時之推論延時及精度。作為判定NAS引擎405可用於搜尋基本神經網路架構之效能度量之部分，系統400可包含一效能量測引擎410。 System 400 may include a NAS engine 405. NAS engine 405 and other components of system 400 may be implemented as one or more computer programs, specially configured electronic circuits, or any combination thereof. NAS engine 405 may be configured to receive training data 401 and target computational resource data 402 and generate a basic neural network architecture 407, which may be sent to a LACS engine 415. NAS engine 405 may implement any of the various techniques used for neural architecture search described herein with reference to Figures 1-3. Systems according to aspects of the present invention may be configured to perform NAS using multiple targets, including the inference latency and accuracy of a candidate neural network when executed on the target computational resource. As part of determining performance metrics that the NAS engine 405 can use to search for basic neural network architectures, the system 400 can include a performance measurement engine 410.

效能量測引擎410可經組態用於接收用於一候選基本神經網路之一架構，且根據用於由NAS引擎405執行NAS之目標來產生效能度量。效能度量可根據多個目標提供候選神經網路之一整體效能量度。為了判定候選基本神經網路之精度，效能量測引擎410可在訓練實例之一驗證集上執行候選基本神經網路，例如藉由保留一些訓練資料401來獲得驗證集。 The performance measurement engine 410 can be configured to receive a framework for a candidate basic neural network and generate a performance metric based on the objectives for performing NAS by the NAS engine 405. The performance metric can provide an overall performance metric for the candidate neural network based on multiple objectives. To determine the accuracy of the candidate basic neural network, the performance measurement engine 410 can execute the candidate basic neural network on a validation set of training examples, such as by reserving some of the training data 401 to obtain the validation set.

為了量測延時，效能量測引擎410可與對應於由資料402指定之目標運算資源之運算資源通信。例如，若目標運算資源資料402指定一TPU作為一目標資源，則效能量測引擎410可發送候選基本神經網路以在一對應TPU上執行。TPU可容置於例如透過如參考圖5更詳細描述之網路與實施系統400之一或多個處理器通信之一資料中心中。 To measure latency, the performance measurement engine 410 may communicate with a computing resource corresponding to the target computing resource specified by the data 402. For example, if the target computing resource data 402 specifies a TPU as a target resource, the performance measurement engine 410 may send the candidate basic neural network to be executed on the corresponding TPU. The TPU may be housed in a data center that communicates with one or more processors of the implementation system 400 via a network as described in more detail with reference to FIG. 5 .

效能量測引擎410可接收指示目標運算資源接收輸入與產生輸出之間的延時之延時資訊。可直接在現場對目標運算資源量測延時資訊且將其發送至效能量測引擎410，或由效能量測引擎410本身量測。若效能量測引擎410量測延時，則引擎410可經組態以補償不歸咎於處理候選基本神經網路之延時，例如通信至目標運算資源及自目標運算資源通信之網路延時。作為另一實例，效能量測引擎410可基於目標運算資源之先前量測及目標運算資源之硬體特性來估計處理透過候選基本神經網路之輸入之延時。 The performance measurement engine 410 can receive latency information indicating the delay between a target compute resource receiving an input and producing an output. This latency information can be measured directly on-site at the target compute resource and sent to the performance measurement engine 410, or measured by the performance measurement engine 410 itself. If the performance measurement engine 410 measures latency, the engine 410 can be configured to compensate for delays not attributable to processing the candidate basic neural network, such as network delays in communicating to and from the target compute resource. As another example, the performance measurement engine 410 can estimate the latency of processing the input through the candidate basic neural network based on previous measurements of the target compute resource and the hardware characteristics of the target compute resource.

效能量測引擎410可產生候選神經網路架構之其他特性之效能度量，諸如其操作強度及其執行效率。如本文中參考圖1至圖3描述，推斷延時可被判定為一起依據運算要求(FLOPS)、執行效率及操作強度，且在一些實施方案中，系統400基於此等額外特性直接或間接搜尋及縮放一神經網路。 Performance measurement engine 410 can generate performance metrics for other characteristics of candidate neural network architectures, such as their operational intensity and execution efficiency. As described herein with reference to Figures 1-3, inferred latency can be determined based on a combination of computational requirements (FLOPS), execution efficiency, and operational intensity, and in some embodiments, system 400 directly or indirectly searches for and scales a neural network based on these additional characteristics.

一旦產生效能度量，效能量測引擎410可將度量發送至NAS引擎405，該NAS引擎405繼而可反覆對一新候選基本神經網路架構之一新搜尋，直至達到如本文中參考圖2描述之停止準則。 Once the performance metrics are generated, the performance measurement engine 410 may send the metrics to the NAS engine 405, which may then iterate a new search for a new candidate basic neural network architecture until a stopping criterion as described herein with reference to FIG. 2 is reached.

在一些實例中，根據一或多個控制器參數來調諧NAS引擎405用於調整NAS引擎405如何選擇下一候選基本神經網路架構。可根據用於一特定神經網路任務之一神經網路之所要特性來手動調諧控制器參數。在一些實例中，控制器參數可透過各種機器學習技術之任何者來學習，且NAS引擎405可實施一或多個機器學習模型，該一或多個機器學習模型經訓練用於根據諸如延時及精度之多個目標來選擇一基本神經網路架構。例如，NAS引擎405可實施一遞迴神經網路，該遞迴神經網路經訓練以使用一先前候選基本神經網路之特徵及多個目標以預測更有可能滿足目標之一候選基本網路。可使用經標記以指示鑑於與一神經網路任務相關聯之一組訓練資料及目標運算資源資料而選擇之最終基本神經架構之效能度量及訓練資料來訓練神經網路。 In some examples, NAS engine 405 is tuned based on one or more controller parameters to adjust how NAS engine 405 selects the next candidate base neural network architecture. The controller parameters can be manually tuned based on the desired properties of a neural network for a specific neural network task. In some examples, the controller parameters can be learned using any of a variety of machine learning techniques, and NAS engine 405 can implement one or more machine learning models trained to select a base neural network architecture based on various objectives, such as latency and accuracy. For example, the NAS engine 405 may implement a recurrent neural network that is trained to use characteristics of a previous candidate base neural network and multiple goals to predict a candidate base neural network that is more likely to meet the goal. The neural network may be trained using performance metrics and training data labeled to indicate a final base neural architecture selected based on a set of training data and target computational resource data associated with a neural network task.

LACS引擎415可經組態以執行根據本發明之態樣描述之延時感知複合縮放。LACS引擎415經組態以從NAS引擎405接收指定一基本神經網路架構之資料407。類似於NAS引擎405，LACS引擎415可與效能量測引擎410通信以獲得用於一候選經縮放神經網路架構之效能度量。LACS引擎415可在記憶體中維持不同縮放係數元組之一搜尋空間，且亦可經組態用於縮放一最終經縮放架構以快速獲得一系列經縮放神經網路架構，如本文中參考圖1至圖3描述。在一些實施方案中，LACS引擎415經組態以執行其他形式之縮放(例如，簡單縮放)，但使用由NAS引擎405使用之多個目標(包含延時)。 The LACS engine 415 can be configured to perform latency-aware composite scaling according to aspects of the present invention. The LACS engine 415 is configured to receive data 407 specifying a base neural network architecture from the NAS engine 405. Similar to the NAS engine 405, the LACS engine 415 can communicate with the performance measurement engine 410 to obtain performance metrics for a candidate scaled neural network architecture. The LACS engine 415 can maintain a search space of different scaling factor tuples in memory and can also be configured to scale a final scaled architecture to quickly obtain a series of scaled neural network architectures, as described herein with reference to Figures 1-3. In some implementations, the LACS engine 415 is configured to perform other forms of scaling (e.g., simple scaling), but using multiple objectives (including latency) used by the NAS engine 405.

圖5係用於實施NAS-LACS系統400之一例示性環境500之一方塊圖。系統400可在具有一或多個位置中之一或多個處理器之一或多個裝置上(諸如在伺服器運算裝置515中)實施。用戶端運算裝置512及伺服器運算裝置515可透過一網路560通信地耦合至一或多個儲存裝置530。(若干)儲存裝置530可為揮發性及非揮發性記憶體之一組合，且可在與運算裝置512、515相同或不同之實體位置處。例如，(若干)儲存裝置530可包含能夠儲存資訊之任何類型之非暫時性電腦可讀媒體，諸如一硬碟機、固態硬碟、磁帶機、光學儲存器、記憶卡、ROM、RAM、DVD、CD-ROM、可寫及唯讀記憶體。 FIG5 is a block diagram of an exemplary environment 500 for implementing the NAS-LACS system 400. The system 400 can be implemented on one or more devices having one or more processors in one or more locations, such as in a server computing device 515. The client computing device 512 and the server computing device 515 can be communicatively coupled to one or more storage devices 530 via a network 560. The storage device(s) 530 can be a combination of volatile and non-volatile memory and can be in the same or different physical location as the computing devices 512, 515. For example, storage device(s) 530 may include any type of non-transitory computer-readable media capable of storing information, such as a hard drive, solid-state drive, tape drive, optical storage, memory card, ROM, RAM, DVD, CD-ROM, writable and read-only memory.

伺服器運算裝置515可包含一或多個處理器513及記憶體514。記憶體514可儲存可由(若干)處理器513存取之資訊，包含可由(若干)處理器513執行之指令521。記憶體514亦可包含可由(若干)處理器513 擷取、操縱或儲存之資料523。記憶體514可為能夠儲存可由(若干)處理器513存取之資訊之一種類型之非暫時性電腦可讀媒體，諸如揮發性及非揮發性記憶體。(若干)處理器513可包含一或多個中央處理單元(CPU)、圖形處理單元(GPU)、場可程式化閘陣列(FGPA)及/或特定應用積體電路(ASIC)，諸如張量處理單元(TPU)。 Server computing device 515 may include one or more processors 513 and memory 514. Memory 514 may store information accessible by processor(s) 513, including instructions 521 executable by processor(s) 513. Memory 514 may also contain data 523 that can be retrieved, manipulated, or stored by processor(s) 513. Memory 514 may be a type of non-transitory, computer-readable medium capable of storing information accessible by processor(s) 513, such as volatile and non-volatile memory. The processor(s) 513 may include one or more central processing units (CPUs), graphics processing units (GPUs), field programmable gate arrays (FPGAs), and/or application specific integrated circuits (ASICs), such as tensor processing units (TPUs).

指令521可包含當由(若干)處理器513執行時導致一或多個處理器執行由指令定義之動作之一或多個指令。指令521可以目的碼格式儲存以由(若干)處理器513直接處理，或以其他格式儲存，包含按需解譯或預先編譯之可解譯指令碼或獨立原始碼模組集合。指令521可包含用於實施與本發明之態樣一致之系統400之指令。可使用(若干)處理器513及/或使用遠離於伺服器運算裝置515定位之其他處理器來執行系統400。 Instructions 521 may include one or more instructions that, when executed by processor(s) 513, cause one or more processors to perform the actions defined by the instructions. Instructions 521 may be stored in object code format for direct processing by processor(s) 513, or in other formats, including interpretable scripts or independent source code modules that are interpreted on demand or pre-compiled. Instructions 521 may include instructions for implementing system 400 consistent with aspects of the present invention. System 400 may be executed using processor(s) 513 and/or using other processors located remotely from server computing device 515.

可藉由(若干)處理器513根據指令521擷取、儲存或修改資料523。資料523可作為具有複數個不同欄位及記錄之一表或作為JSON、YAML、proto或XML文件儲存於電腦暫存器、一關係或非關係資料庫中。資料523亦可以一電腦可讀格式進行格式化，諸如但不限於二進位值、ASCII或萬國碼。再者，資料523可包含足以識別相關資訊之資訊，諸如數字、描述性文字、專有程式碼、指標、對儲存於其他記憶體(包含其他網路位置)中之資料之參考或由一函數用於計算相關資料之資訊。 Data 523 may be retrieved, stored, or modified by processor(s) 513 in accordance with instructions 521. Data 523 may be stored in computer memory, a relational or non-relational database, or as a table with a plurality of different fields and records, or as a JSON, YAML, proto, or XML document. Data 523 may also be formatted in a computer-readable format, such as, but not limited to, binary values, ASCII, or Unicode. Furthermore, data 523 may include information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary code, pointers, references to data stored in other memory (including other network locations), or information used by a function to calculate the relevant data.

用戶端運算裝置512亦可類似於伺服器運算裝置515組態，具有一或多個處理器516、記憶體517、指令518及資料519。用戶端運算裝置512亦可包含一使用者輸出526及一使用者輸入524。使用者輸入524可包含用於接收來自一使用者之輸入之任何適當機構或技術，諸如鍵盤、滑鼠、機械致動器、軟性致動器、觸控螢幕、麥克風及感測器。 The client computing device 512 may also be configured similarly to the server computing device 515, having one or more processors 516, memory 517, instructions 518, and data 519. The client computing device 512 may also include a user output 526 and a user input 524. The user input 524 may include any suitable mechanism or technology for receiving input from a user, such as a keyboard, mouse, mechanical actuator, soft actuator, touch screen, microphone, and sensor.

伺服器運算裝置515可經組態以將資料傳輸至用戶端運算裝置512，且用戶端運算裝置512可經組態以在作為使用者輸出526之部分實施之一顯示器上顯示所接收資料之至少一部分。使用者輸出526亦可用於顯示用戶端運算裝置512與伺服器運算裝置515之間的一介面。替代地或另外，使用者輸出526可包含一或多個揚聲器、換能器或其他音訊輸出、一觸覺介面或將非視覺及非聽覺資訊提供給用戶端運算裝置512之平台使用者之其他觸覺回饋。 The server computing device 515 can be configured to transmit data to the client computing device 512, and the client computing device 512 can be configured to display at least a portion of the received data on a display as part of an implementation of user output 526. User output 526 can also be used to display an interface between the client computing device 512 and the server computing device 515. Alternatively or in addition, user output 526 can include one or more speakers, transducers, or other audio outputs, a haptic interface, or other haptic feedback that provides non-visual and non-auditory information to a platform user of the client computing device 512.

儘管圖5將處理器513、516及記憶體514、517繪示為在運算裝置515、512內，然本說明書中描述之組件(包含處理器513、516及記憶體514、517)可包含可在不同實體位置中且非在相同運算裝置內操作之多個處理器及記憶體。例如，一些指令521、518及資料523、519可儲存於一可移除SD卡上，且其他者可儲存於一唯讀電腦晶片內。一些或全部指令及資料可儲存於實體地遠離於處理器513、516但仍可由處理器513、516存取之一位置中。類似地，處理器513、516可包含可執行同時及/或循序操作之一處理器集合。運算裝置515、512可各包含提供時序資訊之一或多個內部時脈，該一或多個內部時脈可用於由運算裝置515、512運行之操作及程式之時間量測。 Although FIG5 depicts processors 513, 516 and memories 514, 517 as being within computing devices 515, 512, the components described herein (including processors 513, 516 and memories 514, 517) may include multiple processors and memories that may be located in different physical locations and not operate within the same computing device. For example, some instructions 521, 518 and data 523, 519 may be stored on a removable SD card, while others may be stored within a read-only computer chip. Some or all instructions and data may be stored in a location physically remote from processors 513, 516 but still accessible to processors 513, 516. Similarly, processors 513 and 516 may comprise a collection of processors capable of executing simultaneous and/or sequential operations. Computing devices 515 and 512 may each include one or more internal clocks that provide timing information that can be used to measure the timing of operations and programs executed by computing devices 515 and 512.

伺服器運算裝置515可透過網路560連接至容置硬體加速器551A至N之一資料中心550。資料中心550可為多個資料中心或各種類型之運算裝置(諸如硬體加速器)定位於其中之其他設施之一者。容置於資料中心550中之運算資源可被指定為用於部署經縮放神經網路架構之目標運算資源之部分，如本文中描述。 Server computing device 515 can be connected via network 560 to a data center 550 that houses hardware accelerators 551A through N. Data center 550 can be one of multiple data centers or other facilities where various types of computing devices, such as hardware accelerators, are located. The computing resources housed in data center 550 can be designated as part of the target computing resources for deploying a scaled neural network architecture, as described herein.

伺服器運算裝置515可經組態以從用戶端運算裝置512接收在資料中心550中之運算資源上處理資料之請求。例如，環境500可為經組態以透過曝露平台服務之各種使用者介面及/或API將各種服務提供給使用者之一運算平台之部分。一或多個服務可為用於根據一指定任務及訓練資料產生神經網路或其他機器學習模型之一機器學習框架或一組工具。用戶端運算裝置512可接收及傳輸指定待分配用於執行經訓練以執行一特定神經網路任務之一神經網路之目標運算資源之資料。根據本文中參考圖1至圖4描述之本發明之態樣，NAS-LACS系統400可接收指定目標運算資源之資料及訓練資料，且作為回應產生用於部署於目標運算資源上之一系列經縮放神經網路架構。 Server computing device 515 can be configured to receive requests from client computing device 512 to process data on computing resources in data center 550. For example, environment 500 may be part of a computing platform configured to provide various services to users through various user interfaces and/or APIs that expose platform services. One or more of these services may be a machine learning framework or a set of tools for generating neural networks or other machine learning models based on a specified task and training data. Client computing device 512 can receive and transmit data specifying target computing resources to be allocated for executing a neural network trained to perform a specific neural network task. According to aspects of the present invention described herein with reference to Figures 1 to 4 , the NAS-LACS system 400 can receive data and training data specifying a target computing resource and, in response, generate a series of scaled neural network architectures for deployment on the target computing resource.

作為由實施環境500之一平台提供之潛在服務之其他實例，伺服器運算裝置515可根據資料中心550處可用之不同潛在目標運算資源來維持各種系列之經縮放神經網路架構。例如，伺服器運算裝置515可維持用於在容置於資料中心550中或以其他方式可用於處理之各種類型之TPU及/或GPU上部署神經網路之不同系列。 As another example of a potential service provided by a platform implementing environment 500, server computing device 515 may maintain various families of scaled neural network architectures based on different potential target computing resources available at data center 550. For example, server computing device 515 may maintain different families for deploying neural networks on various types of TPUs and/or GPUs housed in data center 550 or otherwise available for processing.

裝置512、515及資料中心550可能夠透過網路560進行直接及間接通信。例如，在使用一網路套接字(network socket)的情況下，用戶端運算裝置512可透過一網際網路協定連接至在資料中心550中操作之一服務。裝置515、512可設立監聽套接字，該等監聽套接字可接受用於發送及接收資訊之一起始連接。網路560本身可包含各種組態及協定，包含網際網路、全球資訊網、內部網路、虛擬專用網路、廣域網路、區域網路及使用一或多個公司專有之通信協定之專用網路。網路560可支援各種短程及長程連接。短程及長程連接可在不同頻寬上進行，諸如2.402GHz至2.480GHz(通常與Bluetooth®標準相關聯)、2.4GHz及5GHz(通常與 Wi-Fi®通信協定相關聯)；或使用各種通信標準，諸如用於無線寬頻通信之LTE®標準。另外或替代地，網路560亦可支援裝置512、515與資料中心550之間的有線連接，包含透過各種類型之乙太網路連接。 Devices 512, 515 and data center 550 may be able to communicate directly and indirectly via network 560. For example, using a network socket, client computing device 512 may connect to a service operating in data center 550 via an internet protocol. Devices 515, 512 may establish listening sockets that accept an initial connection for sending and receiving information. Network 560 itself may include a variety of configurations and protocols, including the Internet, the World Wide Web, intranets, virtual private networks, wide area networks, local area networks, and private networks using one or more company-specific communication protocols. Network 560 may support a variety of short-range and long-range connections. Short-range and long-range connections can occur over different bandwidths, such as 2.402 GHz to 2.480 GHz (commonly associated with the Bluetooth® standard), 2.4 GHz and 5 GHz (commonly associated with the Wi-Fi® communication protocol), or using various communication standards, such as the LTE® standard for wireless broadband communications. Additionally or alternatively, network 560 can also support wired connections between devices 512, 515 and data center 550, including via various types of Ethernet connections.

儘管在圖5中展示一單一伺服器運算裝置515、用戶端運算裝置512及資料中心550，然應理解，本發明之態樣可根據運算裝置之各種不同組態及數量來實施，包含在用於循序或並行處理之範例中，或透過多個裝置之分佈式網路。在一些實施方案中，本發明之態樣可在連接至經組態用於處理神經網路之硬體加速器之一單一裝置及其等之任何組合上執行。 Although a single server computing device 515, client computing device 512, and data center 550 are shown in FIG5 , it should be understood that aspects of the present invention can be implemented with a variety of different configurations and numbers of computing devices, including in scenarios for sequential or parallel processing, or across a distributed network of multiple devices. In some embodiments, aspects of the present invention can be executed on a single device connected to a hardware accelerator configured for processing a neural network, or on any combination thereof.

Example usage scenarios:

如本文中描述，本發明之態樣提供根據一多目標方法產生從基本神經網路縮放之一神經網路之一架構。下文係神經網路任務之實例。 As described herein, aspects of the present invention provide an architecture for generating a neural network scaled from a base neural network according to a multi-objective approach. Below are examples of neural network tasks.

作為一實例，至神經網路之輸入可為影像、視訊之形式。作為處理一給定輸入之部分，例如作為一電腦視覺任務之部分，一神經網路可經組態以提取、識別及產生特徵。經訓練以執行此類型之神經網路任務之一神經網路可經訓練以從一組不同潛在分類產生一輸出分類。另外或替代地，神經網路可經訓練以輸出對應於影像或視訊中之一所識別物件屬於一特定類別之一估計概率之一得分。 As an example, input to a neural network can be in the form of images or videos. As part of processing a given input, such as as part of a computer vision task, a neural network can be configured to extract, recognize, and generate features. A neural network trained to perform this type of neural network task can be trained to generate an output classification from a set of different potential classifications. Additionally or alternatively, the neural network can be trained to output a score corresponding to an estimated probability that a recognized object in an image or video belongs to a particular classification.

作為另一實例，至神經網路之輸入可為對應於一特定格式之資料檔案，例如HTML檔案、文字處理文件或從其他類型之資料獲得之格式化後設資料，諸如影像檔案之後設資料。在此內容脈絡中，一神經網路任務可為分類、評分或以其他方式預測關於所接收輸入之某一特性。例如，一神經網路可經訓練以預測所接收輸入包含與一特定物件相關之文字之概率。亦作為執行一特定任務之部分，神經網路可經訓練以產生文字預測，例如作為用於在編寫一文件時自動完成文件中之文字之一工具之部分。一神經網路亦可經訓練用於預測一輸入文件中之文字至一目標語言之一翻譯，例如當編寫一訊息時。 As another example, the input to a neural network can be a data file corresponding to a specific format, such as an HTML file, a word processing document, or formatted metadata derived from other types of data, such as metadata from an image file. In this context, a neural network's task can be to classify, score, or otherwise predict a characteristic of the received input. For example, a neural network can be trained to predict the probability that the received input contains text related to a specific object. Also as part of performing a specific task, a neural network can be trained to generate text predictions, such as as part of a tool for automatically completing text in a document as it is being written. A neural network can also be trained to predict a translation of text in an input document into a target language, such as when composing a message.

其他類型之輸入文件可為與互連裝置之一網路之特性相關之資料。此等輸入文件可包含活動日誌，以及關於不同運算裝置存取潛在敏感資料之不同源之存取特權之記錄。一神經網路可經訓練用於處理此等及其他類型之文件以預測正在進行及未來之網路安全漏洞。例如，神經網路可經訓練以預測一惡意行為者對網路之入侵。 Other types of input files may be data related to the characteristics of a network of interconnected devices. These input files may include activity logs and records of the access privileges of different computing devices to different sources of potentially sensitive data. A neural network can be trained to process these and other types of files to predict ongoing and future network security breaches. For example, a neural network can be trained to predict a malicious actor's intrusion into a network.

作為另一實例，至一神經網路之輸入可為音訊輸入，包含串流化音訊、預記錄音訊及作為一視訊或其他源或媒體之部分之音訊。音訊內容脈絡中之一神經網路任務可包含語音辨識，包含將語音與其他所識別音訊源隔離及/或增強所識別語音之特性以更容易被聽到。一神經網路可經訓練以預測輸入語音至一目標語言之一精確翻譯，例如即時地作為一翻譯工具之部分。 As another example, the input to a neural network can be audio input, including streamed audio, pre-recorded audio, and audio that is part of a video or other source or media. A neural network task in the context of audio content can include speech recognition, including isolating speech from other identified audio sources and/or enhancing the characteristics of the identified speech to make it easier to hear. A neural network can be trained to predict an accurate translation of input speech into a target language, for example, in real time as part of a translation tool.

除了包含本文中描述之各種類型之資料之資料輸入之外，一神經網路亦可經訓練以處理對應於給定輸入之特徵。特徵係與輸入之某一特性相關之值(例如，數值或類別)。例如，在一影像之內容脈絡中，影像之一特徵可與影像中之各像素之RGB值相關。影像/視訊內容脈絡中之一神經網路任務可為對一影像或視訊之內容進行分類，例如針對不同人、地方或事物之存在。神經網路可經訓練以提取及選擇相關特徵用於處理以針對一給定輸入產生一輸出，且亦可經訓練以基於輸入資料之各種特性之間的所學習關係來產生新特徵。 In addition to data input, which may include the various types of data described herein, a neural network can also be trained to process features corresponding to a given input. A feature is a value (e.g., a numeric value or a category) associated with a characteristic of the input. For example, in the context of an image, a feature of an image may be associated with the RGB value of each pixel in the image. A neural network task in the context of image/video content may be to classify the content of an image or video, such as the presence of different people, places, or things. A neural network can be trained to extract and select relevant features for processing to produce an output for a given input, and can also be trained to generate new features based on learned relationships between various characteristics of the input data.

本發明之態樣可在數位電路、電腦可讀儲存媒體中、作為一或多個電腦程式或前述之一或多者之一組合來實施。電腦可讀儲存媒體可為非暫時性的，例如，作為可由(若干)處理器執行且儲存於一有形儲存裝置上之一或多個指令。 Aspects of the present invention may be implemented in a digital circuit, a computer-readable storage medium, as one or more computer programs, or a combination of one or more of the foregoing. The computer-readable storage medium may be non-transitory, for example, as one or more instructions executable by a processor(s) and stored on a tangible storage device.

在本說明書中，片語「經組態以」用於與電腦系統、硬體或一電腦程式之部分相關之不同內容脈絡中。當一系統被稱為經組態以執行一或多個操作時，此意謂系統具有安裝在系統上之適當軟體、韌體及/或硬體，該軟體、韌體及/或硬體當在操作時導致系統執行一或多個操作。當某一硬體被稱為經組態以執行一或多個操作時，此意謂硬體包含一或多個電路，該一或多個電路當在操作時接收輸入且根據輸入且對應於一或多個操作產生輸出。當一電腦程式被稱為經組態以執行一或多個操作時，此意謂電腦程式包含一或多個程式指令，該一或多個程式指令在由一或多個電腦執行時導致一或多個電腦執行一或多個操作。 In this specification, the phrase "configured to" is used in various contexts relating to a computer system, hardware, or portion of a computer program. When a system is said to be configured to perform one or more operations, this means that the system has appropriate software, firmware, and/or hardware installed on the system that, when operated, causes the system to perform the one or more operations. When hardware is said to be configured to perform one or more operations, this means that the hardware includes one or more circuits that, when operated, receive input and produce output based on the input and corresponding to the one or more operations. When a computer program is said to be configured to perform one or more operations, this means that the computer program comprises one or more program instructions that, when executed by one or more computers, cause the one or more computers to perform the one or more operations.

雖然以一特定順序展示在圖式中展示且在發明申請專利範圍中敘述之操作，但應理解，可以不同於所展示之順序執行該等操作，且一些操作可被省略、執行超過一次、及/或與其他操作並行執行。此外，經組態用於執行不同操作之不同系統組件之分離不應被理解為需要分離組件。所描述之組件、模組、程式及引擎可整合在一起作為一單一系統，或可為多個系統之部分。 Although the operations shown in the figures and described in the claims are presented in a particular order, it should be understood that the operations may be performed in a different order than shown, and that some operations may be omitted, performed more than once, and/or performed concurrently with other operations. Furthermore, the separation of different system components configured to perform different operations should not be construed as requiring the separation of the components. The described components, modules, programs, and engines may be integrated together as a single system or may be part of multiple systems.

除非另有說明，否則前述替代實例並非相互排斥的，而可以各種組合來實施以達成獨特優點。由於可在不脫離由發明申請專利範圍定義之標的物之情況下利用上文論述之特徵之此等及其他變化及組合，所以實施例之先前描述應被視為藉由繪示而非藉由限制由發明申請專利範圍定義之標的物來進行。另外，提供本文中描述之實例以及表達為「諸如」、「包含」及類似物之子句不應被解釋為將發明申請專利範圍之標的物限於特定實例；實情係，該等實例旨在僅繪示許多可能態樣之一者。此外，不同圖式中之相同元件符號可識別相同或類似元件。 Unless otherwise indicated, the foregoing alternative embodiments are not mutually exclusive and may be implemented in various combinations to achieve unique advantages. Because these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of the embodiments should be considered by way of illustration, not by way of limitation, of the subject matter defined by the claims. Furthermore, the examples provided herein and clauses expressing "such as," "including," and the like should not be construed as limiting the subject matter of the claims to specific examples; rather, such examples are intended to illustrate only one of many possible aspects. Furthermore, identical element symbols in different figures may identify identical or similar elements.

101:基本神經網路架構 101: Basic Neural Network Architecture

103:系列 103: Series

107A至N:候選網路 107A to N: Candidate Networks

108:係數搜尋空間 108: Coefficient Search Space

109:經縮放神經網路架構 109: Scaled Neural Network Architecture

115:資料中心 115: Data Center

116:硬體加速器 116: Hardware Accelerator

Claims

A computer-implemented method for determining an architecture of a neural network, comprising: receiving, by one or more processors, information specifying target computational resources for deploying the neural network to perform a neural network task; receiving, by the one or more processors, data specifying an architecture of a base neural network; identifying, by the one or more processors, a plurality of scaling parameter values for scaling the base neural network based on the information specifying the target computational resources and a plurality of scaling parameters of the base neural network, wherein scaling the base neural network includes adjusting at least one of a number of neural network layers or a size of one or more neural network layers of the base neural network, wherein the identifying includes repeatedly performing the following steps: selecting a plurality of candidate scaling parameter values, determining a performance metric of the basic neural network scaled according to the plurality of candidate scaling parameter values, wherein the performance metric is determined according to a plurality of targets including a precision target and a latency target when operating the neural network using the target computing resources; and comparing the performance metric with a performance threshold to determine whether the performance metric meets the performance threshold, wherein the performance threshold includes a precision threshold and a latency threshold specific to the neural network task; generating, by the one or more processors, an architecture of a scaled neural network using the architecture of the basic neural network scaled according to the plurality of scaling parameter values, in response to the performance metric meeting the performance threshold; and The scaled neural network is deployed on the target computing resources to execute the neural network task.

The method of claim 1, wherein receiving the data specifying the architecture of the basic neural network includes: receiving, by the one or more processors, training data corresponding to the neural network task, and performing, by the one or more processors, a neural architecture search on a search space using the training data and based on a plurality of objectives to identify the architecture of the basic neural network.

The method of claim 2, wherein the search space comprises candidate neural network layers, each candidate neural network layer comprising a different respective activation function and configured to perform one or more respective operations.

The method of claim 3, wherein the architecture of the basic neural network comprises a plurality of components, each component having a respective plurality of neural network layers.

The method of claim 2, wherein the plurality of objectives used to perform the neural architecture search include the accuracy objective and the latency objective.

The method of claim 1, wherein the accuracy target corresponds to a minimum accuracy of the output of the base neural network when deployed on the target computing resources.

The method of claim 1, wherein the performance metric corresponds at least in part to a delay measure between the basic neural network receiving an input and generating an output when the basic neural network is scaled according to the plurality of candidate scaling parameter values and deployed on the target computing resources.

The method of claim 1, wherein the latency target corresponds to a minimum delay between the basic neural network receiving an input and generating an output when the basic neural network is deployed on the target computing resources.

The method of claim 1, wherein: the information specifying the target computing resources specifies one or more hardware accelerators for executing the neural network task; and deploying the scaled neural network on the target computing resources further includes executing the scaled neural network on the one or more hardware accelerators to execute the neural network task.

The method of claim 1, wherein the performance threshold is a composite of a performance threshold for accuracy and a performance threshold for delay.

The method of claim 1, wherein generating the scaled neural network architecture further comprises generating the scaled neural network architecture using a plurality of second scaling parameter values, the second scaling parameter values being generated based on the plurality of scaling parameter values modified by one or more complex coefficients.

The method of claim 1, wherein the base neural network is a convolutional neural network and the plurality of scaling parameters include one or more of a depth of the base neural network, a width of the base neural network, or an input resolution of the base neural network.

A system for neural architecture scaling includes: one or more processors, and one or more storage devices coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations for determining an architecture of a neural network, the operations comprising: receiving information specifying target computing resources for deploying the neural network to execute a neural network task; receiving data specifying an architecture of a basic neural network; identifying a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network, wherein scaling the basic neural network includes adjusting at least one of the number of neural network layers or a size of one or more neural network layers of the basic neural network, wherein the identifying includes repeatedly performing the following steps: selecting a plurality of candidate scaling parameter values, determining an efficiency metric of the basic neural network scaled according to the plurality of candidate scaling parameter values, wherein the efficiency metric is determined based on a plurality of targets of an accuracy target and a latency target when operating the neural network using the target computing resources; and comparing the performance metric to a performance threshold to determine whether the performance metric meets the performance threshold, wherein the performance threshold includes an accuracy threshold and a latency threshold specific to the neural network task; generating an architecture of a scaled neural network using the architecture of the base neural network scaled according to the plurality of scaling parameter values in response to the performance metric meeting the performance threshold; and deploying the scaled neural network on the target computing resources to execute the neural network task.

The system of claim 13, wherein receiving the data specifying the architecture of the basic neural network comprises: receiving training data corresponding to the neural network task, and using the training data and performing a neural architecture search on a search space based on multiple objectives to identify the architecture of the basic neural network.

The system of claim 14, wherein the search space comprises candidate neural network layers, each candidate neural network layer comprising a different respective activation function and configured to perform one or more respective operations.

The system of claim 14, wherein the plurality of objectives for performing the neural architecture search includes the accuracy objective and the latency objective.

The system of claim 13, wherein the accuracy target corresponds to a minimum accuracy of the output of the base neural network when deployed on the target computing resources.

The system of claim 13, wherein the performance metric corresponds at least in part to a delay measure between the basic neural network receiving an input and generating an output when the basic neural network is scaled according to the plurality of candidate scaling parameter values and deployed on the target computing resources.

The system of claim 13, wherein the latency target corresponds to a minimum delay between the basic neural network receiving an input and generating an output when the basic neural network is deployed on the target computing resources.

A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for determining an architecture of a neural network, comprising: receiving information specifying target computing resources for deploying the neural network to perform a neural network task; receiving, by the one or more processors, data specifying an architecture of a basic neural network; identifying a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network, wherein scaling the basic neural network includes adjusting at least one of the number of neural network layers or a size of one or more neural network layers of the basic neural network, wherein the identifying includes repeatedly performing the following steps: selecting a plurality of candidate scaling parameter values, determining an efficiency metric of the basic neural network scaled according to the plurality of candidate scaling parameter values, wherein the efficiency metric is determined based on a plurality of targets of an accuracy target and a latency target when operating the neural network using the target computing resources; and comparing the performance metric to a performance threshold to determine whether the performance metric meets the performance threshold, wherein the performance threshold includes an accuracy threshold and a latency threshold specific to the neural network task; generating an architecture of a scaled neural network using the architecture of the base neural network scaled according to the plurality of scaling parameter values in response to the performance metric meeting the performance threshold; and deploying the scaled neural network on the target computing resources to execute the neural network task.