[go: up one dir, main page]

CN113535349B - Data batch processing method, device and storage medium - Google Patents

Data batch processing method, device and storage medium

Info

Publication number
CN113535349B
CN113535349B CN202110011581.9A CN202110011581A CN113535349B CN 113535349 B CN113535349 B CN 113535349B CN 202110011581 A CN202110011581 A CN 202110011581A CN 113535349 B CN113535349 B CN 113535349B
Authority
CN
China
Prior art keywords
target
index
data
processing
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110011581.9A
Other languages
Chinese (zh)
Other versions
CN113535349A (en
Inventor
汪申鹏
丁丹迪
姚达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110011581.9A priority Critical patent/CN113535349B/en
Publication of CN113535349A publication Critical patent/CN113535349A/en
Application granted granted Critical
Publication of CN113535349B publication Critical patent/CN113535349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to face recognition in artificial intelligence, in particular to a data batch processing method, a device and a storage medium. The method comprises the steps of obtaining a target task, determining an instruction stream corresponding to the target task, determining an index complete set corresponding to the target task, dividing the index complete set into a plurality of data index sets, and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index sets distributed by each processing channel is larger than the number of caches corresponding to a cache unit, and executing at least one round of target operation corresponding to the instruction stream through the processing channels in parallel based on the data index sets distributed by the processing channels respectively until an operation result corresponding to the target task is obtained. The method can improve the processing efficiency of data batch processing.

Description

Data batch processing method, device and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for batch processing of data, and a storage medium.
Background
Learning models can improve model accuracy through a large number of computations, while in practical applications rich online services can be deployed through machine learning models. A plurality of access-intensive operators, such as unary operators, binary operators, and reduction operators, are typically included in the machine learning model.
In conventional schemes, the access intensive operators are typically computed in a pipelined fashion, e.g., for unary operators, the data to be processed is often processed sequentially. Since the different data to be processed have a dependency in the process of performing the data processing, for example, the data processing result of the first data to be processed needs to be depended in the process of performing the data processing on the second data to be processed, the subsequent data processing process must wait for the execution of the previous data processing process to be completed before being executed, thereby easily causing the phenomenon of data processing stuck. Thus, the overall time required to process access to the dense operator is very large, and there is a problem of inefficiency in batch processing of large amounts of data.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data batch processing method, apparatus, computer device, and storage medium capable of improving the efficiency of data batch processing.
A method of batch processing of data, the method comprising:
acquiring a target task and determining an instruction stream corresponding to the target task;
Determining an index complete set corresponding to the target task, wherein each index element in the index complete set is used for pointing to each content element in a matrix to be processed;
dividing the index complete set into a plurality of data index sets, and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index sets distributed by each processing channel is larger than the number of caches corresponding to the cache units;
Based on the data index sets respectively and correspondingly allocated to the processing channels, respectively executing at least one round of target operation corresponding to the instruction stream through the processing channels in parallel until an operation result corresponding to the target task is obtained;
When each processing channel executes the target operation of the current round, searching the corresponding content elements from the cache unit according to the current index elements in the corresponding data index set to serve as operation objects, and triggering the cache unit to acquire the content elements with the cached quantity according to the current index elements to carry out overlay updating when the corresponding content elements are not searched from the cache unit.
In one embodiment, the target operation of each round includes a set of operations performed in multiple cycles, each set of operations including a first target operation and a second target operation;
When executing the target operation of the current round, each processing channel searches the corresponding content element from the cache unit as an operation object according to the current index element in the corresponding data index set, and the processing method comprises the following steps:
When each processing channel executes a first target operation in the current cycle of the current round, searching a corresponding content element from the cache unit according to a corresponding current index element to serve as an operation object of the first target operation;
And when each processing channel executes the second target operation in the current cycle of the current round, acquiring intermediate operation data obtained by executing the second target operation in the previous cycle, and taking the intermediate operation data and the content element searched through the first target operation as operation objects together.
In one embodiment, when executing the target operation of the current round, each processing channel searches the corresponding content element from the cache unit as an operation object according to the current index element in the corresponding data index set, and includes:
determining the maximum concurrency quantity supported by the local resource;
when each processing channel executes the target operation of the current round, searching the content elements with the largest concurrency number from the cache unit according to the current index elements in the corresponding data index set, and taking each searched content element as an operation object in parallel to execute the target operation in parallel.
In one embodiment, the method further comprises:
acquiring a face image to be detected, and inputting the face image to be detected into the face detection model;
Acquiring a target task through the face detection model, triggering an instruction stream corresponding to the target task, and executing the instruction stream to obtain an operation result corresponding to the target task;
and determining a face detection result according to the operation result.
In one embodiment, the executing, according to the address table and each index element in each index subset, at least one round of target operation corresponding to the instruction stream until an operation sub-result corresponding to the target sub-task is obtained includes:
For each data index set, determining a to-be-processed index subset in the corresponding data index set, and determining a current to-be-processed index subset from the to-be-processed index subsets;
Executing at least one round of target operation corresponding to the instruction stream according to an address table corresponding to a corresponding data index set and each index element in the current index subset to be processed, and updating the current index subset to be processed into a completed index subset after completing the target operation;
and entering a flow for processing the next index subset to be processed, and returning to the step of determining the index subset to be processed in the corresponding data index set to continue to be executed until the target operation corresponding to the instruction stream is executed based on all the index subsets in each data index set, so as to obtain an operation sub-result corresponding to the target sub-task.
A data batch processing apparatus, the apparatus comprising:
the system comprises an index complete set acquisition module, a processing module and a processing module, wherein the index complete set acquisition module is used for acquiring a target task, determining an instruction stream corresponding to the target task, determining an index complete set corresponding to the target task, and pointing to each content element in a matrix to be processed by each index element in the index complete set;
the data index set acquisition module is used for dividing the index complete set into a plurality of data index sets and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index set distributed by each processing channel is larger than the number of caches corresponding to the cache units;
And the target operation execution module is used for executing at least one round of target operation corresponding to the instruction stream through each processing channel in parallel based on the data index sets respectively allocated by each processing channel until an operation result corresponding to the target task is obtained, wherein each processing channel searches the corresponding content element from the cache unit as an operation object according to the current index element in the corresponding data index set when executing the target operation of the current round, and triggers the cache unit to acquire the content elements with the cache number according to the current index element to carry out coverage update when the corresponding content element is not searched from the cache unit.
In one embodiment, the data index set obtaining module is further configured to determine the number of channels of the processing channel and the number of caches corresponding to the cache units, divide the index corpus according to the number of channels and the number of caches to obtain a plurality of data index sets, and allocate each data index set to a corresponding processing channel respectively.
In one embodiment, the target operation execution module is further configured to determine, for each of the plurality of processing channels, at least one to-be-processed data index set corresponding to the corresponding processing channel, determine, for each processing channel, a current to-be-processed data index set from the corresponding to-be-processed data index set, respectively, execute, based on the current to-be-processed data index set corresponding to each processing channel, at least one round of target operation corresponding to the instruction stream through each processing channel in parallel, and update the current to-be-processed data index set to a completed data index set after completing the target operation, enter a process of processing a next to-be-processed data index set, and return to each processing channel of the plurality of processing channels, determine, for each processing channel, the to-be-processed data index set corresponding to the corresponding processing channel, and continue execution until, based on each data index set in the index full set, execute the target operation corresponding to the instruction stream for at least one round.
In one embodiment, the index corpus acquisition module is used for determining an instruction stream corresponding to the target task, wherein the instruction stream comprises more than one operation instruction with determined triggering sequence, the more than one operation instruction comprises a first instruction and a second instruction, the first instruction is triggered before the second instruction, and the target operation corresponding to the instruction stream comprises a first target operation corresponding to the first instruction and a second target operation corresponding to the second instruction.
In one embodiment, the target operation of each round comprises a group of operations which are circularly executed for a plurality of times, each group of operations comprises a first target operation and a second target operation, the target operation execution module further comprises a cycle execution module, and is used for searching corresponding content elements from the cache unit according to corresponding current index elements when each processing channel executes the first target operation in the current round of the current round to serve as operation objects of the first target operation, and acquiring intermediate operation data obtained by executing the second target operation in the previous round when each processing channel executes the second target operation in the current round of the current round to serve as operation objects together with the intermediate operation data and the content elements searched through the first target operation.
In one embodiment, the target operation execution module further comprises a concurrency processing module, wherein the concurrency processing module is used for determining the maximum concurrency quantity supported by the local resource, and each processing channel searches the content elements with the maximum concurrency quantity from the cache unit according to the current index element in the corresponding data index set when executing the target operation of the current round, and takes each searched content element as an operation object in parallel to execute the target operation in parallel.
In one embodiment, the data batch processing device is further configured to trigger, when no corresponding content element is found in the cache unit, the local kernel to acquire content elements with continuity and corresponding to the number of caches from the local memory according to the current index element, and trigger to perform overlay update on the stored content in the cache unit based on the acquired content elements, where the content elements in the cache unit are used for executing subsequent target operations.
In one embodiment, the data batch processing device is further configured to determine the number of page tables of an address table corresponding to each data index set, divide the data index set according to the number of page tables to obtain at least one index subset, each target page table in the address table corresponds to at least one matrix row in the index subset, the target page table is configured to store address information of index elements having continuity in the corresponding matrix row, and execute at least one round of target operation corresponding to the instruction stream in parallel through each processing channel according to the address table corresponding to each processing channel and the index subset in the data index set allocated corresponding to each processing channel until an operation sub result corresponding to the target sub-task is obtained, and synthesize each operation sub-result to obtain an operation result corresponding to the target task.
In one embodiment, the data batch processing device is further configured to, when each processing channel performs a target operation of a current round, search corresponding target address information according to a target page table corresponding to a current matrix row where a current index element is located, obtain a corresponding content element based on the target address information as an operation object, and trigger the target page table to perform coverage update of the address information according to the current index element when the corresponding content element is not searched in the target page table.
In one embodiment, the data batch processing device is deployed with an inference engine for execution and is applied to a face detection model, wherein the target task is one of access-memory intensive tasks generated by the face detection model during face detection.
In one embodiment, the data batch processing device is further used for acquiring a face image to be detected, inputting the face image to be detected into the face detection model, acquiring a target task through the face detection model, triggering an instruction stream corresponding to the target task and executing the instruction stream to obtain an operation result corresponding to the target task, and determining a face detection result according to the operation result.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring a target task and determining an instruction stream corresponding to the target task;
Determining an index complete set corresponding to the target task, wherein each index element in the index complete set is used for pointing to each content element in a matrix to be processed;
dividing the index complete set into a plurality of data index sets, and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index sets distributed by each processing channel is larger than the number of caches corresponding to the cache units;
Based on the data index sets respectively and correspondingly allocated to the processing channels, respectively executing at least one round of target operation corresponding to the instruction stream through the processing channels in parallel until an operation result corresponding to the target task is obtained;
When each processing channel executes the target operation of the current round, searching the corresponding content elements from the cache unit according to the current index elements in the corresponding data index set to serve as operation objects, and triggering the cache unit to acquire the content elements with the cached quantity according to the current index elements to carry out overlay updating when the corresponding content elements are not searched from the cache unit.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring a target task and determining an instruction stream corresponding to the target task;
Determining an index complete set corresponding to the target task, wherein each index element in the index complete set is used for pointing to each content element in a matrix to be processed;
dividing the index complete set into a plurality of data index sets, and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index sets distributed by each processing channel is larger than the number of caches corresponding to the cache units;
Based on the data index sets respectively and correspondingly allocated to the processing channels, respectively executing at least one round of target operation corresponding to the instruction stream through the processing channels in parallel until an operation result corresponding to the target task is obtained;
When each processing channel executes the target operation of the current round, searching the corresponding content elements from the cache unit according to the current index elements in the corresponding data index set to serve as operation objects, and triggering the cache unit to acquire the content elements with the cached quantity according to the current index elements to carry out overlay updating when the corresponding content elements are not searched from the cache unit.
According to the data batch processing method, the data batch processing device, the computer equipment and the storage medium, the quality stream and the index corpus corresponding to the target task can be determined by acquiring the target task. The index complete set can be divided by determining the index complete set, so that a plurality of data index sets are obtained, and each data index set is distributed to a corresponding processing channel. By assigning each data index set to a corresponding processing channel, at least one round of target operation can be performed in parallel based on the assigned data index set based on each processing channel until an operation result corresponding to the target task is obtained. Because the index corpus is divided based on the number of channels and the number of caches, the number of elements of the divided data index corpus can be larger than the number of caches corresponding to the cache units, so that when a plurality of processing channels read data from the cache units in parallel based on the corresponding index elements, cache miss events (CACHE MISS) can occur simultaneously, and the multi-channel memory allows the simultaneous occurrence of memory reading, the cache miss events which should occur in a plurality of different time phases can be concentrated to occur at the same time, thereby hiding the access delay of a part of the cache miss events, and further improving the processing efficiency of data batch processing.
A method of batch processing of data, the method comprising:
acquiring a target subtask and determining an instruction stream corresponding to the target subtask;
Determining a data index set corresponding to the target subtask and the number of page tables of address tables in local resources, wherein the data index set is used for searching content elements in a matrix to be processed;
dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset respectively and is used for storing address information of index elements with continuity in the corresponding matrix row;
Executing at least one round of target operation corresponding to the instruction stream according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained;
When the target operation of the current round is executed, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information to serve as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element.
A data batch processing apparatus, the apparatus comprising:
the subtask determining module is used for acquiring a target subtask and determining an instruction stream corresponding to the target subtask;
The data dividing module is used for determining a data index set corresponding to the target subtask and the number of page tables of address tables in local resources, wherein the data index set is used for searching content elements in a matrix to be processed, dividing the data index set according to the number of the page tables to obtain at least one index subset, and each target page table in the address tables is respectively corresponding to at least one matrix row in the index subset and is used for storing address information of index elements with continuity in the corresponding matrix row;
And the sub-result determining module is used for executing at least one round of target operation corresponding to the instruction stream according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained, wherein when the target operation of the current round is executed, corresponding target address information is searched according to a target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information and serves as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element.
In one embodiment, the sub-result determining module is further configured to determine, for each data index set, a subset of indexes to be processed in the corresponding data index set, and determine a current subset of indexes to be processed from the subset of indexes to be processed, execute at least one round of target operation corresponding to the instruction stream according to an address table corresponding to the corresponding data index set and each index element in the current subset of indexes to be processed, update the current subset of indexes to be processed to a completed subset of indexes after completing the target operation, enter a process of processing a next subset of indexes to be processed, and return the step of determining the subset of indexes to be processed in the corresponding data index set to continue execution until, based on all the subsets of indexes in each data index set, execute at least one round of target operation corresponding to the instruction stream, and obtain an operator result corresponding to the target sub-task.
In one embodiment, the sub-result determining module is further configured to determine a maximum concurrency number supported by the local resource, group the index subsets according to the maximum concurrency number to obtain at least one concurrency subset, and execute a round of target operation corresponding to a round of instruction stream based on each concurrency subset in turn.
In one embodiment, the target operation of each round includes a set of operations performed in a number of cycles corresponding to the number of page tables, and the sub-result determination module is further configured to, for each concurrent subset, cycle the set of operations in the number of page tables based on respective matrix rows in the concurrent subset during execution of the target operation of each round.
In one embodiment, the data batch processing device is deployed with an inference engine for execution and is applied to a face detection model, wherein the target subtasks are tasks used for processing part of data in the matrix to be processed in target tasks, and the target tasks are one of access intensive tasks in the tasks generated when the face detection model performs face detection.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring a target subtask and determining an instruction stream corresponding to the target subtask;
Determining a data index set corresponding to the target subtask and the number of page tables of address tables in local resources, wherein the data index set is used for searching content elements in a matrix to be processed;
dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset respectively and is used for storing address information of index elements with continuity in the corresponding matrix row;
Executing at least one round of target operation corresponding to the instruction stream according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained;
When the target operation of the current round is executed, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information to serve as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring a target subtask and determining an instruction stream corresponding to the target subtask;
Determining a data index set corresponding to the target subtask and the number of page tables of address tables in local resources, wherein the data index set is used for searching content elements in a matrix to be processed;
dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset respectively and is used for storing address information of index elements with continuity in the corresponding matrix row;
Executing at least one round of target operation corresponding to the instruction stream according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained;
When the target operation of the current round is executed, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information to serve as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element.
According to the data batch processing method, the data batch processing device, the computer equipment and the storage medium, the instruction stream and the data index set corresponding to the target subtask can be determined by acquiring the target subtask. By determining the data index set, the data index set may be partitioned based on the number of page tables of the address table, resulting in at least one index subset. By determining index subsets, a target operation of at least one round corresponding to an instruction stream may be performed based on an address table and index elements in each index subset. Because the data index set is divided based on the number of the page tables, each index subset obtained by division can contain matrix rows of the number of the page tables, and because each target page table in the address table corresponds to at least one matrix row in the index subset respectively, the address information of index elements with continuity in the corresponding matrix row is stored in the target page table, when the target operation is executed based on the address table and the current index subset, the address information of all index elements in the current index subset can be completely stored in the same address table, so that the address information of all index elements in the current index subset can be searched based on the address table after the original address table which does not contain the address information is completely updated once, and the page table in the address table is not required to be subjected to covering update any more, thereby reducing the number of times of covering update of the address table, saving the clock period consumed by the covering update and improving the processing efficiency of batch processing of the data.
Drawings
FIG. 1 is an application environment diagram of a data batch processing method in one embodiment;
FIG. 2A is a flow diagram of a normal pipeline in one embodiment;
FIG. 2B is a flow diagram of an exception pipeline in one embodiment;
FIG. 3 is a flow diagram of a method of data batch processing in one embodiment;
FIG. 4 is a flow diagram of multi-channel parallel processing in one embodiment;
FIG. 5 is a flow chart of face detection in one embodiment;
FIG. 6 is a flow chart of a method of data batch processing in another embodiment;
FIG. 7 is a flow chart of a method of data batch processing in one embodiment;
FIG. 8 is a flow chart of a method for data batch processing in another embodiment;
FIG. 9 is an application scenario diagram of a data batch processing method in another embodiment;
FIG. 10 is a block diagram showing a structure of a data batch processing apparatus in one embodiment;
FIG. 11 is a block diagram showing a structure of a data batch processing apparatus in one embodiment;
fig. 12 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Referring to FIG. 1, FIG. 1 is a schematic diagram of an alternative architecture of a data batch processing system according to an embodiment of the present application, and as shown in FIG. 1, to support a data batch processing application, in a data batch processing system 100, a user terminal 102 communicates with a computer device 104 via a network. The user terminals 102 include a first user terminal 102-1 and a second user terminal 102-2. The user may initiate a service request through the first user terminal 102-1, e.g., upon determining that an image to be detected is received, the first user terminal 102-1 may display a detection control 102-12 on the graphical interface 102-11 to initiate a face detection request. The computer device 104 generates more than one computing task based on the business request, wherein the more than one computing task includes the target task. The computer device 104 may obtain the target task, determine an instruction stream corresponding to the target task, and simultaneously execute at least one round of target operation corresponding to the instruction stream based on the plurality of processing channels until an operation result corresponding to the target task is obtained. The computer device 104 may determine a service processing result corresponding to the service request based on each operation result, and feed back the service processing result to the second user terminal 102-2, for example, feed back a face detection result to the second user terminal 102-2, so that the second user terminal 102-2 may display the face detection result based on the graphical interface 102-21. The first user terminal 102-1 and the second user terminal 102-2 may be the same terminal or different terminals.
The user terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the computer device 104 may be a terminal or a server, where the server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.
It should be noted that, the computer device 104 may be deployed with machine learning models having different functions, and different kinds of online services may be deployed through the machine learning models having different functions. For example, the computer device may provide a face detection service, a face registration service, an identification service, or the like through a machine learning model having a face detection function. The computer device may also provide an image classification service, a target object recognition service, a monitoring analysis service, or the like through a machine learning model (such as a semantic segmentation model or an image classification model, or the like) having an image processing function. While different machine learning models may involve operations that access dense operators, such as unary operators, binary operators, instNorm operators (a variance operator), and reduction operators, when providing services. When the target task is to execute the access intensive operator, the batch operation can be performed on a large amount of data to be processed by the data batch processing method provided by the embodiment of the application, so that the processing efficiency is greatly improved. Of course, the target task may be another calculation task that needs to be processed in batch, and may be related to a specific application scenario, which is not limited in the embodiment of the present application.
It should be further noted that, the data batch processing method according to the embodiments of the present application is mainly aimed at that, when a machine learning model implemented by an artificial intelligence technology performs an operation, more than one computing task needs to be performed, and this includes the target tasks mentioned in more detail in the embodiments of the present application. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
It can be appreciated that the data batch processing method in the embodiments of the present application relates to an artificial intelligence machine learning technology, where machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
It should be noted that the terms "first," "second," and the like as used herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The singular forms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one, unless the context clearly dictates otherwise. The numbers of "plural" or "multiple" etc. mentioned in the embodiments of the present application each refer to the number of "at least two", for example, "plural" means "at least two", and "multiple" means "at least two".
For a better understanding of the present solution, the conventional solution is further described below. For the access intensive unary operator, in the sequential calculation process, the normal workflow of the CPU pipeline is shown in FIG. 2A, and a group of operations including a data loading operation (Load) and a summation operation (Add) are taken as a cycle, when the first cycle is performed, after loading the first data to be processed, the summation processing can be performed on the first data to be processed, so as to obtain an intermediate operand of the first cycle operation, and when the intermediate operand of the first cycle operation is obtained, the second cycle operation can be performed. And so on until the intermediate operand of the last round operation is obtained. As shown in fig. 2A, assuming that each instruction stage consumes one clock cycle, one Load and one Add operation are regarded as one set of operations, and n sets of operations need to be performed in a loop, the first set of operations takes 7 clock cycles, and then the end time of each set of operations is 3 clock cycles later than the last loop, so the total time taken to loop n sets of operations is 7+ (n-1) x 3 clock cycles, and when the number of loops is sufficiently high, the average time taken to perform each set of operations is 3 clock cycles.
When the data loading operation is executed and the data to be loaded is not found out from the cache unit, the data to be loaded needs to be read from the memory, so that blocking occurs and the pipeline efficiency is reduced. As shown in fig. 2B, when the data to be loaded corresponding to Load1 does not exist in the cache unit, the clock cycle required for executing MEM (reading data from the register) is increased from 1 clock cycle to 4 clock cycles, and the next operation has a dependency relationship with the current operation, for example, the execution of Add1 needs to depend on the data obtained by executing Load1, so that the execution of the next operation needs to wait for the execution of the current operation to finish, which eventually results in a decrease in pipeline efficiency. FIG. 2A illustrates a flow diagram of a normal pipeline in one embodiment. FIG. 2B illustrates a flow diagram of an exception pipeline in one embodiment.
In one embodiment, as shown in fig. 3, a data batch processing method is provided, and the method is applied to the computer device in fig. 1 for illustration, and includes the following steps:
Step S302, a target task is acquired, and an instruction stream corresponding to the target task is determined.
The target task is a computing task to be executed, and may specifically be executing an access intensive operator, such as a protocol task, a InstNorm task, and the like. An instruction stream is a sequence of instructions that a computer program needs to execute. The instruction stream proposed by the embodiment of the application comprises more than one target instruction with determined triggering sequence. After the target instruction is triggered, the computer device executes a target operation corresponding to the target instruction. Different kinds of target instructions may correspond to different target operations. For example, when the target task is an addition protocol task, the corresponding instruction stream includes a data load instruction and a summation instruction, and the target operation corresponding to the target instruction may specifically include a data load operation and a summation operation.
It will be understood that each target instruction in the instruction stream is an instruction related to a target task, and execution of the target task can be realized after different kinds of target instructions are triggered to be executed. For example, when the target task is an addition specification task, the target instructions in the corresponding instruction stream may be specifically a data load instruction and a summation instruction. When the target task is a maximum task, the corresponding target instruction may specifically include a shuffle instruction and a maximum instruction. When the target task is a calculate variance task, the corresponding target instructions may include, in particular, a shuffle instruction, a multiply instruction, a sum instruction, and a divide instruction, among others. It will be appreciated that the target instructions included in the instruction stream are related to a specific target task, and embodiments of the present application are not limited in this regard.
In particular, after determining a target task, the computer device may determine an instruction stream required for executing the target task, where the instruction stream may include one target instruction of multiple loops, and may include more than one target instruction of multiple loops and determined in trigger sequence. Wherein the trigger sequence determination indicates that the firing order of each of the more than one target instructions is fixed. For example, when a first instruction and a second instruction are included in the instruction stream, the first instruction must be issued before the second instruction, i.e., the second instruction can be issued only after the first instruction is issued.
In one embodiment, dependencies exist between heterogeneous target instructions. For example, when the target task is an addition protocol task, the corresponding instruction stream includes a data Load instruction (Load instruction) and a summation instruction (Add instruction), and the operation object of the Add instruction includes data obtained by executing the Load instruction, so that the two instructions have a dependency relationship. That is, the computer device can issue the corresponding Add instruction based on the target operand after issuing the Load instruction and completing execution to obtain the target operand.
And step S304, determining an index complete set corresponding to the target task, wherein each index element in the index complete set is used for pointing to each content element in the matrix to be processed.
The index complete set refers to a set formed by combining at least one index element, and each index element in the index complete set is used for pointing to each content element in a matrix to be processed. The matrix to be processed refers to a set formed by content elements that do not perform the target task, for example, when the content elements that do not perform the target task are data "0", "1", "2", and "3", the corresponding matrix to be processed may be a one-dimensional array [0,1,2,3] including a matrix row, so that when the identifier (array name) of the one-dimensional array is src, src [ i ] (i is 0,1,2, or 3) is used as the index corpus, src [0] in the index corpus points to the value 0 in the one-dimensional array, src [1] points to the value 1 in the one-dimensional array, and so on. In a specific embodiment, the matrix to be processed may be a one-dimensional array including only one matrix row, a two-dimensional array including a plurality of rows and columns of data, or a three-dimensional array having a spatial structure. The present embodiment is not limited herein.
Specifically, when determining the target task, the computer device may determine a matrix to be processed corresponding to the target task, determine an identifier of the matrix to be processed and a number of rows and a number of columns included in the matrix to be processed, and determine the index corpus corresponding to the target task according to the number of rows and the number of columns and the identifier. Wherein the identifier is information uniquely identifying a matrix to be processed. For example, when the matrix to be processed is determined to be a two-dimensional array containing 10 rows and 8 columns, and the identifier of the two-dimensional array is src, the computer device may determine that the corresponding index corpus is src [ i ] (0≤i≤80, i is an integer).
In one embodiment, a developer may input a matrix to be processed through the corpus of indices in advance, so that the computer device may store the matrix to be processed to memory ready for subsequent processing. For example, a developer may input the matrix to be processed by the code int src [4] = {0,1,2,3}, where src is an identifier, {0,1,2,3} is a content element in the matrix to be processed, and src [ i ] (0≤i≤4, i is an integer) is the complete set of indices.
In one embodiment, the developer may directly specify the identifier, the number of rows, and the number of columns, such that the computer device may determine the corresponding full set of indices based on the specified identifier, number of rows, and number of columns.
Step S306, dividing the index complete set into a plurality of data index sets, and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index set distributed by each processing channel is larger than the number of caches corresponding to the cache units.
Wherein, the processing channel refers to a memory channel. The computer equipment can be provided with a plurality of memory controllers, each memory controller can work independently and in parallel, and respectively control one processing channel to address, read data and process data, so that the bandwidth of the memory is increased by the number of channels, and the data processing speed is correspondingly increased by the number of channels. A cache unit, also called a cache block, is a unit in a cache (cache). At least one cache unit may be included in one cache, each of which may store a cache number of data. The number of caches may be determined according to the size of the storage space of the cache unit and the size of the space occupied by the data to be stored, for example, when one cache block is 16B and one data occupies 4B, one cache unit may store 4 data.
Specifically, the computer device may divide the index complete set according to the number of caches corresponding to the cache unit, obtain a plurality of data index sets, and allocate each data index set to a corresponding processing channel. It is easy to understand that, until the operation result corresponding to the target task is obtained, when the number of divided data index sets is less than or equal to the number of processing channels, each processing channel may be allocated to obtain one or zero data index sets, and when the number of divided data index sets is greater than the number of processing channels, each processing channel may be allocated to obtain one or more data index sets.
In one embodiment, the computer device may divide the data index set by the number of processing channels to obtain the data index set by the number of processing channels. The element quantity of each data index set is larger than the cache quantity corresponding to the cache unit. For example, when the matrix to be processed includes 80 content elements and has two processing channels, the computer device may use index elements corresponding to the first 40 content elements as the data index set 1, use index elements corresponding to the last 40 content elements as the data index set 2, allocate the data index set 1 to the processing channel 1, and allocate the data index set 2 to the processing channel 2. Namely, the designated processing channel 1 extracts the first 40 content elements in the matrix to be processed from the buffer unit and processes the first 40 content elements extracted, and the designated processing channel 2 extracts the last 40 content elements in the matrix to be processed from the buffer unit and processes the last 40 content elements extracted.
In one embodiment, dividing the index corpus into a plurality of data index sets and respectively distributing each data index set to a corresponding processing channel comprises determining the channel number of the processing channel and the buffer number corresponding to the buffer unit, dividing the index corpus according to the channel number and the buffer number to obtain a plurality of data index sets, and respectively distributing each data index set to the corresponding processing channel.
Specifically, the computer device determines the number of channels of the processing channels and the number of caches corresponding to the cache units, and divides the index corpus according to the number of channels and the number of caches to obtain a plurality of data index sets. The number of elements contained in each data index set is greater than the number of caches corresponding to the cache units. For example, when the matrix to be processed contains 80 content elements, the number of caches corresponding to the cache units is 4, the computer device may use the index element corresponding to each 10 content elements as a data index set, for example, the data index set 1 is src [ i ] (0 is equal to or less than or equal to i is equal to 9,i is an integer), the data index set 2 is src [ i ] (10 is equal to or less than or equal to i is equal to 19, i is an integer)...
Further, the computer device allocates each data index set to a corresponding processing channel, so that the processing channel can search for a corresponding content element from the cache unit according to the index element in the data index set.
In one embodiment, the computer device may sequentially assign the data index sets to each processing channel, or may randomly assign the data index sets to each processing channel, e.g., when there are 4 data index sets, the computer device may assign the data index sets 1, 3 to the processing channel 1, the data index sets 2, 4 to the processing channel 2, or may randomly assign the data index sets 3,4 to the processing channel 1, and the data index sets 1,2 to the processing channel 2. The present embodiment is not limited herein.
In the above embodiment, the index corpus is divided based on the number of channels and the number of caches, so that the number of elements of the divided data index corpus may be greater than the number of caches corresponding to the cache units, so that when a plurality of processing channels read data from the cache units in parallel based on the respective corresponding index elements, a cache miss event (CACHE MISS) may occur simultaneously, thereby improving overall data batch processing efficiency.
And step 308, based on the data index sets respectively allocated correspondingly to the processing channels, respectively executing at least one round of target operation corresponding to the instruction stream through the processing channels in parallel until an operation result corresponding to the target task is obtained, wherein when each processing channel executes the target operation of the current round, corresponding content elements are searched from the cache unit as operation objects according to the current index elements in the corresponding data index sets, and when the corresponding content elements are not searched from the cache unit, the cache unit is triggered to acquire the content elements with the cached quantity according to the current index elements for coverage updating.
Specifically, each processing channel executes at least one round of target operation corresponding to the instruction stream in parallel based on the allocated data index set, that is, each processing channel determines a corresponding current data index set to be processed respectively, and executes at least one round of target operation in parallel based on the corresponding current data index set to be processed respectively. For example, channel 1 processes data index set 1, while channel 2 processes data index set 2. And traversing index elements contained in the current data index set for each of the plurality of processing channels, and executing corresponding target operations based on instructions in the instruction stream. Wherein the instruction stream includes at least one round of instruction stream based on which a round of target operations can be performed. A round of target operations may include a set of operations with multiple cycles.
For example, when a set of operations includes a data loading operation and a summing operation, for each of a plurality of processing channels, an index element included in a current data index set is traversed, and by a target operation, a content element pointed to by the currently traversed index element is processed based on the currently traversed index element. More specifically, when a set of operations of the current cycle is executed, the processing channel determines a current index element in the current data index set, performs a data loading operation, extracts a corresponding target content element from the cache based on the current index element, and performs a summing operation, and adds the extracted target content element to an intermediate operand obtained by performing the summing operation in the previous cycle to obtain the intermediate operand obtained by performing the summing operation in the current cycle. And circulating until each index element contained in the current data index set is traversed.
It is noted that when each processing channel executes the target operation of the current round, the corresponding content elements are searched from the cache unit according to the current index elements in the corresponding data index set to serve as operation objects, and when the corresponding content elements are not searched from the cache, the cache unit is triggered to acquire the content elements with the cache number according to the current index elements to carry out overlay update. That is, each processing channel may not search for the corresponding content element from the cache unit at the same time, so that the cache unit is triggered to acquire the content elements with the cached number according to the current index element at the same time to perform overlay update.
In one embodiment, steps S306 and S308 may be alternately performed until an operation result corresponding to the target task is obtained. Specifically, for each of the plurality of processing channels, an unprocessed data index set may be pre-assigned, so that each processing channel may process the assigned data index set, and after the processing is completed, update the assigned data index set to a completed data index set. When the next round of data index set allocation is performed, for each processing channel in the plurality of processing channels, the computer device may select one data index set from all unprocessed data index sets to allocate to the corresponding processing channel until, based on each data index set in the index full set, a target operation corresponding to the instruction stream for at least one round is performed.
For example, when the matrix to be processed is a matrix containing 80 content elements, the computer device may use index elements corresponding to every 10 content elements as a data index set, for example, data index set 1 is src [ i ] (0≤i≤ 9,i is an integer), data index set 2 is src [ i ] (10≤i≤19, i is an integer)... Further, the computer device may allocate the data index set 1 to the processing channel 1 and allocate the data index set 2 to the processing channel 2, so that the processing channel 1 may extract content elements in the matrix to be processed from the buffer unit based on the index elements in the data index set 1 and perform corresponding processing on the extracted content elements, and similarly, the processing channel 2 may extract content elements in the matrix to be processed from the buffer unit based on the index elements in the data index set 2 and perform corresponding processing on the extracted content elements. Further, when entering the process of data index set allocation for the next round, the computer device may allocate data index set 3 to processing channel 1, data index set 4 to processing channel 2, and so on, until a target operation corresponding to at least one round of instruction stream is performed based on each data index set in the index full set.
In one embodiment, referring to fig. 4, it is assumed that there are 4 processing channels, and when 4 content elements far apart in the matrix to be processed are loaded simultaneously, that is, when 4 processing channels process corresponding data index sets in parallel, since the number of elements of the data index set allocated by each processing channel is greater than the number of caches corresponding to the cache unit, and the cache unit may store consecutive and cached content elements, the 4 content elements loaded simultaneously are not in the same cache unit, so that 4 cache miss events (CACHE MISS) occur simultaneously. The cache miss event refers to triggering to read the corresponding content element from the memory when the corresponding content element is not found in the cache unit, and triggering the cache unit to acquire the content elements with the number of caches according to the current index element for overlay update.
For example, when there are 4 cache units and no content elements are stored in the cache units in the initial state, when the processing channel 1 executes the MEM instruction of Load1 in the 4 th clock cycle, the computer device determines that there is no first target content element to be loaded pointed by the current index element in the data index set 1 in each cache unit, the computer device may read the first target content element from the memory based on the current index element and store the number of content elements connected to the first target content element into the cache unit 1, and when the processing channel 2 executes the MEM instruction of Load2 in the 5 th clock cycle, since the number of index elements spaced between the data index set 1 and the data index set 2 is greater than the number of caches possessed by the cache units, it may determine that there is no second target content element to be loaded pointed by the current index element in the data index set 2 in each cache unit, at this time, the computer device reads the second target content element from the memory and stores the number of content elements connected to the second target content element into the cache unit 2. The iteration is such that cache miss events occur in parallel for all 4 processing channels.
The multi-channel memory allows simultaneous memory reading, so that cache miss events which should occur in a plurality of different time phases can be concentrated to occur at the same time, thereby hiding the memory delay of a part of the cache miss events. When n Load and Add operations are required to be performed in a loop and 4 processing channels are provided, the 4 Load operations and the 4 Add operations can be regarded as a group of operations, the time consumption of the first group of operations is 20 clock cycles, and then each group of operations and the last group of operations have a time interval of only 16 clock cycles, so that the total time consumption is about 20+ (n/8-1) x 16, and when the number of loops is high enough, the calculation time consumption of each group of operations is about 16/8=2 clock cycles on average, which is less than 3 clock cycles in fig. 2A, so that the overall efficiency of the batch processing of data is improved. FIG. 4 illustrates a flow diagram of multi-channel parallel processing in one embodiment.
In one embodiment, when the corresponding content elements are not found in the cache unit, triggering the cache unit to acquire the content elements with the cache quantity according to the current index element for overlay updating comprises triggering a local kernel to acquire the content elements with continuity and the corresponding cache quantity from the local memory according to the current index element when the corresponding content elements are not found in the cache unit, and triggering to perform overlay updating on the stored content in the cache unit based on the acquired content elements, wherein the content elements stored in the cache unit in an overlay mode are used for executing subsequent target operations.
Specifically, when the computer device does not find the corresponding content element from each cache unit, triggering the computer kernel to find the content element pointed by the current index element from the local memory according to the current index element, and taking the content element pointed by the current index element as a starting point to acquire the content element which has continuity and corresponds to the number of caches, and performing coverage update on one cache unit in the caches. Wherein the content element stored in the cache unit is used for executing the subsequent target operation.
In this embodiment, the content elements are cached in the cache unit, so that the content elements can be directly read from the cache unit later, thereby improving the processing efficiency of data batch processing.
In the data batch processing method, the quality stream and the index corpus corresponding to the target task can be determined by acquiring the target task. The index complete set can be divided by determining the index complete set, so that a plurality of data index sets are obtained, and each data index set is distributed to a corresponding processing channel. By assigning each data index set to a corresponding processing channel, at least one round of target operation can be performed in parallel based on the assigned data index set based on each processing channel until an operation result corresponding to the target task is obtained. Because the index corpus is divided based on the number of channels and the number of caches, the number of elements of the divided data index corpus can be larger than the number of caches corresponding to the cache units, so that when a plurality of processing channels read data from the cache units in parallel based on the corresponding index elements, cache miss events (CACHE MISS) can occur simultaneously, and the multi-channel memory allows the simultaneous occurrence of memory reading, the cache miss events which should occur in a plurality of different time phases can be concentrated to occur at the same time, thereby hiding the access delay of a part of the cache miss events, and further improving the processing efficiency of data batch processing.
In one embodiment, the target operation of at least one round corresponding to the instruction stream is executed through the processing channels in parallel based on the data index sets respectively allocated to the processing channels, and the method comprises the steps of determining at least one to-be-processed data index set corresponding to the corresponding processing channel for each processing channel in the plurality of processing channels, determining the current to-be-processed data index set from the corresponding to the to-be-processed data index set for each processing channel respectively, executing the target operation of at least one round corresponding to the instruction stream through the processing channels in parallel based on the current to-be-processed data index set respectively corresponding to each processing channel, updating the current to-be-processed data index set to be the completed data index set after the target operation is completed, entering a process for processing the next to-be-processed data index set, and returning to the step of determining the to-be-processed data index set corresponding to the corresponding processing channel for each processing channel in the plurality of processing channels, and continuing to execute until the target operation of at least one round corresponding to the instruction stream is executed based on each data index set in the index set.
Specifically, for each of the plurality of processing channels, an assigned at least one data index set to be processed is determined, and a current index set to be processed is determined from the assigned at least one data index set to be processed, so that each processing channel can execute, in parallel, at least one round of target operations corresponding to the instruction stream based on the respective corresponding current data index set to be processed. For each of the plurality of processing channels, when the processing channel completes the target operation, the current data index set to be processed can be updated to a completed data index set, the next data index set to be processed is determined from the allocated data index sets to be processed, and the next data index set to be processed is processed until the target operation corresponding to the instruction stream for at least one round is executed based on each data index set in the index full set.
It is easy to understand that a single processing channel may process each data index set according to the order of allocating the data index sets, or may process the data index sets to be processed, which are allocated and matched randomly, for example, when the processing channel 1 is designated to need to process the data index set 1, the data index set 2 and the data index set 3, the processing channel 1 may process the data index set 1, the data index set 2 and the data index set 3 in sequence, or may process the data index set 3 first at random and then process the data index set 1 and the data index set 2. The present embodiment is not limited herein.
In this embodiment, the processing efficiency of batch processing of data can be improved by parallel processing of the data index sets in multiple channels.
In one embodiment, the instruction stream comprises more than one operation instruction with determined triggering sequence, the more than one operation instruction comprises a first instruction and a second instruction, the first instruction is triggered before the second instruction, and the target operation corresponding to the instruction stream comprises a first target operation corresponding to the first instruction and a second target operation corresponding to the second instruction.
Specifically, a first instruction and a second instruction may be included in the instruction stream, the first instruction being triggered before the second instruction. The computer device may correspondingly perform a first target operation based on the first instruction and may correspondingly perform a second target operation based on the second instruction. In one embodiment, the target task comprises a reduction task, the corresponding first instruction is a data Load instruction (Load instruction), and the second instruction is a sum instruction (Add instruction). In the process of each cyclic emission of the instruction stream, the computer equipment firstly emits a Load instruction, and then executes corresponding data loading operation to obtain corresponding target content elements. And then the corresponding target content element is used as an operation object of the Add instruction to transmit the Add instruction and execute the Add instruction.
In the above embodiment, the instruction stream includes a first instruction and a second instruction, where the first instruction and the second instruction are transmitted in a dependency relationship, that is, the first instruction is triggered before the second instruction, and the trigger object of the second instruction further includes a target content element obtained by executing the first instruction. In this way, the whole instruction stream can jointly assist in achieving the execution of the target task through the dependency relationship among different instructions.
In one embodiment, the target operation of each round comprises a group of operations which are circularly executed for a plurality of times, each group of operations comprises a first target operation and a second target operation, when each processing channel executes the target operation of the current round, the corresponding content element is searched from a cache unit to serve as an operation object according to the current index element in the corresponding data index set, when each processing channel executes the first target operation in the current round of the current round, the corresponding content element is searched from the cache unit to serve as the operation object of the first target operation according to the corresponding current index element, and when each processing channel executes the second target operation in the current round of the current round, the intermediate operation data obtained by executing the second target operation in the previous round are obtained, and the intermediate operation data and the content element searched through the first target operation are used as the operation object together.
Wherein the target operation of each round comprises a group of operations which are circularly executed for a plurality of times, and each group of operations comprises a first target operation and a second target operation.
Specifically, each processing channel may perform a set of operations for multiple cycles based on the index elements in the data index set. For each of the plurality of processing channels, when a first target operation in a set of operations of the current loop is executed, a corresponding content element can be searched from the cache unit based on the current index element, and the searched content element is used as an operation object of the first target operation. When the second target operation in the set of operations of the current loop is executed, intermediate operation data obtained by executing the second target operation in the previous loop can be acquired, and the intermediate operation data and the content elements searched through the first target operation are used as operation objects together. For example, when the current index element is src [5], the content element pointed to by the current index element is 8 (src [5] =8), the first target operation is a data loading operation, and when the second target operation is a summation operation, the processing channel may search the cache for the corresponding content element 8 based on src [5] when executing the first target operation in the set of operations of the current loop, and when executing the second target operation in the set of operations of the current loop, may sum the intermediate operation data obtained by executing the second target operation in the previous loop with the content element 8 to obtain the intermediate operation data obtained by executing the second target operation in the current loop. And iterating in this way until at least one round of target operation corresponding to the instruction stream is executed.
In this embodiment, the first target operation and the second target operation are alternately executed, so that the whole target operation can jointly assist in implementing the execution of the target task.
In one embodiment, each processing channel searches corresponding content elements from a cache unit as operation objects according to the current index elements in the corresponding data index set when the target operation of the current round is executed, wherein the method comprises the steps of determining the maximum concurrency quantity supported by local resources, searching the content elements with the maximum concurrency quantity from the cache unit according to the current index elements in the corresponding data index set when the target operation of the current round is executed, and taking each searched content element as operation objects in parallel to execute the target operation in parallel.
Wherein, the maximum concurrency number supported by the local resource refers to the number of data to be processed which can be processed simultaneously by a single processing channel. In one embodiment, multiple data to be processed may be processed simultaneously based on a SIMD implementation. SIMD (Single Instruction Multiple Data ) is a technique that employs one controller to control multiple processors while performing the same operations on each of a set of data (also called "data vectors") separately to achieve spatial parallelism.
Specifically, for each of the multiple processing channels, determining the maximum concurrency number supported by the local resource, determining the content element pointed by the current index element from the cache unit, acquiring the content element with the maximum concurrency number by taking the content element pointed by the current index element as a starting point according to the maximum concurrency number, and taking the searched content elements as operation objects in parallel to execute the target operation in parallel.
In one embodiment, when the processing channel obtains the current index element, the processing channel may determine the index element with the maximum concurrency number according to the maximum concurrency number supported by the local resource and the current index element, and search the corresponding content element from the cache unit according to the index element with the maximum concurrency number, for example, when the maximum concurrency number supported by the local resource is 8 and the current index element is src [0], the processing channel determines that the index element with the maximum concurrency number is src [0] to src [7] according to src [0] to src [7], and searches the content element corresponding to each of src [0] to src [7] from the cache unit.
In the above embodiment, by searching a plurality of content elements according to the maximum concurrency number supported by the local resource and using the searched plurality of content elements as operation objects in parallel to execute the target operation in parallel, the execution efficiency of the target operation can be greatly improved, so that the processing efficiency of data batch processing is improved.
In one embodiment, the target task comprises at least one target subtask, each target subtask corresponds to one of a plurality of data index sets, the data batch processing method further comprises the steps of determining the number of page tables of address tables corresponding to the data index sets respectively, dividing the data index sets according to the number of the page tables to obtain at least one index subset, each target page table in the address tables corresponds to at least one matrix row in the index subset respectively, the target page tables are used for storing address information of index elements with continuity in the corresponding matrix row, based on the data index sets respectively corresponding to the processing channels, executing at least one round of target operation corresponding to an instruction stream through the processing channels in parallel until operation results corresponding to the target task are obtained, and the method comprises the steps of executing at least one round of target operation corresponding to the instruction stream in parallel through the processing channels until operation sub-results corresponding to the target subtask are obtained according to the address tables respectively corresponding to the processing channels and the index subsets in the data index sets respectively corresponding to the processing channels.
The address table is also called TLB (Translation Lookaside Buffer ), which is a storage space for caching the translation relationship between the virtual address and the physical address, and may specifically include at least one page table, where each page table is used to store the translation relationship between a plurality of virtual addresses with continuity and corresponding physical addresses. Each target page table in the address table may correspond to at least one matrix row in the index subset, respectively, the page table being for storing address information of index elements having continuity in the corresponding matrix row. Wherein one matrix row in the index subset points to one matrix row in the matrix to be processed.
The target tasks include at least one target subtask, and each target subtask corresponds to one of a plurality of data index sets.
Specifically, different processing channels may correspond to different address tables, that is, each processing channel may correspond to one TLB, so the computer device may determine the number of page tables included in the address table corresponding to each processing channel, and determine the number of page tables included in the address table corresponding to the data index set allocated to the different processing channel according to the number of page tables included in the address table corresponding to each processing channel, for example, when the data index set 1 is allocated to the processing channel 1 and the number of page tables included in the address table 1 corresponding to the processing channel 1 is 4, the computer device determines the number of page tables of the address table corresponding to the data index set 1 is 4. Further, the computer device divides the corresponding data index sets according to the number of the page tables, and takes index elements corresponding to content elements of each page table number row in the matrix to be processed as an index subset.
For example, in the above example, when the matrix to be processed is a two-dimensional group having 100 rows and 8 columns, and the data index set 1 points to the content elements of the first 8 rows in the matrix to be processed, the computer device divides the data index set 1 according to the number of page tables of the address table 1 to obtain 2 index subsets, where the first index subset points to the content elements of the first 4 rows in the matrix to be processed, and the second index subset points to the content elements of the 5 th to 8 rows in the matrix to be processed. It will be readily appreciated that the index subset may comprise at least one row of matrix rows, such that each row of matrix rows points to a row of matrix rows in the matrix to be processed, e.g. the first matrix row in the index subset points to the first row of matrix rows in the matrix to be processed. For convenience of description, the matrix rows in the index subset will be referred to as index matrix rows, the matrix rows in the matrix to be processed will be referred to as pending matrix rows, and the matrix rows in the concurrency subset will be referred to as concurrency matrix rows.
Further, for each of the plurality of processing channels, performing at least one round of target operation corresponding to the instruction stream based on the index subset and the corresponding address table in turn until an operation sub-result corresponding to the target sub-task is obtained. Wherein, each processing channel can execute the target operation in parallel. When the operation sub-results corresponding to the target sub-tasks are obtained, the computer equipment synthesizes the operation sub-results to obtain the operation results corresponding to the target tasks.
In one embodiment, for each of the plurality of processing channels, the assigned data index sets may be sequentially processed, and for each of the plurality of data index sets, the index subsets included in the current data index set may be sequentially processed, i.e., based on sequentially based on the index subsets and the corresponding address table, performing at least one round of target operations corresponding to the instruction stream.
In one embodiment, a computer device obtains a target task and determines an instruction stream corresponding to the target task. The computer equipment determines an index complete set corresponding to the target task, divides the index complete set into a plurality of data index sets and distributes the data index sets to corresponding processing channels respectively. For each of the plurality of processing channels, determining the number of page tables assigned to determine address tables corresponding to the respective data index sets, and dividing the data index sets according to the number of page tables to obtain at least one index subset. And executing at least one round of target operation corresponding to the instruction stream in parallel through each processing channel according to the address table respectively corresponding to each processing channel and the index subset in the data index set respectively corresponding to each processing channel until an operation sub-result corresponding to the target sub-task is obtained, and integrating each operation sub-result to obtain an operation result corresponding to the target task.
In the above embodiment, the data index set is divided based on the number of page tables, so that each index subset obtained by division may include a matrix row of the number of page tables, and because each target page table in the address table corresponds to at least one matrix row in the index subset, and address information of index elements with continuity in the corresponding matrix row is stored in the target page table, when the target operation is executed based on the address table and the current index subset, address information of all index elements in the current index subset may be completely stored in the same address table, so that only after the original address table which does not include address information is subjected to one-time full coverage update, address information of all index elements in the current index subset can be searched based on the address table, and updating of page tables in the address table is not required, thereby reducing the number of times of coverage update of the address table, saving clock cycles consumed by coverage update, and improving the processing efficiency of batch data processing.
In one embodiment, when each processing channel executes the target operation of the current round, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information to be used as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out the coverage update of the address information according to the current index element.
Specifically, when each processing channel executes the target operation of the current round, the computer device may determine, according to the current index element, the virtual address where the pointed content element is located, and find, according to the virtual address, the corresponding physical address from the target page table corresponding to the current matrix row where the current index element is located, that is, find corresponding target address information, and find, based on the target address information, the corresponding content element from the cache unit. When the computer device does not find the corresponding content element from the target page table, triggering generation of an address information miss event (TLB miss), triggering the computer kernel to find target address information corresponding to the content element pointed to by the current index element from the memory. The method comprises the steps that a computer kernel determines the storage quantity of target addresses which can be stored in each page table in an address table, acquires a plurality of pieces of continuous target address information by taking the searched target address information as a starting point, and carries out coverage updating on one page table in the address table by the acquired storage quantity of continuous target address information.
In the above embodiment, by caching the address information into the address table, the target address information can be directly read from the address table, thereby improving the processing efficiency of the data batch processing.
In one embodiment, the data batch processing method is executed by an inference engine and is applied to a face detection model, wherein the target task is one of access-memory intensive tasks in tasks generated when the face detection model performs face detection.
Specifically, the data batch processing method according to the embodiments of the present application is performed by an inference engine deployed on a computer device, and the data batch processing method is applied to a face detection model. When the face detection model performs service processing, a plurality of calculation tasks need to be executed, and one of the calculation tasks is a memory-access intensive task, namely an access intensive operator is executed.
For example, a computer device may provide online services by deploying a face detection model, and the addition convention is an important operator in the face detection model. If this addition reduction operator runs too long on the GPU, the overall reasoning task of the face detection model is inefficient. By the data batch processing method provided by the embodiments of the application, the execution of access-memory intensive tasks can be efficiently realized, and the calculation efficiency of the addition protocol operator is further improved, so that the service processing efficiency of face detection is improved, the on-line response speed is improved, and the service delay is reduced.
In a specific application scenario, the face detection model is loaded through a deployed reasoning engine, so that various online services can be provided, such as face detection service, face recognition service, face monitoring service and the like. The user may trigger a service request corresponding to a certain service, such as a face retrieval request corresponding to a face recognition service, through the user terminal. And the computer equipment can perform corresponding business processing and feedback through the face detection model. In the process of performing service processing on the face detection model, a scene that the access intensive operator needs to be processed exists, for example, the face detection model needs to perform addition protocol processing on the feature vector extracted in the middle processing process, and at this time, the addition protocol operation can be performed on the feature vector through the data batch processing method provided by each embodiment of the application, so that the following calculation task can be continued. The reasoning engine refers to a module used for completing the reasoning function in the application system.
In the above embodiment, the computing tasks in the face detection model may be processed in parallel by the inference engine, where the computing tasks include access intensive tasks. Therefore, by carrying out parallel processing on the access intensive task, cache miss events which should occur in a plurality of different time phases are concentrated to occur at the same time, so that the execution of the access intensive task is realized efficiently, the business processing efficiency of the face detection model is further improved, the on-line response speed when on-line service is provided through the face detection model is improved, and the service delay is reduced.
In one embodiment, the data batch processing further comprises the steps of obtaining a face image to be detected, inputting the face image to be detected into a face detection model, obtaining a target task through the face detection model, triggering an instruction stream corresponding to the target task and executing the instruction stream to obtain an operation result corresponding to the target task, and determining the face detection result according to the operation result.
Specifically, the computer device may acquire face images of more than one frame to be detected, and input each face image to be detected to the face detection model. In the process of executing face detection, the face detection model can generate various computing tasks, and the access intensive task is the target task in the various computing tasks. When the target task is obtained, the face detection model triggers an instruction stream corresponding to the target task and executes the instruction stream to obtain an operation result corresponding to the target task, and determines a corresponding face detection result based on the operation result, for example, when the target task is an addition reduction task, the face detection model may normalize the operation result obtained by executing an addition reduction operator after obtaining the operation result, obtain normalized face features, and obtain a face detection result based on the normalized face features.
In one embodiment, referring to fig. 5, when obtaining the face image to be detected, the computer device may load the face detection model through the deployed inference engine, and infer the face image to be detected based on the face detection model, so as to obtain an operation result of the target task. The face detection model determines a corresponding face detection result 502 based on the operation result of the target task, and displays the face detection result 502 correspondingly. Figure 5 illustrates a flow diagram of face detection in one embodiment.
In one embodiment, as shown in fig. 6, a data batch processing method is provided, and the method is applied to the computer device in fig. 1 for illustration, and includes the following steps:
Step S602, a target subtask is acquired, and an instruction stream corresponding to the target subtask is determined.
Step S604, determining a data index set corresponding to the target subtask and the number of page tables of address tables in the local resource, where the data index set is used to find content elements in the matrix to be processed.
In particular, a processing task that processes a content element pointed to by one data index set may be considered as one target sub-task, such that the target task includes at least one target sub-task, and each target sub-task corresponds to one of the plurality of data index sets. Further, after determining the target subtask, the computer device may determine the number of page tables of address tables in the local resource, and determine an instruction stream required for executing the target subtask, where the instruction stream may include one target instruction of multiple loops, and may include more than one target instruction of multiple loops and determined trigger sequence.
Step S606, dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset, and the target page table is used for storing address information of index elements with continuity in the corresponding matrix row.
Specifically, the computer device may divide the data index set according to the number of page tables, and use the index element corresponding to each page table number row of the content elements in the matrix to be processed as one index subset, for example, when the number of page tables is 4, the index element pointing to the first 4 rows of the content elements in the matrix to be processed may be divided into one index subset, and the index element pointing to the 5 th to 8 th rows of the content elements in the matrix to be processed may be divided into another index subset. Wherein the index subset may comprise a plurality of index matrix rows, each row of index matrix rows in the index subset pointing to a corresponding row of the matrix to be processed, e.g. a first row of index elements in the index subset pointing to a first row of content elements in the matrix to be processed. Each target page table in the address table corresponds to at least one index matrix row in the index subset, and the target page table is used for storing address information of index elements with continuity in the corresponding index matrix row. For example, when the address table includes 4 page tables, the index subset 1 includes 4 rows of index matrix rows, and the 4 rows of index matrix rows included in the index subset 1 are used for pointing to the content elements of the first row to the fourth row in the matrix to be processed, the page table 1 may be used for storing the address information of the first row of content elements pointed to by the first row of index matrix rows in the index subset 1, that is, the address information of the index elements with continuity included in the first row of index matrix rows in the index subset 1, the page table 2 may be used for storing the address information of the index elements with continuity included in the second row of index matrix rows in the index subset 1, and so on until the address information corresponding to all the index elements in the index subset 1 is stored in the address table.
And step 608, executing at least one round of target operation corresponding to the instruction stream according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained, wherein when the target operation of the current round is executed, the corresponding target address information is searched according to a target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information and is used as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element.
Specifically, the computer device performs at least one round of target operations corresponding to the instruction stream based on the index subset and the corresponding address table until an operation sub-result corresponding to the target sub-task is obtained. It is noted that when the target operation of the current round is executed, the corresponding target address information is searched from the corresponding target page table according to the current index element in the corresponding index subset to acquire the corresponding content element based on the target address information, and when the corresponding target address information is not searched from the address table, the computer kernel is triggered to acquire the address information of the content elements with the storage quantity according to the current index element to carry out coverage update on the page table. Where the number of stores refers to the number of address information that the page table can store.
In the data batch processing method, the instruction stream and the data index set corresponding to the target subtask can be determined by acquiring the target subtask. By determining the data index set, the data index set may be partitioned based on the number of page tables of the address table, resulting in at least one index subset. By determining index subsets, a target operation of at least one round corresponding to an instruction stream may be performed based on an address table and index elements in each index subset. Because the data index set is divided based on the number of the page tables, each index subset obtained by division can contain matrix rows of the number of the page tables, and because each target page table in the address table corresponds to at least one matrix row in the index subset respectively, the address information of index elements with continuity in the corresponding matrix row is stored in the target page table, when the target operation is executed based on the address table and the current index subset, the address information of all index elements in the current index subset can be completely stored in the same address table, so that the address information of all index elements in the current index subset can be searched based on the address table after the original address table which does not contain the address information is completely updated once, and the page table in the address table is not required to be subjected to covering update any more, thereby reducing the number of times of covering update of the address table, saving the clock period consumed by the covering update and improving the processing efficiency of batch processing of the data.
In one embodiment, the target operation of at least one round corresponding to the instruction stream is executed according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained, and the method comprises the steps of determining the index subset to be processed in the corresponding data index set and determining the current index subset to be processed from the index subset to be processed, executing the target operation of at least one round corresponding to the instruction stream according to the address table corresponding to the corresponding data index set and each index element in the current index subset to be processed, updating the current index subset to be the completed index subset after the target operation is completed, entering a process for processing the next index subset to be processed, and returning to determine the index subset to be processed in the corresponding data index set to continue to be executed until the target operation of at least one round corresponding to the instruction stream is executed based on all index subsets in the data index set to obtain the operation sub-result corresponding to the target sub-task.
Specifically, for each data index set, the computer equipment determines a subset of indexes to be processed in the corresponding data index set, and sequentially processes the subset of indexes to be processed until an operation sub-result corresponding to the target sub-task is obtained. For a better understanding of the present embodiment, the following description will take an example of executing one data index set. When a current to-be-processed data index set needs to be processed, the computer equipment determines a current to-be-processed index subset in the current to-be-processed data index set, and performs at least one round of target operation corresponding to the instruction stream based on the current to-be-processed index subset and a corresponding address table. When the current index subset to be processed is determined to be processed, the computer equipment updates the current index subset to be processed into the completed index subset, and enters a process of processing the next index subset to be processed until each index subset in the current data index set to be processed is processed, and an operation sub-result of a target sub-task corresponding to the current data index set to be processed is obtained.
For example, when the index subset 1 points to the content elements of the first to fourth rows in the matrix to be processed, the computer device may acquire the content elements of the first to fourth rows in the matrix to be processed based on the index subset 1 and the corresponding address table, and perform a target operation on the acquired content elements based on the instruction stream. When the processing of the index subset 1 is finished, the computer equipment acquires the index subset 2 pointing to the content elements of the fifth row to the eighth row in the matrix to be processed, acquires the content elements of the fourth row to the eighth row in the matrix to be processed based on the index subset 2 and the corresponding address table, and executes target operation on the acquired content elements based on the instruction stream. And so on until each index subset in the currently pending data index set is processed.
In this embodiment, by traversing the index subsets, each index subset may be sequentially processed, so as to finally obtain an operation result of the target subtask.
In one embodiment, performing at least one round of target operations corresponding to an instruction stream according to an address table and index elements in each index subset includes determining a maximum number of concurrency supported by a local resource, grouping the index subsets according to the maximum number of concurrency to obtain at least one concurrency subset, and sequentially performing one round of target operations corresponding to one round of instruction stream based on each concurrency subset.
Specifically, the computer device determines a maximum concurrency number supported by the local resource, and groups index subsets by columns based on the maximum concurrency number, resulting in at least one concurrency subset. Wherein each concurrency matrix row in each concurrency subset contains no more than a maximum concurrency number of index elements. For example, when the maximum concurrency number is 4, the matrix to be processed is a two-dimensional array including 100 rows and 8 columns, the index subset 1 includes four index matrix rows, and each index matrix row includes 8 index elements, the computer device may divide the index subset 1 into two concurrency subsets according to the maximum concurrency number 4, where the first concurrency subset includes index elements of a first column index matrix column to a fourth column index matrix column in the index subset 1, and the second concurrency subset includes index elements of a fifth column index matrix column to an eighth column index matrix column in the index subset 1.
Further, the computer device sequentially performs a round of target operations corresponding to a round of instruction stream based on each concurrent subset, e.g., the computer device preferentially performs a round of target operations based on concurrent subset 1 and then performs a round of target operations based on concurrent subset 2.
In this embodiment, the index subsets are divided by the maximum concurrency number to obtain concurrency subsets, so that one matrix row in the concurrency subsets can be executed in parallel, and thus, the processing efficiency of batch processing of data can be improved.
In one embodiment, the target operation of each round includes a set of operations performed in a number of cycles corresponding to the number of page tables, and executing a round of target operations corresponding to the instruction stream of a round based on the concurrent subsets in turn includes, for each concurrent subset, executing a set of operations in a number of cycles of the page tables based on the matrix rows in the concurrent subsets during execution of the target operation of each round.
Specifically, one concurrent subset corresponds to one round of target operations, and each round of target operations includes a set of operations performed in a number of loops, where the number of loops corresponds to the number of page tables, such that the computer device may perform the set of operations a number of times based on each concurrent matrix row in the concurrent subset when performing one round of target operations corresponding to one round of instruction stream. For example, when performing a round of target operations based on concurrency subset 1, the computer device may preferentially perform a first round of a set of target operations based on a first row of concurrency matrix rows in concurrency subset 1, and then perform a second round of a set of target operations based on a second row of concurrency matrix rows in concurrency subset until a round of page table number times performs a set of operations.
As will be readily appreciated, since the computer device sequentially executes the concurrent subsets, and for each concurrent subset, a set of operations is executed in a cyclic page table number of times based on each matrix row in the concurrent subset, when executing the first concurrent subset in the index subset, the computer device may store the address information corresponding to each index element in the index subset into the corresponding page table in the address table, so that when subsequently processing the remaining concurrent subsets in the index subset other than the first concurrent subset, the computer device may directly find the corresponding target address information from the address table, and find the corresponding content element based on the corresponding target address information, without finding the target address information from the memory, thereby greatly reducing the number of times of finding the target address information from the memory (reducing the number of times of MiTLB, and improving the processing efficiency of batch processing of data.
For example, when the number of page tables is 4 and the current index subset includes two concurrent subsets, each including 4 rows of concurrent matrix rows, the computer device may prioritize the first row of concurrent matrix rows in concurrent subset 1. Because each target page table in the address table corresponds to at least one matrix row in the index subset respectively, when the address information corresponding to the first row of concurrent matrix rows in the concurrent subset 1 does not exist in the address table, the computer equipment can simultaneously store the address information corresponding to the first row of concurrent matrix rows in the concurrent subset 1 and the address information corresponding to the first row of concurrent matrix rows in the concurrent subset 2 in the page table 1 of the address table, and when the computer equipment processes the second row of concurrent matrix rows in the concurrent subset 1, because each target page table in the address table corresponds to at least one matrix row in the index subset respectively, when the address information corresponding to the second row of concurrent matrix rows in the concurrent subset 1 does not exist in the address table, the computer equipment can simultaneously store the address information corresponding to the second row of concurrent matrix rows in the concurrent subset 1 and the address information corresponding to the second row of concurrent matrix rows in the concurrent subset 2 in the page table 2. And iterating in this way until the concurrent subset 1 is processed.
When the concurrent subset 1 is processed, the address information corresponding to each index element in the index subset is stored in the corresponding page table of the address table, so that when the concurrent subset 2 is processed, the corresponding target address information can be directly searched from the address table without searching the corresponding target address information from the memory.
Similarly, when the number of the concurrency matrix lines included in each concurrency subset is smaller than or equal to the number of the cache units, and the number of caches which can be stored in each cache unit is larger than the maximum concurrency number, when the computer equipment finishes processing the first concurrency subset in the index subsets, the content elements pointed by each index unit in the index subsets are stored in the corresponding cache units, so that when the first concurrency subset is divided in the index subsets, the computer equipment can directly search the corresponding content elements from the corresponding cache units without searching the corresponding content elements from the content, and therefore, the probability of cache miss is reduced, and the processing efficiency of data batch processing is improved. That is, after the data in the cache is searched once, the cache is updated in a covering way, so that the utilization rate of the cache is improved, the times of searching the content elements from the memory are reduced, and the processing efficiency of batch processing of the data is improved.
In this embodiment, the concurrent subsets are sequentially processed, and for each concurrent subset, in the execution process of the target operation in each round, the group of operations is executed in a cyclic page table number of times based on each matrix row in the concurrent subset, so that after the data in the page table is searched for one time, the page table is updated in a covering manner, thereby improving the utilization rate of the page table, reducing the times of searching for the target address information from the memory, and improving the processing efficiency of the data batch processing.
In one embodiment, the data batch processing method is executed by an inference engine and is applied to a face detection model, wherein the target subtasks are tasks used for processing part of data in a matrix to be processed in target tasks, and the target tasks are one of access intensive tasks in the tasks generated when the face detection model performs face detection.
Specifically, the above data batch processing method is executed by an inference engine, and the above data batch processing method is applied to a face detection model. When the face detection model is used for carrying out business processing, a plurality of calculation tasks are required to be executed, one of the calculation tasks is a target task, the target task comprises at least one target subtask, and the target subtask is a task used for processing partial data in a matrix to be processed in the target task.
In the above embodiment, the inference engine may process the target subtasks in the face detection model in sequence according to the above data batch processing method, so that the execution of the access intensive task may be efficiently implemented, thereby improving the service processing efficiency of the face detection model, improving the on-line response speed when the on-line service is provided by the face detection model, and reducing the service delay.
For better understanding of the present embodiment, the following description will be given by comparing the conventional scheme with the present scheme. In the conventional scheme, the addition convention operation in the d1 dimension may be performed by the following code:
where d1 represents the number of rows of the matrix to be processed, d2 represents the number of columns of the matrix to be processed, src represents the identifier of the matrix to be processed, and src [ j×d2+i ] represents the index element. When executing the above code, the computer device may find the corresponding target address information from the address table based on src [ j×d2+i ], and find the corresponding content element from the cache unit based on the target address information, for example, when src [3] =8, the computer device may determine that the corresponding content element is 8 according to src [3 ].
When executing the code, the index elements obtained in two adjacent loops need data spanning d2 length, for example, in the first loop, when i=0 and j=0, the corresponding index element is src [0], and in the second loop, when i=0 and j=1, the corresponding index element is src [ d2], and d2 data spans between src [0] and src [ d2 ]. In general, when d2 exceeds the number of caches (exceeds the size of a cache block) that can be cached by the cache unit, a cache Miss event (CACHE MISS) occurs when a content element corresponding to src [ j×d2+i ] is searched for each cycle, and when d2 spans the page table size, an address information Miss event (TLB Miss) occurs when a target address information corresponding to src [ j×d2+i ] is searched for each cycle, so that when target address information and a content element for d1×d2 need to be searched for, the total number of cache Miss events (CACHE MISS) is d1×d2, and the total number of address information Miss events (TLB Miss) is d1×d2. The memory overhead of the address information miss event (TLB miss) is greater than that of the cache miss event (CACHE MISS), and typically at least 5 memories are required, so that the total memory count per operation is about 6×d1×d2 in the conventional scheme.
In the embodiment of the present application, taking still an addition reduction operator as an example, assuming that an address Table (TLB) can store 4 page tables, when the maximum concurrency supported by a local resource is 8 (simd processes 8 data at a time), the code may be updated as follows:
Where d1 represents the number of rows of the matrix to be processed, d2 represents the number of columns of the matrix to be processed, src represents the identifier of the matrix to be processed, and src [ j x d2+i: j x d2+i+7] represents one of the concurrent matrix rows in the concurrent subset. Because the cache unit can be updated by overwriting after the content elements stored in the cache unit are searched once under the technical scheme, the cache unit can be fully utilized, so that in the embodiment of the application, the number of cache miss events (CACHE MISS) generated by executing the code is d1 x d2/m, that is, the cache miss event (CACHE MISS) is generated once every m content elements are searched. Where m is the number of caches (the size of the cache unit) corresponding to the cache unit.
Similarly, the page table can be updated after the address information stored in the page table is searched once, so that the page table can be fully utilized. Assuming that a page table (typically 4 KB) can store address information (1K for float) of P data, the total number of address information miss events (TLB miss) is d1×d2/P, that is, an address information miss event (TLB miss) is generated every time address information corresponding to P content elements is searched.
In this embodiment of the application, since a part of the result is temporarily stored in the dst array, when the content element corresponding to dst [ i: i+7] is stored in the memory, a cache miss event (CACHE MISS) is also generated, and since m is the number of caches corresponding to the cache unit (the size of the cache unit), d2 is the number of columns of the matrix to be executed, the total number of times of executing the cache miss event (CACHE MISS) corresponding to the dst array generated by the code is d2/m times.
When 5 accesses are required for the address information miss event (TLB miss), and 1 access is required for the cache miss event (CACHE MISS), the total number of accesses according to the embodiments of the present application is d1×d2/m+5 (d1×d2/p) +d2/m. Because the page table can store data far larger than the cache unit, when the page table is updated by covering the page table after the address information stored in the page table is searched once, 5 (d 1) d 2/p) can be ignored, so that the total access frequency (d 1d 2/m+d2/m) is less than 6 d 1d 2 in the traditional scheme.
Because frequent cache unit coverage update occurs in the traditional scheme, a large number of repeated DRAM (Dynamic Random Access Memory ) accesses are caused, and the embodiment effectively utilizes the data updated to the cache unit each time and the data updated to the TLB each time, thereby greatly reducing the access times and remarkably improving the processing efficiency of data batch processing.
In one embodiment, as shown in fig. 7, a flow diagram of a data batch processing method in one specific embodiment is provided:
S702, acquiring a target task, determining an instruction stream corresponding to the target task, wherein the instruction stream comprises more than one operation instruction with determined triggering sequence, the more than one operation instruction comprises a first instruction and a second instruction, the first instruction is triggered before the second instruction, and the target operation corresponding to the instruction stream comprises a first target operation corresponding to the first instruction and a second target operation corresponding to the second instruction.
And S704, determining an index complete set corresponding to the target task, determining the channel number of the processing channels and the cache number corresponding to the cache unit, wherein each index element in the index complete set is used for pointing to each content element in the matrix to be processed.
S706, dividing the index complete set according to the number of channels and the number of caches to obtain a plurality of data index sets, and respectively distributing each data index set to a corresponding processing channel, wherein the number of elements of the data index set distributed by each processing channel is larger than the number of caches corresponding to the cache unit.
S708, for each processing channel in the plurality of processing channels, determining at least one to-be-processed data index set corresponding to the corresponding processing channel, determining the number of page tables of address tables corresponding to the data index sets respectively, and dividing the data index sets according to the number of page tables to obtain at least one index subset.
S710, for each processing channel, determining a current data index set to be processed from the corresponding data index set to be processed.
S712, based on the current data index set to be processed corresponding to each processing channel, executing at least one round of target operation corresponding to the instruction stream through each processing channel in parallel, and updating the current data index set to be completed data index set after completing the target operation, wherein for each index subset in the current data index set to be processed, executing at least one round of target operation corresponding to the instruction stream through each processing channel in parallel according to the address table corresponding to each processing channel and the index subset allocated corresponding to each processing channel respectively until obtaining an operation sub-result corresponding to the target sub-task, and synthesizing each operation sub-result to obtain an operation result corresponding to the target task.
S714, a process of processing the next data index set to be processed is entered, and the step of determining at least one data index set to be processed corresponding to the corresponding processing channel is returned for each processing channel in the plurality of processing channels, and the step is continued until the target operation corresponding to at least one round of instruction stream is executed based on each data index set in the index total set until an operation result corresponding to the target task is obtained.
And S716, when each processing channel executes the first target operation in the current cycle of the current cycle, searching the corresponding content element from the cache unit according to the corresponding current index element to serve as an operation object of the first target operation. And when each processing channel executes the second target operation in the current cycle of the current round, acquiring intermediate operation data obtained by executing the second target operation in the previous cycle, and taking the intermediate operation data and the content elements searched through the first target operation as operation objects.
And S718, when the corresponding content elements are not found in the cache unit, triggering the local kernel to acquire the content elements which have continuity and correspond to the cache quantity from the local memory according to the current index element, and triggering the overlay update of the stored content in the cache unit based on the acquired content elements, wherein the content elements stored in the cache unit in an overlay manner are used for executing subsequent target operation.
According to the data batch processing method, the index complete set is divided based on the number of channels and the number of caches, so that the number of elements of the divided data index set can be larger than the number of caches corresponding to the cache units, when a plurality of processing channels read data from the cache units in parallel based on the corresponding index elements, cache miss events (CACHE MISS) can occur simultaneously, and the multi-channel memory allows memory reading to occur simultaneously, so that cache miss events which should occur in a plurality of different time stages can be concentrated to occur at the same time, access delay of a part of cache miss events is hidden, and further the processing efficiency of data batch processing is improved.
In one embodiment, as shown in fig. 8, a flow diagram of a data batch processing method in another specific embodiment is provided:
s802, acquiring a target subtask, and determining an instruction stream corresponding to the target subtask.
S804, determining a data index set corresponding to the target subtask and the number of page tables of address tables in the local resource, wherein the data index set is used for searching content elements in the matrix to be processed.
And S806, dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset, and the target page table is used for storing the address information of index elements with continuity in the corresponding matrix row.
S808, for each data index set, determining a subset of indexes to be processed in the corresponding data index set, and determining a current subset of indexes to be processed from the subset of indexes to be processed.
S810, determining the maximum concurrency quantity supported by the local resources, grouping the current index subsets to be processed according to the maximum concurrency quantity to obtain at least one concurrency subset, and sequentially executing a round of target operation on the concurrency subsets in the current index subsets to be processed.
For each concurrent subset, the target operation of each round is based on each matrix row in the concurrent subset, the target operation of each round is composed of a group of operations which are circularly executed, the number of times of the multiple times of the circulation corresponds to the number of page tables, and the current index subset to be processed is updated to the completed index subset after the target operation is completed.
S814, entering a process of processing the next index subset to be processed, and returning to the step of determining the index subset to be processed in the corresponding data index set to continue to be executed until the target operation corresponding to the instruction stream is executed for at least one round based on all the index subsets in each data index set, so as to obtain an operation sub-result corresponding to the target sub-task, and until the operation sub-result corresponding to the target sub-task is obtained.
When the target operation of the current round is executed, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained based on the target address information to serve as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element.
The application also provides an application scene, which applies the data batch processing method. Specifically, the application of the data batch processing method in the application scene is as follows:
when the face image to be detected is obtained, the computer equipment can input the image to be detected into the face detection model, so that the face detection model can generate a corresponding target task, and execute at least one round of target operation based on a corresponding instruction stream to obtain an operation result of the target task. Further, the computer equipment performs subsequent calculation based on the operation result of the target task, so that a face detection result is obtained.
The application further provides an application scene, and the application scene applies the data batch processing method. Specifically, the application of the data batch processing method in the application scene is as follows:
When the image to be detected is obtained, the computer device can load a corresponding machine learning model through the inference engine based on the above manner, and the machine learning model can recognize the image to be detected, for example, referring to fig. 9, a face can be recognized to obtain a face recognition result 902, a cup on a desktop can be recognized to obtain a cup recognition result 904, or a face contour can be recognized to obtain a face contour recognition result 906 and the like. FIG. 9 illustrates an application scenario diagram of a data batching method in one embodiment.
It can be understood that the above application scenario is only used for illustrating the scheme of the present application, and the data batch processing method in the present application can be also applied to other scenarios, without limitation. For example, license plate images acquired by an application program are identified to realize license plate number determination. For example, audio recognition is performed on the media data to perform subsequent processing, such as performing an analog conversation, based on the audio information that is successfully recognized.
It should be understood that, although the steps in the flowcharts of fig. 3, 6, 7-8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 3, 6, 7-8 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.
In one embodiment, as shown in fig. 10, a data batch processing apparatus 1000 is provided, which may employ a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes an index corpus acquisition module 1002, a data index corpus acquisition module 1004, and a target operations execution module 1006, where:
the index corpus acquisition module 1002 is configured to acquire a target task and determine an instruction stream corresponding to the target task, determine an index corpus corresponding to the target task, and index elements in the index corpus are configured to point to content elements in a matrix to be processed.
The data index set obtaining module 1004 is configured to divide the index complete set into a plurality of data index sets, and allocate each data index set to a corresponding processing channel, where the number of elements of the data index set allocated to each processing channel is greater than the number of caches corresponding to the cache unit.
And a target operation execution module 1006, configured to execute, in parallel, at least one round of target operations corresponding to the instruction stream through each processing channel, respectively, based on the data index sets respectively allocated to each processing channel, until an operation result corresponding to the target task is obtained, where each processing channel searches, when executing the target operation of the current round, for a corresponding content element from the cache unit as an operation object according to a current index element in the corresponding data index set, and when not searching for the corresponding content element from the cache unit, triggers the cache unit to acquire the content elements of the cache number according to the current index element to perform coverage update.
In one embodiment, the data index set obtaining module 1004 is further configured to determine the number of channels of the processing channel and the number of caches corresponding to the cache units, divide the index complete set according to the number of channels and the number of caches to obtain a plurality of data index sets, and allocate each data index set to a corresponding processing channel respectively.
In one embodiment, the target operation execution module 1006 is further configured to determine, for each of the plurality of processing channels, a set of to-be-processed data indices corresponding to at least one of the plurality of processing channels, determine, for each of the plurality of processing channels, a current set of to-be-processed data indices from the respective set of to-be-processed data indices, respectively, execute, in parallel, at least one round of target operation corresponding to the instruction stream through each of the processing channels based on the respective current set of to-be-processed data indices for each of the processing channels, respectively, and update the current set of to-be-processed data indices to a completed set of data indices after completing the target operation, enter a process for processing a next set of to-be-processed data indices, and return to execution of the step of determining, for each of the plurality of processing channels, the set of to-be-processed data indices corresponding to the at least one of the respective processing channels until the target operation corresponding to the instruction stream is executed based on each set of data indices in the whole set.
In one embodiment, the index corpus acquisition module 1002 is configured to determine an instruction stream corresponding to a target task, where the instruction stream includes more than one operation instruction determined by a trigger sequence, where the more than one operation instruction includes a first instruction and a second instruction, where the first instruction triggers before the second instruction, and where the target operation corresponding to the instruction stream includes a first target operation corresponding to the first instruction and a second target operation corresponding to the second instruction.
In one embodiment, the target operation of each round includes a set of operations that are performed in multiple cycles, each set of operations including a first target operation and a second target operation, the target operation execution module 1006 further includes a cycle execution module 1061 configured to, when each processing channel is performing the first target operation in the current cycle of the current round, search, according to a corresponding current index element, a corresponding content element from the cache unit as an operation object of the first target operation, and when each processing channel is performing the second target operation in the current cycle of the current round, obtain intermediate operation data obtained by performing the second target operation in the previous cycle, and use the intermediate operation data and the content element searched by the first target operation together as the operation object.
In one embodiment, the target operation execution module 1006 further includes a concurrency processing module 1062, configured to determine a maximum concurrency number supported by the local resource, where each processing channel searches the cache unit for a content element with the maximum concurrency number according to a current index element in the corresponding data index set when executing the target operation of the current round, and uses each searched content element as an operation object in parallel to execute the target operation in parallel.
In one embodiment, the data batch processing device 1000 is further configured to trigger the local kernel to obtain, from the local memory, content elements having continuity and corresponding to the number of caches according to the current index element when the corresponding content element is not found in the cache unit, and trigger to perform overlay update on the stored content in the cache unit based on the obtained content elements, where the overlay stored content element is used for executing a subsequent target operation.
In one embodiment, the data batch processing device 1000 is further configured to determine the number of page tables of an address table corresponding to each data index set, divide the data index set according to the number of page tables to obtain at least one index subset, each target page table in the address table corresponds to at least one matrix row in the index subset, the target page table is configured to store address information of index elements having continuity in the corresponding matrix row, and execute, in parallel, at least one round of target operations corresponding to the instruction stream through each processing channel according to the address table corresponding to each processing channel and the index subset in the data index set allocated corresponding to each processing channel until an operation sub-result corresponding to the target sub-task is obtained, and synthesize each operation sub-result to obtain an operation result corresponding to the target task.
In one embodiment, the data batch processing device 1000 is further configured to, when each processing channel performs the target operation of the current round, search corresponding target address information according to a target page table corresponding to the current matrix row where the current index element is located, obtain the corresponding content element based on the target address information as an operation object, and trigger the target page table to perform the coverage update of the address information according to the current index element when the corresponding content element is not searched in the target page table.
In one embodiment, the data batch processing device 1000 is deployed with an inference engine for execution, and the data batch processing device 1000 is applied to a face detection model, wherein the target task is one of the memory-access intensive tasks generated by the face detection model when face detection is performed.
In one embodiment, the data batch processing device 1000 is further configured to obtain a face image to be detected, input the face image to be detected to a face detection model, obtain a target task through the face detection model, trigger an instruction stream corresponding to the target task and execute the target task to obtain an operation result corresponding to the target task, and determine a face detection result according to the operation result.
In one embodiment, as shown in FIG. 11, a data batch processing apparatus 1100 is provided, which may employ software modules or hardware modules, or a combination of both, as part of a computer device, and specifically includes a subtask determination module 1102, a data partitioning module 1104, and a subtresult determination module 1106, wherein:
the subtask determining module 1102 is configured to obtain a target subtask, and determine an instruction stream corresponding to the target subtask.
The data dividing module 1104 is configured to determine a data index set corresponding to the target subtasks and a number of page tables of an address table in the local resource, where the data index set is used to find content elements in the matrix to be processed, divide the data index set according to the number of page tables to obtain at least one index subset, and each target page table in the address table corresponds to at least one matrix row in the index subset, and the target page table is used to store address information of index elements having continuity in the corresponding matrix row.
The sub-result determining module 1106 is configured to execute at least one round of target operation corresponding to the instruction stream according to the address table and each index element in each index subset until an operation sub-result corresponding to the target sub-task is obtained, where when the target operation of the current round is executed, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, so that the corresponding content element is obtained based on the target address information and is used as an operation object, and when the corresponding content element is not searched in the target page table, the target page table is triggered to perform coverage update of the address information according to the current index element.
In one embodiment, the sub-result determining module 1106 is further configured to determine, for each data index set, a subset of indexes to be processed in the corresponding data index set, and determine a current subset of indexes to be processed from among the subsets of indexes to be processed, execute at least one round of target operations corresponding to the instruction stream according to the address table corresponding to the corresponding data index set and each index element in the current subset of indexes to be processed, update the current subset of indexes to be processed to a completed subset of indexes after completing the target operations, enter a process of processing a next subset of indexes to be processed, and return to determining the subset of indexes to be processed in the corresponding data index set, and continue to be executed until, based on all the subsets of indexes in each data index set, execute at least one round of target operations corresponding to the instruction stream, thereby obtaining an operation sub-result corresponding to the target sub-task.
In one embodiment, the sub-result determining module 1106 is further configured to determine a maximum number of concurrency supported by the local resource, group the index subsets according to the maximum number of concurrency to obtain at least one concurrency subset, and execute a round of target operation corresponding to a round of instruction stream sequentially based on each concurrency subset.
In one embodiment, the target operation for each round includes a set of operations performed in a number of cycles corresponding to the number of page tables, and the sub-result determination module 1106 is further configured to, for each concurrent subset, perform the set of operations in a number of cycles based on respective matrix rows in the concurrent subset during execution of the target operation for each round.
In one embodiment, the data batch processing device 1100 is deployed with an inference engine for execution, and the data batch processing device 1100 is applied to a face detection model, wherein the target subtasks are tasks for processing part of data in a matrix to be processed in target tasks, and the target tasks are one of access intensive tasks in the tasks generated when the face detection model performs face detection.
For specific limitations of the data batch processing apparatus, reference may be made to the above limitation of the data batch processing method, and no further description is given here. The above-described individual modules in the data batch processing apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data batch processing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data batch processing method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (27)

1. A method for batch processing of data, the method comprising:
acquiring a target task and determining an instruction stream corresponding to the target task;
Determining an index complete set corresponding to the target task, wherein each index element in the index complete set is used for pointing to each content element in a matrix to be processed;
Dividing the index complete set into a plurality of data index sets, and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index set distributed to each processing channel is larger than the number of caches corresponding to the cache units, the processing channels are memory channels controlled by a memory controller and used for addressing, reading data and processing data, and each processing channel is provided with the corresponding cache unit;
Based on the data index sets respectively and correspondingly allocated to the processing channels, respectively executing at least one round of target operation corresponding to the instruction stream through the processing channels in parallel until an operation result corresponding to the target task is obtained;
When each processing channel executes the target operation of the current round, searching the corresponding content elements from the cache unit according to the current index elements in the corresponding data index set to serve as operation objects, and triggering the cache unit to acquire the content elements with the cached quantity according to the current index elements to carry out overlay updating when the corresponding content elements are not searched from the cache unit.
2. The method of claim 1, wherein dividing the full set of indices into a plurality of data index sets and assigning each of the data index sets to a corresponding processing channel, respectively, comprises:
Determining the number of channels of a processing channel and the number of caches corresponding to the cache units;
Dividing the index complete set according to the channel number and the cache number to obtain a plurality of data index sets, and respectively distributing the data index sets to the corresponding processing channels.
3. The method of claim 1, wherein the performing, in parallel, by each processing channel, the target operation of at least one round corresponding to the instruction stream, respectively, based on the respectively assigned data index set for each processing channel, comprises:
For each processing channel in the plurality of processing channels, determining at least one to-be-processed data index set correspondingly allocated to the corresponding processing channel;
for each processing channel, determining a current data index set to be processed from the corresponding data index set to be processed respectively;
Based on the current data index set to be processed corresponding to each processing channel, respectively executing at least one round of target operation corresponding to the instruction stream through each processing channel in parallel, and updating the current data index set to be processed into a completed data index set after completing the target operation;
And entering a flow for processing the next data index set to be processed, returning to each processing channel in the plurality of processing channels, and determining at least one data index set to be processed correspondingly allocated to the corresponding processing channel, and continuing to execute until the target operation corresponding to the instruction stream for at least one round is executed based on each data index set in the index total set.
4. The method of claim 1, wherein the instruction stream includes more than one operation instruction determined by a trigger sequence, wherein the more than one operation instruction includes a first instruction and a second instruction, wherein the first instruction triggers before the second instruction, and wherein the target operation corresponding to the instruction stream includes a first target operation corresponding to the first instruction and a second target operation corresponding to the second instruction.
5. The method according to claim 1, wherein when no corresponding content element is found in the cache unit, triggering the cache unit to obtain, according to the current index element, a cache number of content elements for overlay update includes:
When the corresponding content elements are not found from the cache unit, triggering the local kernel to acquire continuous content elements with corresponding cache quantity from the local memory according to the current index element, and triggering the overlay update of the stored content in the cache unit based on the acquired content elements, wherein the content elements stored in the cache unit in an overlay manner are used for executing subsequent target operation.
6. The method of claim 1, wherein the target task comprises at least one target subtask, and each target subtask corresponds to one of the plurality of data index sets, the method further comprising:
determining the number of page tables of address tables corresponding to the data index sets respectively;
dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset respectively and is used for storing address information of index elements with continuity in the corresponding matrix row;
the step of executing at least one round of target operation corresponding to the instruction stream through each processing channel in parallel based on the data index sets respectively allocated correspondingly to each processing channel until an operation result corresponding to the target task is obtained, comprises the following steps:
According to the address table respectively corresponding to each processing channel and the index subset in the data index set respectively corresponding to each processing channel, executing at least one round of target operation corresponding to the instruction stream in parallel through each processing channel until an operation sub-result corresponding to the target sub-task is obtained;
and synthesizing the operation sub-results to obtain an operation result corresponding to the target task.
7. The method of claim 6, wherein each processing channel searches corresponding target address information according to a target page table corresponding to a current matrix row in which a current index element is located when performing a target operation of a current round, so as to acquire a corresponding content element as an operation object based on the target address information, and when the corresponding content element is not searched in the target page table, the target page table is triggered to perform coverage update of the address information according to the current index element.
8. The method according to any one of claims 1 to 7, wherein the method is performed by an inference engine and the method is applied to a face detection model, wherein the target task is one of the memory intensive tasks generated by the face detection model when face detection is performed.
9. A method for batch processing of data, the method further comprising:
acquiring a target subtask and determining an instruction stream corresponding to the target subtask;
Determining a data index set corresponding to the target subtask and the number of page tables of address tables in local resources, wherein the data index set is used for searching content elements in a matrix to be processed;
dividing the data index set according to the number of the page tables to obtain at least one index subset, wherein each target page table in the address table corresponds to at least one matrix row in the index subset respectively and is used for storing address information of index elements with continuity in the corresponding matrix row;
The method comprises the steps of determining the maximum concurrency quantity supported by local resources, grouping index subsets according to the maximum concurrency quantity to obtain at least one concurrency subset, and executing a round of target operation corresponding to a round of instruction flow sequentially based on each concurrency subset until an operation sub-result corresponding to a target sub-task is obtained;
When the target operation of the current round is executed, the corresponding target address information is searched according to the target page table corresponding to the current matrix row where the current index element is located, the corresponding content element is obtained from the cache unit based on the target address information to serve as an operation object, and when the corresponding content element is not searched from the cache unit based on the target address information in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element, so that the cache unit is updated.
10. The method of claim 9, wherein grouping the index subsets according to the maximum number of concurrency results in at least one concurrency subset, comprising:
And grouping the index subsets according to columns based on the maximum concurrency quantity to obtain at least one concurrency subset, wherein each concurrency matrix row in each concurrency subset contains index elements not exceeding the maximum concurrency quantity.
11. The method of claim 9, wherein the target operation for each round comprises a set of operations performed in a number of cycles, the number of cycles corresponding to the number of page tables;
The sequentially executing a round of target operation corresponding to a round of instruction stream based on each concurrent subset comprises:
For each concurrent subset, during execution of the target operation for each round, the set of operations is executed in a number of cycles based on the respective matrix rows in the concurrent subset.
12. The method according to any one of claims 9 to 11, wherein the method is performed by an inference engine and the method is applied to a face detection model, wherein the target subtask is a task for processing part of the data in the matrix to be processed in a target task, and wherein the target task is one of the memory intensive tasks generated by the face detection model when face detection is performed.
13. A data batch processing apparatus, the apparatus comprising:
the system comprises an index complete set acquisition module, a processing module and a processing module, wherein the index complete set acquisition module is used for acquiring a target task, determining an instruction stream corresponding to the target task, determining an index complete set corresponding to the target task, and pointing to each content element in a matrix to be processed by each index element in the index complete set;
the data index set acquisition module is used for dividing the index complete set into a plurality of data index sets and respectively distributing the data index sets to corresponding processing channels, wherein the number of elements of the data index set distributed by each processing channel is larger than the number of caches corresponding to the cache units, and the processing channels are memory channels controlled by a memory controller and used for addressing, reading data and processing data, and each processing channel is provided with the corresponding cache unit;
And the target operation execution module is used for executing at least one round of target operation corresponding to the instruction stream through each processing channel in parallel based on the data index sets respectively allocated by each processing channel until an operation result corresponding to the target task is obtained, wherein each processing channel searches the corresponding content element from the cache unit as an operation object according to the current index element in the corresponding data index set when executing the target operation of the current round, and triggers the cache unit to acquire the content elements with the cache number according to the current index element to carry out coverage update when the corresponding content element is not searched from the cache unit.
14. The apparatus of claim 13, wherein the data index set obtaining module is further configured to determine a number of channels of a processing channel and a number of caches corresponding to the cache unit, divide the index corpus according to the number of channels and the number of caches to obtain a plurality of data index sets, and allocate each of the data index sets to a corresponding processing channel.
15. The apparatus of claim 13, wherein the target operation execution module is further configured to determine, for each of the plurality of processing channels, a set of to-be-processed data indices corresponding to at least one of the respective processing channels, determine, for each of the processing channels, a current set of to-be-processed data indices from the respective set of to-be-processed data indices, respectively, execute, for each of the processing channels, in parallel, at least one round of target operation corresponding to the instruction stream through each of the processing channels, respectively, and update the current set of to a completed set of data indices after completion of the target operation, enter a process for processing a next set of to-be-processed data indices, and return to each of the plurality of processing channels, determine, for each of the processing channels, the at least one set of to-be-processed data indices corresponding to the respective processing channel, and continue until, based on each of the sets of data indices in the full set of indices, execute the target operation corresponding to the instruction stream for at least one round.
16. The apparatus of claim 13, wherein the instruction stream includes more than one operation instruction determined by a trigger sequence, wherein the more than one operation instruction includes a first instruction and a second instruction, wherein the first instruction triggers before the second instruction, and wherein the target operation corresponding to the instruction stream includes a first target operation corresponding to the first instruction and a second target operation corresponding to the second instruction.
17. The apparatus of claim 13, wherein the data batch processing apparatus is further configured to trigger, when no corresponding content element is found in the cache unit, the local kernel to obtain content elements with continuity and corresponding cache number from the local memory according to the current index element, and trigger to perform overlay update on the stored content in the cache unit based on the obtained content elements, where the content elements in the cache unit are used for performing subsequent target operations.
18. The apparatus of claim 13, wherein the target task comprises at least one target subtask, and each target subtask corresponds to one of the plurality of data index sets;
The data batch processing device is further used for determining the number of page tables of address tables corresponding to data index sets respectively, dividing the data index sets according to the number of page tables to obtain at least one index subset, wherein each target page table in the address tables corresponds to at least one matrix row in the index subset respectively, the target page table is used for storing address information of index elements with continuity in the corresponding matrix row, and according to the address tables corresponding to processing channels and the index subsets in the data index sets allocated corresponding to the processing channels respectively, at least one round of target operation corresponding to the instruction stream is executed in parallel through the processing channels until an operation sub result corresponding to the target sub task is obtained, and the operation sub result corresponding to the target task is obtained by integrating the operation sub results.
19. The apparatus of claim 18, wherein each processing channel, when executing a target operation of a current round, searches for corresponding target address information according to a target page table corresponding to a current matrix row in which a current index element is located, so as to obtain a corresponding content element based on the target address information as an operation object, and when no corresponding content element is searched for in the target page table, triggers the target page table to perform coverage update of address information according to the current index element.
20. The apparatus according to any one of claims 13 to 19, wherein the apparatus is executed by an inference engine and the apparatus is applied to a face detection model, wherein the target task is one of memory intensive tasks among tasks generated by the face detection model when face detection is performed.
21. A data batch processing apparatus, the apparatus comprising:
the subtask determining module is used for acquiring a target subtask and determining an instruction stream corresponding to the target subtask;
The data dividing module is used for determining a data index set corresponding to the target subtask and the number of page tables of address tables in local resources, wherein the data index set is used for searching content elements in a matrix to be processed, dividing the data index set according to the number of the page tables to obtain at least one index subset, and each target page table in the address tables is respectively corresponding to at least one matrix row in the index subset and is used for storing address information of index elements with continuity in the corresponding matrix row;
The sub-result determining module is used for determining the maximum concurrency quantity supported by the local resource, grouping the index subsets according to the maximum concurrency quantity to obtain at least one concurrency subset, sequentially executing a round of target operation corresponding to one round of instruction flow based on each concurrency subset until an operation sub-result corresponding to the target sub-task is obtained, wherein when the target operation of the current round is executed, corresponding target address information is searched according to a target page table corresponding to a current matrix row where a current index element is located, corresponding content elements are obtained from a cache unit based on the target address information to serve as operation objects, and when the corresponding content elements are not searched from the cache unit based on the target address information in the target page table, the target page table is triggered to carry out coverage update of the address information according to the current index element so as to update the cache unit.
22. The apparatus of claim 21, wherein the means for determining the sub-results is further configured to group the index subsets into at least one concurrency subset by columns based on the maximum concurrency number, wherein each concurrency matrix row in each concurrency subset contains index elements not exceeding the maximum concurrency number.
23. The apparatus of claim 21, wherein the target operation for each round comprises a set of operations performed in a number of cycles, the number of cycles corresponding to the number of page tables;
the sub-result determining module is further configured to, for each concurrent subset, cycle, in execution of the target operation of each round, the execution of the set of operations for the number of page tables based on each matrix row in the concurrent subset.
24. The apparatus according to any one of claims 21 to 23, wherein the apparatus is executed by an inference engine and the apparatus is applied to a face detection model, wherein the target subtask is a task for processing part of data in the matrix to be processed among target tasks, and the target task is one of memory intensive tasks generated by the face detection model when face detection is performed.
25. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 12.
26. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.
27. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.
CN202110011581.9A 2021-01-06 2021-01-06 Data batch processing method, device and storage medium Active CN113535349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110011581.9A CN113535349B (en) 2021-01-06 2021-01-06 Data batch processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110011581.9A CN113535349B (en) 2021-01-06 2021-01-06 Data batch processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN113535349A CN113535349A (en) 2021-10-22
CN113535349B true CN113535349B (en) 2025-08-08

Family

ID=78094335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110011581.9A Active CN113535349B (en) 2021-01-06 2021-01-06 Data batch processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113535349B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116560817B (en) * 2023-05-29 2024-05-07 北京百度网讯科技有限公司 Task execution method, device, electronic device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488177A (en) * 2020-04-14 2020-08-04 腾讯科技(深圳)有限公司 Data processing method, data processing device, computer equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6745368B1 (en) * 1999-06-11 2004-06-01 Liberate Technologies Methods, apparatus, and systems for storing, retrieving and playing multimedia data
US7904905B2 (en) * 2003-11-14 2011-03-08 Stmicroelectronics, Inc. System and method for efficiently executing single program multiple data (SPMD) programs
US7689806B2 (en) * 2006-07-14 2010-03-30 Q Method and system to indicate an exception-triggering page within a microprocessor
US9595074B2 (en) * 2011-09-16 2017-03-14 Imagination Technologies Limited Multistage collector for outputs in multiprocessor systems
US8635406B2 (en) * 2012-03-08 2014-01-21 Arm Limited Data processing apparatus and method for providing target address information for branch instructions
CN109814927B (en) * 2018-12-19 2021-01-29 成都海光集成电路设计有限公司 Machine learning reasoning coprocessor
CN111897579B (en) * 2020-08-18 2024-01-30 腾讯科技(深圳)有限公司 Image data processing method, device, computer equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488177A (en) * 2020-04-14 2020-08-04 腾讯科技(深圳)有限公司 Data processing method, data processing device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113535349A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
US20110057937A1 (en) Method and system for blocking data on a gpu
Gong et al. Save: Sparsity-aware vector engine for accelerating dnn training and inference on cpus
CN111488177A (en) Data processing method, data processing device, computer equipment and storage medium
EP3380993B1 (en) Systems and methods for robust large-scale machine learning
US20210304010A1 (en) Neural network training under memory restraint
Boyer et al. Dense dynamic programming on multi GPU
US12321849B1 (en) Performing hardware operator fusion
CN118520210B (en) Data processing method, processor, electronic device and storage medium
US11610102B1 (en) Time-based memory allocation for neural network inference
CN114761920A (en) Hardware accelerator with reconfigurable instruction set
US20240403621A1 (en) Processing sequential inputs using neural network accelerators
Liu Parallel and scalable sparse basic linear algebra subprograms
Jeong et al. REACT: Scalable and high-performance regular expression pattern matching accelerator for in-storage processing
CN113032621A (en) Data sampling method and device, computer equipment and storage medium
CN113535349B (en) Data batch processing method, device and storage medium
US11841792B1 (en) Instructions with multiple memory access modes
Pan et al. G-slide: A gpu-based sub-linear deep learning engine via lsh sparsification
US11188302B1 (en) Top value computation on an integrated circuit device
Peng et al. Adaptive runtime exploiting sparsity in tensor of deep learning neural network on heterogeneous systems
US20240004954A1 (en) Computer-implemented accumulation method for sparse matrix multiplication applications
US11782706B1 (en) Reconfigurable neural network processing based on subgraph recognition
US12333415B2 (en) Neural network accelerators
CN112100446B (en) Search method, readable storage medium, and electronic device
Lopresti et al. GPU permutation index: good trade-off between efficiency and results quality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant