Data processing method and device
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a data processing method and apparatus.
Background
Program iterative call refers to the situation where one program calls another program during execution. For example, in ray tracing, there may be multiple iteratively invoked shader programs (shader programs) on one rendering pipeline (pipeline).
Currently, in ray tracing, iterative program invocation may be implemented based on both ray intersection test results and states. Specifically, in the scheme based on the ray intersection test result, the subroutine to be called by the parent program can be determined according to the intersection test result corresponding to the parent program, such as that the ray intersects or does not intersect with the object. In the state-based scheme, a child program to be called by the parent program can be determined according to the execution state of the parent program. In the two program iterative calling schemes, the parent program and the subprogram required to be called by the parent program are executed by the same thread.
However, if the subroutines to be executed by different threads in the same thread group are different, different subroutines need to be executed respectively in multiple instruction cycles, so that threads which do not execute the subroutines exist in each instruction cycle of the same thread group, i.e. threads in an idle state exist, which results in resource waste and low data processing efficiency. In addition, in each instruction cycle, the threads in the same thread group need to read the instructions corresponding to the subroutines from the instruction buffer. Because the size of the instruction buffer is limited, only the instructions corresponding to part of the sub-program can be stored. Therefore, when the subroutine to be executed is frequently changed, there may be no instruction corresponding to the subroutine in the instruction buffer, and the instruction of the subroutine needs to be read from the main memory, such as Double Data Rate (DDR), that is, the instruction hit rate is reduced. In addition, since the instruction is read from the main memory with a long time, the corresponding thread is still in an idle state during the instruction is read from the main memory, which also results in resource waste and low efficiency.
Disclosure of Invention
The embodiment of the application provides a data processing method and device, which can solve the problem of resource waste, thereby improving the data processing efficiency.
In order to achieve the above purpose, the present application adopts the following technical scheme:
in a first aspect, a data processing method is provided. The data processing method comprises the following steps: determining I seed programs called by N father programs, and acquiring the execution times L of the ith seed program in the I seed programs i . Wherein N is>I, N, I are positive integers greater than 1.L i Is a positive integer. Determining J instruction cycles and K in the J instruction cycles according to the execution times of various subroutines j A set of thread groups. Wherein, J、K j for positive integer, one thread group set comprises one or more thread groups, and the types of subroutines executed by all thread group sets in J instruction periods are different; the number of threads of the non-executing subroutine is smaller than the number of threads of one thread group in each thread group set.
Based on the data processing method provided in the first aspect, the number of instruction cycles for executing the plurality of subroutines and the number of execution times of each subroutine can be determined according to the plurality of subroutines called by the plurality of parent programs, and the number of thread group sets in each instruction cycle, so that the same subroutine can be executed in the same instruction cycle, so that the number of required instruction cycles and/or the number of required thread groups can be reduced as much as possible, and the resource utilization rate can be improved, thereby improving the data processing efficiency.
In addition, the same subprogram is executed in the same instruction period, so that the condition that threads in different thread groups frequently switch the read instructions in the same instruction period and threads in the same thread group in different instruction periods is avoided, the hit rate of the instruction fetching operation can be improved, and the resource utilization rate and the data processing efficiency are improved.
In one possible design, J instruction cycles, and K in the J-th instruction cycle, are determined based on the number of executions of each subroutine j A set of thread groups may include: and determining a thread group set corresponding to the ith seed program and an instruction period corresponding to the thread group set corresponding to the ith seed program according to the execution times. Thus, the thread group set and the instruction period corresponding to each subprogram are determined according to the execution times of the subprogram, so that the number of thread groups for executing the same subprogram can be reduced, and/or the number of instruction periods required by all subprograms can be reduced, thereby further improving the resource utilization rate and the data processing efficiency.
Optionally, the number of thread groups in the thread group set corresponding to the ith seed program, the number of threads of one thread group, and the number of execution times corresponding to the ith seed program satisfy one or more of the following conditions: Alternatively, (W) i -1)*m≤L i ≤W i * m. Wherein W is i And m is the number of threads in one thread group, wherein the number is the number of the threads in the thread group set corresponding to the ith subprogram. In this way, the same kind of subroutine is executed by using as few thread groups as possible, so as to further improve the resource utilization and the data processing efficiency.
Optionally, determining the thread group set corresponding to the ith seed program and the instruction cycle of the thread group set corresponding to the ith seed program according to the execution times may include: and determining the first sequence of the I seed program according to the execution times. The first order is used for indicating the acquisition order of the thread group set corresponding to each subprogram. And determining a thread group set corresponding to the ith seed program and an instruction period corresponding to the thread group set corresponding to the ith seed program according to the first sequence. Therefore, the thread group set corresponding to the subprogram with the front sequence can be preferentially determined, and the resource utilization rate and the data processing efficiency are further improved.
Further, the first order indicates to preferentially acquire the thread group set corresponding to the subroutine with the more execution times. Therefore, the thread group set corresponding to the subprogram with the large execution times is preferentially determined, and the subprogram with the large occupied number of threads can be executed in an earlier instruction period, so that the resource utilization rate and the data processing efficiency are further improved.
In one possible design, the number of thread groups of all thread groups of the j-th instruction cycle is greater than or equal to the number of thread groups of all thread groups of the j+1-th instruction cycle. Thus, the total amount of threads for executing the subprogram is unchanged or gradually reduced in each instruction cycle, and the situation that an idle thread group cannot be released can be avoided, so that the resource utilization rate and the data processing efficiency are further improved.
In one possible design, the number of executions may be determined based on the number of parents invoking the ith seed program. Therefore, the execution times of the ith seed program are determined according to the number of the father programs calling the ith seed program, the obtained execution times can be more accurate, the thread group set of each seed program is further accurately determined, the number of idle threads in each instruction period is reduced, and the resource utilization rate and the data processing efficiency are further improved.
In a second aspect, a data processing apparatus is provided. The data processing apparatus includes: a determining module and an acquiring module. The determining module is used for determining I seed programs called by the N parent programs. N (N)>I, N, I are positive integers greater than 1. An acquisition module for acquiring the execution times L of the ith seed program in the I seed program i . Wherein, L i is a positive integer. The determining module is also used for determining J instruction periods and K in the J instruction periods according to the execution times of various subroutines j A set of thread groups. Wherein, J、K j for positive integer, one thread group set comprises one or more thread groups, and the types of subroutines executed by all thread group sets in J instruction periods are different; the number of threads of the non-executing subroutine is smaller than the number of threads of one thread group in each thread group set.
In one possible design, the determining module is further configured to determine, according to the execution times, a thread group set corresponding to the ith seed program, and an instruction period corresponding to the thread group set corresponding to the ith seed program.
Optionally, the number of thread groups in the thread group set corresponding to the ith seed program, the number of threads of one thread group, and the execution times corresponding to the ith seed program satisfy one or more of the following:alternatively, (W) i -1)*m≤L i ≤W i * m. Wherein W is i And m is the number of threads of one thread group, which is the number of the thread groups in the thread group set corresponding to the ith subprogram.
Optionally, the determining module is further configured to determine a first order of the I-seed program according to the execution times, and determine, according to the first order, a thread group set corresponding to the I-seed program, and an instruction period corresponding to the thread group set corresponding to the I-seed program. The first order is used for indicating the acquisition order of the thread group set corresponding to each subprogram.
Further, the first order indicates to preferentially acquire the thread group set corresponding to the subroutine with the more execution times.
In one possible design, the number of thread groups of all thread groups of the j-th instruction cycle is greater than or equal to the number of thread groups of all thread groups of the j+1-th instruction cycle.
In one possible design, the number of executions may be determined based on the number of parents invoking the ith seed program.
It should be noted that, alternatively, the determining module and the acquiring module may be integrated into one module, such as a processing module. The processing module is used for realizing the operation of each functional module.
Optionally, the data processing method according to the second aspect may further include a storage module, such as an instruction cache, storing a program or instructions that, when executed by the processing module, enable the data processing apparatus to perform the data processing method according to the first aspect.
Optionally, the data processing apparatus of the second aspect may further include an input/output port. Wherein the input/output port is used for realizing the input and/or output functions of the instructions and/or data of the data processing device according to the second aspect.
Further, the input-output port may also be coupled with a transceiver. The transceiver is used for information interaction between the data processing device and other data processing devices, and the processor executes program instructions to execute the data processing method according to the first aspect. The data processing apparatus according to the second aspect may be a computing device, such as a mobile phone, a tablet (PAD), a personal computer, or a server, or may be a chip (system) or other part or component, such as a processor, that may be disposed in the computing device, or may be a cluster of computing devices, such as a server cluster, including the computing device, which is not limited in this application.
In addition, the technical effects of the data processing apparatus according to the second aspect may refer to the technical effects of the data processing method according to the first aspect, which are not described herein.
In a third aspect, a data processing apparatus is provided. The data processing apparatus is configured to perform the data processing method described in the implementation manner in the first aspect.
In this application, the data processing apparatus according to the third aspect may be the computing device according to the first aspect, or a chip (system) or other part or component that may be provided in the computing device, or an apparatus that includes the computing device.
It should be understood that the data processing apparatus according to the third aspect includes a corresponding module, unit, or means (means) for implementing the data processing method according to the first aspect, where the module, unit, or means may be implemented by hardware, software, or implemented by hardware executing corresponding software. The hardware or software comprises one or more modules or units for performing the functions involved in the data processing methods described above.
In addition, the technical effects of the data processing apparatus according to the third aspect may refer to the technical effects of the data processing method according to the first aspect, which are not described herein.
In a fourth aspect, a data processing apparatus is provided. The data processing apparatus includes: a processor for performing the data processing method according to a possible implementation manner of the first aspect.
In a possible implementation manner, the data processing apparatus according to the fourth aspect may further include a transceiver. The transceiver may be a transceiver circuit or an interface circuit. The transceiver may be for use in a data processing apparatus according to the fourth aspect in communication with other data.
In a possible embodiment, the data processing device according to the fourth aspect may further comprise a memory. The memory may be integral with the processor or may be separate. The memory may be used for storing computer programs and/or data involved in the data processing method described in the first aspect.
In this application, the data processing apparatus according to the fourth aspect may be the computing device according to the first aspect, or a chip (system) or other part or component that may be disposed in the computing device, or an apparatus including the computing device.
In addition, the technical effects of the data processing apparatus according to the fourth aspect may refer to the technical effects of the data processing method according to the implementation manner of the first aspect, which are not described herein.
In a fifth aspect, a data processing apparatus is provided. The data processing apparatus includes: a processor coupled to the memory, the processor configured to execute a computer program stored in the memory, to cause the data processing apparatus to perform the data processing method according to an implementation of the first aspect.
In a possible implementation manner, the data processing apparatus according to the fifth aspect may further include a transceiver. The transceiver may be a transceiver circuit or an interface circuit. The transceiver may be for use in a data processing apparatus according to the fifth aspect to communicate with other data processing apparatus.
In this application, the data processing apparatus according to the fifth aspect may be the computing device of the first aspect, or a chip (system) or other part or component that may be disposed in the computing device, or an apparatus that includes the computing device.
In addition, the technical effects of the data processing apparatus according to the fifth aspect may refer to the technical effects of the data processing method according to the implementation manner of the first aspect, which are not described herein.
In a sixth aspect, there is provided a data processing apparatus comprising: a processor and a memory; the memory is configured to store a computer program which, when executed by the processor, causes the data processing apparatus to perform the data processing method according to an implementation of the first aspect.
In a possible implementation manner, the data processing apparatus according to the sixth aspect may further include a transceiver. The transceiver may be a transceiver circuit or an interface circuit. The transceiver may be for use in a data processing apparatus according to the sixth aspect to communicate with other data processing apparatus.
In this application, the data processing apparatus according to the sixth aspect may be the computing device of the first aspect, or a chip (system) or other part or component that may be disposed in the computing device, or an apparatus including the computing device.
In addition, the technical effects of the data processing apparatus described in the sixth aspect may refer to the technical effects of the data processing method described in the implementation manner of the first aspect, which are not described herein.
In a seventh aspect, there is provided a data processing apparatus comprising: a processor; the processor is configured to execute the data processing method according to the implementation of the first aspect after being coupled to the memory and reading the computer program in the memory.
In a possible implementation manner, the data processing apparatus according to the seventh aspect may further include a transceiver. The transceiver may be a transceiver circuit or an interface circuit. The transceiver may be for the data processing apparatus of the seventh aspect to communicate with other data processing apparatus.
In this application, the data processing apparatus according to the seventh aspect may be the computing device of the first aspect, or a chip (system) or other part or component that may be provided in the computing device, or an apparatus that includes the computing device.
In addition, the technical effects of the data processing apparatus according to the seventh aspect may refer to the technical effects of the data processing method according to the implementation manner of the first aspect, which are not described herein.
In addition, the technical effects of the data processing apparatus according to the fifth aspect to the seventh aspect may refer to the technical effects of the data processing method according to the first aspect, and are not described herein again.
In an eighth aspect, a processor is provided. Wherein the processor is configured to perform the data processing method according to an implementation manner of the first aspect.
In a ninth aspect, a data processing system is provided. The data processing system includes one or more computing devices.
In a tenth aspect, there is provided a computer readable storage medium comprising: computer programs or instructions; the computer program or instructions, when run on a computer, cause the computer to perform the data processing method described by an implementation of the first aspect.
In an eleventh aspect, a computer program product is provided, comprising a computer program or instructions which, when run on a computer, cause the computer to perform the data processing method according to an implementation of the first aspect.
Drawings
Fig. 1 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram showing a relationship between a parent program and a child program according to an embodiment of the present disclosure;
FIG. 4 is a second schematic diagram of a relationship between a parent program and a child program according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating a relationship between a thread group set and a subroutine according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram illustrating a relationship between a thread group set and a thread group according to an embodiment of the present disclosure;
FIG. 7 is a second schematic diagram of a relationship between a thread group set and a subroutine according to an embodiment of the present disclosure;
FIG. 8 is a second schematic diagram of a relationship between a thread group set and a thread group according to an embodiment of the present disclosure;
FIG. 9 is a third schematic diagram of a relationship between a parent program and a child program according to an embodiment of the present disclosure;
FIG. 10 is a third schematic diagram of a relationship between a thread group set and a subroutine according to an embodiment of the present disclosure;
FIG. 11 is a fourth schematic diagram of a relationship between a thread group set and a subroutine according to an embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
fig. 13 is a schematic diagram of a second structure of the data processing apparatus according to the embodiment of the present application.
Detailed Description
Technical terms related to the embodiments of the present application are described below.
1. Ray tracing (ray tracing): a method for displaying an object includes tracking along the opposite direction of a ray reaching a viewpoint, finding out the object surface point intersecting with the line of sight through each pixel on the screen, and continuing tracking to find out all light sources affecting the light intensity of the object surface point, so as to calculate the accurate light intensity on the object surface point.
2. Shader program (shader program): the shader program is an editable program for implementing image rendering.
3. The subroutine: a subroutine, such as a shader program, is code that consists of one or more statement blocks. It is responsible for accomplishing a particular task and has relative independence from other code, and subroutines may be called by other programs.
4. Parent program: the parent program is a code set capable of realizing a certain function, and can call the child program in the execution process.
5. Thread (thread): the minimum unit of the graphics processor (graphics processing unit, GPU) that can perform operation scheduling can be used to execute programs, such as parent programs and child programs called by the parent program.
The technical solutions in the present application will be described below with reference to the accompanying drawings.
The technical scheme of the embodiment of the application can be applied to the computing equipment. The computing device may be a terminal device, such as a client (client), a mobile phone (mobile phone), a tablet (Pad), a server, or a computing device cluster including the computing device, such as a server cluster, which is not limited in this application.
The present application will present various aspects, embodiments, or features about a system that may include multiple devices, components, modules, etc. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules etc. discussed in connection with the figures. Furthermore, combinations of these schemes may also be used.
In addition, in the embodiments of the present application, words such as "exemplary," "for example," and the like are used to indicate an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion.
In the embodiment of the present application, "information", "signal", "message", "channel", and "signaling" may be used in a mixed manner, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding" and "corresponding" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.
The device structure described in the embodiments of the present application is for more clearly describing the technical solution of the embodiments of the present application, and does not constitute a limitation to the technical solution provided in the embodiments of the present application, and as a person of ordinary skill in the art can know that, with the evolution of equipment and the appearance of a new service scenario, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
For the sake of understanding the embodiments of the present application, a terminal device shown in fig. 1 is first taken as an example to describe in detail a device suitable for the embodiments of the present application.
Fig. 1 is a schematic structural diagram of a terminal device 100 to which the data processing method provided in the embodiment of the present application is applicable. As shown in fig. 1, the terminal device 100 includes a processor 110, an internal memory 121, an external memory interface 122, an antenna a, a mobile communication module 131, an antenna B, a wireless communication module 132, an audio module 140, a speaker 140A, a receiver 140B, a microphone 140C, an earphone interface 140D, a display 151, a user identification module (subscriber identification module, SIM) card interface 152, a camera 153, keys 154, a sensor module 160, a universal serial bus (universal serial bus, USB) interface 170, a charge management module 180, a power management module 181, and a battery 182. In other embodiments, the terminal device 100 may also include a motor, an indicator, and the like.
Wherein the processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (application processor, AP), a modem, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. It should be noted that the different processing units may be separate devices, or may be integrated in one or more separate processors, or may be integrated in the same device with other modules in the terminal device. For example, the modem may be a processing unit independent of the processor 110, may be integrated with other processing units (e.g., AP, ISP, GPU, etc.) in the same device, or may integrate some or all of the functions with the mobile communication module 131 in the same device. Further, taking the controller as an example, the controller may be a processing unit independent of the processor 110, may be integrated with other processing units (for example, a video codec, a digital signal processor, etc.) in the same device, or may integrate part or all of the functions with the mobile communication module 131 in the same device.
The internal memory 121 may be used to store data and/or at least one computer program comprising instructions. In particular, the internal memory 121 may include a storage program area and a storage data area. Wherein the storage program area may store at least one computer program. The computer program may include an application program (such as a gallery, a contact, etc.), an operating system (such as an Android operating system, or an IOS operating system, etc.), or other program, etc., for example, the computer program may include a program for performing static detection. The storage data area may store at least one of data created during use of the terminal device, received data from other devices (e.g., other terminal devices, network devices, servers, external memories, etc.), or data previously stored before shipment, etc. For example, the data stored in the internal memory 121 may be at least one of an application program, a source code of the application program, an image, a file, or information such as an identification.
In some embodiments, the internal memory 121 may include high-speed random access memory and/or nonvolatile memory. For example, the internal memory 121 includes one or more disk storage devices, flash memory devices (flash), or universal flash memory (universal flash storage, UFS), or the like.
The processor 110 may, among other things, cause the terminal device to implement one or more functions by invoking one or more computer programs and/or data stored in the internal memory 121 to meet the needs of the user. For example, the processor 110 may cause the terminal device to execute the dead-loop detection method provided in the embodiment of the present application by calling the instructions and data stored in the internal memory 121.
The external memory interface 122 may be used to connect an external memory card (e.g., micro SD card) to realize expansion of the memory capability of the terminal device. External memory interface 122 may also be used to connect hard disks, optical disks, U-disks, etc. The external memory card communicates with the processor 110 via an external memory interface 122 to implement data storage functions. For example, files such as images, music, videos, application installation packages, application source code, etc. are stored in an external memory card. In some embodiments, a buffer may be further disposed in the processor 110, for storing instructions and/or data that the processor 110 needs to recycle, and if the processor 110 needs to reuse the instructions or data, the instructions or data may be directly called from the buffer, so as to help avoid repeated access, reduce the latency of the processor 110, and thus help improve the efficiency of the system. For example, the cache region may be implemented by a cache memory.
Alternatively, the processor 110 may obtain data to be processed or instructions to be executed from the internal memory 121 or from an external memory through the external memory interface 122. After the processor 110 obtains the data to be processed or the instruction to be executed, the data to be processed or the instruction to be executed may be stored in the buffer.
The antennas a and B are used for transmitting and receiving electromagnetic wave signals. Each antenna in the terminal device may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: antenna a may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
The mobile communication module 131 may be configured to implement communication between the terminal device and the network device according to a mobile communication technology (e.g. 2G, 3G, 4G, 5G, etc.) supported by the terminal device. By way of example, the mobile communication technology supported by the terminal device may include at least one of global system for mobile communications (global system for mobile communication, GSM), general packet radio service technology (general packet radio service, GPRS), english abbreviation for code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division-synchronous code division multiple access (time division-synchronous code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), or new radio, NR, etc. For example, the terminal device supports GSM, and after the terminal device accesses the network through a cell provided by a base transceiver station (base transceiver station, BTS) in the GSM communication system, the terminal device can implement communication between the terminal device and the BTS through the mobile communication module 131 when the network signal strength of the accessed cell is not lower than a decision threshold, that is, when the terminal device is in a network-resident state. For example, the mobile communication module 131 may transmit the modulated signal to the network device via the antenna a after amplifying the modulated signal; the mobile communication module 131 may also receive the signal transmitted by the network device through the antenna a, amplify the signal, and then transmit the amplified signal to the modem, where the received signal is demodulated into a low frequency baseband signal, and then perform other corresponding processing. In some embodiments, the mobile communication module 131 may include filters, switches, power amplifiers, low noise amplifiers (low noise amplifier, LNA), and the like.
The wireless communication module 132 may provide solutions for wireless communication including wireless access network (wireless local area networks, WLAN) (e.g., wireless-fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (IR), etc., applied on the terminal device. The GNSS may include at least one of a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS), and/or a satellite based augmentation system (satellite based augmentation systems, SBAS), among others. By way of example, the wireless communication module 132 may be one or more devices that integrate at least one communication processing module. The wireless communication module 132 may communicate with the corresponding device through the antenna B according to a wireless communication technology (e.g., wi-Fi, bluetooth, FM, NFC, etc.) supported by itself.
The terminal device may implement audio functions through the audio module 140, speaker 140A, receiver 140B, microphone 140C, headphone interface 140D, AP, etc. Such as music playing, recording, etc.
The terminal device may implement a display function through the GPU, the display screen 151, the AP, and the like. The display 151 may be used to display images, video, etc. The display 151 may include a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light emitting diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the terminal device may include 1 or N display screens 151, N being a positive integer greater than 1.
The keys 154 may include a power on key, a volume key, etc. The key 154 may be a mechanical key, a virtual button, a virtual option, or the like. The terminal device may receive key inputs, generating key signal inputs related to user settings of the terminal device and function control. For example, the terminal device may, in response to selecting a virtual option for indicating approval to participate in the "user experience improvement plan," count or gather some information about the use of the terminal device by the user to enable more personalized services to be provided to the user, thereby enhancing the user experience.
The sensor module 160 may include one or more sensors. For example, the sensor module 160 includes an acceleration sensor 160A, a touch sensor 160B, a fingerprint sensor 160C, and the like. In some embodiments, the sensor module 160 may also include pressure sensors, gyroscopic sensors, environmental sensors, distance sensors, proximity sensors, bone conduction sensors, and the like.
The acceleration sensor (acceleration sensor, ACC sensor) 160A may collect the magnitude of acceleration of the terminal device in all directions (typically three axes). The magnitude and direction of gravity can be detected when the terminal device is stationary. In addition, the acceleration sensor 160A may also be used to identify the gesture of the terminal device, and may be applied to applications such as horizontal-vertical screen switching, pedometers, and the like. In some embodiments, the acceleration sensor 160A may be coupled to the processor 110 through a micro control unit (micro controller unit, MCU), thereby helping to save power consumption of the terminal device. For example, the acceleration sensor 160A may be connected to an AP and a modem via an MCU. In some embodiments, the MCU may be a Universal Intelligent sensor hub (sensor hub).
The touch sensor 160B may also be referred to as a "touch panel". The touch sensor 160B may be disposed on the display screen 151, and the touch sensor 160B and the display screen 151 form a touch screen, which is also referred to as a "touch screen". The touch sensor 160B is used to detect a touch operation acting thereon or thereabout. The touch sensor 160B may communicate the detected touch operation to the AP to determine the touch event type. Then, the terminal device provides visual output related to the touch operation through the display screen 151 according to the determined touch event type. In other embodiments, the touch sensor 160B may also be disposed on a surface of the terminal device, different from the location of the display 151.
The fingerprint sensor 160C is used to collect a fingerprint. The terminal equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access application locks, fingerprint photographing, fingerprint incoming call answering and the like.
In other embodiments, the processor 110 may also include one or more interfaces. For example, the interface may be a SIM card interface 152. As another example, the interface may also be a USB interface 170. For another example, the interface may be an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, or the like. It will be appreciated that the processor 110 in the embodiment of the present application may interface with different modules of the terminal device, so as to enable the terminal device to implement different functions. Such as searching the web, taking a photograph, etc. It should be noted that, in the embodiment of the present application, the connection manner of the interface in the terminal device is not limited.
The SIM card interface 152 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 152 or withdrawn from the SIM card interface 152 to enable contact and separation with the terminal device. The terminal device may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 152 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 152 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 152 may also be compatible with different types of SIM cards. In some embodiments, the SIM card interface 152 may also be compatible with external memory cards. The terminal equipment realizes the functions of communication, data communication and the like through the SIM card. In some embodiments, the terminal device may also employ esims, namely: an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.
The USB interface 170 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 170 may be used to connect a charger to charge a terminal device, or may be used to transfer data between the terminal device and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.
It should be understood that the connection relationship between the modules illustrated in the embodiment of the present invention is only illustrative, and does not limit the structure of the terminal device. In other embodiments of the present application, the terminal device may also use different interfacing manners in the foregoing embodiments, or a combination of multiple interfacing manners.
The charge management module 180 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. The power management module 181 is used to connect the battery 182, the charge management module 180 and the processor 110. The power management module 181 receives input from the battery 182 and/or the charge management module 180 to power the processor 110 and the like. In some embodiments, the power management module 181 may also be used to monitor parameters such as battery capacity, battery cycle times, battery state of health (leakage, impedance), etc.
It should be understood that the structure of the terminal device shown in fig. 1 is only one example. The computing device of embodiments of the present application may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
It should be noted that, the data processing method provided in the embodiment of the present application may be applied to the terminal device shown in fig. 1, such as a mobile phone, a tablet, a computer, etc., and specific implementation may refer to the following method embodiments, which are not described herein again.
It should be noted that the solution in the embodiments of the present application may also be applied to other computing devices, and the corresponding names may also be replaced by names of corresponding functions in other computing devices. It should be understood that fig. 1 is a simplified schematic diagram that is merely illustrated for ease of understanding, and that other modules, and/or other components, may also be included in the terminal device, which are not depicted in fig. 1.
It is to be understood that the solution in the embodiments of the present application may be applied to a computing device, such as a mobile phone, a tablet (PAD), a personal computer, or a server, or may be a chip (system) or other parts or components that may be disposed in the computing device, such as a processor, or may be a cluster of computing devices including the computing device, such as a server cluster, which is not limited in this application.
The data processing method provided in the embodiment of the present application will be specifically described below with reference to fig. 2 to 13.
Fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application. The data processing method may be applied to the terminal device shown in fig. 1.
As shown in fig. 2, the data processing method includes the steps of:
s201, determining I seed programs called by N parent programs.
Wherein N > I, N, I is a positive integer greater than 1.
A parent program is illustratively a collection of code that performs a function that, when executed, invokes other programs. Each parent is executed by one thread, and the N parent are identical parent programs executed by each of the N threads. Illustratively, the homoparent program may be a program corresponding to a set of homocodes.
Illustratively, a child program is a program that is called by a parent program. The same code as the sub-program, the data processed by the sub-programs called by different parent programs in the same kind of sub-program are different.
For example, in ray tracing, the type of subroutine called by each parent program may be determined based on the intersection test results corresponding to the parent program. The type of the subroutine called by the parent program may also be determined based on other data corresponding to each parent program.
FIG. 3 is a diagram illustrating the data flow of a parent program and a child program. As shown in fig. 3, the same parent program may call different types of subroutines, and data may be transferred between the parent program and the subroutine called by the parent program through data packets, that is, the parent program and the subroutine called by the parent program may read and write data of the same data packet. Illustratively, the data packet may include data stored within a particular memory region.
FIG. 4 is a diagram illustrating a data flow between a parent program and a child program. As shown in FIG. 4, in ray tracing, a parent program may be a program for determining the rays produced by a light source, such as a ray-generating parent shader program. The child programs that the parent program may call include a child shader program A, B, and the data packet corresponding to the ray generation parent shader program is a ray data packet. The ray generation parent shader program may determine which child program to call based on the results of the object intersection test of the corresponding ray and the image region corresponding to the parent program. If the ray intersects the object of the image area corresponding to the parent program, the child shading program A can be called, and if the ray does not intersect the object of the image area corresponding to the parent program, the child shading program B can be called. Illustratively, the result of the intersection test of the ray with the object may be obtained from the intersection test result after execution of the intersection test instruction (traceRay). The function of the intersection test can be realized by a parent program or other programs except the parent program.
It should be noted that, in the embodiment of the present application, the thread may be a one-dimensional thread. For example, threads identified as 0,1, 2, or 3, etc. The thread in this embodiment may be a two-dimensional or multi-dimensional thread. For example, threads identified as arrays of (0, 0), (1, 0), (0, 1), or (1, 1) and the like.
S202, obtaining the execution times L of the ith seed program in the I seed program i 。
Wherein, L i is a positive integer.
Illustratively, the number of executions is the number of times each subroutine will be executed.
By way of example only, and in an illustrative,i.e. the number of parent programs and all child programsIs equal to the sum of the execution times of the number of times.
Alternatively, the number of executions may be determined based on the number of parent programs that call the ith seed program.
As shown in Table 1, the threads corresponding to thread group 0 are thread 0-thread 3, the threads corresponding to thread group 1 are thread 4-thread 7, and the threads corresponding to thread group 2 are thread 8-thread 11. If the parent program of 8 threads including thread 0 to thread 4, thread 6, thread 8 and thread 10 calls the child program 0, the number of execution times of the child program 0 is 8. If the parent program of 4 threads including thread 5, thread 7, thread 9 and thread 11 calls the subroutine 1, the number of executions of the subroutine 1 is 4.
TABLE 1
Therefore, the execution times of the ith seed program are determined according to the number of parent programs calling the ith seed program, the obtained execution times can be more accurate, the thread group set of each seed program is further accurately determined, the number of idle threads in each instruction period is reduced, and the resource utilization rate and the data processing efficiency are further improved.
S203, determining J instruction periods and K in the J instruction period according to the execution times of various subroutines j A set of thread groups.
Wherein, J、K j for positive integer, one thread group set comprises one or more thread groups, and the types of subroutines executed by all thread group sets in J instruction periods are different; the number of threads of the non-executing subroutine is smaller than the number of threads of one thread group in each thread group set.
The instruction cycle may be, for example, the length of time it takes to complete a subroutine.
The thread group set is illustratively the sum of all thread groups for executing a subroutine in each instruction cycle.
It should be noted that one thread group may be included in one thread group set, or a plurality of thread groups may be included.
By way of example only, and in an illustrative,the total number of thread group sets in all instruction cycles is the same as the number of subroutine types.
Illustratively, all sets of thread groups in the J instruction cycles execute a subroutine of different types, that is, each set of thread groups in each cycle executes a subroutine. For example, among a plurality of instruction cycles, there are 3 instruction cycles: instruction cycle T1-instruction cycle T3. The instruction cycle T1 includes a thread group set 0 and a thread group set 1, the instruction cycle T2 includes a thread group set 2 and a thread group set 3, and the instruction cycle T3 includes a thread group set 4. Thread group set 0, thread group set 1, thread group set 2, thread group set 3, and thread group set 4 each execute a different subroutine. FIG. 5 is a schematic diagram illustrating a correspondence between a set of thread groups and a subroutine. As shown in fig. 5, thread group set 0 executes subroutine 0, thread group set 1 executes subroutine 1, thread group set 2 executes subroutine 2, thread group set 3 executes subroutine 3, and thread group set 4 executes subroutine 4.
Illustratively, the number of threads in each set of thread groups that do not execute a subroutine is less than the number of threads of one thread group, that is, there is at least one thread for executing a subroutine per thread group in each instruction cycle.
In one possible design, J instruction cycles, and K in the J-th instruction cycle, are determined based on the number of executions of each subroutine j A set of thread groups may include: and determining a thread group set corresponding to the ith seed program and an instruction period corresponding to the thread group set corresponding to the ith seed program according to the execution times.
Wherein the determined instruction cycle may be a plurality. Still taking subroutine 0-subroutine 4 of FIG. 5 as an example, the thread group executing the parent program includes thread group W0-thread group W11. The number of threads in the thread group set corresponding to the subroutine 0 is 2, the number of threads in the thread group set corresponding to the subroutine 1 is 3, the number of threads in the thread group set corresponding to the subroutine 2 is 2, the number of threads in the thread group set corresponding to the subroutine 3 is 2, and the number of threads in the thread group set corresponding to the subroutine 4 is 3. FIG. 6 is a diagram illustrating a relationship between a set of thread groups and a thread group. As shown in fig. 6, if five thread groups are used for executing threads in instruction cycle T1, thread group set 0 including thread groups W0 and W1, and thread group set 1 including thread group W2-thread group W4 may be determined in instruction cycle T1. A thread group set 2 including thread group W0 and thread group W1, and a thread group set 3 including thread group W2 and thread group W3 are determined in instruction cycle T2. A thread group set 4 is determined that includes thread group W0-thread group W2 during instruction cycle T3.
FIG. 7 is a diagram illustrating a relationship between a thread group set and a subroutine. As shown in fig. 7, the thread group set 0 in fig. 6 may execute the subroutine 0, the thread group set 1 may execute the subroutine 1, the thread group set 2 may execute the subroutine 2, the thread group set 3 may execute the subroutine 3, and the thread group set 4 may execute the subroutine 4.
It will be appreciated that in this embodiment, the number of instruction cycles determined may be 1, in other words, all subroutines may be executed in the same instruction cycle. Still taking subroutine 0-subroutine 4 of fig. 5 as an example, subroutine 0-subroutine 4 may be executed in instruction cycle T1. The following is described with reference to fig. 8.
FIG. 8 is a diagram illustrating a second relationship between a thread group set and a child thread group. As shown in fig. 8, the thread group set 0 corresponding to the subroutine 0 may include a thread group W0 and a thread group W1; the thread group set 1 corresponding to subroutine 1 may include thread group W2-thread group W4. The thread group set 2 corresponding to the subroutine 2 may include a thread group W5 and a thread group W6; the thread group set 3 corresponding to the subroutine 3 may include a thread group W7 and a thread group W8; the set of thread groups 4 corresponding to subroutine 4 may include thread group W9-thread group W11.
It should be noted that, the examples of fig. 5 to 8 are only used to help understand the present embodiment, the numbers of the thread groups corresponding to the respective subroutines may be discontinuous, and the thread groups included in the thread group set corresponding to the respective subroutines may be other thread groups. For example, the thread group set corresponding to subroutine 1 may include thread group W9-thread group W11, or the thread group set corresponding to subroutine 1 may include thread group W0, thread group W3, thread group W6, and so on. The number of thread groups in the thread group set is not limited in this embodiment. When multiple instruction cycles are included, the set of thread groups corresponding to each subroutine may be in any cycle.
Optionally, determining the thread group set corresponding to the ith seed program and the instruction period of the thread group set corresponding to the ith seed program according to the execution times may include step 1 and step 2.
And step 1, determining a first sequence of the I seed program according to the execution times.
The first order is used for indicating the acquisition order of the thread group set corresponding to each subprogram.
In this embodiment, before the step 1, the data processing method may further include: a second order of each sub-program is obtained. Specifically, the second order may be acquired according to attribute information corresponding to a parent program calling each of the subroutines. Wherein, the attribute information may include a texture identifier, which may be integer data.
Taking the subroutine in table 2 as an example, among the thread 0-thread 2, the thread 4, the thread 6-thread 8 and the thread 10 of the subroutine 0 is called by the parent program, the second order of the subroutine 0 called by the parent program of each of the thread 0-thread 2, the thread 4, the thread 6-thread 8 and the thread 10 is obtained by sorting according to the attribute information of each thread. Thread 3, thread 5, thread 9, and thread 11 of the child program 1 are called by the parent program, and are ordered according to attribute information of each thread, so that a second order of thread 3, thread 5, thread 9, and thread 11 is obtained.
TABLE 2
For example, the attribute information includes texture 1, texture 2, texture 3, and texture 4. The priorities of the material 1, the material 2, the material 3 and the material 4 are respectively from high to low: material 4, material 3, material 2 and material 1. The textures corresponding to the thread 3, the thread 5, the thread 9 and the thread 11 are texture 1, texture 2, texture 3 and texture 4 respectively, and then the sequencing results of the subroutines corresponding to the thread 3, the thread 5, the thread 9 and the thread 11 respectively are as follows: thread 11, thread 9, thread 5, and thread 3. The second order of the sub-routine 0 is obtained in a similar manner to the sub-routine 1, and will not be described again here.
And 2, determining a thread group set corresponding to the ith seed program and an instruction period corresponding to the thread group set corresponding to the ith seed program according to the first sequence.
Still taking the threads and subroutines of table 2 as an example, in thread group 0-thread group 2, the types of subroutines that the parent of thread 0-thread 11 calls in sequence are: sub-program 0, sub-program 1, sub-program 0, and sub-program 1. The number of executions of subroutine 0 is 8, the number of executions of subroutine 1 is 4, then the first order can be determined to be sub-program 0, sub-program 0 sub-program 0, sub-program 1 and sub-program 1, the set of thread groups corresponding to subroutine 0 includes thread group 0 and thread group 1, and the set of thread groups corresponding to subroutine 0 includes thread group 2.
It should be noted that, when each thread executes each subroutine, the thread may acquire data for executing the subroutine according to the thread where the parent program calling the subroutine is located, such as a data packet or a light data packet in light ray tracing. The sub-program actually executed by each thread may be the sub-program called by the parent program in any of the threads in the thread group. For example, a subroutine executed by one thread may be a subroutine called by a parent program in that thread, or may be a subroutine called by a parent program in another thread. In other words, when one thread executes a subroutine, the subroutine may be executed based on data corresponding to the thread itself, or may be executed based on data corresponding to another thread. Specifically, the data may be obtained according to the identity of the thread in which the parent program that invokes the child program is located.
The following still describes how each thread obtains the data needed by that thread to execute the subroutine in conjunction with table 2. For example, in ray tracing, a thread group may obtain ray packets according to subroutines that the thread needs to execute. Thread 0 may obtain a light packet corresponding to thread 0, thread 1 obtains a light packet corresponding to thread 1, thread 2 obtains a light packet corresponding to thread 2, thread 3 obtains a light packet corresponding to thread 4, thread 4 obtains a light packet corresponding to thread 6, thread 5 obtains a light packet corresponding to thread 7, thread 6 obtains a light packet corresponding to thread 8, thread 7 obtains a light packet corresponding to thread 10, thread 8 obtains a light packet corresponding to thread 3, thread 9 obtains a light packet corresponding to thread 5, thread 10 obtains a light packet corresponding to thread 9, and thread 11 obtains a light packet corresponding to thread 11.
According to the method and the device for processing the data, the data are acquired according to the identification of the thread, and the ordering operation of the data required by the execution of the subprogram can be avoided, so that the efficiency is improved.
Thus, the thread group set corresponding to the subroutine with the front sequence can be determined in priority, and the resource utilization rate is further improved.
It should be noted that, the first order may be obtained by a sorting module, and the sorting module may be a software functional module, such as a shader program in ray tracing, a hardware structure, or a module combining software and hardware. Still referring to ray tracing as an example of the operation of the sorting module, as shown in fig. 4, the sorting module obtains the subroutine identifier, or the subroutine identifier and attribute information, from the data buffer and obtains the first order when obtaining the first order. The sorting module stores the first order into the data buffer after the first order is acquired.
It should be noted that, the sorting module may obtain the identification of the sub-program from the parent program, where the identification of the sub-program is information identifying the kind of the sub-program, and further obtain the first order according to the identification of the sub-program, and the sorting module may also be used to create a new task. The following is a description with reference to fig. 9.
FIG. 9 is a diagram illustrating a relationship between a parent program and a child program. As shown in fig. 9, in ray tracing, after the ray generation shader program obtains the subroutine identifier, the subroutine identifier is transmitted to the sorting module, and the sorting module counts the execution times of each subroutine, and creates a new task according to the execution times of each subroutine.
Further, the first order may indicate to preferentially acquire the thread group set corresponding to the subroutine with the greater number of execution times.
Still taking the subroutine and the threads in table 2 as an example, as shown in table 3, if the number of thread groups for executing the subroutine is at most 2 in each instruction cycle, the thread group set corresponding to the subroutine 0 may be obtained as the thread group 0 and the thread group 1 in the instruction cycle T1, and the thread group corresponding to the subroutine 1 may be obtained as the thread group 1 in the instruction cycle T2. It will be readily appreciated that during instruction cycle T1, the threads in thread group 2 do not execute a subroutine, and therefore, the resources occupied by the threads in thread group 2 may be freed. In instruction cycle T2, the threads in thread group 1 do not execute a subroutine, and therefore, the resources occupied by the threads in thread group 1 may be released.
TABLE 3 Table 3
Therefore, the thread group set corresponding to the subprogram with the large execution times is preferentially determined, the subprogram with the large occupied number of threads can be executed in an earlier instruction period, and the resources occupied by the idle thread group are released, so that the resource utilization rate is further improved.
Alternatively, the first order may further indicate to preferentially acquire the thread group set corresponding to the subroutine with the smaller execution number.
The implementation of the thread group set corresponding to the subroutine with the fewer priority acquisition execution times is similar to the thread group set corresponding to the subroutine with the greater priority acquisition execution times, and will not be described in detail herein.
Thus, according to the execution times of each subprogram, the thread group set and the instruction period corresponding to the subprogram are determined, so that the number of thread groups for executing the same subprogram can be reduced, and/or the number of instruction periods required by all subprograms can be reduced, thereby further improving the resource utilization rate and the data processing efficiency.
Or, optionally, determining, according to the execution times, the thread group set corresponding to the ith seed program and the instruction period in which the thread group set corresponding to the ith seed program is located, where the instruction period includes: and determining a thread group set corresponding to the ith seed program and an instruction period of the thread group set according to the respective execution times of the I seed programs.
The following describes how to determine the thread group set corresponding to the ith seed program and the instruction cycle of the thread group set according to the respective execution times of the I seed program.
For example, the number of threads in one thread group is 4, and the parent program is executed by threads in 5 thread groups, where the number of times of execution of the subroutine 0 is 7, the number of times of execution of the subroutine 1 is 5, the number of times of execution of the subroutine 2 is 3, the number of times of execution of the subroutine 3 is 4, and the number of times of execution of the subroutine 4 is 1. It may be obtained that the set of thread groups corresponding to the subroutine 0 includes two thread groups, the set of thread groups corresponding to the subroutine 1 includes two thread groups, the set of thread groups corresponding to the subroutine 2 includes one thread group, the set of thread groups corresponding to the subroutine 3 includes one thread group, and the set of thread groups corresponding to the subroutine 4 includes one thread group. How the instruction cycles of the set of thread groups corresponding to the subroutine of fig. 9 are determined is described below in connection with fig. 10.
FIG. 10 is a third diagram illustrating a subroutine and thread group correspondence. As shown in FIG. 10, a total of 3 thread groups, thread group W0-thread group W3, are used to execute the subroutine. In instruction cycle T1, the thread group set for get subroutine 0 includes thread group W0 and thread group W1. At this time, in the instruction cycle T1, the thread group W2 remains, and therefore, the remaining thread group W2 can be regarded as a set of thread groups of a subroutine whose execution number is smaller than 4, such as subroutine 2. In instruction cycle T2, the set of thread groups corresponding to subroutine 1 may be obtained, including thread group W0 and thread group W1, and the remaining thread group W2 may be used as the set of thread groups corresponding to subroutine 3, such as thread group W0. In instruction cycle T3, the set of thread groups corresponding to subroutine 4 may be obtained as any one of thread group W0-thread group W2. Thus, only 3 thread groups are needed in both instruction cycle T1 and instruction cycle T2, and therefore, the resources occupied by the remaining thread groups can be freed. In instruction cycle T3, only 1 thread group is needed, so resources occupied by the remaining thread groups can be freed. Thus, resources occupied by the thread group W0-W3 are released during the instruction cycle T1, thereby improving the resource utilization.
As shown in fig. 11, in each instruction cycle, 7 thread groups may be used to execute the subroutine, and the corresponding thread group sets of the subroutine 0-4 may all be in the instruction cycle T1. Specifically, the thread group set corresponding to the subroutine 0 may include a thread group W0 and a thread group W1, the thread group set corresponding to the subroutine 1 may include a thread group W3 and a thread group W4, the thread group set corresponding to the subroutine 2 may include a thread group W2, the thread group set corresponding to the subroutine 3 may include a thread group W5, and the thread group set corresponding to the subroutine 4 may include a thread group W6. Thus, after instruction cycle T1, all subroutines are executed, and therefore, the resources occupied by all thread groups are freed.
It will be appreciated that in this embodiment, the number of thread groups for executing the subroutine may be the same or different per instruction cycle.
It should be noted that, in the embodiment of the present application, each subroutine identifier, or the subroutine identifier and the attribute information may be stored in one data buffer. As shown in fig. 4, in ray tracing, a ray generation shader program obtains a subroutine identifier, or a subroutine identifier and attribute information, and stores it in a data buffer. The first order is obtained from a subroutine identifier, or a subroutine identifier and attribute information, in the data buffer. The second order is obtained from the attribute information.
Optionally, the number of thread groups in the thread group set corresponding to the ith seed program, the number of threads of one thread group, and the number of executions corresponding to the ith seed program may satisfy one or more of the following conditions:
or,
(W i -1)*m≤L i ≤W i *m。
wherein W is i For the number of thread groups in the thread group set corresponding to the ith subprogram, W i Is a positive integer. m is the number of threads for a thread group.
That is, the number of threads included in the thread group set corresponding to each subroutine is obtained by dividing the execution times of the subroutine by the number of threads in one thread group and rounding up. For example, if the number of threads in a thread group is 4 and the number of execution times of the ith subroutine is 9, the corresponding thread group set W of the subroutine i The method comprises the following steps:
(W i -1)*m≤L i ≤W i * m, that is, each of the set of threads corresponding to each subroutine,the number of threads included is between the sum of the number of threads of the n thread groups and the sum of the number of threads of the n+1 thread groups. Still with the number of threads in a thread group being 4, the execution number of the ith subprogram being 9, the calculation formula of the number of thread groups in the thread group set corresponding to the ith subprogram is:
(W i -1)*4≤9,
9≤W i *4,
thus, the number W of thread groups in the thread group set corresponding to the ith subroutine is obtained i 3.
In this way, the same kind of subroutine is executed using as few thread groups as possible, and thus the resource utilization can be further improved.
In one possible design, the number of thread groups of all thread groups of the j-th instruction cycle is greater than or equal to the number of thread groups of all thread groups of the j+1-th instruction cycle.
Still taking the subroutine and thread-group set of FIG. 6 as an example, in instruction cycle T1, the total number of thread groups in all thread-group sets is 5; in the instruction cycle T2, the total amount of thread groups in all thread group sets is 4; in instruction cycle T3, the total number of thread groups in the set of all thread groups is 3.
Thus, the total amount of the thread groups for executing the subroutines in each instruction cycle is unchanged or gradually reduced, so that the condition that the whole thread groups are idle can be avoided, and the resource utilization rate is further improved.
The embodiments of the present application are further described below with reference to specific examples.
For example, the thread group executing a certain parallel computing task is thread group 0-thread group 2, the threads of thread group 0 include thread 0-thread 3, the threads of thread group 1 include thread 4-thread 7, and the threads of thread group 2 include thread 8-thread 11. Each thread executes a parent program, each parent program calls a sub-program, each parent program may call a sub-program 0, a sub-program 1 or a sub-program 2, and the parent programs corresponding to threads 0-11 are in turn parent programs 0-11.
As shown in table 4 below, thread 0, thread 2, thread 6, and thread 8 execute subroutine 0, and thread 3, thread 5, thread 9, and thread 11 execute subroutine 1; thread 1, thread 4, thread 7, and thread 10 execute subroutine 2. Based on the ray intersection test result, each thread executes the subprogram called by the respective parent program, and then thread 0, thread 2, thread 6 and thread 8 execute the subprogram 0, and thread 3, thread 5, thread 9 and thread 11 execute the subprogram 1; thread 1, thread 4, thread 7, and thread 10 execute subroutine 2. Thread group 0 requires execution of subroutine 0-subroutine 2, thread group 1 requires execution of subroutine 0-subroutine 2, and thread group 2 requires execution of subroutine 0-subroutine 2. It will be appreciated that there is a thread execution subroutine for each thread group in instruction cycle T1-instruction cycle T3, and that there will be an idle thread for each thread group in instruction cycle T1-instruction cycle T3. All parent-invoked subroutines in thread group 1-thread group 2 are executed, requiring a total of 3 instruction cycles.
TABLE 4 Table 4
Taking the above thread group and subroutine in table 4 as an example, as shown in table 5, after the scheme in the embodiment of the present application is adopted, the thread of thread group 0 executes subroutine 0, the thread of thread group 1 executes subroutine 1, and the thread of thread group 2 executes subroutine 2. It will be appreciated that all subroutines can be executed in only one instruction cycle, and that each thread group can be released two instruction cycles in advance relative to the scheme of Table 4.
It will be appreciated that the subroutine described in table 4 above may also be executed in two instruction cycles. As shown in table 6, thread group 0 may execute subroutine 0 in instruction cycle T1, thread group 2 may execute subroutine 1 in instruction cycle T1, and thread group 0 may execute subroutine 4 in instruction cycle T2. Thus, the resources occupied by thread group 2 may be released during instruction cycle T1, and the resources occupied by thread group 1 may be further released during instruction cycle T2.
TABLE 5
It will be appreciated that the subroutines in Table 4 above may also be executed separately in 3 different instruction cycles. As shown in table 7, the thread of thread group 0 executes subroutine 0 in instruction cycle T1; executing the subroutine 1 in the instruction cycle T2 by the thread of the thread group 0; the threads of thread group 0 execute subroutine 2 during instruction cycle T3. Thus, the resources occupied by thread group 1 and thread group 2 may be released before instruction cycle T1 begins.
TABLE 6
TABLE 7
As another example, thread group 0 includes threads 0-3 and thread group 1 includes threads 4-7. Thread 0's parent calls to subroutine 1, thread 1's parent calls to subroutine 0, thread 2's parent calls to subroutine 2, thread 3's parent calls to subroutine 2, thread 4's parent calls to subroutine 2, thread 5's parent calls to subroutine 1, thread 6's parent calls to subroutine 0, and thread 7's parent calls to subroutine 1. The following is a description in conjunction with table 8. As shown in table 8, when a thread for executing a subroutine is determined from the state of the subroutine, the subroutine 0 is called when the state of the parent program is 0, the subroutine 1 is called when the state of the parent program is 1, and the subroutine 2 is called when the state of the parent program is 2. Since the same thread group can only execute the same subroutine at the same time, the thread group execution in table 8 is completed, requiring a total of three instruction cycles.
TABLE 8
Still taking the threads and subroutines in table 8 as an example, the number of execution times of subroutine 0 is 3, the number of execution times of subroutine 1 is 3, and the number of execution times of subroutine 2 is 3. If the same subroutine is assigned to the same thread group for execution, as shown in table 9, the execution of each subroutine is performed in instruction cycle T1, subroutine 1 is performed in instruction cycle T2, and subroutine 2 is performed in instruction cycle T3. It will be appreciated that in Table 8, each thread group will execute a subroutine during each instruction cycle, while in Table 9, only the threads of thread group 0 execute a subroutine during instruction cycle T1-instruction cycle T3. In the instruction period T1-instruction period T3, resources occupied by the thread group 1 are released, so that the utilization rate of the resources is improved.
TABLE 9
In the embodiment of the present application, the thread for executing the subroutine may be a new thread. The newly started thread may be implemented by creating a new task through the ordering module as in fig. 9.
Illustratively, the newly started thread, i.e., the thread that only executes the child program and not the parent program. For example, if the thread executing the parent program is thread 0 '-thread 7', the subroutine called by thread 0 '-thread 3' is subroutine 0, and the subroutine called by thread 4 '-thread 7' is subroutine 1, then thread 1, thread 2, thread 3, and thread 4 may be newly created to execute subroutine 0 and subroutine 1.
Specifically, the newly built thread may be determined according to the execution times of each subroutine or the start-up duration of the counter corresponding to each subroutine. For example, when the execution times of the subroutines are greater than the times threshold, or when the starting time of the counter corresponding to one subroutine exceeds the time threshold, a new task may be created according to the counting result of the counter corresponding to the subroutine, so as to start a new thread. In this way, system deadlock may be prevented.
Specifically, a table including various sub-program identifiers may be stored in the memory of the graphics processor, each sub-program corresponds to a counter, and when the parent program outputs a sub-program identifier to be called, the counter corresponding to the sub-program is incremented by 1, so that the execution times of each sub-program may be obtained.
Therefore, after the execution of the parent program is finished, the resources occupied by the threads of the parent program can be released, and the resource utilization rate is improved.
As shown in Table 10, thread group 0 includes threads 0-3 and thread group 1 includes threads 4-7.
Table 10
Thread 0's parent calls to subroutine 0, thread 1's parent calls to subroutine 1, thread 2's parent calls to subroutine 0, thread 3's parent calls to subroutine 1, thread 4's parent calls to subroutine 0, thread 5's parent calls to subroutine 1, thread 6's parent calls to subroutine 0, and thread's middle parent calls to subroutine 1. The number of types of subroutines to be executed is 2, and therefore, the number of times of execution of one type of subroutine can be acquired by two counters, respectively.
As shown in table 11, the number of times of execution of the subroutine 0 may be acquired by using the counter 0, the number of times of execution of the subroutine 1 may be acquired by using the counter 1, and the number of times of execution of the subroutine 0 is 4, and the number of times of execution of the subroutine 1 is 4.
TABLE 11
| Subroutine program |
0 |
1 |
| Counter |
0 |
1 |
| Number of executions |
4 |
4 |
Based on the data processing method shown in fig. 2, the number of instruction cycles for executing the plurality of subroutines and the number of execution times of each subroutine may be determined according to the plurality of subroutines called by the plurality of parent programs, and the number of thread group sets in each instruction cycle, so that the same subroutine may be executed in the same instruction cycle to reduce the number of required instruction cycles and/or the number of required thread groups as much as possible, and the resource utilization can be improved, thereby improving the data processing efficiency.
In addition, the same subprogram is executed in the same instruction period, so that the condition that threads in different thread groups frequently switch the read instructions in the same instruction period and threads in the same thread group in different instruction periods is avoided, the hit rate of the instruction fetching operation can be improved, and the resource utilization rate and the data processing efficiency are improved.
The data processing method provided in the embodiment of the present application is described in detail above with reference to fig. 3 to 11. A data processing apparatus for performing the data processing method provided in the embodiment of the present application is described in detail below with reference to fig. 12 to 13.
Fig. 12 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 12, the data processing apparatus 1200 includes: a determination module 1201 and an acquisition module 1202.
For ease of illustration, fig. 12 shows only the main components of the data processing apparatus.
In some embodiments, the data processing apparatus 1200 may be adapted to the terminal device shown in fig. 1, to perform the data processing method shown in fig. 2.
The determining module is used for determining I seed programs called by the N parent programs. N > I, N, I are positive integers greater than 1.
An acquisition module for acquiring the execution times L of the ith seed program in the I seed program i . Wherein, L i is a positive integer. The determining module is also used for determining J instruction periods and K in the J instruction periods according to the execution times of various subroutines j A set of thread groups. Wherein, J、K j for positive integer, one thread group set comprises one or more thread groups, and the types of subroutines executed by all thread group sets in J instruction periods are different; the number of threads of the non-executing subroutine is smaller than the number of threads of one thread group in each thread group set.
In one possible design, the determining module is further configured to determine, according to the execution times, a thread group set corresponding to the ith seed program, and an instruction period corresponding to the thread group set corresponding to the ith seed program.
Optionally, the number of thread groups in the thread group set corresponding to the ith seed program, the number of threads of one thread group, and the number of execution times corresponding to the ith seed program satisfy one or more of the following conditions:
or,
(W i -1)*m≤L i ≤W i *m。
wherein W is i And m is the number of threads of one thread group, and n is 0 or a positive integer.
Optionally, the determining module is further configured to determine a first order of the I-seed program according to the execution times, and determine, according to the first order, a thread group set corresponding to the I-seed program, and an instruction period corresponding to the thread group set corresponding to the I-seed program. The first order is used for indicating the acquisition order of the thread group set corresponding to each subprogram.
Further, the first order indicates to preferentially acquire the thread group set corresponding to the subroutine with the more execution times.
In one possible design, the number of thread groups of all thread groups of the j-th instruction cycle is greater than or equal to the number of thread groups of all thread groups of the j+1-th instruction cycle.
In one possible design, the number of executions may be determined based on the number of parents invoking the ith seed program.
The data processing apparatus 1200 may be a computing device, a chip (system) or other parts or components that may be disposed in the computing device, or an apparatus including the computing device, which is not limited in this application.
In addition, the technical effects of the data processing apparatus 1200 may refer to the technical effects of the data processing method described in fig. 2, and will not be described herein.
Alternatively, the determining module and the acquiring module may be integrated into one module, such as a processing module. The processing module is used for realizing the operation of each functional module.
Optionally, the data processing apparatus 1200 may further include an input/output port. Wherein the input/output ports are used to implement input and/or output functions of instructions and/or data of the data processing apparatus 1200.
Further, the input-output port may also be coupled with a transceiver. The transceiver is used for information interaction between the data processing device and other data processing devices, and the processor executes program instructions to perform the data processing method shown in fig. 2.
Optionally, the data processing apparatus 1200 may further include a storage module (not shown in fig. 12), such as an instruction cache, which stores a program or instructions that, when executed by the processing module, enable the data processing apparatus to perform the data processing method shown in fig. 2.
The data processing apparatus 1200 may be a computing device, a chip (system) or other parts or components that may be disposed in the computing device, or an apparatus including the computing device, which is not limited in this application.
In addition, the technical effects of the data processing apparatus 1200 may refer to the technical effects of the data processing method shown in fig. 2, and will not be described herein.
Fig. 13 is a schematic diagram of a second structure of the data processing apparatus according to the embodiment of the present application. The data processing apparatus may be a computing device or may be a chip (system) or other part or component that may be provided with the computing device. As shown in fig. 13, a data processing apparatus 1300 may include a processor 1301.
Optionally, the data processing apparatus 1300 may further comprise a memory 1302 and/or a transceiver 1303. Processor 1301 is coupled to memory 1302 and transceiver 1303, which may be connected by a communication bus, for example.
The following describes each constituent element of the data processing apparatus 1300 in detail with reference to fig. 13:
processor 1301 is a control center of data processing apparatus 1300, and may be one processor or a collective term of a plurality of processing elements. For example, processor 1301 is one or more central processing units (central processing unit, CPU), but may also be an integrated circuit specific (application specific integrated circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present application, such as: one or more microprocessors (micro processors), or one or more field programmable gate arrays (field programmable gate array, FPGAs).
Alternatively, processor 1301 may perform various functions of data processing apparatus 1300 by executing or executing software programs stored in memory 1302, and invoking data stored in memory 1302.
In a particular implementation, processor 1301 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 13, as an embodiment.
Processor 1301 may also include one or more GPUs.
In a specific implementation, as an embodiment, the data processing apparatus 1300 may also include a plurality of processors, such as the processor 1301 and the processor 1304 shown in fig. 13. Each of these processors may be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
The memory 1302 is configured to store a software program for executing the present application, and the processor 1301 controls the execution of the software program, and the specific implementation may refer to the above method embodiment, which is not described herein again.
Alternatively, memory 1302 may be, but is not limited to, read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, but may also be electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 1302 may be integrated with the processor 1301 or may exist separately and be coupled to the processor 1301 through an interface circuit (not shown in fig. 13) of the data processing apparatus 1300, which is not specifically limited in this embodiment of the present application.
A transceiver 1303 for communication with other data processing apparatuses. For example, the data processing apparatus 1300 is a terminal device, and the transceiver 1303 may be used to communicate with a network device or another terminal device. For another example, the data processing apparatus 1300 is a network device, and the transceiver 1303 may be used to communicate with a terminal device or another network device.
Alternatively, transceiver 1303 may include a receiver and a transmitter (not separately shown in fig. 13). The receiver is used for realizing the receiving function, and the transmitter is used for realizing the transmitting function.
Alternatively, transceiver 1303 may be integrated with processor 1301, or may exist separately, and be coupled to processor 1301 through interface circuitry (not shown in fig. 13) of data processing apparatus 1300, as embodiments of the present application are not specifically limited.
It should be noted that the structure of the data processing apparatus 1300 shown in fig. 13 is not limited to the data processing apparatus, and an actual data processing apparatus may include more or less components than those shown, or may combine some components, or may be different in arrangement of components.
In addition, the technical effects of the data processing apparatus 1300 may refer to the technical effects of the data processing method described in the above method embodiments, and will not be described herein.
The embodiment of the application also provides a chip system, which comprises: a processor coupled to a memory for storing programs or instructions which, when executed by the processor, cause the system-on-a-chip to implement the method of any of the method embodiments described above.
Alternatively, the processor in the system-on-chip may be one or more. The processor may be implemented in hardware or in software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general purpose processor, implemented by reading software code stored in a memory.
Alternatively, the memory in the system-on-chip may be one or more. The memory may be integral with the processor or separate from the processor, and is not limited in this application. For example, the memory may be a non-transitory processor, such as a ROM, which may be integrated on the same chip as the processor, or may be separately provided on different chips, and the type of memory and the manner of providing the memory and the processor are not specifically limited in this application.
The system-on-chip may be, for example, a field programmable gate array (field programmable gate array, FPGA), an application specific integrated chip (application specific integrated circuit, ASIC), a system on chip (SoC), a central processing unit (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip.
The embodiment of the application provides a data processing system. The data processing system includes one or more of the computing devices described above.
It should be appreciated that the processor in embodiments of the present application may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present application are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.