[go: up one dir, main page]

CN113934457B - A parallel computing optimization method based on single-core DSP - Google Patents

A parallel computing optimization method based on single-core DSP Download PDF

Info

Publication number
CN113934457B
CN113934457B CN202111094502.1A CN202111094502A CN113934457B CN 113934457 B CN113934457 B CN 113934457B CN 202111094502 A CN202111094502 A CN 202111094502A CN 113934457 B CN113934457 B CN 113934457B
Authority
CN
China
Prior art keywords
dsp
optimization method
program
parallel
parallel computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111094502.1A
Other languages
Chinese (zh)
Other versions
CN113934457A (en
Inventor
苗志富
陈建新
周育逵
彭飞
刘超伟
刘波
韩朝君
张琳
王玉
陈玙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Control Engineering
Original Assignee
Beijing Institute of Control Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Control Engineering filed Critical Beijing Institute of Control Engineering
Priority to CN202111094502.1A priority Critical patent/CN113934457B/en
Publication of CN113934457A publication Critical patent/CN113934457A/en
Application granted granted Critical
Publication of CN113934457B publication Critical patent/CN113934457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • G06F12/0884Parallel mode, e.g. in parallel with main memory or CPU
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

一种基于单核DSP的并行计算优化方法,包括如下步骤:将待运行的程序分为关键复杂程序和非关键复杂程序,其中关键复杂程序固定运行在DSP的内部缓存中,非关键复杂程序运行在外部配置的RAM存储器中;对待运行程序中的循环语句,使循环计数器采用递减,且循环体内不进行循环嵌套;每个循环体用到的硬件资源不超过处理器中的运算单元个数和寄存器个数,使得每个循环体中的语句能够同时在多个运算单元中并行执行;采用单指令多数据指令对进行相同计算处理的数据进行汇编编程,提高数据处理的并行度。

A parallel computing optimization method based on a single-core DSP comprises the following steps: dividing a program to be run into a key complex program and a non-key complex program, wherein the key complex program is fixedly run in the internal cache of the DSP, and the non-key complex program is run in an externally configured RAM memory; for loop statements in the running program, the loop counter is decremented, and loop nesting is not performed in the loop body; the hardware resources used by each loop body do not exceed the number of operation units and registers in the processor, so that the statements in each loop body can be executed in parallel in multiple operation units at the same time; and single instruction multiple data instructions are used to assemble the data for the same calculation processing to improve the parallel degree of data processing.

Description

Parallel computing optimization method based on single-core DSP
Technical Field
The invention relates to a parallel computing optimization method based on a single-core DSP, and belongs to the technical field of spaceborne computers.
Background
The first Mars detection in China realizes the targets of 'winding, falling and inspection' of Mars through one-time flight tasks, and the 'congratulating sign' Mars vehicle of the landing inspection device breaks through key technologies of safety autonomous inspection of Mars surfaces and the like. In the autonomous tour process, a navigation control unit of the Mars GNC subsystem needs to run key complex algorithms such as navigation image processing, obstacle avoidance image processing, path autonomous planning, visual range finding, motion control and the like, and the problems of large image data processing amount and long algorithm running time are faced. Therefore, in order to meet the autonomous, safe and efficient patrol requirements of the Mars, parallel optimization needs to be performed on various complex algorithms on the navigation control unit, the operation efficiency of the algorithms is improved, and the functions of environment sensing, autonomous planning, pose determination, motion control and the like of the Mars are efficiently realized.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a parallel computing optimization method based on a single-core DSP. According to the method, the memory access parallelism, the instruction execution parallelism and the data processing parallelism are improved, the key complex algorithm is fixedly operated in the DSP for caching by adopting segmented addressing compiling, the circulation and branch judgment structure of the algorithm is improved, parallel optimization measures such as single instruction and multiple data processing instructions of a processor are utilized, the parallel operation efficiency of the key complex algorithm of the Mars is improved, and autonomous, safe and efficient inspection of the Mars is ensured.
The invention aims at realizing the following technical scheme:
a parallel computing optimization method based on a single-core DSP comprises the following steps:
Dividing the programs to be operated into key complex programs and non-key complex programs, wherein the key complex programs are fixedly operated in an internal cache of the DSP, and the non-key complex programs are operated in an RAM (random access memory) which is externally configured;
The hardware resource used by each loop body does not exceed the number of operation units and the number of registers in the processor, so that the sentences in each loop body can be simultaneously executed in a plurality of operation units;
And the single instruction and multiple data instructions are adopted to carry out assembly programming on the data which are subjected to the same calculation processing, so that the parallelism of data processing is improved.
Preferably, addressing compilation is employed for both critical and non-critical complex programs, the addressing compilation being accomplished by defining code segments, specifying code segment address spaces in a compiler-linked file.
Preferably, no function call and break jump are made in the loop body.
Preferably, the internal cache access bit width of the DSP is much larger than the externally configured RAM access bit width.
Preferably, the branch judgment sentences in the program to be operated are changed into conditional operation sentences.
Preferably, the program to be operated is an autonomous planning algorithm or a visual range finding algorithm.
A single-core DSP (digital signal processor) optimizes the parallel computation in the single-core DSP by adopting the parallel computation optimization method.
A processor includes the single-core DSP described above, and RAM disposed outside the DSP.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention improves the memory access parallelism, adopts segmented addressing compiling to fixedly operate a key complex algorithm in a DSP internal cache with larger data access bit width;
(2) According to the method, the loop structure in the algorithm is unfolded and optimized, and the instruction execution parallelism is improved;
(3) The invention improves the branch judgment structure of the algorithm, replaces branch judgment sentences with conditional operation sentences, fully processes the instruction pipelining function of the processor and provides the instruction execution parallelism;
(4) The invention improves the parallelism of data processing, which is realized by adopting a single instruction and multiple data processing instructions of a processor to carry out algorithm;
(5) The parallel optimization method of the invention improves the autonomous planning and visual range algorithm performance of the Mars by 2.5 times.
Drawings
FIG. 1 is a schematic diagram of parallel computing based on a single core DSP.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
A parallel computing optimization method based on a single-core DSP (digital signal processor) firstly improves the storage access parallelism to achieve the purpose of fetching a plurality of program instructions at one time, then improves the instruction execution parallelism on the basis of obtaining a plurality of instructions to enable the plurality of instructions to be executed in parallel, and finally improves the data processing parallelism, wherein when one instruction is executed, the instruction can process a plurality of data simultaneously. The specific implementation process is shown in fig. 1:
1) And the first step is to improve the parallelism of memory access. Based on the hierarchical memory structure characteristics of the DSP processor, the internal cache of the DSP has faster access rate than the RAM memory configured outside the processor, and larger data access bit width, such as 256 bits of internal cache access bit width of the DSP used by the Mars navigation control unit, and 32 bits of RAM access bit width configured outside. Compared with an RAM memory configured externally, when the DSP accesses the internal cache, more data can be obtained by one access operation. Therefore, in order to improve the parallel operation efficiency of the algorithms, the method addresses and compiles different key complex algorithms, so that the key complex algorithm programs are fixedly operated in an internal cache, and the non-key complex algorithm is operated in an RAM memory configured externally.
2) Addressing compilation is accomplished by defining code segments and specifying code segment address spaces in a compiler-linked file.
3) Defining a code segment mode:
# pragma CODE_ SECTION (key algorithm function name, "KeyAlgorithmText")
4) Specifying a code segment address space manner in a compiler link file:
MEMORY
{
INTERNALRAM origin=internal cache start address, length=internal cache byte length
ExternalRAM origin = external memory start address, length = external memory byte length
}
SECTIONS
{
.KeyAlgorithmText:>InternalRAM
.NonKeyAlgorithmText:>ExternalRAM
}
5) And secondly, improving the instruction execution parallelism. Based on the characteristics that the DSP comprises a plurality of operation function units and a plurality of groups of register groups, a large number of loop sentences contained in the algorithm are unfolded and optimized, so that sentences in the loop body can be simultaneously executed in a plurality of operation units, and meanwhile, a large number of branch judgment sentences contained in the algorithm are improved into conditional operation sentences, so that program jump is reduced, the interruption of an instruction pipeline of a processor is avoided, and the parallel advantage of the instruction pipeline of the processor is fully exerted.
6) The unfolding optimization method of the circulation statement comprises the following steps:
(1) The cycle counter adopts a decrementing mode;
(2) Function call and break jump are not carried out in the circulation body;
(3) The circulation body is not subjected to circulation nesting;
(4) The complex large-loop body is split into simple small-loop bodies, and the hardware resources used by each loop body are not more than the number of operation functional units and registers in the processor.
7) And thirdly, improving the parallelism of data processing. The image processing algorithm has the characteristic of carrying out the same operation on a large amount of data, and is matched with the characteristic of supporting parallel instructions of Single Instruction Multiple Data (SIMD) by adopting a Very Long Instruction Word (VLIW) architecture by a DSP processor. And the SIMD instruction is adopted to carry out assembly programming on a data processing part which influences the algorithm efficiency, so that the parallelism of data processing is improved.
A single-core DSP (digital signal processor) optimizes the parallel computation in the single-core DSP by adopting the parallel computation optimization method.
A processor includes the single-core DSP described above, and RAM disposed outside the DSP.
In summary, with the development of autonomous and intelligent operation of the spacecraft, the functional algorithm of the spacecraft for on-orbit image information processing and autonomous planning is more and more complex, the autonomous, safe and efficient operation of the spacecraft is more and more restricted by the algorithm performance, and the key complex algorithm must be optimized in parallel to improve the system operation performance. The parallel computing optimization method based on the single-core DSP can effectively improve the parallel processing performance of a key complex algorithm by improving the memory access parallelism, the instruction execution parallelism and the data processing parallelism, well meets the parallel processing requirements of future spacecrafts, and has wide application prospects.
What is not described in detail in the present specification is a well known technology to those skilled in the art.
Although the present invention has been described in terms of the preferred embodiments, it is not intended to be limited to the embodiments, and any person skilled in the art can make any possible variations and modifications to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the spirit and scope of the present invention, so any simple modifications, equivalent variations and modifications to the embodiments described above according to the technical matters of the present invention are within the scope of the technical matters of the present invention.

Claims (6)

1. The parallel computing optimization method based on the single-core DSP is characterized by comprising the following steps of:
Dividing the program to be operated into a key complex program and a non-key complex program, wherein the key complex program is fixedly operated in an internal cache of the DSP, and the non-key complex program is operated in an RAM memory configured externally;
The hardware resource used by each loop body does not exceed the number of operation units and the number of registers in the processor, so that the sentences in each loop body can be simultaneously executed in a plurality of operation units;
And the single instruction and multiple data instructions are adopted to carry out assembly programming on the data which are subjected to the same calculation processing, so that the parallelism of data processing is improved.
2. The parallel computing optimization method of claim 1, wherein no function call and break-out are performed in the loop body.
3. The parallel computing optimization method according to claim 1 or 2, wherein an internal cache access bit width of the DSP is much larger than an externally configured RAM access bit width.
4. The parallel computing optimization method according to claim 1 or 2, wherein the program to be run is an autonomous planning algorithm or a visual range algorithm.
5. A single-core DSP, wherein parallel computation in the single-core DSP is optimized using the parallel computation optimization method of claim 1 or 2.
6. A processor comprising the single-core DSP of claim 5 and RAM disposed outside the DSP.
CN202111094502.1A 2021-09-17 2021-09-17 A parallel computing optimization method based on single-core DSP Active CN113934457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111094502.1A CN113934457B (en) 2021-09-17 2021-09-17 A parallel computing optimization method based on single-core DSP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111094502.1A CN113934457B (en) 2021-09-17 2021-09-17 A parallel computing optimization method based on single-core DSP

Publications (2)

Publication Number Publication Date
CN113934457A CN113934457A (en) 2022-01-14
CN113934457B true CN113934457B (en) 2024-12-24

Family

ID=79276028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111094502.1A Active CN113934457B (en) 2021-09-17 2021-09-17 A parallel computing optimization method based on single-core DSP

Country Status (1)

Country Link
CN (1) CN113934457B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115718750B (en) * 2022-11-25 2025-09-02 天津津航计算技术研究所 A Cache-based Multi-core DSP Parallel Programming Optimization Method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103929381A (en) * 2013-01-10 2014-07-16 中国移动通信集团公司 A signal detection method and detection platform based on MIMO
CN107851004A (en) * 2015-08-17 2018-03-27 高通股份有限公司 For the register spilling management of general register (GPR)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040006667A1 (en) * 2002-06-21 2004-01-08 Bik Aart J.C. Apparatus and method for implementing adjacent, non-unit stride memory access patterns utilizing SIMD instructions
CN1516009A (en) * 2003-01-08 2004-07-28 深圳市中兴通讯股份有限公司上海第二 Efficient Optimization Method of Speech Codec Based on Digital Signal Processor
US7676791B2 (en) * 2004-07-09 2010-03-09 Microsoft Corporation Implementation of concurrent programs in object-oriented languages
US10817291B2 (en) * 2019-03-30 2020-10-27 Intel Corporation Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103929381A (en) * 2013-01-10 2014-07-16 中国移动通信集团公司 A signal detection method and detection platform based on MIMO
CN107851004A (en) * 2015-08-17 2018-03-27 高通股份有限公司 For the register spilling management of general register (GPR)

Also Published As

Publication number Publication date
CN113934457A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
CN109375949B (en) Processor with multiple cores
CN108885586B (en) Processor, method, system, and instruction for fetching data to an indicated cache level with guaranteed completion
CN108268385B (en) Optimized caching agent with integrated directory cache
JP6082116B2 (en) Vector move command controlled by read mask and write mask
US11550721B2 (en) Method and apparatus for smart store operations with conditional ownership requests
US20170286122A1 (en) Instruction, Circuits, and Logic for Graph Analytics Acceleration
JP2017532643A (en) Persistent store fence processor, method, system, and instructions
WO2017112176A1 (en) Instructions and logic for load-indices-and-prefetch-gathers operations
US9904549B2 (en) Method and apparatus for loop-invariant instruction detection and elimination
JP6272942B2 (en) Hardware apparatus and method for performing transactional power management
CN120067036B (en) A matrix-vector processor and a matrix-vector collaborative computing method
JP2014182817A (en) Converting conditional short forward branches to computationally equivalent predicated instructions
US10235177B2 (en) Register reclamation
CN113934457B (en) A parallel computing optimization method based on single-core DSP
EP3716046B1 (en) Technology for providing memory atomicity with low overhead
US9880839B2 (en) Instruction that performs a scatter write
CN113849222A (en) Pipelined out-of-order page miss handler
Stankovic et al. SpringNet: A scalable architecture for high performance, predictable, and distributed real-time computing
US20250117226A1 (en) Mixed-sourced dependency control for vector instructions
US11693780B2 (en) System, method, and apparatus for enhanced pointer identification and prefetching
US20250217151A1 (en) Processor pipeline for data transfer operations
US20240419599A1 (en) Direct cache transfer with shared cache lines
US20250085970A1 (en) Semantic ordering for parallel architecture with compute slices
Yan et al. Design of Processors
Liu et al. Technical Difficulties and Development Trend

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant