Parallel computing optimization method based on single-core DSP
Technical Field
The invention relates to a parallel computing optimization method based on a single-core DSP, and belongs to the technical field of spaceborne computers.
Background
The first Mars detection in China realizes the targets of 'winding, falling and inspection' of Mars through one-time flight tasks, and the 'congratulating sign' Mars vehicle of the landing inspection device breaks through key technologies of safety autonomous inspection of Mars surfaces and the like. In the autonomous tour process, a navigation control unit of the Mars GNC subsystem needs to run key complex algorithms such as navigation image processing, obstacle avoidance image processing, path autonomous planning, visual range finding, motion control and the like, and the problems of large image data processing amount and long algorithm running time are faced. Therefore, in order to meet the autonomous, safe and efficient patrol requirements of the Mars, parallel optimization needs to be performed on various complex algorithms on the navigation control unit, the operation efficiency of the algorithms is improved, and the functions of environment sensing, autonomous planning, pose determination, motion control and the like of the Mars are efficiently realized.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a parallel computing optimization method based on a single-core DSP. According to the method, the memory access parallelism, the instruction execution parallelism and the data processing parallelism are improved, the key complex algorithm is fixedly operated in the DSP for caching by adopting segmented addressing compiling, the circulation and branch judgment structure of the algorithm is improved, parallel optimization measures such as single instruction and multiple data processing instructions of a processor are utilized, the parallel operation efficiency of the key complex algorithm of the Mars is improved, and autonomous, safe and efficient inspection of the Mars is ensured.
The invention aims at realizing the following technical scheme:
a parallel computing optimization method based on a single-core DSP comprises the following steps:
Dividing the programs to be operated into key complex programs and non-key complex programs, wherein the key complex programs are fixedly operated in an internal cache of the DSP, and the non-key complex programs are operated in an RAM (random access memory) which is externally configured;
The hardware resource used by each loop body does not exceed the number of operation units and the number of registers in the processor, so that the sentences in each loop body can be simultaneously executed in a plurality of operation units;
And the single instruction and multiple data instructions are adopted to carry out assembly programming on the data which are subjected to the same calculation processing, so that the parallelism of data processing is improved.
Preferably, addressing compilation is employed for both critical and non-critical complex programs, the addressing compilation being accomplished by defining code segments, specifying code segment address spaces in a compiler-linked file.
Preferably, no function call and break jump are made in the loop body.
Preferably, the internal cache access bit width of the DSP is much larger than the externally configured RAM access bit width.
Preferably, the branch judgment sentences in the program to be operated are changed into conditional operation sentences.
Preferably, the program to be operated is an autonomous planning algorithm or a visual range finding algorithm.
A single-core DSP (digital signal processor) optimizes the parallel computation in the single-core DSP by adopting the parallel computation optimization method.
A processor includes the single-core DSP described above, and RAM disposed outside the DSP.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention improves the memory access parallelism, adopts segmented addressing compiling to fixedly operate a key complex algorithm in a DSP internal cache with larger data access bit width;
(2) According to the method, the loop structure in the algorithm is unfolded and optimized, and the instruction execution parallelism is improved;
(3) The invention improves the branch judgment structure of the algorithm, replaces branch judgment sentences with conditional operation sentences, fully processes the instruction pipelining function of the processor and provides the instruction execution parallelism;
(4) The invention improves the parallelism of data processing, which is realized by adopting a single instruction and multiple data processing instructions of a processor to carry out algorithm;
(5) The parallel optimization method of the invention improves the autonomous planning and visual range algorithm performance of the Mars by 2.5 times.
Drawings
FIG. 1 is a schematic diagram of parallel computing based on a single core DSP.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
A parallel computing optimization method based on a single-core DSP (digital signal processor) firstly improves the storage access parallelism to achieve the purpose of fetching a plurality of program instructions at one time, then improves the instruction execution parallelism on the basis of obtaining a plurality of instructions to enable the plurality of instructions to be executed in parallel, and finally improves the data processing parallelism, wherein when one instruction is executed, the instruction can process a plurality of data simultaneously. The specific implementation process is shown in fig. 1:
1) And the first step is to improve the parallelism of memory access. Based on the hierarchical memory structure characteristics of the DSP processor, the internal cache of the DSP has faster access rate than the RAM memory configured outside the processor, and larger data access bit width, such as 256 bits of internal cache access bit width of the DSP used by the Mars navigation control unit, and 32 bits of RAM access bit width configured outside. Compared with an RAM memory configured externally, when the DSP accesses the internal cache, more data can be obtained by one access operation. Therefore, in order to improve the parallel operation efficiency of the algorithms, the method addresses and compiles different key complex algorithms, so that the key complex algorithm programs are fixedly operated in an internal cache, and the non-key complex algorithm is operated in an RAM memory configured externally.
2) Addressing compilation is accomplished by defining code segments and specifying code segment address spaces in a compiler-linked file.
3) Defining a code segment mode:
# pragma CODE_ SECTION (key algorithm function name, "KeyAlgorithmText")
4) Specifying a code segment address space manner in a compiler link file:
MEMORY
{
INTERNALRAM origin=internal cache start address, length=internal cache byte length
ExternalRAM origin = external memory start address, length = external memory byte length
}
SECTIONS
{
.KeyAlgorithmText:>InternalRAM
.NonKeyAlgorithmText:>ExternalRAM
}
5) And secondly, improving the instruction execution parallelism. Based on the characteristics that the DSP comprises a plurality of operation function units and a plurality of groups of register groups, a large number of loop sentences contained in the algorithm are unfolded and optimized, so that sentences in the loop body can be simultaneously executed in a plurality of operation units, and meanwhile, a large number of branch judgment sentences contained in the algorithm are improved into conditional operation sentences, so that program jump is reduced, the interruption of an instruction pipeline of a processor is avoided, and the parallel advantage of the instruction pipeline of the processor is fully exerted.
6) The unfolding optimization method of the circulation statement comprises the following steps:
(1) The cycle counter adopts a decrementing mode;
(2) Function call and break jump are not carried out in the circulation body;
(3) The circulation body is not subjected to circulation nesting;
(4) The complex large-loop body is split into simple small-loop bodies, and the hardware resources used by each loop body are not more than the number of operation functional units and registers in the processor.
7) And thirdly, improving the parallelism of data processing. The image processing algorithm has the characteristic of carrying out the same operation on a large amount of data, and is matched with the characteristic of supporting parallel instructions of Single Instruction Multiple Data (SIMD) by adopting a Very Long Instruction Word (VLIW) architecture by a DSP processor. And the SIMD instruction is adopted to carry out assembly programming on a data processing part which influences the algorithm efficiency, so that the parallelism of data processing is improved.
A single-core DSP (digital signal processor) optimizes the parallel computation in the single-core DSP by adopting the parallel computation optimization method.
A processor includes the single-core DSP described above, and RAM disposed outside the DSP.
In summary, with the development of autonomous and intelligent operation of the spacecraft, the functional algorithm of the spacecraft for on-orbit image information processing and autonomous planning is more and more complex, the autonomous, safe and efficient operation of the spacecraft is more and more restricted by the algorithm performance, and the key complex algorithm must be optimized in parallel to improve the system operation performance. The parallel computing optimization method based on the single-core DSP can effectively improve the parallel processing performance of a key complex algorithm by improving the memory access parallelism, the instruction execution parallelism and the data processing parallelism, well meets the parallel processing requirements of future spacecrafts, and has wide application prospects.
What is not described in detail in the present specification is a well known technology to those skilled in the art.
Although the present invention has been described in terms of the preferred embodiments, it is not intended to be limited to the embodiments, and any person skilled in the art can make any possible variations and modifications to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the spirit and scope of the present invention, so any simple modifications, equivalent variations and modifications to the embodiments described above according to the technical matters of the present invention are within the scope of the technical matters of the present invention.