RU2547618C2

RU2547618C2 - Method of setting up arithmetic accelerator for solving large systems of linear equations

Info

Publication number: RU2547618C2
Application number: RU2013123082/08A
Authority: RU
Inventors: Александр Борисович Самохин; Евгений Евгеньевич Тыртышников; Олег Валерьевич Михеев; Паулина Айкинсовна Габусу
Priority date: 2013-05-21
Filing date: 2013-05-21
Publication date: 2015-04-10
Also published as: RU2013123082A

Abstract

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to computer engineering and can be used to design an arithmetic accelerator for solving large systems of linear equations. The method comprises steps of: accessing the shared memory unit of one or more tertiary or quaternary processors selected from an arbitrary set of different processors; detecting a free primary processor; dividing an intermediate result into groups; performing indexing and recording values of the intermediate result in each group in the shared memory unit; detecting a free tertiary processor and ranking the indices and, based on one of three successive indices selected from the set of indices, performing discrete fast Fourier transform; recording the transformation results in the shared memory unit; detecting a free quaternary processor; considering values of matrix elements for the first index successively; performing discrete fast Fourier transform for two other indices; multiplying term by term the obtained values on said two indices with Fourier transforms of a Toeplitz matrix for said indices; performing inverse discrete fast Fourier transform for said two indices; recording the transformation results in the local memory of the quaternary processor; performing inverse discrete fast Fourier transform for the first index; recording the result in the shared memory.

EFFECT: fewer arithmetic operations.

1 dwg

Description

Изобретение относится к вычислительной технике, в частности к способу организации арифметического ускорителя для решения больших систем линейных уравнений, и может быть использовано для создания новой технологии инженерных расчетов, в том числе объемных интегральных уравнений, электродинамики, гидроакустики, медицинских исследований.The invention relates to computing, in particular to a method for organizing an arithmetic accelerator for solving large systems of linear equations, and can be used to create a new technology of engineering calculations, including volumetric integral equations, electrodynamics, hydroacoustics, medical research.

Известен способ организации многопроцессорной ЭВМ, заключающийся в параллельном исполнении нити вычислений по средствам хранимого в виртуальной памяти распределенного представления дескриптора нити, выполнение первичной выборки архитектурных команд по средствам мониторов нитей, формирование графоинформационных зависимостей транзакций, которые выдают через сеть в исполнительные кластеры, переводит активную нить в резидентную очередь ждущих завершения транзакций и выбирают следующую активную нить, принимают транзакции, переписывают их команды, осуществляют выборку и передачу готовых команд кластеру, корректируют граф, передают результат завершения транзакции монитору и переводят нить с коррекцией корня представления дескриптора нити (см., например, патент РФ 2312388, G06F 15/16, 2005 г.).A known method of organizing a multiprocessor computer, which consists in parallel execution of computational threads by means of the distributed representation of the thread descriptor stored in virtual memory, performing initial sampling of architectural commands by means of thread monitors, forming graph-information dependencies of transactions that are transmitted through the network to executive clusters, translates the active thread into a resident queue waiting for transactions to complete and select the next active thread, accept transactions, rewrite their teams compose, select and transmit ready-made commands to the cluster, adjust the graph, transmit the result of the transaction to the monitor and translate the thread with the correction of the presentation root of the thread descriptor (see, for example, RF patent 2312388, G06F 15/16, 2005).

Известный способ труден в отношении разработки.The known method is difficult in relation to development.

Наиболее близким к заявляемому способу является способ организации арифметического ускорителя для решения больших систем линейных уравнений, включающий способ организации арифметического ускорителя для решения больших систем линейных уравнений, включающий общую память, доступную для первичных и одного или более вторичных процессоров, прием множества коэффициентов, связанных с набором линейных уравнений в общую память и разделением их на блоки коэффициентов с использованием одного или более первичных процессоров, выявление доступного вторичного процессора для обработки выбранного блока коэффициентов, передачу его из общей памяти в блок локальной памяти вторичного процессора с последующей обработкой для получения промежуточного результата и передачей его в общую память (см., например, патент РФ №7236998, G06F 7/00).Closest to the claimed method is a method of organizing an arithmetic accelerator for solving large systems of linear equations, including a method of organizing an arithmetic accelerator for solving large systems of linear equations, including shared memory available for primary and one or more secondary processors, receiving a variety of coefficients associated with a set linear equations into shared memory and dividing them into blocks of coefficients using one or more primary processors, identifying available of the secondary processor for processing the selected block of coefficients, transferring it from the shared memory to the local memory of the secondary processor, followed by processing to obtain an intermediate result and transferring it to the shared memory (see, for example, RF patent No. 7236998, G06F 7/00).

Основным недостатком известного способа является квадратичный рост числа арифметических операций от числа неизвестных.The main disadvantage of this method is the quadratic increase in the number of arithmetic operations from the number of unknowns.

Технической задачей, на решение которой направлен заявляемый способ с указанием технического результата, является создание такого способа, который позволил бы уменьшить рост числа арифметических операций и сделать его пропорциональным числу неизвестных.The technical problem to which the claimed method is directed, indicating the technical result, is to create such a method that would reduce the growth in the number of arithmetic operations and make it proportional to the number of unknowns.

Технический результат связан с использованием метода быстрого умножения циркулянтных матриц на вектор, основанный на быстром дискретном преобразовании Фурье.The technical result is associated with the use of the method of fast multiplication of circulant matrices by a vector based on the fast discrete Fourier transform.

Технический результат достигается путем построения вычислительной среды, позволяющей производить наименьшее количество операций и реализовать рост числа арифметических операций пропорционально числу неизвестных.The technical result is achieved by constructing a computing environment that allows you to perform the least number of operations and realize the growth in the number of arithmetic operations in proportion to the number of unknowns.

Решение технической задачи достигается тем, что способ организации арифметического ускорителя для решения больших систем линейных уравнений, включающий общую память, доступную для первичных и одного или более вторичных процессоров, прием множества коэффициентов, связанных с набором линейных уравнений в общую память и разделением их на блоки коэффициентов с использованием одного или более первичных процессоров, выявление доступного вторичного процессора для обработки выбранного блока коэффициентов, передачу его из общей памяти в блок локальной памяти вторичного процессора с последующей обработкой для получения промежуточного результата и передачей его в общую память, согласно изобретению производят доступ к блоку общей памяти одного или более третичных или четвертичных процессоров, выбранных из произвольного множества разнородных процессоров, выявляют свободный первичный процессор и разделяют промежуточный результат на группы, производят индексирование и записывают значения промежуточного результата в каждой группе в блок общей памяти, выявляют свободный третичный процессор и производят ранжирование индексов и по одному из трех последовательных индексов, выбранных из множества индексов, производят быстрое дискретное преобразование Фурье, записывают результаты преобразования в блок общей памяти, выявляют свободный четвертичный процессор, рассматривают значения элементов матрицы для первого индекса последовательно и производят быстрые дискретные преобразования Фурье по двум другим индексам, затем умножают почленно получившиеся значения по этим двум индексам на Фурье преобразования теплицевой матрицы для этих индексов, затем производят обратное быстрое дискретное преобразование Фурье по этим двум индексам, результаты преобразований записывают в локальную память четвертичного процессора, по окончании процесса производят обратное быстрое дискретное преобразование Фурье по первому индексу, записывают результат в общую память.The solution to the technical problem is achieved by the fact that a method of organizing an arithmetic accelerator for solving large systems of linear equations, including shared memory available for primary and one or more secondary processors, receiving a set of coefficients associated with a set of linear equations in a common memory and dividing them into blocks of coefficients using one or more primary processors, identifying an available secondary processor for processing the selected block of coefficients, transferring it from the shared memory to the block according to the invention, access to the shared memory block of one or more tertiary or quaternary processors selected from an arbitrary set of heterogeneous processors, identify a free primary processor and share the intermediate result into groups, indexing and recording the values of the intermediate result in each group in the block of shared memory, reveal free processor and indexes are ranked and one of three consecutive indices selected from a plurality of indices performs a fast discrete Fourier transform, writes the results of the conversion to a shared memory block, reveals a free Quaternary processor, considers the values of the matrix elements for the first index sequentially and produces fast discrete Fourier transforms by two other indices, then multiply the term-by-term values obtained by these two indices by the Fourier transform of t plitsevoy matrix for these indices, then produce an inverse fast discrete Fourier transform on the two indices, the result is loaded into local memory quaternary processor at the end of the process producing discrete inverse fast Fourier transform on the first index, the result is written to shared memory.

На чертеже изображена общая блок-схема по изобретению, показывающая процессоры 1, 2, 3, 4, модуль 5 индексирования, модули 6, 7 Фурье преобразования, модуль 8 обратного Фурье преобразования, блок 9 общей памяти, блоки 10, 11, 12, 13 локальной памяти.The drawing shows a General block diagram according to the invention, showing the processors 1, 2, 3, 4, the indexing module 5, the Fourier transform modules 6, 7, the inverse Fourier transform module 8, the shared memory unit 9, blocks 10, 11, 12, 13 local memory.

На схеме представлена последовательность действий, отраженная в формуле изобретения, а именно:The diagram shows the sequence of actions reflected in the claims, namely:

Матрица системы линейных уравнений, получающаяся после дискретизации интегральных уравнений, описывающих задачи инженерных расчетов, в том числе объемных интегральных уравнений, электродинамики, гидроакустики, медицинских исследований, например, рассеяние электромагнитных волн на диэлектрических трехмерных структурах имеет форму блочно-теплицевой матрицы. При этом основные вычислительные затраты при умножении матрицы системы линейных уравнений (СЛАУ) на вектор, а это самая затратная вычислительная операция при использовании итерационных алгоритмов, связаны с вычислением сумм вида:The matrix of a system of linear equations obtained after discretization of integral equations describing the tasks of engineering calculations, including volumetric integral equations, electrodynamics, hydroacoustics, medical research, for example, the scattering of electromagnetic waves by dielectric three-dimensional structures has the form of a block-greenhouse matrix. In this case, the main computational costs when multiplying the matrix of a system of linear equations (SLAE) by a vector, and this is the most expensive computational operation when using iterative algorithms, are associated with the calculation of sums of the form:

,

где Q - область рассеивающего тела, находящегося внутри параллелепипеда П.where Q is the region of the scattering body located inside the parallelepiped P.

Для вычисления W(p) в узловых точках х(р)∈Q требуется выполнить ~N² арифметических операций.To calculate W (p) at the nodal points x (p) ∈ Q, ~ N ² arithmetic operations are required.

Для уменьшения числа арифметических операций будем применять технику быстрого умножения циркулянтных матриц на вектор, основанную на быстром дискретном преобразовании Фурье.To reduce the number of arithmetic operations, we will use the technique of fast multiplication of circulant matrices by a vector based on the fast discrete Fourier transform.

Доопределим величины V(q) нулем в точках x(q) параллелепипеда П, не принадлежащих области Q. Тогда (1) можно записать в виде:We define the quantities V (q) by zero at the points x (q) of the parallelepiped не that do not belong to the domain Q. Then (1) can be written in the form:

где матричная функция дискретного аргумента В{р) определена для значений:where the matrix function of the discrete argument B (p) is defined for the values:

-(N₁-1)≤p₁≤(N₁-1); -(N₂-1)≤p₂≤(N₂-1); -(N₃-1)≤p₃≤(N₃-1).- (N ₁ -1) ≤p ₁ ≤ (N ₁ -1); - (N ₂ -1) ≤p ₂ ≤ (N ₂ -1); - (N ₃ -1) ≤p ₃ ≤ (N ₃ -1).

Для электромагнитных задач имеем:For electromagnetic problems we have:

Здесь функция дискретного аргумента δ(k, n, m) определяется формулойHere the function of the discrete argument δ (k, n, m) is determined by the formula

где δ_kn - символ Кронекера.where δ _kn is the Kronecker symbol.

Обозначим через П₂ параллелепипед со сторонами 2N₁h₁, 2N₂h₂ и 2N₃h₃. Продолжим матричную функцию дискретного аргумента В(р₁, р₂, р₃) на все целочисленные значения р₁, р₂, p₃, полагая ее периодической по каждой переменной с периодами соответственно 2N₁, 2N₂, 2N₃. При этом доопределим В(р₁, р₂, р₃) нулем в точках (N₁, p₂, p₃), (p₁, N₂, p₃), (p₁, p₂, N₃), где p₁, p₂, p₃ - любые целые числа. Далее доопределим вектор функции дискретного аргумента V(p₁, p₂, p₃) нулем во всех узловых точках П₂, не принадлежащих П, и продолжим ее на все целочисленные значения р₁, р₂, р₃, полагая ее периодической по каждой переменной с периодами соответственно 2N₁, 2N₂, 2N₃, тогда:Denote by П _{2 the} box with the sides 2N ₁ h ₁ , 2N ₂ h ₂ and 2N ₃ h ₃ . We continue the matrix function of the discrete argument B (p ₁ , p ₂ , p ₃ ) to all integer values p ₁ , p ₂ , p ₃ , assuming it to be periodic in each variable with periods of 2N ₁ , 2N ₂ , 2N ₃ , respectively. Moreover, we define B (p ₁ , p ₂ , p ₃ ) zero at the points (N ₁ , p ₂ , p ₃ ), (p ₁ , N ₂ , p ₃ ), (p ₁ , p ₂ , N ₃ ), wherein p _1, p _2, p ₃ - arbitrary integers. Next, we define the vector of the function of the discrete argument V (p ₁ , p ₂ , p ₃ ) as zero at all nodal points P ₂ not belonging to P, and continue it to all integer values p ₁ , p ₂ , p ₃ , assuming it to be periodic for each variable with periods respectively 2N ₁ , 2N ₂ , 2N ₃ , then:

Учитывая изложенное при х(р)∈П функция W(p₁, p₂, p₃) из (5) совпадает со значениями W(p₁, p₂, p₃) из (2). Ниже через П и П₂ будем обозначать целочисленные параллелепипеды с числом дискретных аргументов по каждой оси N₁, N₂, N₃ и 2N₁, 2N₂, 2N₃ соответственно. Теперь проводя дискретное преобразование Фурье по каждой переменной от обеих частей (5), получим следующее равенство:Considering the above, for x (p) ∈ П the function W (p ₁ , p ₂ , p ₃ ) from (5) coincides with the values W (p ₁ , p ₂ , p ₃ ) from (2). Below, by цел and _{2 2} we denote integer parallelepipeds with the number of discrete arguments on each axis N ₁ , N ₂ , N ₃ and 2N ₁ , 2N ₂ , 2N _3, respectively. Now, conducting a discrete Fourier transform for each variable of both sides of (5), we obtain the following equality:

Из (3) и периодичности функций следует, что элементы массива ${{\tilde{B}}_{n m} (k_{1}, k_{2}, k_{3})}$

, k∈П₂, удовлетворяют соотношениям:From (3) and the periodicity of functions it follows that the elements of the array

{{\tilde{B}}_{n m} (k_{one}, k_{2}, k_{3})}

, k∈P ₂ , satisfy the relations:

Таким образом, учитывая, что ${\tilde{B}}_{n m} (k_{1}, k_{2}, k_{3}) = {\tilde{B}}_{m n} (k_{1}, k_{2}, k_{3})$

, ясно, что количество элементов массива, которые надлежит вычислить и хранить в памяти компьютера, определяется формулойThus, given that

{\tilde{B}}_{n m} (k_{one}, k_{2}, k_{3}) = {\tilde{B}}_{m n} (k_{one}, k_{2}, k_{3})

, it is clear that the number of array elements to be calculated and stored in computer memory is determined by the formula

При вычислении W(p), p∈П, основные вычислительные затраты, без учета нахождения ${\tilde{B} (k)}$

, k∈П₂ (этот массив вычисляется один раз и затем без изменения используется при вычислении итераций), связаны с прямым и обратным быстрым преобразованием Фурье функций дискретного аргумента. При прямом преобразовании Фурье функция V(p) отличается от нуля только при значениях р∈П. С другой стороны, при обратном преобразовании Фурье значение функции W(p) требуется определить только при значениях р∈П. Кроме того, алгоритмы прямого и обратного быстрого дискретного преобразования Фурье можно применять в любой последовательности переменных. Принимая вышесказанное во внимание, сначала изложим по шагам схему эффективного вычисления вектора функции W(p), р∈П.When calculating W (p), p∈P, the main computational costs, excluding finding

{\tilde{B} (k)}

, k∈P ₂ (this array is calculated once and then used without changes in calculating iterations), are associated with the direct and inverse fast Fourier transform of the functions of the discrete argument. Under the direct Fourier transform, the function V (p) differs from zero only for p ∈ P. On the other hand, in the inverse Fourier transform, the value of the function W (p) is required to be determined only for p ∈ P. In addition, the algorithms of direct and inverse fast discrete Fourier transform can be applied in any sequence of variables. Bearing in mind the above, we first outline the steps for efficiently computing the vector of the function W (p), p∈P.

Шаг 1. Проведем ранжирование и расположим N₁, N₂, N₃ в порядке убывания, т.е. N₁≥N₂≥N₃.Step 1. Let us rank and arrange N ₁ , N ₂ , N ₃ in descending order, i.e. N ₁ ≥N ₂ ≥N ₃ .

Шаг 2. Произведем быстрое дискретное преобразование Фурье по целочисленной переменной р₁. Получаем массив чисел V_n(k₁, p₂, p₃), 0≤k₁≤2N₁-1, 0≤p₂≤N₂-1, 0≤p₃≤N₃-1; n=1, 2, 3. Общее количество комплексных чисел в этом массиве равно 6 N₁N₂N₃.Step 2. We perform a fast discrete Fourier transform with respect to the integer variable p ₁ . We get an array of numbers V _n (k ₁ , p ₂ , p ₃ ), 0≤k ₁ ≤2N ₁ -1, 0≤p ₂ ≤N ₂ -1, 0≤p ₃ ≤N ₃ -1; n = 1, 2, 3. The total number of complex numbers in this array is 6 N ₁ N ₂ N ₃ .

Шаг 3. Для каждого фиксированного k₁, 0≤k₁≤2N₁-1 последовательно производим быстрое дискретное преобразование Фурье по переменным р₂, р₃. Затем при том же k₁, используя формулу (6), получаем значения ${\tilde{W}}_{n} (k_{1}, k_{2}, k_{3})$

. Далее по переменным k₂, k₃ производим обратное быстрое дискретное преобразование Фурье. В результате для каждого k₁, 0≤k₁≤2N₁-1, получаем значения W_n(k₁, p₂, p₃), 0≤p₂≤N₂-1, 0≤р₃≤N₃-1. Для реализации этого шага требуется дополнительный массив для хранения 12 N₂N₃ комплексных чисел для проведения преобразований Фурье. Ясно, что, как правило, эта величина существенно меньше чем 6 N₁N₂N₃.Step 3. For each fixed k ₁ , 0≤k ₁ ≤2N ₁ -1 we sequentially produce a fast discrete Fourier transform with respect to the variables p ₂ , p ₃ . Then, with the same k ₁ , using formula (6), we obtain the values

{\tilde{W}}_{n} (k_{one}, k_{2}, k_{3})

. Further, with respect to the variables k ₂ , k _{3, we} perform the inverse fast discrete Fourier transform. As a result, for each k ₁ , 0≤k ₁ ≤2N ₁ -1, we obtain the values of W _n (k ₁ , p ₂ , p ₃ ), 0≤p ₂ ≤N ₂ -1, 0≤р ₃ ≤N ₃ - one. To implement this step, an additional array is required to store 12 N ₂ N ₃ complex numbers for Fourier transforms. It is clear that, as a rule, this value is substantially less than 6 N ₁ N ₂ N ₃ .

Шаг 4. Произведем быстрое обратное дискретное преобразование Фурье по переменной k₁. Получаем требуемые значения W_n(p₁, p₂, p₃), 0≤p₁≤N₁-1, 0≤p₂≤N₂-1, 0≤p₃≤N₃-1.Step 4. We perform a fast inverse discrete Fourier transform with respect to the variable k ₁ . We obtain the required values of W _n (p ₁ , p ₂ , p ₃ ), 0≤p ₁ ≤N ₁ -1, 0≤p ₂ ≤N ₂ -1, 0≤p ₃ ≤N ₃ -1.

Для реализации изложенной схемы необходим объем памяти для хранения массива, равного ${{\tilde{B}}_{n m} (k_{1}, k_{2}, k_{3})}$

, k∈П₂, число элементов которого определяется формулой (8), и двух вышеупомянутых массивов. Таким образом, для реализации вышеизложенной эффективной схемы умножения матрицы СЛАУ на вектор, общее количество комплексных чисел, которое надлежит одновременно хранить в памяти компьютера, определяется формулойTo implement the above scheme, the amount of memory required to store an array equal to

{{\tilde{B}}_{n m} (k_{one}, k_{2}, k_{3})}

, k∈P ₂ , the number of elements of which is determined by formula (8), and the two above-mentioned arrays. Thus, to implement the above effective scheme for multiplying the SLAE matrix by a vector, the total number of complex numbers that must be simultaneously stored in computer memory is determined by the formula

Число арифметических операций, требуемое для вычисления W_n(p), p∈П, оценивается формулойThe number of arithmetic operations required to calculate W _n (p), p∈P, is estimated by the formula

где LOG(N)=N определяется как сумма всех делителей целого числа N с учетом их кратности.where LOG (N) = N is defined as the sum of all divisors of the integer N, taking into account their multiplicity.

Действительно, при проведении преобразования Фурье вычисления можно проводить в любой последовательности переменных. Поэтому очевидно, что наименьшее количество операций потребуется для приведенной выше последовательности.Indeed, when carrying out the Fourier transform, calculations can be performed in any sequence of variables. Therefore, it is obvious that the least number of operations will be required for the above sequence.

Если числа N₁, N₂, N₃ являются степенями числа 2, то можно воспользоваться известным быстрым преобразованием Фурье. Тогда число арифметических операций для вычисления функций дискретного аргумента W(p), p∈П, оценивается формулойIf the numbers N ₁ , N ₂ , N ₃ are powers of 2, then we can use the well-known fast Fourier transform. Then the number of arithmetic operations for calculating the functions of the discrete argument W (p), p∈P, is estimated by the formula

Если для умножения матрицы СЛАУ на вектор использовать (6) без применения рассмотренной схемы, то количество арифметических операций и требуемый объем памяти будут в несколько раз больше.If, to multiply the SLAE matrix by a vector, use (6) without applying the considered scheme, the number of arithmetic operations and the required memory size will be several times larger.

Обычно при использовании быстрого дискретного преобразования Фурье выбираются значения N, кратные степени 2. Однако при дискретизации интегральных уравнений, это, во многих случаях, может привести к значительным дополнительным вычислительным затратам, поскольку скважность чисел степени 2 весьма велика. Поясним на примере. Пусть N₁=N₂=N₃=N₀, т.е. П-куб. Предположим, что для аппроксимации решения с требуемой точностью, достаточно взять значение N₀=150. Ближайшие степени двойки - числа 128 и 256. Значение 128 не удовлетворяет требованию аппроксимации решения, поэтому, если мы хотим воспользоваться стандартным БПФ, то необходимо брать значение N₀=256. Пусть Т(N₀) - число арифметических операций, которое требуется для умножения матрицы СЛАУ на вектор, в зависимости от значений N₀. Тогда имеем,Usually, when using the fast discrete Fourier transform, N values that are multiples of degrees 2 are selected. However, when discretizing integral equations, this, in many cases, can lead to significant additional computational costs, since the duty cycle of numbers of degree 2 is very large. Let us illustrate with an example. Let N ₁ = N ₂ = N ₃ = N ₀ , i.e. P-cube Suppose that to approximate the solution with the required accuracy, it is enough to take the value N ₀ = 150. The nearest powers of two are the numbers 128 and 256. The value 128 does not satisfy the requirement of approximating the solution, therefore, if we want to use the standard FFT, then we need to take the value N ₀ = 256. Let T (N ₀ ) be the number of arithmetic operations that is required to multiply the SLAE matrix by a vector, depending on the values of N ₀ . Then we have

Объем памяти для хранения матрицы СЛАУ при N₀=256 в несколько раз больше, чем при N₀=150. Значит, использование БПФ для значения N₀=150 значительно более эффективно, чем использование БПФ со степенью 2.The memory capacity for storing the SLAE matrix at N ₀ = 256 is several times larger than at N ₀ = 150. Therefore, the use of an FFT for a value of N ₀ = 150 is significantly more effective than the use of an FFT with a degree of 2.

Отметим, что использование БПФ дает для числа арифметических операций практически линейную зависимость от размерности СЛАУ по сравнению с квадратичной зависимостью, которая появляется при умножении матрицы на вектор без применения специальных быстрых алгоритмов. Это чрезвычайно важно при численном решении объемных интегральных уравнений, которые после дискретизации сводятся к СЛАУ огромной размерности (больше 10⁶). При этом нужно отметить, что объем памяти также имеет практически линейную зависимость от размерности матрицы. Это обстоятельство очень важно, поскольку без использования эффективных методов дискретизации объем требуемой памяти имеет квадратичную зависимость от размерности.Note that the use of FFT gives for the number of arithmetic operations an almost linear dependence on the dimension of the SLAE compared with the quadratic dependence that appears when the matrix is multiplied by a vector without using special fast algorithms. This is extremely important for the numerical solution of volumetric integral equations, which after discretization are reduced to SLAEs of enormous dimension (more than 10 ⁶ ). It should be noted that the amount of memory also has an almost linear dependence on the dimension of the matrix. This circumstance is very important, since without the use of effective discretization methods, the amount of required memory has a quadratic dependence on the dimension.

На этой основе предлагается реализовать способ организации арифметического ускорителя для решения больших систем линейных уравнений.On this basis, it is proposed to implement a method of organizing an arithmetic accelerator to solve large systems of linear equations.

Проведем ранжирование и расположим N₁, N₂, N₃ в порядке убывания, т.е. N₁≥N₂≥N₃.We will rank and arrange N ₁ , N ₂ , N ₃ in descending order, i.e. N ₁ ≥N ₂ ≥N ₃ .

Произведем быстрое дискретное преобразование Фурье по целочисленной переменной р₁. Получаем массив чисел V_n(k₁, p₂, p₃), 0≤k₁≤2N₁-1, 0≤p₂≤N₂-1, 0≤p₃≤N₃-1; n=1, 2, 3. Общее количество комплексных чисел в этом массиве равно 6 N₁N₂N₃.We perform a fast discrete Fourier transform with respect to the integer variable p ₁ . We get an array of numbers V _n (k ₁ , p ₂ , p ₃ ), 0≤k ₁ ≤2N ₁ -1, 0≤p ₂ ≤N ₂ -1, 0≤p ₃ ≤N ₃ -1; n = 1, 2, 3. The total number of complex numbers in this array is 6 N ₁ N ₂ N ₃ .

Для каждого фиксированного k₁, 0≤k₁≤2N₁-1, последовательно производим быстрое дискретное преобразование Фурье по переменным р₂, р₃. Затем при том же k₁, используя формулу (6), получаем значения ${\tilde{W}}_{n} (k_{1}, k_{2}, k_{3})$

. Далее по переменным k₂, k₃ производим обратное быстрое дискретное преобразование Фурье. В результате для каждого k₁, 0≤k₁≤2N₁-1, получаем значения W_n(k₁, p₂, p₃), 0≤p₂≤N₂-1, 0≤p₃≤N₃-1. Для реализации этого шага требуется дополнительный массив для хранения 12 N₂N₃ комплексных чисел для проведения преобразований Фурье. Ясно, что, как правило, эта величина существенно меньше чем 6 N₁N₂N₃.For each fixed k ₁ , 0≤k ₁ ≤2N ₁ -1, we sequentially produce a fast discrete Fourier transform in the variables p ₂ , p ₃ . Then, with the same k ₁ , using formula (6), we obtain the values

{\tilde{W}}_{n} (k_{one}, k_{2}, k_{3})

. Further, with respect to the variables k ₂ , k _{3, we} perform the inverse fast discrete Fourier transform. As a result, for each k ₁ , 0≤k ₁ ≤2N ₁ -1, we obtain the values of W _n (k ₁ , p ₂ , p ₃ ), 0≤p ₂ ≤N ₂ -1, 0≤p ₃ ≤N ₃ - one. To implement this step, an additional array is required to store 12 N ₂ N ₃ complex numbers for Fourier transforms. It is clear that, as a rule, this value is substantially less than 6 N ₁ N ₂ N ₃ .

Произведем быстрое обратное дискретное преобразование Фурье по переменной k₁. Получаем требуемые значения W_n(p₁, p₂, p₃), 0≤p₁≤N₁-1, 0≤p₂≤N₂-1, 0≤p₃≤N₃-1.We perform inverse fast discrete Fourier transform variable k by _one. We obtain the required values of W _n (p ₁ , p ₂ , p ₃ ), 0≤p ₁ ≤N ₁ -1, 0≤p ₂ ≤N ₂ -1, 0≤p ₃ ≤N ₃ -1.

Предложенный способ реализуется следующим образом.The proposed method is implemented as follows.

Производят доступ к блоку общей памяти одного или более третичных или четвертичных процессоров, выбранных из произвольного множества разнородных процессоров, затем выявляют свободный первичный процессор и разделяют промежуточный результат на группы, производят индексирование и записывают значения промежуточного результата в каждой группе в блок общей памяти. Далее выявляют свободный третичный процессор и производят ранжирование индексов и по одному из трех последовательных индексов, выбранных из множества индексов, производят быстрое дискретное преобразование Фурье, записывают результаты преобразования в блок общей памяти, затем выявляют свободный четвертичный процессор, рассматривают значения элементов матрицы для первого индекса последовательно и производят быстрые дискретные преобразования Фурье по двум другим индексам, затем умножают почленно получившиеся значения по этим двум индексам на Фурье преобразования теплицевой матрицы для этих индексов, затем производят обратное быстрое дискретное преобразование Фурье по этим двум индексам, результаты преобразований записывают в локальную память четвертичного процессора, по окончании процесса производят обратное быстрое дискретное преобразование Фурье по первому индексу, записывают результат в общую память.The shared memory block of one or more tertiary or quaternary processors selected from an arbitrary set of heterogeneous processors is accessed, then a free primary processor is identified and the intermediate result is divided into groups, indexing is performed and the values of the intermediate result in each group are recorded in the shared memory block. Next, a free tertiary processor is identified and the indices are ranked, and one of three consecutive indices selected from a plurality of indices is performed, a fast discrete Fourier transform is performed, the results of the transformation are written to the shared memory block, then a free quaternary processor is identified, matrix element values for the first index are examined sequentially and produce fast discrete Fourier transforms for two other indices, then multiply the term-by-value values obtained for these two and Dex on Fourier transform Toeplitz matrix for these indexes, then produce an inverse fast discrete Fourier transform on the two indices, the result is loaded into local memory quaternary processor, upon completion of the process produces an inverse fast discrete Fourier transform on the first index, record the result in shared memory.

Технический результат заключается в уменьшении роста числа арифметических операций и реализации роста арифметических операций пропорционально числу неизвестных и связан с использованием метода быстрого умножения циркулянтных матриц на вектор, основанный на быстром дискретном преобразовании Фурье.The technical result consists in reducing the growth in the number of arithmetic operations and realizing the growth of arithmetic operations in proportion to the number of unknowns and is associated with the use of the method of fast multiplication of circulant matrices by a vector based on the fast discrete Fourier transform.

Claims

A method of organizing an arithmetic accelerator for solving large systems of linear equations, including shared memory available for primary and one or more secondary processors, receiving a plurality of coefficients associated with a set of linear equations in a common memory and dividing them into blocks of coefficients using one or more primary processors , identifying the available secondary processor for processing the selected block of coefficients, transferring it from the shared memory to the local memory block of the secondary processor from the last by further processing to obtain an intermediate result and transferring it to the shared memory, characterized in that they access the shared memory block of one or more tertiary or quaternary processors selected from an arbitrary set of heterogeneous processors, identify a free primary processor and divide the intermediate result into groups, produce indexing and record the values of the intermediate result in each group in the block of shared memory, identify the free tertiary processor and rank ind indexes and one of three consecutive indices selected from a plurality of indices, produce a fast discrete Fourier transform, write the results of the conversion to a shared memory block, identify a free Quaternary processor, examine the values of the matrix elements for the first index sequentially and perform fast discrete Fourier transforms on the other two indices, then multiply the term-by-term values obtained by these two indices by the Fourier transform of the greenhouse matrix for these indices, then produce odyat inverse fast discrete Fourier transform on the two indices, the result is loaded into local memory quaternary processor at the end of the process producing discrete inverse fast Fourier transform on the first index, the result is written to shared memory.