RU2461055C1

RU2461055C1 - Cluster system with direct channel switching

Info

Publication number: RU2461055C1
Application number: RU2011128993/08A
Authority: RU
Inventors: Борис Николаевич Четверушкин (RU); Борис Николаевич Четверушкин; Юрий Павлович Смольянов (RU); Юрий Павлович Смольянов; Алексей Оттович Лацис (RU); Алексей Оттович Лацис; Георгий Сергеевич Елизаров (RU); Георгий Сергеевич Елизаров; Виктор Станиславович Горбунов (RU); Виктор Станиславович Горбунов; Георгий Борисович Кульков (RU); Георгий Борисович Кульков; Александр Георгиевич Титов (RU); Александр Георгиевич Титов; Андрей Владимирович Патрикеев (RU); Андрей Владимирович Патрикеев; Виктор Викторович Парамонов (RU); Виктор Викторович Парамонов; Витилий Владимирович Мякушко (RU); Витилий Владимирович Мякушко
Original assignee: Федеральное государственное унитарное предприятие "Научно-исследовательский институт "Квант"; Институт прикладной математики им. М.В. Келдыша РАН
Priority date: 2011-07-13
Filing date: 2011-07-13
Publication date: 2012-09-10

Abstract

FIELD: information technology.

SUBSTANCE: cluster system has master and slave computer modules, having a PICe bus on their motherboards, a switch and interface boards, having 2 ports each - one port operates in transparent port mode and the second in non-transparent port mode. The switch can form a range of addresses in the address space of the PICe bus of the master computer module which covers the entire access window for the slave computer module and can form in each non-transparent port of the interface board, a window with an address in the address region in the master computer module for enabling each slave computer module to access through the address space the memory of other slave computer modules.

EFFECT: high rate of transmitting data and shorter delay time for writing data into foreign memory once.

8 dwg

Description

Изобретение относится к вычислительной технике и касается коммуникационной среды, отличающейся использованием прямой коммутации магистралей PCI Express без преобразования коммуникационного протокола.The invention relates to computer technology and relates to a communication environment, characterized by the use of direct switching of PCI Express lanes without converting the communication protocol.

Кластеризация - это технология, с помощью которой два или более серверов функционируют как один, представляя собой, с точки зрения пользователя, единый вычислительный ресурс. Главная цель кластеризации - увеличение производительности и (или) отказоустойчивости кластерной системы. Кластерные системы применяются в компаниях, которым нужна надежность и бесперебойная работа бизнес-критичных серверов и приложений.Clustering is a technology by which two or more servers function as one, representing, from the point of view of the user, a single computing resource. The main goal of clustering is to increase the performance and / or fault tolerance of a cluster system. Cluster systems are used in companies that need reliability and uninterrupted operation of business-critical servers and applications.

Вычислительные кластеры, построенные по технологии прямой коммутации каналов PCI Express, могут применяться для решения вычислительно трудоемких задач в областях молекулярной фармакологии, наноэлектроники, в создании энергетических комплексов нового поколения, а также при проведении фундаментальных научных исследований в астрофизике, микробиологии, физике твердого тела, нейроматематике, томографии Земли и в других областях. В первую очередь от кластеров, построенных по базе данной коммуникационной среды, можно ожидать эффективности в расчетах на неструктурных и адаптивных сетках, то есть на задачах, в которых данные сложным образом разбросаны по совокупной памяти системы мелкими порциями, взаимное расположение которых не является регулярным и/или меняется в процессе расчета. Это перспективное, быстро растущее направление в численных методах, известное исключительно высокой трудоемкостью при его реализации на традиционных вычислительных кластерах.Computing clusters built using PCI Express direct switching technology can be used to solve computationally labor-intensive problems in the fields of molecular pharmacology, nanoelectronics, to create new-generation energy complexes, as well as to conduct basic scientific research in astrophysics, microbiology, solid state physics, neuromathematics , tomography of the Earth and in other areas. First of all, from clusters built on the basis of this communication environment, one can expect efficiency in calculations on non-structural and adaptive grids, that is, on problems in which data are complexly scattered in small amounts in the total memory of the system, the mutual arrangement of which is not regular and / or changes in the calculation process. This is a promising, rapidly growing direction in numerical methods, known for its extremely high laboriousness in its implementation on traditional computing clusters.

Разработка сети прямой коммутации PCI Express была выполнена американской фирмой OneStopSystems (http://www.onestopsystems.com), но в их разработке отсутствует простой и эффективный способ масштабирования сети без сосредоточенного центрального коммутатора, который присутствует в нашей реализации.The development of the PCI Express direct switching network was carried out by the American company OneStopSystems (http://www.onestopsystems.com), but there is no simple and effective way to scale the network without the concentrated central switch that is present in our implementation.

Известен самоформирующийся суперкомпьютер, содержащий множество стоек, на каждой из которых размещен массив разъемов для подключения токопроводящих шин питания стоек к шинам питания плат, и массив плат с размещенными на них микросхемами, причем платы имеют устройства для их закрепления на стойках, а элементы суперкомпьютера соединяются друг с другом с помощью оптоволоконных и электрических шин, при этом имеется компьютер-строитель, один или несколько манипуляторов, управляемых этим компьютером, а также библиотеки охлаждаемых плат, оптоволоконных шин и электрических кабелей в виде наборов этих элементов, установленных в доступных для манипуляторов магазинах (RU №2367125, H05K 7/00, H05K 7/20, G06F 15/18, G06F 1/16, G06F 1/20, опубл. 10.09.2009).A self-forming supercomputer is known that contains many racks, on each of which there is an array of connectors for connecting conductive power racks of the racks to the power rails of the boards, and an array of boards with microcircuits placed on them, the boards have devices for fixing them on the racks, and the elements of the supercomputer are connected to each other with a friend using fiber-optic and electric buses, there is a construction computer, one or more manipulators controlled by this computer, as well as libraries of cooled boards, opt fiber busbars and electric cables in the form of sets of these elements installed in stores available for manipulators (RU No. 2367125, H05K 7/00, H05K 7/20, G06F 15/18, G06F 1/16, G06F 1/20, publ. 10.09 .2009).

Недостаток данного решения заключается в недостаточном быстродействии.The disadvantage of this solution is the lack of speed.

Известна кластерная система, которая содержит ведущий вычислительный модуль, представляющий собой компьютер с шиной PCIe на своей материнской плате, связанный с коммутатором, выходы которого связаны с ведомыми вычислительными модулями, выполненными с шиной PCIe на своих материнских платах (см. статью «Кластерные системы», выложенную на сайте «IBS Platformix» в сети Интернет в режиме он-лайн по адресу: http://platformix.ru/rus/IT-Infrastructure/components-IT/computing/clusters-sys/index.wbp, обнаружено в мае 2011 г.). Принят в качестве прототипа.A known cluster system that contains a leading computing module, which is a computer with a PCIe bus on its motherboard, is connected to a switch, the outputs of which are connected to slave computing modules made with a PCIe bus on their motherboards (see the article “Cluster systems”, posted on the IBS Platformix website on the Internet at http://platformix.ru/eng/IT-Infrastructure/components-IT/computing/clusters-sys/index.wbp, found in May 2011 g.). Adopted as a prototype.

Недостаток данного решения заключается в достаточно большой задержке при передаче данных в сети. Кроме того, конструкция такого суперкомпьютера достаточно сложна и дорога.The disadvantage of this solution is the rather large delay in transmitting data on the network. In addition, the design of such a supercomputer is quite complicated and expensive.

Настоящее изобретение направлено на достижение технического результата, заключающегося в повышении скорости передачи данных и уменьшении времени задержки единичной записи данных в «чужую» память на уровне, сравнимом с аналогичными показателями для передачи данных внутри компьютера.The present invention is aimed at achieving a technical result, which consists in increasing the speed of data transfer and reducing the delay time of a single recording of data in a "foreign" memory at a level comparable to similar indicators for data transfer inside a computer.

Указанный технический результат достигается тем, что кластерная система содержит ведущий вычислительный модуль, представляющий собой компьютер с шиной PCIe на своей материнской плате, связанный через интерфейсную плату, подключенную к шине PCIe ведущего вычислительного модуля, с коммутатором, выходы которого связаны с ведомыми вычислительными модулями, выполненными с шиной PCIe на своих материнских платах, через интерфейсные платы каждый, подключенные к шине PCIe соответствующего ведомого вычислительного модуля, каждая интерфейсная плата выполнена с двумя портами, один из которых выполнен с возможностью работы в режиме прозрачного порта, а второй - в режиме непрозрачного порта, при этом ведущий вычислительный модуль подключен к прозрачному порту одной интерфейсной платы, которая связана с коммутатором, а ведомые вычислительные модули подключены к непрозрачным портам других интерфейсных плат, которые связаны с коммутатором, коммутатор выполнен с возможностью реализации функции при включении ведомых вычислительных модулей инициализации шины PCIe и подключенных к ней ведомых вычислительных модулей с выделением в непрозрачном порте соответствующей интерфейсной платы адресного пространства для каждого из них в оперативной памяти, и при включении ведущего вычислительного модуля инициализации шины PCIe и подключенного к ней ведущего вычислительного модуля с выделением в прозрачном порте соответствующей интерфейсной платы адресного пространства для них в оперативной памяти, причем коммутатор выполнен с возможностью образования в адресном пространстве шины PCIe ведущего вычислительного модуля диапазона адресов, перекрывающего все окна доступа к ведомым вычислительным модулям, и формирования в каждом непрозрачном порте интерфейсной платы окна с адресом в адресной области в ведущем вычислительном модуле для обеспечения для каждого ведомого вычислительного модуля возможности доступа через адресное пространство к памяти других ведомых вычислительных модулей.The indicated technical result is achieved in that the cluster system contains a host computing module, which is a computer with a PCIe bus on its motherboard, connected via an interface board connected to the PCIe bus of the host computing module, with a switch, the outputs of which are connected to the slave computing modules made with a PCIe bus on their motherboards, through interface boards each, connected to the PCIe bus of the corresponding slave computing module, each interface board in full with two ports, one of which is configured to operate in transparent port mode, and the second in opaque port mode, while the host computing module is connected to the transparent port of one interface board that is connected to the switch, and the slave computing modules are connected to opaque the ports of other interface cards that are connected to the switch, the switch is configured to implement the function when you turn on the slave computing modules for initializing the PCIe bus and the slaves connected to it x computing modules with allocation in the opaque port of the corresponding interface board of the address space for each of them in the RAM, and when the host computing module for initializing the PCIe bus and the leading computing module connected to it with allocation in the transparent port of the corresponding interface board of the address space for them in RAM, and the switch is configured to form in the address space of the PCIe bus the leading computing module of the address range ow, covering all access windows to the slave computing modules, and forming, in each opaque port of the interface board, a window with an address in the address area in the master computing module to provide each slave computing module with access to the memory of other slave computing modules through the address space.

Указанные признаки являются существенными и взаимосвязаны с образованием устойчивой совокупности существенных признаков, достаточной для получения требуемого технического результата.These signs are significant and interconnected with the formation of a stable set of essential features sufficient to obtain the desired technical result.

Настоящее изобретение поясняется конкретными примерами исполнения, которые, однако, не являются единственно возможными, но наглядно демонстрируют возможность достижения требуемого технического результата.The present invention is illustrated by specific examples of execution, which, however, are not the only possible, but clearly demonstrate the ability to achieve the desired technical result.

На фиг.1 - топология PCI Express;Figure 1 - topology of PCI Express;

фиг.2 - Мульти-Root с применением NT Bridge;figure 2 - Multi-Root using NT Bridge;

фиг.3 - коммутатор PCI Express с прозрачным и непрозрачным портами;figure 3 - PCI Express switch with transparent and opaque ports;

фиг.4 - общая блок-схема коммуникационной среды;4 is a General block diagram of a communication environment;

фиг.5 - общая блок-схема коммутатора;5 is a General block diagram of a switch;

фиг.6 - блок-схема интерфейсной платы;6 is a block diagram of an interface board;

фиг.7 - блок-схема коммутатора PCIe (PCIe Switch) ИП с прозрачным и непрозрачным портами;7 is a block diagram of a PCIe switch (PCIe Switch) IP with transparent and opaque ports;

фиг.8 - блок-схема вычислительного кластера ЭВМ МВС-Экспресс.Fig is a block diagram of a computing cluster of a computer MVS-Express.

Согласно настоящему изобретению рассматривается кластерная система, построенная по технологии прямой коммутации каналов PCI Express, которая реализует на аппаратном уровне общее поле памятей большого объема и внешних устройств для всех входящих в кластер узлов. При этом уровни скоростей и задержек при передаче данных в такой сети сравнимы с аналогичными показателями для передачи данных внутри компьютера. Так, время задержки при единичной записи данных в «чужую» память сокращается с 3-6 микросекунд (для современных кластеров) до 200-400 наносекунд. Скорость передачи данных будет лежать в диапазоне от 600 Мбайт до 1,5 Гбайт в секунду в зависимости от ширины применяемых кабелей. Длина пакета данных, на которой достигается пиковая скорость, сократится с нескольких килобайт до десятков байт. Таким образом, возникают предпосылки построения суперкомпьютера с архитектурой, близкой к архитектуре оригинальных западных суперкомпьютеров с общим полем памяти, но с показателями доступности и низкой удельной стоимостью, такими же, как у суперкомпьютеров, построенных на основе кластерных технологий.According to the present invention, a cluster system is constructed based on direct PCI Express channel switching technology, which implements at the hardware level a common field of large-capacity memories and external devices for all nodes in the cluster. At the same time, the levels of speeds and delays in transmitting data in such a network are comparable to those for data transfer within a computer. So, the delay time for a single record of data in a "foreign" memory is reduced from 3-6 microseconds (for modern clusters) to 200-400 nanoseconds. The data transfer speed will range from 600 MB to 1.5 GB per second, depending on the width of the cables used. The length of the data packet at which peak speed is reached will be reduced from a few kilobytes to tens of bytes. Thus, the prerequisites for building a supercomputer with an architecture similar to the architecture of the original Western supercomputers with a common memory field, but with indicators of availability and low unit cost, are the same as those of supercomputers built on the basis of cluster technologies.

Коммуникационная среда кластерной системы основана на стандарте компьютерной шины PCI Express (PCIe) (фиг.1) и имеет древовидную структуру. Корнем дерева является особый узел - Root Complex 1 (с точки зрения шины PCIe), назовем его Ведущим. Листьями дерева являются устройства, подключенные к шине, таковыми здесь являются интерфейсные платы (ИП) всех остальных вычислительных модулей ВМ (назовем - Ведомых), подключенных через коммутатор. Вершины дерева, не являющиеся листьями, представляют собой коммутаторы, т.е. устройства способные принять пакет по некоему входу, проанализировать содержащийся в пакете адрес назначения и в зависимости от того, чему равен этот адрес, выдать пакет в тот или иной выход.The communication environment of the cluster system is based on the PCI Express (PCIe) computer bus standard (FIG. 1) and has a tree structure. The root of the tree is a special node - Root Complex 1 (from the point of view of the PCIe bus), let's call it Lead. The leaves of the tree are devices connected to the bus, such are the interface cards (IP) of all other computing modules of the VM (let's call them Slave) connected through the switch. The tree tops that are not leaves are commutators, i.e. devices capable of receiving a packet at some input, analyzing the destination address contained in the packet, and depending on what this address is equal to, issuing the packet to one or another output.

Для более удобного рассмотрения данной коммуникационной среды можно рассмотреть шину PCIe, вкратце описав ее компоненты и общую терминологию.For a more convenient consideration of this communication environment, you can consider the PCIe bus, briefly describing its components and general terminology.

PCI Express (или PCIe, или PCI-E) - компьютерная шина, использующая программную модель шины PCI и высокопроизводительный физический протокол, основанный на последовательной передаче данных. В отличие от шины PCI, использующей для передачи данных общую шину, PCI Express, в общем случае, является пакетной сетью с топологией типа звезда, устройства PCI Express взаимодействуют между собой через среду, образованную коммутаторами, при этом каждое устройство напрямую связано соединением типа точка-точка с коммутатором.PCI Express (or PCIe, or PCI-E) is a computer bus using the software model of the PCI bus and a high-performance physical protocol based on serial data transfer. Unlike the PCI bus, which uses a common bus for data transfer, PCI Express, in general, is a packet network with a star topology, PCI Express devices communicate with each other through an environment formed by switches, with each device directly connected by a point-to-point connection. point with the switch.

В стандарте шины PCI Express (PCIe) выделяются три типа устройств. Root Complex I, PCI Switch 2 (PCI коммутатор) и Endpoint 3 (Конечная точка или устройство). Root Complex I - это отдельный процессор подсистемы, который включает отдельный порт или несколько портов PCIe, один или более CPU (Центральных процессоров), объединенных с RAM (Оперативной памятью) и контроллером памяти, а также другими внутренними соединениями и/или мостовыми функциями. Проще говоря. Root Complex - это устройство является точкой соприкосновения трех интерфейсов: шины памяти, PCI Express и процессорной шины.The PCI Express (PCIe) bus standard identifies three types of devices. Root Complex I, PCI Switch 2 (PCI Switch), and Endpoint 3 (Endpoint or Device). Root Complex I is a separate subsystem processor that includes a single port or several PCIe ports, one or more CPUs (Central Processing Units), combined with RAM (RAM) and a memory controller, as well as other internal connections and / or bridge functions. Simply put. Root Complex - this device is the common ground of three interfaces: memory bus, PCI Express and processor bus.

Маршрутизация PCIe основана на адресации памяти или уникальных номерах устройств (ID), зависящих от типа транзакций. Таким образом, каждое устройство (или функция в устройстве) должно быть идентифицировано уникальным номером на древовидной шине PCIe.PCIe routing is based on memory addressing or unique device numbers (IDs), depending on the type of transaction. Thus, each device (or function in the device) must be identified by a unique number on the PCIe tree bus.

В течение инициализации системы Root Complex выполняет регистрацию для определения различных имеющихся шин, а также устройств, расположенных на каждой шине, к тому же требуется адресное пространство регистров и памяти устройств. Root Complex выделяет номера шин по всем PCIe шинам и конфигурирует номера шин, которые будут использоваться коммутатором PCIe. PCIe ведет себя, как если бы он был мульти PCI-PCI Bridge 4 (Мост) (фиг.1, выносная вставка). Root Complex выделяет и конфигурирует память и адресное пространство портов ввода/вывода для каждого коммутатора PCIe и конечного устройства. Топология PCIe приведена на фиг.1.During the initialization of the system, Root Complex performs registration to determine the various available buses, as well as the devices located on each bus, in addition, the address space of the registers and device memory is required. Root Complex allocates bus numbers on all PCIe buses and configures the bus numbers that will be used by the PCIe switch. PCIe behaves as if it were multi PCI-PCI Bridge 4 (Bridge) (Fig. 1, remote insert). Root Complex allocates and configures the memory and address space of the I / O ports for each PCIe switch and end device. The topology of the PCIe is shown in figure 1.

Для подключения устройства PCI Express используется двунаправленное последовательное соединение типа точка-точка, называемое lane. Соединение между двумя устройствами PCI Express называется link и состоит из одного (называемого 1х) или нескольких (x2, x4, x8, x12, x16 и x32) двунаправленных последовательных соединений lane. Каждое устройство должно поддерживать соединение х1.A PCI Express device uses a point-to-point bidirectional serial connection called lane. The connection between the two PCI Express devices is called link and consists of one (called 1x) or several (x2, x4, x8, x12, x16 and x32) lane bidirectional serial connections. Each device must support x1 connection.

PCI-PCI Bridge (РРВ) 4 - устройство, с помощью которого можно подключать к локальной шине компьютера дополнительные шины. Причем если, например, локальная шина компьютера - PCIe, то к ней можно подключить дополнительно как такую же шину PCIe, так и, например, «старую» шину PCI. Существуют различные типы мостов. Группа PCI Special Iterest Group (PCISIG) в 1994 году разработала основные спецификации архитектур мостов РВВ. Таким образом, выделяются два основных вида мостов: «прозрачный» или «стандартный» и «непрозрачный» или «встроенный» мосты.PCI-PCI Bridge (RRV) 4 - a device with which you can connect additional buses to the local bus of the computer. Moreover, if, for example, the local bus of the computer is PCIe, then you can connect to it additionally both the same PCIe bus and, for example, the "old" PCI bus. There are various types of bridges. PCI Special Iterest Group (PCISIG) in 1994 developed the basic specifications for RVV bridge architectures. Thus, two main types of bridges are distinguished: “transparent” or “standard” and “opaque” or “built-in” bridges.

Прозрачный мост (Transparent Bridge или ТВ) при процедуре автоконфигурации проходит стандартную идентификацию на шине PCIe, после чего становится прозрачным по отношению к управляющему процессору, соответственно Root Complex также инициализирует все устройства, которые находятся за этим мостом РРВ. Данный тип РРВ не содержит никаких аппаратных ресурсов, например, устройств прямого доступа к памяти (DMA) или регистров (майлбоксов), которые требовали бы наличия отдельного драйвера устройства, а также не преобразует адреса от одной шины PCIe к другой. Таким образом, коммутатор PCIe, который имеет один порт Upstream и два порта Downstream, состоит из трех прозрачных мостов РРВ 4 и виртуальной шины PCI (фиг.1).The transparent bridge (Transparent Bridge or TV) undergoes standard identification on the PCIe bus during the auto-configuration procedure, after which it becomes transparent with respect to the control processor; accordingly, Root Complex also initializes all devices that are behind this PPB bridge. This type of PPB does not contain any hardware resources, for example, direct memory access devices (DMA) or registers (mailboxes), which would require a separate device driver, and also does not translate addresses from one PCIe bus to another. Thus, the PCIe switch, which has one Upstream port and two Downstream ports, consists of three transparent bridges PPV 4 and a virtual PCI bus (figure 1).

Непрозрачный мост (Not-Transparent Bridge или NTB) - мост, основной задачей которого является разделение областей или, можно выразиться, зон, подверженных сканированию и автоконфигурации локальной или первичной стороной компьютера. При этом конфигурируется непрозрачный мост, устанавливаются адресные окна, определяются трансляция адреса и карты адресных регистров для локальных устройств PCI во встроенных мостах с первичной (локальной) и вторичной (объединительной шины) сторон. Диапазон конфигурации локальных процессоров ограничивается встроенным мостом, и поиск устройств на PCI шине объединительной платы не производится. Таким образом, NTB может решать проблему использования нескольких процессоров (фиг.2).An opaque bridge (Not-Transparent Bridge or NTB) is a bridge whose main task is to separate areas or, you can say, zones subject to scanning and auto-configuration by the local or primary side of the computer. In this case, an opaque bridge is configured, address windows are set, address translation and address register cards for local PCI devices in the built-in bridges from the primary (local) and secondary (backplane bus) sides are determined. The configuration range of local processors is limited by the built-in bridge, and devices on the backplane PCI bus are not searched. Thus, NTB can solve the problem of using multiple processors (figure 2).

На фиг.2 изображены Root Complex-1 и Root Complex-2, соединенных между собой по шине PCIe через NTB port 5. Задача NTB port 5 в данном случае заключается в том, чтобы оградить адресное пространство Address Domain-1 (поз.6) от Address Domain-2 (поз.7), таким образом, при автоконфигурации шины Root Complex-1 производит инициализацию локальных PCI устройств до NTB, т.е. он видит NTB как конечное устройство (Endpoint или Internal Endpoint). Root Complex-2, со своей стороны, также инициализирует свои локальные PCI устройства, также до NTB и, соответственно, также видит NTB как конечное устройство (Endpoint или External Endpoint). На фиг.3 приведена более детальная схема PCIe Bridge, где имеются три порта, верхний (Upstream) настроен как ТВ port, нижний левый (Downstream) ТВ port и нижний правый (Downstream) как NTB port.Figure 2 shows Root Complex-1 and Root Complex-2, interconnected via a PCIe bus through NTB port 5. The task of NTB port 5 in this case is to shield the Address Domain-1 address space (item 6) from Address Domain-2 (item 7), thus, during auto-configuration of the bus, Root Complex-1 initializes local PCI devices to NTB, i.e. he sees NTB as an endpoint device (Endpoint or Internal Endpoint). Root Complex-2, for its part, also initializes its local PCI devices, also to NTB, and, accordingly, also sees NTB as an end device (Endpoint or External Endpoint). Figure 3 shows a more detailed diagram of the PCIe Bridge, where there are three ports, the upper (Upstream) is configured as a TV port, the lower left (Downstream) of a TV port and the lower right (Downstream) as an NTB port.

Использование непрозрачных мостов также позволяет решить проблему конфликтов адресации за счет преобразования адресов. Исходя из фиг.2 рассмотрим пример, в котором Root Complex-1 записывает данные по шине PCIe из своей локальной оперативной памяти в оперативную память Root Complex-2. Основным препятствием при этом является то, что у них разные адресные пространства. PCIe является пакетной сетью, с топологией типа звезда. Следовательно, в заголовке передаваемого пакета нужно указывать идентификационный номер Requester ID (включает в себя идентификаторы шины, устройства и функции), идентификатор типа трафика (Traffic Class), адрес, сведения о маршрутизации и значение, определяющее длину этого пакета или пакета, которым должно ответить запрошенное устройство. Т.к. единственной точкой соприкосновения адресных пространств является NTB, то, соответственно, при передаче пакета из Root Complex-1 в Root Complex-2 в заголовке передаваемого пакета нужно указать адрес, которому соответствует Internal Endpoint (фиг.3). Далее NTB сам преобразовывает адрес и передает пакет в адресное пространство Root Complex-2, где оно доставляется к своей точке назначения.The use of opaque bridges also solves the problem of addressing conflicts through address translation. Based on figure 2, consider an example in which Root Complex-1 writes data via the PCIe bus from its local RAM to the RAM of Root Complex-2. The main obstacle is that they have different address spaces. PCIe is a packet network with a star topology. Therefore, in the header of the transmitted packet, you must specify the Requester ID (includes bus, device and function identifiers), Traffic Class identifier, address, routing information and a value that determines the length of this packet or packet to which it should respond requested device. Because the only point of contact between address spaces is NTB, then, respectively, when transferring a packet from Root Complex-1 to Root Complex-2, in the header of the transmitted packet you need to specify the address to which the Internal Endpoint corresponds (Fig. 3). Next, NTB itself translates the address and transfers the packet to the address space of Root Complex-2, where it is delivered to its destination.

Более подробное описание NT Bridge можно найти по следующим ссылкам:A more detailed description of NT Bridge can be found at the following links:

1. http://www.google.ru/url?sa=t&source=web&cd=1&ved=0CB0QFjAA&url=http%3A%2F%2Fwww.plxtech.com%2Fpdf%2Ftechnical%2Fexpresslane%2FNTB_Brief_April-05.pdf&rct=j&q=Non-Transporent%20Bridging&ei=IR4sTYq0BsudOsSZpOQK&usg=AFQjCNHQ_gRT2izo9j9ZRHt1vs7BWgtmKw&cad=rja1. http://www.google.com/url?sa=t&source=web&cd=1&ved=0CB0QFjAA&url=http%3A%2F%2Fwww.plxtech.com%2Fpdf%2Ftechnical%2Fexpresslane%2FNTB_Brief_Aprilct.jdf Non-Transporent% 20Bridging & ei = IR4sTYq0BsudOsSZpOQK & usg = AFQjCNHQ_gRT2izo9j9ZRHt1vs7BWgtmKw & cad = rja

2. http://www.eetimes.com/electronics-news/4139182/Non-Transparent-Bridging-Makes-PCI-Express-HA-Friendly2. http://www.eetimes.com/electronics-news/4139182/Non-Transparent-Bridging-Makes-PCI-Express-HA-Friendly

3. http://www.design-reuse.com/articles/8408/non-transparent-bridging-allows-multiprocessor-design-with-pci-express.html3. http://www.design-reuse.com/articles/8408/non-transparent-bridging-allows-multiprocessor-design-with-pci-express.html

4. http://www.google.ru/url?sa=t&source=web&cd=2&ved=0CCYQFjAB&url=http%3A%2F%2Fwww.plxtech.com%2Fpdf%2Ftechnical%2Fexpresslane%2FNontransparentBridging.pdf&rct=j&q=Non-Transporent%20Bridging&ei=IR4sTYq0BsudOsSZpOQK&usg=AFQjCNHOjTiobXgalN_cjMmxs5tIHZaffQ&cad=rja4. http://www.google.com/url?sa=t&source=web&cd=2&ved=0CCYQFjAB&url=http%3A%2F%2Fwww.plxtech.com%2Fpdf%2Ftechnical%2Fexpresslane%2FNontransparentBridging.pdf&rct=jq Transporent% 20Bridging & ei = IR4sTYq0BsudOsSZpOQK & usg = AFQjCNHOjTiobXgalN_cjMmxs5tIHZaffQ & cad = rja

5. http://www.google.ru/url?sa=t&source=web&cd=5&ved=0CEMQFjAE&url=ftp%3A%2F%2Fdownload.intel.nl%2Fdesign%2Fintarch%2FPAPERS%2F323328.pdf&rct=j&q=Non-Transporent%20Bridging&ei=IR4sTYq0BsudOsSZpOQK&usg=AFQjCNGTjUE-PTFKXLwg9xSW1WTVTXxhRA&cad=rja5. http://www.google.com/url?sa=t&source=web&cd=5&ved=0CEMQFjAE&url=ftp%3A%2F%2Fdownload.intel.nl%2Fdesign%2Fintarch%2FPAPERS%2F323328.pdf&rct=j&q=Non- Transporent% 20Bridging & ei = IR4sTYq0BsudOsSZpOQK & usg = AFQjCNGTjUE-PTFKXLwg9xSW1WTVTXxhRA & cad = rja

Итак, рассмотрев основные компоненты шины PCI Express, можно приступить к описанию коммутационной среды на базе прямой коммутации каналов PCI Express. Данное решение является изобретением.So, having examined the main components of the PCI Express bus, we can begin to describe the switching environment based on direct switching of PCI Express channels. This solution is an invention.

Общая блок-схема коммуникационной среды (КС) на базе прямой коммутации каналов шины PCIe приведена на фиг.4.The general block diagram of a communication medium (CS) based on direct switching of PCIe bus channels is shown in Fig. 4.

КС состоит из:COP consists of:

- коммутатора 8;- switch 8;

- вычислительных модулей 9 (ВМ);- computing modules 9 (VM);

- интерфейсных плат 10 (ИП), которые подключаются к шине PCIe ВМ;- interface cards 10 (IP), which are connected to the PCIe bus VM;

- сетевых проводов 11 (СП), с помощью которых ИП 10 подключаются к коммутатору 8.- network wires 11 (SP), with which IP 10 are connected to the switch 8.

Коммутатор 8 (фиг.5), представляющий собой PCI Switch 2, осуществляет функции пересылки пакетов между ВМ согласно адресу получателя. С логической точки зрения все коммуникации в системе относятся к категории точка-точка и соединяют все устройства напрямую.The switch 8 (figure 5), which is a PCI Switch 2, performs the function of forwarding packets between the VMs according to the recipient address. From a logical point of view, all communications in the system belong to the point-to-point category and connect all devices directly.

ВМ представляет собой компьютер, который имеет на своей материнской плате шину PCIe.A VM is a computer that has a PCIe bus on its motherboard.

ИП 10 (фиг.6) мультипортовая интерфейсная плата, имеющая два порта, один 12 из которых работает в режиме прозрачного порта, второй 13 в режиме непрозрачного порта. Прозрачный порт используется только для подключения Ведущего (главного Root Complex) к коммутатору. Во всех остальных узлах (Ведомых) для подключения к коммутатору используется непрозрачный порт. Вся логика ИП заложена в коммутаторе PCIe 2 (PCIe Switch на фиг.6). Более подробная блок-схема коммутатора PCIe приведена на фиг.7. Для организации логики, приведенной на фиг.7, используется наша собственная битовая прошивка. Прошивка загружается с помощью программатора. В процессе прошивки в NT-порте настраиваются нужные нам регистры - memory bar, размер которых будет равен размеру локальной порции. Локальная порция - кусок системной памяти по известному фиксированному адресу, который вырезается при загрузке ядра операционной системы (ОС) на ВМ.IP 10 (Fig.6) multiport interface board having two ports, one 12 of which operates in the transparent port mode, the other 13 in the opaque port mode. The transparent port is used only to connect the Master (main Root Complex) to the switch. All other nodes (Slaves) use an opaque port to connect to the switch. All IP logic is embedded in the PCIe 2 switch (PCIe Switch in FIG. 6). A more detailed block diagram of the PCIe switch is shown in Fig.7. To organize the logic shown in Fig. 7, our own bit firmware is used. The firmware is loaded using the programmer. During the firmware process, the necessary registers are configured in the NT port - memory bar, the size of which will be equal to the size of the local portion. A local portion is a piece of system memory at a known fixed address that is cut out when the kernel of the operating system (OS) is loaded on the VM.

В данной КС требуется следующий порядок загрузки узлов: сначала все ведомые, затем коммутатор, последним - ведущий узел.In this CS, the following order of node loading is required: first all the slaves, then the switch, and the last - the master node.

Включение ведомых ВМ 9: при этом при загрузке BIOS на ВМ 9 происходит инициализация шины PCIe и всех подключенных к ней устройств, а также выделение адресного пространства для них. Далее в процессе загрузки операционная система (ОС) (в качестве ОС используется Linux) выделяет адресное пространство для PCIe шины и устройств в оперативной памяти. Стоит отметить, что Internal Endpoint в NTB ИП (фиг.7) при этом видится ВМ как конечное устройство на своей локальной шине PCIe.Turning on the slave VM 9: at the same time, when loading the BIOS on VM 9, the PCIe bus and all devices connected to it are initialized, as well as the address space is allocated for them. Further, during the boot process, the operating system (OS) (Linux is used as the OS) allocates address space for the PCIe bus and devices in RAM. It should be noted that the Internal Endpoint in the NTB IP (Fig. 7) is seen by the VM as an end device on its local PCIe bus.

Далее загружается коммутатор путем подачи на него питания. На данном этапе никаких конфигураций или выделений памяти на никаких ВМ не происходит, т.е. просто требуется некоторое время, чтобы коммутатор пришел в работоспособное состояние.Next, the switch is loaded by supplying power to it. At this stage, no configurations or memory allocations to any VMs occur, i.e. it just takes some time for the switch to become operational.

После этого питание подается на ведущий ВМ. Так же, как и в ведомых ВМ, сперва BIOS инициализирует свою шину и устройства на ней, но т.к. ведущий ВМ подключен к коммутатору через прозрачный порт на своей ИП, то он видит как коммутатор, так и External Endpoint (фиг.7) ИП всех ведомых ВМ. Другими словами, Ведущий видит их как конечные устройства на своей локальной шине, и, соответственно, BIOS инициализирует свою шину и устройства, а также адресное пространство для них. Далее ОС выделяет пространство адресов и портов ввода/вывода в своей оперативной памяти. Теперь у ведущего в адресном пространстве шины PCIe появились N-1 (где N - общее число всех ВМ) NT memory bar-ов (extrenal endpoint ИП), каждый из которых размером с локальную порцию. Ведущий может настроить каждый из этих NT портов (окон доступа к ведомым) на адрес локальной порции в соответствующем ведомом узле. Теперь при работе с этим bar-ом ведущий будет попадать в локальную порцию соответствующего ведомого.After that, power is supplied to the host VM. As in the slave VMs, the BIOS first initializes its bus and devices on it, but since Since the master VM is connected to the switch through a transparent port on its IP, it sees both the switch and the External Endpoint (Fig. 7) IP of all slave VMs. In other words, the Master sees them as end devices on their local bus, and, accordingly, the BIOS initializes its bus and devices, as well as the address space for them. Next, the OS allocates a space of addresses and input / output ports in its RAM. Now the host in the address space of the PCIe bus has N-1 (where N is the total number of all VMs) NT memory bars (extrenal endpoint IP), each of which is the size of a local portion. The master can configure each of these NT ports (slave access windows) to the local portion address in the corresponding slave node. Now, when working with this bar, the master will fall into the local portion of the corresponding slave.

У каждого из ведомых в адресном пространстве шины PCIe появился 1 NT memory bar (Internal Endpoint), каждый из которых размером с локальную порцию. Каждый из ведомых может настроить соответствующий NT порт (окно доступа к ведущему) на адрес локальной порции в ведущем.Each of the slaves in the address space of the PCIe bus has 1 NT memory bar (Internal Endpoint), each of which is the size of a local portion. Each of the slaves can configure the corresponding NT port (access window to the master) to the local portion address in the master.

На этом этапе ведущий видит всех ведомых, каждый ведомый видит ведущего.At this stage, the leader sees all the slaves, each slave sees the leader.

Теперь дадим возможность ведомым видеть друг друга. Для этого выберем в адресном пространстве шины PCIe ведущего минимальный naturally aligned диапазон адресов, перекрывающий все окна доступа к ведомым. Назовем его областью перекрестного доступа. Закажем в каждом NT порте еще один NT memory bar (окно перекрестного доступа) размером с эту область и настроим его на адрес этой области в ведущем. Теперь каждый ведомый может через окно перекрестного доступа видеть в области перекрестного доступа любого другого ведомого. При перекрестном доступе данные физически через ведущего не ходят - ведущий предоставил ведомым только адресное пространство, а построенные аппаратно при начальном запуске PCI Express маршрутные таблицы все гонят по кратчайшему пути.Now let the followers see each other. To do this, we select in the address space of the PCIe bus master the minimum naturally aligned range of addresses that covers all access windows to the slaves. Let's call it a cross-access area. We order in each NT port another NT memory bar (cross-access window) the size of this area and configure it to the address of this area in the master. Now each slave can see any other slave in the cross-access area through the cross-access window. With cross-access, the data does not physically go through the master — the master provided the slaves with only the address space, and the routing tables built in hardware at the initial start of PCI Express all drive along the shortest path.

Теперь любой ВМ может осуществлять прямой доступ к памяти всех ведомых ВМ (используемый режим доступа - Master DMA). Также отметим, что за счет использования базового программного обеспечения чтение из чужой памяти осуществляется при помощи программного запроса встречной записи.Now any VM can directly access the memory of all slave VMs (the access mode used is Master DMA). We also note that through the use of basic software, reading from someone else's memory is carried out using a program request for a counter record.

Ниже приводится описание примера исполнения кластерной системы ЭВМ МВС-Экспресс с применением данной коммуникационной среды и ЭВМ К-100. На примере блок-схемы ЭВМ МВС-Экспресс (фиг.8) рассмотрим принцип работы коммуникационной среды.The following is a description of an example of the execution of the MVS-Express computer cluster system using this communication medium and K-100 computer. On the example of the block diagram of the computer MVS-Express (Fig. 8), we consider the principle of the communication medium.

ЭВМ МВС-Экспресс содержит в себе следующее аппаратные средства:The MVS-Express computer contains the following hardware:

- 8 вычислительных узлов, каждый из которых имеет следующие характеристики:- 8 computing nodes, each of which has the following characteristics:

- Процессор 2 x Opteron 2382; 7 доступных задаче пользователя ядер;- Processor 2 x Opteron 2382; 7 cores available to the user’s task;

- Оперативная память 16 Gb;- RAM 16 Gb;

- Диск SATA 320 Gb;- SATA 320 Gb drive;

- Сетевая карта Gigibit Ethernet;- Network card Gigibit Ethernet;

- Интерфейсная плата;- Interface board;

- Видеокарта nVidia GeForce 285GTX, 240 GPU;- Video card nVidia GeForce 285GTX, 240 GPU;

- Серверный корпус АТХ;- Server chassis ATX;

- Управляющая машина:- Control machine:

- Процессор Intel Core2Duo E8400;- Processor Intel Core2Duo E8400;

- Диск SATA 320Gb;- SATA 320Gb drive;

- DVD- DVD

- Корпус АТХ;- ATX housing;

- Сетевые карты Gigabit Ethernet;- Network cards Gigabit Ethernet;

- Коммутатор Gigabit Ethernet- Gigabit Ethernet Switch

- Коммутатор PCI Express- PCI Express Switch

Узлы, управляющая машина и коммутаторы размещены в 24U шкафах 600×800.The nodes, control machine and switches are located in 24U 600 × 800 cabinets.

Коммуникационная среда на базе шины PCI Express (или PCIe), на данной блок-схеме (фиг.8) состоит из Коммутатора PCI Express, Интерфейсных плат, которые вставляются в слот PCI Express в вычислительных узлах и сетевых проводов (стандартные сетевые провода 10 Gb Ethernet) для соединения вычислительных узлов с коммутатором PCI Express. Т.к. данная коммуникационная среда сделана на основе шины PCI Express, имеющей древовидную структуру, то необходимо кого-то назначить корнем этого дерева (или главным Root). Соответственно, один из вычислительных узлов назначается этим корнем, мы его называем Ведущим. Все остальные N-1 узлов - Ведомыми.The communication environment based on the PCI Express bus (or PCIe) in this block diagram (Fig. 8) consists of a PCI Express switch, interface cards that are inserted into the PCI Express slot in the computing nodes and network wires (standard 10 Gb Ethernet network wires ) for connecting compute nodes to a PCI Express switch. Because Since this communication environment is based on the PCI Express bus, which has a tree structure, you need to assign someone the root of this tree (or the main Root). Accordingly, one of the computing nodes is assigned by this root, we call it Leading. All other N-1 nodes are Slaves.

Коммутатор PCIe имеет N-1 Downstream портов, к которым подключаются Интерфейсные платы Ведомых узлов и один Upstream порт, к которому подключается Интерфейсная плата Ведущего узла.The PCIe switch has N-1 Downstream ports to which the Slave Node Interface Cards are connected and one Upstream port to which the Lead Node Interface Board is connected.

Каждая Интерфейсная плата имеет три порта, один из которых настроен как непрозрачный мост (Downstream Not transparent bridge), второй как прозрачный мост (Downstream Transparent bridge), а также порт Upstream. Соответственно, первый слот используется при подключении Ведомых вычислительных узлов к Downstream портам коммутатора PCIe, второй при подключении Ведущего узла к Upstream порту коммутатора PCIe, а третий вставляется в шину PCIe на материнской плате вычислительного узла. Все Интерфейсные платы функционально идентичны, различие состоит лишь в выборе Downstream порта при подключении к коммутатору.Each Interface Board has three ports, one of which is configured as an Downstream Not transparent bridge, the other as a Downstream Transparent bridge, and an Upstream port. Accordingly, the first slot is used when connecting the Slave nodes to the Downstream ports of the PCIe switch, the second when connecting the Lead node to the Upstream ports of the PCIe switch, and the third is inserted into the PCIe bus on the motherboard of the node. All interface boards are functionally identical, the only difference is the choice of the downstream port when connected to the switch.

Сетевое подключение для доступа из сетей kiam.ru и internet.Network connection for access from kiam.ru and internet networks.

Программное обеспечениеSoftware

- Операционная система SuSE Linux Enterprise Server 10SP2;- Operating system SuSE Linux Enterprise Server 10SP2;

- Система Управления Прохождением Пользовательских Задач (СУППЗ);- User Management Passage Management System (CPSS);

- Компиляторы С, Fortran;- Compilers C, Fortran;

- Сервер удаленного доступа по протоколам SSH/SCP.- Remote access server via SSH / SCP.

Основные характеристики коммуникационной сети на базе PCIe:Key Features of a PCIe-based Communication Network:

- скорость: до 700 МБ/с;- speed: up to 700 MB / s;

- латентность: 1,2 мкс;Latency: 1.2 μs

- время выдачи слова: ~70 нс;- word output time: ~ 70 ns;

- время чтения слова: ~2,5 мкс.- word reading time: ~ 2.5 μs.

Настоящее изобретение промышленно применимо, созданы функционирующие системы, которые опробованы и показали высокие результаты, полностью подтверждающие возможность достижения технического результата.The present invention is industrially applicable, functioning systems are created that are tested and showed high results, fully confirming the possibility of achieving a technical result.

Claims

A cluster system containing a host computing module, which is a computer with a PCIe bus on its motherboard, connected via an interface board connected to the PCIe bus of the host computing module, with a switch whose outputs are connected to slave computing modules made with a PCIe bus on their motherboards boards, through interface boards each connected to the PCIe bus of the corresponding slave computing module, each interface board is made with two ports, one of which is made with the ability to work in the transparent port mode, and the second in the opaque port mode, while the leading computing module is connected to the transparent port of one interface board that is connected to the switch, and the slave computing modules are connected to the opaque ports of other interface boards that are connected to the switch, the switch is configured to implement the function when the slave computing modules for initializing the PCIe bus and the slave computing modules connected to it are turned on with highlighting in opaque th port of the corresponding interface board address space for each of them in the RAM, and when you turn on the host computing module for initializing the PCIe bus and connected to it the host computing module with the allocation in the transparent port of the corresponding interface board address space for them in RAM, and the switch is made with the possibility of the formation in the address space of the PCIe bus of the leading computing module, an address range that covers all access windows to the slave computing module, and the formation in each opaque port of the interface board of a window with an address in the address area in the master computing module to provide each slave computing module with access through the address space to the memory of other slave computing modules.