TW201327144A

TW201327144A - Method for managing cloud server system

Info

Publication number: TW201327144A
Application number: TW100147601A
Authority: TW
Inventors: Ying-Chih Lu
Original assignee: Inventec Corp
Priority date: 2011-12-21
Filing date: 2011-12-21
Publication date: 2013-07-01

Abstract

A method for managing a cloud server system is provided. In the method, a plurality of node apparatuses of the cloud server system are abnormal is detected. When one of the node apparatuses is abnormal, a hardware address of the abnormal node apparatus is obtained. A location information of the abnormal node apparatus is searched from a node database according to the hardware address. And the abnormal node apparatus is isolated from the cloud server system. Moreover, a light emitting unit of the abnormal node apparatus is enable according to the location information.

Description

Cloud servo system management method

本發明是有關於一種雲端伺服系統，且特別是有關於一種利用發光單元提示發生異常之雲端伺服系統的管理方法。The present invention relates to a cloud servo system, and more particularly to a method for managing a cloud servo system that utilizes a light unit to indicate an abnormality.

目前伺服器(server)廣為各企業所使用，發展的範圍除了結合網際網路(internet)與電信業的應用外，也更深入到一般人的生活中，例如金融、財經、網路銀行、網路信用卡的使用等等，這些都必需靠著伺服器強大的運算能力，才能做到資料高度保密且不易被破解之程度。At present, the server is widely used by various enterprises. The development scope is not only combined with the application of the internet and the telecom industry, but also deepens into the lives of ordinary people, such as finance, finance, online banking, and the Internet. The use of credit cards, etc., all rely on the powerful computing power of the server to achieve a high degree of confidentiality and difficulty in being cracked.

現今雲端伺服系統的種類有很多種，比較常見的有機架式伺服器(rack server)與塔式伺服器(tower server)。其中，機架伺服器是一種優化結構的塔式伺服器，它的設計宗旨主要是為了盡可能減少空間的佔用。很多專業網路設備都是採用機架式的結構(例如交換機、路由器、硬體防火牆等等)，其多為扁平式，就如同抽屜一般。一般而言，機架伺服器的寬度為19英寸，高度以U為單位(1U=1.75英寸=44.45毫米)，通常有1U，2U，3U，4U，5U，7U幾種標準的伺服器。There are many types of cloud-based servo systems today. The more common ones are rack servers and tower servers. Among them, the rack server is an optimized structure of the tower server, which is designed to minimize space occupation. Many professional network devices are rack-mounted (such as switches, routers, hardware firewalls, etc.), which are mostly flat, just like drawers. In general, the rack server has a width of 19 inches and a height in U (1U = 1.75 inches = 44.45 mm). There are usually 1U, 2U, 3U, 4U, 5U, 7U standard servers.

目前一般在機櫃內之節點裝置都具有高可用性(High Available，HA)功能，其提供冗餘(redundancy)的容錯備份，在其中一個節點裝置失效後，能夠立即接管相關資源及繼續提供相應服務。而當一個節點因為其節點裝置發生異常或其他原因，導致需要更換另一個節點裝置或是升級原有的節點裝置時，由於置換過程為人為，在節點裝置的數量龐大時，要在眾多的節點裝置中尋找欲更換的節點裝置，則相當不易，再者當置換節點後該節點需要能自動被加入雲端伺服系統。At present, the node devices in the cabinet generally have a High Availability (HA) function, which provides redundant fault-tolerant backup. After one node device fails, it can immediately take over related resources and continue to provide corresponding services. When a node needs to replace another node device or upgrade the original node device due to an abnormality of its node device or other reasons, since the replacement process is artificial, when the number of node devices is large, it is required to be numerous. It is quite difficult to find the node device to be replaced in the node device, and the node needs to be automatically added to the cloud servo system after replacing the node.

本發明提供一種雲端伺服系統的管理方法，透過發光單元的提示，便於管理者來得知欲進行維護的節點裝置。The invention provides a management method of a cloud servo system, which is convenient for a manager to know a node device to be maintained through a prompt of a light emitting unit.

具體而言，本發明提出一種雲端伺服系統的管理方法，適用於一雲端伺服系統，例如為提供IaaS(Infrastructure as a Service)服務之機櫃式(Container)資料中心(Data Center)，其中雲端伺服系統包括多個節點裝置。在本方法中，偵測這些節點裝置是否發生異常。當偵測到這些節點裝置其中之一發生異常時，取得異常節點裝置的第一硬體位址。接著，依據第一硬體位址，自節點資料庫中搜尋異常節點裝置的節點相關資訊。在此，節點相關資訊中記錄有異常節點裝置的位置資訊。並且，自雲端伺服系統中隔離異常節點裝置。另外，依據位置資訊來致能異常節點裝置的發光單元。Specifically, the present invention provides a cloud server system management method, which is applicable to a cloud server system, for example, a data center (Container) that provides IaaS (Infrastructure as a Service) service, wherein the cloud server system Includes multiple node devices. In the method, it is detected whether an abnormality has occurred in these node devices. When an abnormality is detected in one of the node devices, the first hardware address of the abnormal node device is obtained. Then, according to the first hardware address, the node related information of the abnormal node device is searched from the node database. Here, the location information of the abnormal node device is recorded in the node related information. Moreover, the abnormal node device is isolated from the cloud servo system. In addition, the light emitting unit of the abnormal node device is enabled according to the position information.

在本發明之一實施例中，上述位置資訊為異常節點裝置位於機櫃中的節點位置或異常節點裝置的網路位址。In an embodiment of the present invention, the location information is a node location of the abnormal node device located in the cabinet or a network address of the abnormal node device.

在本發明之一實施例中，上述依據位置資訊來致能異常節點裝置的發光單元的步驟中，可依據位置資訊傳送一命令至異常節點裝置的基板管理控制器(Baseboard Management Controller，BMC)，以由基板管理控制器將發光單元點亮為第一顏色。In an embodiment of the present invention, the step of enabling the illumination unit of the abnormal node device according to the location information may transmit a command to the Baseboard Management Controller (BMC) of the abnormal node device according to the location information. The light emitting unit is illuminated to the first color by the substrate management controller.

在本發明之一實施例中，在上述雲端伺服系統的管理方法中，當偵測到異常節點裝置更換為另一節點裝置之後，點亮更換後之節點裝置的發光單元為第二顏色。In an embodiment of the present invention, in the management method of the cloud server system, after detecting that the abnormal node device is replaced with another node device, the light-emitting unit of the node device after the replacement is lit is the second color.

在本發明之一實施例中，上述在偵測到異常節點裝置更換為另一節點裝置之後，透過網路管理模組重新取得節點裝置的節點相關資訊。進一步地說，可自更換後的節點裝置接收第二硬體位址，以重新分配網路位址給上述節點裝置。之後，依據網路位址傳送一指令給節點裝置，以透過節點裝置的基板管理控制器取得節點裝置的節點相關資訊。並且，將節點相關資訊更新至節點資料庫中。In an embodiment of the present invention, after detecting that the abnormal node device is replaced with another node device, the node related information of the node device is retrieved through the network management module. Further, the second hardware address can be received from the replaced node device to reallocate the network address to the node device. Then, an instruction is transmitted to the node device according to the network address to obtain the node related information of the node device through the base device management controller of the node device. And, update the node related information to the node database.

在本發明之一實施例中，上述透過節點裝置的基板管理控制器取得節點裝置的節點相關資訊的步驟中，當基板管理控制器接收到上述指令時，將節點裝置的中央處理單元重新開機，以在重新開機的過程中，透過基板管理控制器取得節點相關資訊。In an embodiment of the present invention, in the step of obtaining the node related information of the node device by the baseboard management controller of the transmitting node device, when the baseboard management controller receives the command, the central processing unit of the node device is restarted. In the process of rebooting, the node related information is obtained through the substrate management controller.

基於上述，本發明利用發光單元來提醒管理者發生異常的異常節點裝置所在處，便於管理者進行節點裝置的更換。Based on the above, the present invention utilizes the light-emitting unit to remind the manager where the abnormal node device is located, facilitating the manager to replace the node device.

為讓本發明之上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。The above described features and advantages of the present invention will be more apparent from the following description.

圖1是依照本發明之一實施例之雲端伺服系統的方塊圖。在本實施例中，雲端伺服系統中包含至少一機櫃(container)，由於各機櫃組成相同，為求方便說明，在本實施例中以一個機櫃100為例。機櫃100的架構一般包括多個機架(rack)，每一個機架中包括多個插槽(slot)，每一個插槽包括多個節點。並且，機櫃中設置有一交換機(switch)140，此交換機140耦接至各個節點裝置。1 is a block diagram of a cloud servo system in accordance with an embodiment of the present invention. In this embodiment, the cloud server system includes at least one cabinet. Since the cabinets have the same composition, for convenience of description, in the embodiment, one cabinet 100 is taken as an example. The architecture of cabinet 100 generally includes a plurality of racks, each of which includes a plurality of slots, each of which includes a plurality of nodes. Moreover, a switch 140 is disposed in the cabinet, and the switch 140 is coupled to each node device.

請參照圖1，機櫃100中包括n個節點，這些節點分別設置了n個節點裝置。這些節點裝置可以分類成三種節點類型，即，服務資源池(service pool)110、計算資源池(computing pool)120以及儲存資源池(storage nodes pool)130。其中，服務資源池100包括i個節點裝置110_1~110_i，計算資源池120包括j個節點裝置120_1~120_j，儲存資源池130包括k個節點裝置130_1~130_k。Referring to FIG. 1, the cabinet 100 includes n nodes, and these nodes are respectively provided with n node devices. These node devices can be classified into three types of nodes, namely, a service pool 110, a computing pool 120, and a storage nodes pool 130. The service resource pool 100 includes i node devices 110_1~110_i, the computing resource pool 120 includes j node devices 120_1~120_j, and the storage resource pool 130 includes k node devices 130_1~130_k.

並且，在本實施例中，上述各節點裝置皆設置有一發光單元，包括發光單元111_1~111_i、發光單元121_1~121_j以及發光單元131_1~131_k。在此，發光單元例如為發光二極體(Light Emitting Diode，LED)，然並不以此為限。In addition, in this embodiment, each of the node devices is provided with a light emitting unit, including the light emitting units 111_1~111_i, the light emitting units 121_1~121_j, and the light emitting units 131_1~131_k. Here, the light-emitting unit is, for example, a Light Emitting Diode (LED), but is not limited thereto.

服務資源池110用以提供資料庫服務、虛擬資源提供(virtual resource provisioning)服務、實體安裝(physical installer)服務、實體管理(physical manager)服務、虛擬管理(virtual manager)服務、應用程式介面(Application Programming Interface，API)服務、儲存管理(storage manager)服務、負載平衡(load balance)以及安全機制(security)服務等服務類型。而在計算資源池120用以提供計算服務。儲存資源池130用以提供儲存服務。The service resource pool 110 is used to provide a database service, a virtual resource provisioning service, a physical installer service, a physical manager service, a virtual manager service, and an application interface. Programming Interface, API) Service types such as service, storage manager service, load balance, and security services. The computing resource pool 120 is used to provide computing services. The storage resource pool 130 is used to provide a storage service.

在本實施例中，在服務資源池110的節點裝置110_2中安裝一異常偵測模組112，以藉由異常偵測模組112來監控雲端伺服系統中是否出現異常。而在其他實施例中，異常偵測模組112亦可以安裝在服務資源池110的其他節點裝置，或者安裝在獨立於機櫃100外的另一伺服器中。另外，服務資源池110中的節點裝置110_1用以提供資料庫服務，其設置有一節點資料庫113來儲存機櫃100中之各節點裝置的節點相關資訊。In the embodiment, an abnormality detecting module 112 is installed in the node device 110_2 of the service resource pool 110 to monitor whether an abnormality occurs in the cloud servo system by using the abnormality detecting module 112. In other embodiments, the anomaly detection module 112 can also be installed in other node devices of the service resource pool 110 or installed in another server independent of the cabinet 100. In addition, the node device 110_1 in the service resource pool 110 is configured to provide a database service, and is provided with a node database 113 for storing node related information of each node device in the cabinet 100.

在此，節點相關資訊記錄了每一個節點裝置的相關資訊，其包括多個項目，每一個項目分別代表一個節點裝置的資料。而每一個項目包括了網路卡資訊、處理器資訊、記憶體資訊、硬碟資訊、節點位置、節點類型資訊以及服務類型。具體而言，節點裝置中一般具有系統網卡以及基板管理控制器(Baseboard Management Controller，BMC)網卡。而網路卡資訊包括BMC網卡的媒體存取控制(Media Access Control，MAC)位址、BMC網卡的網際網路通訊協定(Internet Protocol，IP)位址及頻寬(bandwidth，單位為Mbps(megabit per second))，以及系統網卡的MAC位址、系統網卡的IP位址及頻寬。處理器資訊包括處理器型號以及工作頻率。記憶體資訊包括記憶體模組的大小。硬碟資訊包括托架(carrier)編號、硬碟類型、硬碟容量、硬碟轉速(Revolution Per Minute，RPM)以及硬碟快取(cache)容量。節點位置包括機架編號、插槽編號以及節點編號。節點類型資訊用以表示對應的節點裝置屬於服務資源池110、計算資源池120或是儲存資源池130。服務類型記錄對應的節點裝置所提供的服務類型。Here, the node related information records related information of each node device, and includes a plurality of items, each of which represents data of one node device. Each project includes network card information, processor information, memory information, hard disk information, node location, node type information, and service type. Specifically, the node device generally has a system network card and a Baseboard Management Controller (BMC) network card. The network card information includes the media access control (MAC) address of the BMC network card, the Internet Protocol (IP) address of the BMC network card, and the bandwidth (in Mbps (megabit)). Per second)), as well as the MAC address of the system NIC, the IP address and bandwidth of the system NIC. Processor information includes the processor model and operating frequency. The memory information includes the size of the memory module. Hard disk information includes carrier number, hard disk type, hard disk capacity, Revolution Per Minute (RPM), and hard disk cache capacity. The node location includes the rack number, slot number, and node number. The node type information is used to indicate that the corresponding node device belongs to the service resource pool 110, the computing resource pool 120, or the storage resource pool 130. The service type records the type of service provided by the node device corresponding to it.

底下即搭配上述雲端伺服系統來說明其管理方法。圖2是依照本發明之一實施例之雲端伺服系統的管理方法流程圖。請同時參照圖1及圖2，在本實施例中，透過異常偵測模組112來監控這些節點裝置是否發生異常，節點裝置異常之原因例如為節點裝置故障、預測裝置節點故障、節點裝置意外被拔除、節點裝置維護、節點裝置強制置換、節點裝置加入，但並不以此為限。The above-mentioned cloud servo system is used to explain the management method. 2 is a flow chart of a method for managing a cloud server system according to an embodiment of the present invention. Referring to FIG. 1 and FIG. 2 simultaneously, in the embodiment, the abnormality detecting module 112 is used to monitor whether the node devices are abnormal. The reason for the abnormality of the node device is, for example, a node device failure, a prediction device node failure, and a node device failure. It is removed, node device maintenance, node device forced replacement, node device join, but not limited to this.

當異常偵測模組112偵測到這些節點裝置其中之一發生異常時，如步驟S205所示，取得異常節點裝置的硬體位址。例如，取得異常節點裝置的BMC的MAC位址。When the abnormality detecting module 112 detects that an abnormality occurs in one of the node devices, the hardware address of the abnormal node device is obtained as shown in step S205. For example, the MAC address of the BMC of the abnormal node device is obtained.

接著，在步驟S210中，依據上述硬體位址，自節點資料庫113中搜尋異常節點裝置的節點相關資訊。在此，節點相關資訊中記錄有異常節點裝置的位置資訊，位置資訊例如為異常節點裝置位於機櫃100中的節點位置或異常節點裝置的網路位址。節點位置也就是在機櫃100中的實際位置，即，所在的機架(rack)編號、插槽(slot)編號以及節點(node)編號。而網路位址例如為BMC的IP位址。Next, in step S210, the node related information of the abnormal node device is searched from the node database 113 according to the hardware address. Here, the location information of the abnormal node device is recorded in the node related information, and the location information is, for example, a node location of the abnormal node device in the cabinet 100 or a network address of the abnormal node device. The node location is also the actual location in the cabinet 100, ie, the rack number, the slot number, and the node number. The network address is, for example, the IP address of the BMC.

並且，在步驟S215中，自雲端伺服系統中隔離異常節點裝置。例如依據一隔離機制將異常節點裝置自雲端伺服系統中隔離。And, in step S215, the abnormal node device is isolated from the cloud servo system. For example, the abnormal node device is isolated from the cloud servo system according to an isolation mechanism.

另外，在步驟S220中，依據位置資訊來致能異常節點裝置的發光單元。例如，依據位置資訊傳送一命令至異常節點裝置的BMC，以由BMC點亮發光單元為第一顏色(例如紅色)。而當偵測到異常節點裝置更換為另一節點裝置之後，便點亮更換後之節點裝置的發光單元為第二顏色(例如綠色)。In addition, in step S220, the light emitting unit of the abnormal node device is enabled according to the position information. For example, a command is sent to the BMC of the abnormal node device according to the location information, so that the BMC lights the light-emitting unit to be the first color (for example, red). When it is detected that the abnormal node device is replaced with another node device, the light-emitting unit of the replaced node device is lit to a second color (for example, green).

據此，管理者藉由觀察發光單元的顏色來找出異常節點裝置。之後，管理者便可將異常節點裝置拔出，之後將好的節點裝置插入即可。例如，以效能較高的節點裝置來替換異常節點裝，或將異常節點裝置中異常的硬體更換成可正常運作的硬體。Accordingly, the manager finds the abnormal node device by observing the color of the light emitting unit. After that, the manager can pull out the abnormal node device and then insert the good node device. For example, replacing the abnormal node device with a higher-efficiency node device or replacing the abnormal hardware in the abnormal node device with a functioning hardware.

值得注意的是，在偵測到異常節點裝置更換為另一節點裝置之後，可透過一網路管理模組重新取得節點裝置的節點相關資訊。此網路管理模組例如為具有具有動態主機組態協定(Dynamic Host Configuration Protocol，DHCP)伺服(Server)功能的伺服器。在本實施例中，網路管理模組設置在獨立於機櫃100之外的另一台主機。而在其他實施例中，亦可以機櫃100中之服務資源池110的任一節點裝置來設置網路管理模組。It is worth noting that after detecting that the abnormal node device is replaced with another node device, the node related information of the node device can be retrieved through a network management module. The network management module is, for example, a server having a Dynamic Host Configuration Protocol (DHCP) Servo (Server) function. In this embodiment, the network management module is disposed in another host independent of the cabinet 100. In other embodiments, the network management module can also be configured by any node device of the service resource pool 110 in the cabinet 100.

圖3是依照本發明一實施例之取得節點相關資訊的方法流程圖。請參照圖3，在步驟S305中，網路管理模組自節點裝置接收BMC的MAC位址，以分配IP位址給節點裝置的BMC。3 is a flow chart of a method for obtaining node related information according to an embodiment of the invention. Referring to FIG. 3, in step S305, the network management module receives the MAC address of the BMC from the node device to allocate an IP address to the BMC of the node device.

接著，在步驟S310中，網路管理模組依據IP位址傳送一指令給上述BMC。上述指令例如為智慧平台管理介面(Intelligent Platform Management Interface)的OEM(Original Equipment Manufacturer)指令。Next, in step S310, the network management module transmits an instruction to the BMC according to the IP address. The above instructions are, for example, an OEM (Original Equipment Manufacturer) instruction of the Intelligent Platform Management Interface.

當BMC接收到上述指令時，如步驟S315所示，將節點裝置的中央處理單元重新開機，以在重新開機的過程中，透過BMC來取得節點裝置的節點相關資訊。這是因為，處理器資訊、網路卡資訊、記憶體資訊、硬碟資訊以及節點位置為動態取得，因此將中央處理單元重新開機至可延伸韌體介面(Extensible Firmware Interface，EFI)殼層，以在行開機自我測試(Power On Test Self，POST)時由基本輸入輸出系統(Basic Input Output System，BIOS)去取得，再傳送給BMC。When the BMC receives the above instruction, as shown in step S315, the central processing unit of the node device is restarted to obtain the node related information of the node device through the BMC during the restarting process. This is because the processor information, network card information, memory information, hard disk information, and node location are dynamically acquired, so the central processing unit is rebooted to the Extensible Firmware Interface (EFI) shell. It is obtained by the Basic Input Output System (BIOS) during the Power On Test Self (POST) and then transmitted to the BMC.

之後，在步驟S320中，BMC會回應上述指令，而將節點相關資訊傳送至網路管理模組，使得網路管理模組將節點相關資訊儲存至節點資料庫113中。另外，網路管理模組還可依據節點相關資訊來決定節點裝置所欲部署的服務類型，並且將節點相關資訊傳送至雲端部署程序，以進行雲端作業系統的部署。Then, in step S320, the BMC responds to the above instruction, and transmits the node related information to the network management module, so that the network management module stores the node related information into the node database 113. In addition, the network management module can also determine the type of service to be deployed by the node device according to the information about the node, and transmit the node related information to the cloud deployment program for deployment of the cloud operating system.

就目前技術而言，BMC在出廠時，可將其預設成當其初始碼(initial code)執行完後會自動藉由DHCP協定，而發送其MAC位址至具有DHCP伺服功能的網路管理模組，以獲得BMC的IP位址。在本實施例中是由網路管理模組來取得節點相關資訊。然，在其他實施例中，亦可由異常偵測模組112來取得節點相關資訊。例如，異常偵測模組112可藉由BMC的IP位址對BMC下達IPMI OEM指令，以便取得節點相關資訊。As far as the current technology is concerned, the BMC can be preset to automatically send its MAC address to the network management with DHCP servo function when the initial code is executed. Module to get the IP address of the BMC. In this embodiment, the network management module obtains node related information. However, in other embodiments, the node related information may also be obtained by the anomaly detection module 112. For example, the anomaly detection module 112 can issue an IPMI OEM command to the BMC through the IP address of the BMC to obtain node related information.

舉例來說，表1所示為節點位置(6,0,0)的節點裝置的節點相關資訊。其中，節點位置(6,0,0)代表機架編號為6、插槽編號為0且節點編號為0。另，表2所示為節點位置(6,0,0)在更換另一節點裝置後的節點相關資訊。For example, Table 1 shows the node-related information of the node device of the node position (6, 0, 0). The node position (6, 0, 0) represents the rack number 6, the slot number is 0, and the node number is 0. In addition, Table 2 shows the node related information after the node location (6, 0, 0) is replaced by another node device.

當節點偵測模組112偵測到節點位置(6,0,0)所設置的節點裝置發生異常時，其會取得異常節點裝置BMC的MAC位址，即“00:A0:D1:EC:F8:BA”。之後，依據MAC位址“00:A0:D1:EC:F8:BA”查詢節點資料庫113，便可獲得如表1所示之節點相關資訊。之後，再依據異常節點裝置BMC的IP位址“10.1.0.19”，發送一命令至異常節點裝置，以將異常節點裝置的發光單元點亮為紅色。When the node detecting module 112 detects that an abnormality occurs in the node device set by the node position (6, 0, 0), it acquires the MAC address of the abnormal node device BMC, that is, "00: A0: D1: EC: F8: BA". Then, the node database 113 is queried according to the MAC address "00: A0: D1: EC: F8: BA", and the node related information as shown in Table 1 can be obtained. Then, according to the IP address "10.1.0.19" of the abnormal node device BMC, a command is sent to the abnormal node device to illuminate the illumination unit of the abnormal node device to red.

當管理者發現紅色燈號時，便將異常節點裝置拔下，並準備另一個節點裝置安裝上去。假設新更換的節點裝置的BMC網卡的MAC位址為“00:A0:D1:EC:F8: F0”、其頻寬為100Mbps，而系統網卡的MAC位址為“00:A0:D1:EA:34:F0”，其頻寬為1000Mbps，處理器型號為“Intel(R) Xeon(R) CPU E5540”、工作頻率為2260 MHz，節點類型為儲存資源池。並且，假設此節點裝置具有四個硬碟，以(托架編號，硬碟類型，硬碟轉速，硬碟容量)表示一個硬碟的硬碟資訊，其為(1,SAS,1TB,7200,16 MB)、(1,SAS,1TB,7200,16 MB)、(2,SAS,1TB,7200,16 MB)及(2,SAS,1TB,7200,16 MB)。When the manager finds the red light, the abnormal node device is unplugged and another node device is installed. Assume that the MAC address of the BMC NIC of the newly replaced node device is "00:A0:D1:EC:F8: F0", its bandwidth is 100Mbps, and the MAC address of the system NIC is "00:A0:D1:EA :34:F0", its bandwidth is 1000Mbps, the processor model is "Intel(R) Xeon(R) CPU E5540", the working frequency is 2260 MHz, and the node type is storage resource pool. And, assuming that the node device has four hard disks, the hard disk information of one hard disk is represented by (bay number, hard disk type, hard disk speed, hard disk capacity), which is (1, SAS, 1TB, 7200, 16 MB), (1, SAS, 1TB, 7200, 16 MB), (2, SAS, 1TB, 7200, 16 MB) and (2, SAS, 1TB, 7200, 16 MB).

將上述節點裝置安裝至機櫃之後，BMC在執行初始碼會透過DHCP協定而獲得BMC網卡的IP位址“10.1.0.19”，以及獲得系統網卡的IP位址“10.1.0.20”。之後，再透過上述步驟S305~S320，便可獲得如表2所示之節點相關資訊，系統網卡的IP位址之取得是透過系統網路啟動(Network Boot)。After the node device is installed in the cabinet, the BMC obtains the IP address "10.1.0.19" of the BMC network card through the DHCP protocol, and obtains the IP address "10.1.0.20" of the system network card. Then, through the above steps S305~S320, the node related information as shown in Table 2 can be obtained, and the IP address of the system network card is obtained through the system network boot (Network Boot).

底下以再舉一例來說明機櫃100的架構，藉以說明BMC如何取得節點位置。圖4是依照本發明一實施例之機櫃架構的示意圖。請參照圖4，機櫃100中包括r個機架，即，機架0~機架r-1，在此以機架0為例，其餘機架的架構亦與機架0相似，不再贅述。機架0中包括s個插槽，即，插槽0~插槽s-1。為求方便說明，在此以插槽0為例，其餘插槽的架構亦與插槽0相似，不再贅述。插槽0中包括n個節點，即，節點0~節點n-1。以節點0為例，節點0中設置有一節點裝置400。The architecture of the cabinet 100 will be described below by way of an example to illustrate how the BMC obtains the node location. 4 is a schematic diagram of a cabinet architecture in accordance with an embodiment of the present invention. Referring to FIG. 4, the cabinet 100 includes r racks, that is, rack 0 to rack r-1. Here, the rack 0 is taken as an example, and the architecture of the other racks is similar to that of the rack 0, and details are not described herein again. . Rack 0 includes s slots, that is, slot 0 to slot s-1. For convenience of explanation, slot 0 is taken as an example here, and the architecture of the remaining slots is similar to slot 0, and will not be described again. Slot 0 includes n nodes, that is, node 0 to node n-1. Taking node 0 as an example, a node device 400 is provided in node 0.

在此，為了方便說明，僅繪示出節點裝置400的部分構件。節點裝置400包括包括中央處理單元410、控制晶片420、基本輸入輸出系統(Basic Input Output System，BIOS)晶片430、BMC 440、記憶體模組450、節點編號模組460、系統網卡470、發光單元480以及BMC網卡490。其中，中央處理單元410耦接至記憶體模組450並且透過控制晶片420耦接至BIOS晶片440、BMC 440、節點編號模組460以及系統網卡470。另外，BMC 440則耦接至發光單元480與BMC網卡490。Here, for convenience of explanation, only some of the components of the node device 400 are shown. The node device 400 includes a central processing unit 410, a control chip 420, a basic input output system (BIOS) chip 430, a BMC 440, a memory module 450, a node number module 460, a system network card 470, and a light unit. 480 and BMC network card 490. The central processing unit 410 is coupled to the memory module 450 and coupled to the BIOS chip 440, the BMC 440, the node number module 460, and the system network card 470 through the control chip 420. In addition, the BMC 440 is coupled to the light emitting unit 480 and the BMC network card 490.

中央處理單元410用以執行節點裝置400中的硬體、韌體以及處理軟體中的資料。控制晶片420為中央處理單元410對外訊息交換的橋梁。在本實施例中，控制晶片420包括北橋晶片與南橋晶片。而在其他實施例中，控制晶片420例如為南橋晶片，而北橋晶片可與中央處理單元410互相整合。而記憶體模組450例如為雙列記憶體模組(Dual In-line Memory Module，DIMM)。節點編號模組460用以儲存電子裝置400所在的節點編號。在此，節點編號模組460中的節點編號為0。The central processing unit 410 is configured to execute data in the hardware, the firmware, and the processing software in the node device 400. Control chip 420 is a bridge for external processing of central processing unit 410. In the present embodiment, the control wafer 420 includes a north bridge wafer and a south bridge wafer. In other embodiments, the control wafer 420 is, for example, a south bridge wafer, and the north bridge wafer can be integrated with the central processing unit 410. The memory module 450 is, for example, a dual in-line memory module (DIMM). The node number module 460 is used to store the node number where the electronic device 400 is located. Here, the node number in the node number module 460 is 0.

另外，機架0中包括機架編號模組401，用以儲存機架編號。在此，機架編號模組401中儲存的機架編號為0。而插槽0中包括擴展器(expander)403以及插槽編號模組405。擴展器403耦接至機架編號模組401、插槽編號模組405、節點裝置400的控制晶片420。在節點0中，節點裝置400的控制晶片420透過擴展器403取得節點裝置400所在的機架編號0與插槽編號0，並且，由節點編號模組460得知其所在的節點編號為0。而在控制晶片420中可設置暫存器來儲存機架編號、插槽編號以及節點編號。例如，設置三個暫存器來儲存。而BIOS晶片430或BMC 440便可自控制晶片420來取得電子裝置400在機櫃100中的節點位置。In addition, the rack 0 includes a rack number module 401 for storing the rack number. Here, the rack number stored in the rack number module 401 is 0. The slot 0 includes an expander 403 and a slot number module 405. The expander 403 is coupled to the rack number module 401, the slot number module 405, and the control chip 420 of the node device 400. In node 0, the control chip 420 of the node device 400 acquires the rack number 0 and the slot number 0 where the node device 400 is located through the expander 403, and the node number module 460 knows that the node number at which it is located is zero. A register can be set in the control chip 420 to store the rack number, the slot number, and the node number. For example, set up three scratchpads to store. The BIOS chip 430 or the BMC 440 can control the wafer 420 from the control chip 420 to obtain the node position of the electronic device 400 in the cabinet 100.

假設以(r,s,n)來表示節點位置(也就是在機櫃100中的實體位置)，r代表機架編號，s代表插槽編號，n代表節點編號。據此，節點裝置400的節點位置即為(0,0,0)。以此類推，其他節點裝置的控制晶片的暫存器中亦會儲存有其所在的節點位置。It is assumed that the node position (that is, the physical position in the cabinet 100) is represented by (r, s, n), r represents the rack number, s represents the slot number, and n represents the node number. Accordingly, the node position of the node device 400 is (0, 0, 0). By analogy, the location of the node where the other node device controls the wafer is also stored.

在此，擴展器403例如是透過通用輸入(General Purpose Input，GPI)接口耦接至機架編號模組401與插槽編號模組405，並且透過內部整合電路(Inter-Integrated Circuit，I²C)匯流排耦接至節點裝置400的控制晶片420。另外，控制晶片420例如亦是透過GPI接口分別耦接至節點編號模組460。也就是說，可透過GPI接口來取得機架編號模組401、插槽編號模組405以及節點編號模組460的編號。Here, the expander 403 is coupled to the rack number module 401 and the slot number module 405 through a General Purpose Input (GPI) interface, and through an internal integrated circuit (Inter-Integrated Circuit, I ² C). The bus bar is coupled to the control wafer 420 of the node device 400. In addition, the control chip 420 is also coupled to the node number module 460 via a GPI interface, for example. That is, the numbers of the rack number module 401, the slot number module 405, and the node number module 460 can be obtained through the GPI interface.

綜上所述，上述實施例中，利用發光單元來提醒管理者發生異常的異常節點裝置所在處，以便於管理者進行節點裝置的更換。並且，在更換新的節點裝置之後，還可自動取得其節點相關資訊，以進行之後的自動部署(deploy)等管理，例如為將該節點裝置安裝其所需之作業系統及服務並將其加入雲端系統之運作。In summary, in the above embodiment, the lighting unit is used to remind the manager of the abnormal node device where the abnormality occurs, so that the manager can replace the node device. Moreover, after the new node device is replaced, the node related information can also be automatically obtained for subsequent management such as automatic deployment, for example, to install and install the required operating system and service of the node device. The operation of the cloud system.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，故本發明之保護範圍當視後附之申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the invention, and any one of ordinary skill in the art can make some modifications and refinements without departing from the spirit and scope of the invention. The scope of the invention is defined by the scope of the appended claims.

100．．．機櫃100. . . Cabinet

110_1~110_i、120_1~120_j、130_1~130_k、400．．．節點裝置110_1~110_i, 120_1~120_j, 130_1~130_k, 400. . . Node device

110．．．服務資源池110. . . Service resource pool

120．．．計算資源池120. . . Computing resource pool

130．．．儲存資源池130. . . Storage resource pool

111_1~111_i、121_1~121_j、131_1~131_k、480．．．發光單元111_1~111_i, 121_1~121_j, 131_1~131_k, 480. . . Light unit

112．．．異常偵測模組112. . . Anomaly detection module

113．．．節點資料庫113. . . Node database

140．．．交換機140. . . switch

401．．．機架編號模組401. . . Rack number module

403．．．擴展器403. . . Expander

410．．．中央處理單元410. . . Central processing unit

420．．．控制晶片420. . . Control chip

430．．．BIOS晶片430. . . BIOS chip

440．．．BMC440. . . BMC

450．．．記憶體模組450. . . Memory module

460．．．節點編號模組460. . . Node number module

470．．．系統網卡470. . . System NIC

490．．．BMC網卡490. . . BMC network card

S205~S220．．．雲端伺服系統的管理方法各步驟S205~S220. . . Cloud servo system management method steps

S305~S320．．．取得節點相關資訊的方法各步驟S305~S320. . . Method of obtaining node related information

圖1是依照本發明之一實施例之雲端伺服系統的方塊圖。1 is a block diagram of a cloud servo system in accordance with an embodiment of the present invention.

圖2是依照本發明之一實施例之雲端伺服系統的管理方法流程圖。2 is a flow chart of a method for managing a cloud server system according to an embodiment of the present invention.

圖3是依照本發明一實施例之取得節點相關資訊的方法流程圖。3 is a flow chart of a method for obtaining node related information according to an embodiment of the invention.

圖4是依照本發明一實施例之機櫃架構的示意圖。4 is a schematic diagram of a cabinet architecture in accordance with an embodiment of the present invention.

Claims

A cloud server system management method is applicable to a cloud server system, wherein the cloud server system includes a plurality of node devices, and the cloud server system management method comprises: detecting whether the node devices are abnormal; when detecting the When an abnormality occurs in one of the node devices, a first hardware address of the abnormal node device that generates the abnormality is obtained; and according to the first hardware address, a node related information of the abnormal node device is searched from a node database, The location information of the node is recorded in the node related information; the abnormal node device is isolated from the cloud server system; and an illumination unit of the abnormal node device is enabled according to the location information.

The method for managing a cloud server system according to claim 1, wherein the location information is a node location of the abnormal node device in a cabinet or a network address of the abnormal node device.

The method for managing a cloud server system according to claim 1, wherein the step of enabling the light emitting unit of the abnormal node device according to the location information comprises: transmitting a command to the abnormal node device according to the location information The substrate management controller is configured to illuminate the light emitting unit by the substrate management controller to be a first color.

The method for managing a cloud server system according to claim 3, further comprising: after detecting that the abnormal node device is replaced with another node device, lighting the replaced light emitting unit of the node device as a The second color.

The method for managing a cloud server system according to claim 1, wherein after detecting that the abnormal node device is replaced with another node device, the node of the node device is re-acquired through a network management module. The step of information includes: receiving a second hardware address from the node device to allocate a network address to the node device; transmitting an instruction to the node device according to the network address to transmit through the node device The baseboard management controller obtains the node related information of the node device; and updates the node related information to the node database.

The method for managing a cloud server system according to claim 5, wherein the step of obtaining, by the baseboard management controller of the node device, the node related information of the node device comprises: when the substrate management controller receives the When the command is executed, a central processing unit of the node device is restarted to obtain related information of the node through the baseboard management controller during the restarting process.