WO2001080007A2 - Methods and apparatus for robust startup of a computer system having redundant components - Google Patents
Methods and apparatus for robust startup of a computer system having redundant components Download PDFInfo
- Publication number
- WO2001080007A2 WO2001080007A2 PCT/US2001/011990 US0111990W WO0180007A2 WO 2001080007 A2 WO2001080007 A2 WO 2001080007A2 US 0111990 W US0111990 W US 0111990W WO 0180007 A2 WO0180007 A2 WO 0180007A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processor
- module
- pair
- input
- boot
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/142—Reconfiguring to eliminate the error
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1441—Resetting or repowering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2035—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1417—Boot up procedures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/1633—Error detection by comparing the output of redundant processing systems using mutual exchange of the output between the redundant processing components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/1641—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/165—Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2028—Failover techniques eliminating a faulty processor or activating a spare
Definitions
- the present invention relates to methods and apparatus for robust operation of a fault-
- the present invention relates to methods and apparatus for recovering from an operational failure
- a simple system reset may fail to identify intermittent hardware problems.
- a redundant, fault-tolerant system may include multiple CPUs
- a single misbehaving central processing unit may sometimes boot properly, masking a system error and causing the error to be irreproducible. In these cases, the system cannot be examined to determine the cause of the failure.
- the present invention is directed to methods and apparatus for robust operation of a fault- tolerant computer system with redundant components. It provides methods and apparatus for booting a computer system with redundant hardware and/or software components in a deterministic fashion. Individual hardware and/or software components are selected and a boot process is performed using those selected components. Booting in this manner allows application programs written for traditional machine to be used without modification. Further, modifications to boot software are rendered minimal or non-existent using this scheme. Moreover, booting individual processor-I/O controller pairs allows system faults to be isolated and detected in a deterministic fashion. The present invention also provides a user-configurable mechanism for instructing a computer system to take increasingly severe steps in order to return a computer system to operational status without destroying the data stored in processor registers or computer memory. The methods and apparatus disclosed are particularly useful for fault- tolerant computer systems using standard operating systems.
- the present invention relates to a method for deterministically booting a fault-tolerant computer having a plurality of processors and one or more input-output controllers.
- a first processor/input-output controller pair is chosen and an attempt is made to boot the chosen pair. In the event that the attempt to boot the chosen pair fails, a new boot pair is selected.
- the present invention relates to a method for deterministically booting a fault-tolerant computer having a plurality of processor boards and one or more input-output controller boards.
- a first processor/input-output controller board pair is chosen and an attempt is
- the present invention relates to an apparatus for deterministically
- the apparatus includes a plurality of processors, at least one
- a memory element storing a list of
- processor/controller pairs and a control module in communication with each element.
- control module retrieves a first processor/controller pair identifier from the memory element
- second identifier is retrieved from the memory element and an attempt is made to boot the second boot pair identified.
- the present invention relates to an apparatus for deterministically
- booting a fault-tolerant system composed of individual hardware or software objects.
- a set of hardware and/or software components is selected and a boot process is performed using this set
- the present invention relates to a method for recovering from a failure of a fault-tolerant system that includes the plurality of processors and one or more input-output
- a non-responsive processor is identified. One of the processors is selected from the plurality of processors and its execution is halted. The non-responsive processor is then restarted
- the present invention relates to an apparatus for recovering from the failure of a processor in a fault-tolerant system.
- the apparatus includes a plurality of processors, at least one input-output controller in communication with the processors, and a
- control module in communication with each of these elements.
- the control module detects that a processor is non-responsive, halts execution of the other processors in the plurality, selects a
- FIG. 1 is a block diagram of an embodiment of a traditional computer system
- FIG. 2 is a block diagram of an embodiment of a redundant, fault-tolerant computer
- FIG. 3 is a block diagram showing an embodiment of auxiliary connections between
- FIGs. 4 and 4A are block diagrams depicting an embodiment of the steps to be taken
- FIGs. 5 and 5 A are screen shots depicting exemplary embodiments of user interfaces for
- a typical computer 14 as known in the prior art includes a
- central processor 20 central processor 20, a main memory unit 22 for storing programs and/or data, an input/output (I/O) controller 24, a display device 26, and a data bus 42 coupling these components to allow
- the memory 22 may include random access memory
- the computer 14 typically also has one or more RAM (RAM) and read only memory (ROM) chips.
- RAM random access memory
- ROM read only memory
- keyboard 32 e.g., an alphanumeric keyboard and/or a musical
- keyboard a mouse 34, and, in some embodiments, a joystick 12.
- the computer 14 typically also has a hard disk drive 36 and a floppy disk drive 38 for
- floppy disks such as 3.5-inch disks.
- Other devices 40 also can be part of the computer
- output devices e.g., printer or plotter
- optical disk drives for receiving
- one or more computer programs define the operational capabilities of the system 10. These programs can be loaded
- Applications may be caused to run by double clicking a related icon displayed on the display
- controlling software program(s) and all of the data utilized by the program(s) are stored on one or more of the computer's storage mediums such as the hard drive 36, CD-ROM 40, etc.
- System bus 42 allows data to be transferred between the various units in the computer 14.
- processor 20 may retrieve program data from memory 22 over system bus 42.
- Various system busses 42 are standard in computer systems 14, such as the Video Electronics Standards Association Local Bus (VESA Local Bus), the industry standard architecture ISA bus
- ISA Extended Industry Standard Architecture bus
- EISA Extended Industry Standard Architecture bus
- MCA multi-MediaCar bus
- PCI Peripheral Component Interconnect bus
- busses may be used to provide access to different units of the system.
- a system 14 may be used to provide access to different units of the system.
- a system 14 may be used to provide access to different units of the system.
- a system 14 may be used to provide access to different units of the system.
- a system 14 may be used to provide access to different units of the system.
- a system 14 may be used to provide access to different units of the system.
- main memory unit 22 may use a PCI to connect a processor 20 to peripheral devices 30, 36, 38 and concurrently connect the processor 20 to main memory 22 using an MCA bus. It is immediately apparent from FIG. 1 that such a traditional computer system 14 is highly sensitive to any single point of failure. For example, if main memory unit 22 fails to
- a redundant, fault-tolerant system may be provided with any one
- Configurations include dual redundant systems, which include
- redundant central processing units 20 duplicates of certain hardware units found in FIG. 1, and triply redundant configurations, which include three of each unit shown in FIG. 1. In either case, redundant central processing units 20
- main memory units 22 run in "lock step,” that is, each processor runs identical copies of the
- registers provided by the replicated processors 20 should be identical at all times.
- one embodiment of a redundant, fault-tolerant system 14' is
- processors 20, 20', 20" (generally 20) and at least two input output
- controllers 24, 24' (generally 24). As shown in FIG. 2, system 14' may include more than two
- I/O devices more I/O devices.
- four redundant system busses 42, 42', 42" and 42'" are used to interconnect each processor 20 and I/O controllers 24.
- 42 redundant system busses 42, 42', 42" and 42'"
- processors 20 are selected from the "x86" family of processors manufactured
- the x86 family of processors includes the 80286
- processor the 80386 processor, the 80486 processor, and the Pentium, Pentium II, Pentium III,
- processors are selected from the "680x0" family of
- processors includes the 68000, 68020, 68030, and 68040 processors.
- Other processor families include the Power PC line of processors manufactured by the Motorola Corporation, the Alpha line of processors manufactured by Compaq Corporation of Houston, Texas, and the Crusoe line of processors manufactured by Transmeta Corporation of Santa Clara, California.
- Each processor 20 may include logic that implements fault-tolerant support.
- the fault-tolerant logic may be included on the chip itself.
- the CPU 20 is a processor board that includes a processor, associated memory, and fault-tolerant logic.
- the fault-tolerant logic can be implemented as a separate set of logic on processor board 20.
- the fault-tolerant logic may be provided as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a programmable logic device (PLD), or a read-only memory device (ROM).
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- EEPROM electrically erasable programmable read-only memory
- PROM programmable read-only memory
- PLD programmable logic device
- ROM read-only memory device
- Each input-output controller may also include fault-tolerant logic that monitors transactions on the system busses 42 to aid in determining a processor failure.
- the I/O controller boards 24 also provide support for the display 26, input devices 30 and mass storage such as floppy drives 38, hard drives, and CD-ROM devices.
- the embodiment shown in FIG. 2 includes a front panel 52 that provides an interface to these input and output devices.
- the front panel may serve as an adapter between the I/O controllers 24 and, for example, a universal serial bus (USB) used by keyboard and mouse input devices, or a video connector (EGA, VGA, or SVGA) used for connecting displays to the system 14'.
- Each I/O controller 24 includes service management logic which performs various system
- management functions such as: monitoring the operational status of the system; performing on ⁇
- the service management logic (including a processor boot sequence).
- the service management logic includes a com ection for communicating with other customer
- the service management logic is provided as a separate board that is in
- a service is provided to I/O controller 24.
- I/O controller 24 In one particularly preferred embodiment, a service
- the service management board including all service management logic connects to I/O controller 24 via a PCI slot.
- the service management logic (referred to hereafter as SML) may be provided with a
- FIG. 3 a block diagram shows the connection between SML units 50,
- 50' (generally 50) and the I/O controllers 24, 24' and processors 20, 20', 20" of the system 14'.
- each SML 50 is connected to each of the other units by redundant auxiliary
- busses 60, 60' in addition to redundant busses 42.
- Auxiliary busses 60, 60' may be any bus that
- the SMLs 50 allows the SMLs 50 to control and query the processors 20 and I/O controllers 24.
- the SMLs are configured to control and query the processors 20 and I/O controllers 24.
- the boot process begins by powering on the SMLs (step 402),
- step 410 determining whether or not the system requires booting
- SMLs 50 are provided with power separate from the
- the SML is a portion of an I/O controller
- a SML uses auxiliary busses 60, 60' to determine if other SMLs exist in the system (step
- the SMLs exchange messages over the auxiliary busses 60, 60' in order to determine
- SML SML will function as the primary SML may include many factors, including: whether or not a service management logic unit has been previously inserted in the system to be powered
- the identity of the primary SML may be "hardwired.”
- SML identifies with which I/O controller 24 it is associated (step 408).
- the SML 50 uses this
- SML 50 during the boot process. For example, if the I/O controller with which the SML 50 is associated is not selected for booting, then the SML 50 associated with the booting I/O controller
- boot status messages will be directed to the SML 50 on the booting I/O controller, even if that SML 50 is not the primary SML 50.
- SMLs 50 can exchange messages to negotiate which SML
- SML 50 is the primary SML 50. If an SML 50 is already functioning in the system as primary, then a
- the SMLs 50 negotiate to determine which SML 50 is the primary SML 50. In one embodiment the SMLs 50 negotiate to determine which SML 50 is the primary SML 50. In one embodiment, the SMLs 50 negotiate using the following rules:
- the SML 50 in I/O board slot 0 becomes the primary SML 50.
- the SML 50 in I/O board slot 1 becomes secondary.
- a service management logic unit in this embodiment, will not boot the system if it was explicitly
- shut down by an administrator for example, if the administrator used a "power off' command to shut down the system. Whether or not a system has been explicitly shut down by an administrator may be stored in non- volatile memory (not shown in the drawings) that the SML 50 may query.
- a SML 50 determines that it should not boot the system 14', it transitions to a state in
- step 412 which it monitors the system. This state is described in greater detail below.
- an SML 50 may query a non- volatile memory element and discover that the system 14'
- the boot process shown in FIG. 4A may be commenced by an initializing SML 50.
- FIG. 5A is a screen shot showing an exemplary embodiment for providing such commands to the system administrator by the primary SML 50.
- system administration commands are grouped as a set of "tabs" and displayed to
- FIG. 5A The administrator selects the tab containing the desired operations.
- FIG. 54 depicts an embodiment in which a "System Control" tab 54 provides four controls for a system: a "Power On” command 56 (depicted in gray to indicate the system is currently running; an
- System information 64 as well as information concerning the primary SML 66, is provided
- FIG. 5A depicts an embodiment using
- NETSCAPE NAVIGATOR manufactured by Netscape Communications of Mountain View, California, any browser may be used, including MICROSOFT INTERNET EXPLORER,
- a boot list is a list of component systems allowing the system to boot.
- boot components may include processors, I/O controllers, BIOS, and other software (both application and system).
- a boot list an ordered list of processor-I/O controller pairs.
- the boot list includes "heartbeat" values associated with each boot pair. Heartbeat values are used by an SML 50 during system operation to determine if a processor 20 is functioning properly. Heartbeats are described in greater detail below.
- the boot list may be stored in a data structure that associates processor identification values with I/O controller values.
- the data structure includes an additional field to associate heartbeat timer values with each boot pair.
- the data structure may be stored on each SML 50 in a system 14'.
- the data structure is stored in a non- volatile, erasable memory element, such as an EEPROM, that is accessible using auxiliary busses 60, 60'.
- the SML 50 may use a hard-coded default list.
- FIG. 5B depicts a screen shot of an exemplary user interface allowing a system administrator to modify the default boot list.
- the user interface is browser based and provides information to the administrator regarding the system 14' and SML 50 currently active.
- the graphical user interface shown in Fig. 5B is used to create a boot list, it is saved to the non- volatile memory element. In one embodiment, once a boot list is determined, whether by retrieving a list from a
- the SML 50 determines available processors 20 and
- the SML 50 may transmit a message over auxiliary busses 60, 60' to determine this information.
- Processors 20 and I/O controller 24 respond to the message
- the SML 50 concludes that a processor 20 or I/O controller does not
- the SML 50 to skip pairs in the boot list if they reference units not present in the system 14'.
- the SML 50 provides system clocks to the processors 20 and the I/O controllers 24 (step 452). In other embodiments system clocks
- step 452 may be skipped.
- auxiliary busses 60, 60' the SML 50 asserts a reset signal associated with each
- the SML 50 takes any other steps necessary at
- the SML releases reset from the processor 20 and the I/O controller 24 identified in the
- boot list as the first boot pair while holding reset active for all other system units (step 458). This allows the selected boot pair to boot in a manner consistent with a traditional computer.
- the SML 50 monitors the boot process of the selected boot pair to determine if the boot process
- the SML 50 monitors the progress of the boot
- heartbeats are transmitted over system busses 40. Failure to receive a heartbeat
- the SML 50 selects a new boot pair from the boot list (step 462) and attempts to boot that processor-I/O controller pair.
- BIOS Output System
- the SML 50 indicates that the system 14' was unable to boot.
- the SML 50 removes all power from the processors 20 and the I/O controllers 24
- the BIOS transmits a message to the SML indicating that the operating system has booted properly. In this case, the SML transitions to a monitoring state
- step 464) After successfully booting the first processor-I/O pair the SML
- the SML 50 enters a monitoring state (steps 412 or 464). In this state the
- SML 50 monitors heartbeat signals from each of the processors 20 to determine operation status
- a failure to receive a heartbeat signal from a processor 20 during a predetermined period indicates that a failure has occurred.
- the SML 50 consults a
- non- volatile memory element to determine what actions, if any to take.
- the memory element may be the same memory element discussed above that stores the boot list, or a separate memory
- auxiliary busses 60, 60' may be provided that is accessible via the auxiliary busses 60, 60'.
- the memory element stores a value that indicates one of seven actions for the SML 50 to take upon heartbeat failure: (1) no action; (2) normal interrupt; (3) non-maskable interrupt; (4) stop
- processor from executing; (5) system reboot; or (6) deterministic boot.
- the SML 50 logs the failure but
- a memory value indicating "normal interrupt” restricts recovery attempts by the SML 50 to issuing normal interrupts to the processor 20 or processors 20 that have ceased to transmit a
- the SML 50 issues an interrupt to a target processor 20 via the
- auxiliary busses 60, 60' If the processor's operating system is able to process the interrupt, it
- the operating system responds by restarting heartbeat transmission.
- the operating system responds by restarting heartbeat transmission.
- the SML 50 issues interrupts to the processor or processors such that the processors resume lockstep operation. For example, interrupts may be issued to processors simultaneously which should avoid breaking lockstep.
- the SML 50 simply logs this failure. In other words,
- the SML 50 alerts an administrator that the system 14' will not respond.
- SML 50 to issuing normal and non-maskable interrupts to the processor 20 or processors 20 that
- the SML 50 issues a non-maskable interrupt to a target processor 20 via the I/O controller 24. If multiple processors 20 are hung, non-maskable interrupts are issued to all processors 20 in lockstep to avoid breaking processor lockstep. If the processor's operating
- system is able to process the non-maskable interrupt, it responds by restarting heartbeat
- the SML 50 must revoke the previously issued normal interrupt, hi
- the SML 50 simply logs this failure. In other embodiments, the SML 50 alerts an administrator that the system 14' will not
- a memory value indicating that processor execution should be suspended allows the
- processor 20 Processor and memory state of the suspended processor is not destroyed. If
- the state of the suspended processor 20 may be dumped for
- the state of the suspended processor may be replaced with state from one of the operational processors 20, or both. If this step fails to restore the system 14' to operational
- the SML 50 may dump the state of the suspended processor 20 for analysis by a system
- a memory value indicating "system reboot” allows the SML 50 to attempt to reboot the
- the reboot process is similar to the reboot process described in connection with FIGs. 4 and 4A, except that the
- suspended processor 20 is skipped during reboot of the boot pairs listed in the boot list.
- the SML 50 maintains an index to identify the last processor-I/O boot pair in the boot list that last rebooted successfully. During the reboot process, this index is
- the state of the suspended processor 20 may be dumped for analysis, the state of the suspended processor 20
- processor 20 may be replaced with the state of one of the operational processors, or both.
- SML 50 may dump the state of the suspended processor 20 for analysis by a system
- a memory value indicating "deterministic boot” allow the SML 50 to abandon the state
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Hardware Redundancy (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2001257027A AU2001257027A1 (en) | 2000-04-14 | 2001-04-12 | Methods and apparatus for robust startup of a computer system having redundant components |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US55000400A | 2000-04-14 | 2000-04-14 | |
US09/549,733 US6691225B1 (en) | 2000-04-14 | 2000-04-14 | Method and apparatus for deterministically booting a computer system having redundant components |
US09/549,733 | 2000-04-14 | ||
US09/550,004 | 2000-04-14 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001080007A2 true WO2001080007A2 (en) | 2001-10-25 |
WO2001080007A3 WO2001080007A3 (en) | 2002-12-12 |
Family
ID=27069212
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/011990 WO2001080007A2 (en) | 2000-04-14 | 2001-04-12 | Methods and apparatus for robust startup of a computer system having redundant components |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2001257027A1 (en) |
WO (1) | WO2001080007A2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2415805A (en) * | 2004-07-03 | 2006-01-04 | Diehl Bgt Defence Gmbh & Co Kg | Monitoring a fault-tolerant computer architecture at PCI bus level |
WO2011001685A1 (en) * | 2009-07-01 | 2011-01-06 | Panasonic Corporation | Secure boot method and secure boot apparatus |
CN102331786A (en) * | 2011-07-18 | 2012-01-25 | 北京航空航天大学 | A dual computer cold backup system for attitude and orbit control |
WO2021118901A1 (en) * | 2019-12-10 | 2021-06-17 | Cisco Technology, Inc. | Fault isolation and recovery of cpu cores for failed secondary asymmetric multiprocessing instance |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4377000A (en) * | 1980-05-05 | 1983-03-15 | Westinghouse Electric Corp. | Automatic fault detection and recovery system which provides stability and continuity of operation in an industrial multiprocessor control |
US5627962A (en) * | 1994-12-30 | 1997-05-06 | Compaq Computer Corporation | Circuit for reassigning the power-on processor in a multiprocessing system |
ES2169608T3 (en) * | 1998-05-19 | 2002-07-01 | Siemens Ag | CONTROL SYSTEM FOR CONTROLLING THE OPERATION OF A DISTRIBUTED SYSTEM. |
-
2001
- 2001-04-12 AU AU2001257027A patent/AU2001257027A1/en not_active Abandoned
- 2001-04-12 WO PCT/US2001/011990 patent/WO2001080007A2/en active Application Filing
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2415805A (en) * | 2004-07-03 | 2006-01-04 | Diehl Bgt Defence Gmbh & Co Kg | Monitoring a fault-tolerant computer architecture at PCI bus level |
WO2011001685A1 (en) * | 2009-07-01 | 2011-01-06 | Panasonic Corporation | Secure boot method and secure boot apparatus |
CN102449634A (en) * | 2009-07-01 | 2012-05-09 | 松下电器产业株式会社 | Secure boot method and secure boot apparatus |
US8892862B2 (en) | 2009-07-01 | 2014-11-18 | Panasonic Corporation | Secure boot method for executing a software component including updating a current integrity measurement based on whether the software component is enabled |
CN102331786A (en) * | 2011-07-18 | 2012-01-25 | 北京航空航天大学 | A dual computer cold backup system for attitude and orbit control |
CN102331786B (en) * | 2011-07-18 | 2013-05-08 | 北京航空航天大学 | Dual-computer cold-standby system of attitude and orbit control computer |
WO2021118901A1 (en) * | 2019-12-10 | 2021-06-17 | Cisco Technology, Inc. | Fault isolation and recovery of cpu cores for failed secondary asymmetric multiprocessing instance |
US11531607B2 (en) | 2019-12-10 | 2022-12-20 | Cisco Technology, Inc. | Fault isolation and recovery of CPU cores for failed secondary asymmetric multiprocessing instance |
US11847036B2 (en) | 2019-12-10 | 2023-12-19 | Cisco Technology, Inc. | Fault isolation and recovery of CPU cores for failed secondary asymmetric multiprocessing instance |
US12222830B2 (en) | 2019-12-10 | 2025-02-11 | Cisco Technology, Inc. | Fault isolation and recovery of CPU cores for failed secondary asymmetric multiprocessing instance |
Also Published As
Publication number | Publication date |
---|---|
WO2001080007A3 (en) | 2002-12-12 |
AU2001257027A1 (en) | 2001-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6691225B1 (en) | Method and apparatus for deterministically booting a computer system having redundant components | |
US7577871B2 (en) | Computer system and method having isolatable storage for enhanced immunity to viral and malicious code infection | |
KR100620216B1 (en) | Network-expanded basic input / output system that enables remote management of computers without a functioning operating system | |
US7003775B2 (en) | Hardware implementation of an application-level watchdog timer | |
US7840796B2 (en) | Booting to a recovery/maintenance environment | |
US6807643B2 (en) | Method and apparatus for providing diagnosis of a processor without an operating system boot | |
US6880110B2 (en) | Self-repairing computer having protected software template and isolated trusted computing environment for automated recovery from virus and hacker attack | |
US7487343B1 (en) | Method and apparatus for boot image selection and recovery via a remote management module | |
US20040078679A1 (en) | Autonomous boot failure detection and recovery | |
US6246666B1 (en) | Method and apparatus for controlling an input/output subsystem in a failed network server | |
US20100162043A1 (en) | Method, Apparatus, and System for Restarting an Emulated Mainframe IOP | |
US20080301272A1 (en) | Quorum-based power-down of unresponsive servers in a computer cluster | |
JPH11504459A (en) | Enhanced BIOS adapted for remote diagnostic repair | |
GB2328045A (en) | Data processing system diagnostics | |
CN100383748C (en) | Policy-based responses to system errors that occur during OS runtime | |
EP1119809A1 (en) | Process monitoring in a computer system | |
US20030051127A1 (en) | Method of booting electronic apparatus, electronic apparatus and program | |
US6275930B1 (en) | Method, computer, and article of manufacturing for fault tolerant booting | |
US20050044207A1 (en) | Service processor-based system discovery and configuration | |
JP2002215399A (en) | Computer system | |
JP2003173272A (en) | Information processing system, information processing device and maintenance center | |
WO2001080007A2 (en) | Methods and apparatus for robust startup of a computer system having redundant components | |
US12405848B2 (en) | Error correction dynamic method to detect and troubleshoot system boot failures | |
US6438689B1 (en) | Remote reboot of hung systems in a data processing system | |
TW200521837A (en) | Method for switching to boot multi-processor computer system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |