CN1568467B

CN1568467B - Just-once cache architecture

Info

Publication number: CN1568467B
Application number: CN02820026.8A
Authority: CN
Inventors: 迪安·B·雅各布斯; 埃里克·哈尔彭
Original assignee: BEA Systems Inc
Current assignee: Oracle International Corp
Priority date: 2001-09-06
Filing date: 2002-09-05
Publication date: 2010-06-16
Anticipated expiration: 2022-09-05
Also published as: WO2003023633A1; EP1433073A1; EP1433073A4; AU2002332845B2; CN1568467A; JP2005502957A

Abstract

A system for managing objects located in a clustered network includes a file system that contains at least one copy (214) of a data object. The system may include a plurality of cluster servers in communication with a file system (212). A boot server is selected that contains a distributed consensus algorithm for selecting a master server (206) and that utilizes multicasting in executing the algorithm loop. The selected primary server (206) may contain a copy (208) of the data object (214) in, for example, a local cache, so that access to the local copy (208) is provided to any other server in the cluster. Changes to items hosted by the primary server (206) may be updated in the file system (212). If the primary server (206) becomes unable to accommodate the object, a new primary server may be selected using a distributed consensus algorithm and the other servers (216, 218) notified of the new primary server using multicast.

Description

exactly once cache structure

本申请要求这里所引入的下述申请的优先权：This application claims priority from the following applications incorporated herein:

由Dean Bernard Jacobs和EricHalpern于2001年9月6日所申请的申请号为No.60/317,718、发明名称为“EXACTLY ONCE CACHE FRAMEWORK”的美国临时专利申请。由Dean Bernard Jacobs和EricHalpern于2002年9月4日所申请的发明名称为“EXACTLY ONCE CACHE FRAMEWORK”美国专利申请。U.S. Provisional Patent Application No. 60/317,718 filed on September 6, 2001 by Dean Bernard Jacobs and Eric Halpern and titled "EXACTLY ONCE CACHE FRAMEWORK". The invention titled "EXACTLY ONCE CACHE FRAMEWORK" US patent application filed by Dean Bernard Jacobs and Eric Halpern on September 4, 2002.

由Dean Bernard Jacobs和EricHalpern于2001年9月6日所申请的申请号为No.60/317,566、发明名称为“EXACTLY ONCE JMSCOMMUNICATION”的美国临时专利申请。U.S. Provisional Patent Application No. 60/317,566, titled "EXACTLY ONCE JMS COMMUNICATION," filed September 6, 2001 by Dean Bernard Jacobs and Eric Halpern.

由Dean Bernard Jacobs和EricHalpern于2002年9月4日所申请的发明名称为“EXACTLY ONCE JMS COMMUNICATION”的美国临时专利申请。US provisional patent application entitled "EXACTLY ONCE JMS COMMUNICATION" filed September 4, 2002 by Dean Bernard Jacobs and Eric Halpern.

版权须知Copyright Notice

该专利文献所公开的部分包括受到版权保护的材料。当它出现在专利与商标局的专利文件或者记录中时，版权所有者不反对任何人传真重现该专利所公开的专利文献，反之，则无论如何都保留所有版权。The disclosure of this patent document includes material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document disclosed in the Patent and Trademark Office as it appears in the Patent and Trademark Office patent files or records, and otherwise, all copyright rights are reserved whatsoever.

交叉参考的案例：Examples of cross-references:

下述美国专利申请是交叉参考案例并且在这里引入以做参考：The following U.S. patent applications are examples of cross-references and are hereby incorporated by reference:

由Dean Bernard Jacobs、Reto Kramer、以及Ananthan Bala Srinivasan于2001年7月16日所申请的申请号为No.60/305,986、发明名称为“DATAREPLICATION PROTOCOL”的美国专利申请。U.S. Patent Application No. 60/305,986 filed on July 16, 2001 by Dean Bernard Jacobs, Reto Kramer, and Ananthan Bala Srinivasan, entitled "DATAREPLICATION PROTOCOL".

技术领域technical field

本发明涉及用于在网络群集器中的服务器当中分布对象的技术。The present invention relates to techniques for distributing objects among servers in a network cluster.

背景技术Background technique

在分布式计算机系统中，经常存在这样的情况，即若干个服务器和/或网络节点必须一起工作。当在所述多个设备当中存在需要共享的典型网络信息以便允许它们用做单一实体时，必须对这些服务器和节点进行协调。就资源和效率而言，通常可对设备进行协调的方法是非常昂贵的。In distributed computer systems it is often the case that several servers and/or network nodes must work together. These servers and nodes must be coordinated when there is typical network information that needs to be shared among the multiple devices in order to allow them to function as a single entity. The methods by which devices can be coordinated are generally very expensive in terms of resources and efficiency.

通常，由于在多个节点之间存在若干信息传送，因此，为了使这些节点一致，需要某种同步。然而，在群集网络环境中这种同步要求是所不希望的。许多群集环境简单避免了利用任何这种同步要求。然而，在某些应用中，这种一致是必需的。Usually, since there are several information transfers between multiple nodes, some kind of synchronization is required in order for these nodes to be consistent. However, this synchronization requirement is undesirable in a clustered network environment. Many cluster environments simply avoid taking advantage of any such synchronization requirements. In some applications, however, such consistency is required.

在某些需要一致的情况下，存在一个群集器试图排除对其进行访问的设备。一种这样的设备是一个事项注册文件系统。只要事项处理在进行中，就存某些必须以持续方式保存的对象，从而如果出现故障，则可恢复持续保存的对象。In some cases where consistency is required, there is a device to which the cluster is trying to exclude access. One such device is a transaction registry file system. As long as transaction processing is in progress, certain objects are stored that must be persisted in such a way that if a failure occurs, the persisted objects can be restored.

对于其需要被保存在一个位置上的对象而言，一般存在其运行于群集器或域中每个服务器上的一事项监视器，该事项监视器此后使用本地文件系统来访问对象。每个服务器可以具有其自己的事项管理器，以便在持续性上几乎不存在问题。由于每个服务器都具有其自己的事项管理器，所以，也不需要协调。For objects which need to be kept in one location, there is typically a transaction monitor running on each server in the cluster or domain which thereafter uses the local file system to access the object. Each server can have its own transaction manager so that there are few problems with persistence. Since each server has its own transaction manager, no coordination is required.

例如，可能存在包括三个服务器的群集器，每个服务器具有一事项管理器。这些服务器中的一个可能遇到故障或者由于该服务器不可用于群集器所引起的其它问题。由于有故障的服务器是唯一可访问特定事项处理记录的服务器，所以，在该服务器再次可用于该群集器之前，不可能恢复特定记录上的任何事项。由于服务器必须花费大量的时间来解决这些问题，所以恢复该记录是很困难的或者至少效率很低。重要的服务器问题可能包括诸如服务器上的母板短路或者电源被烧坏这样的事件。For example, there may be a cluster of three servers, each with a transaction manager. One of these servers may experience a failure or other problem due to the server being unavailable for the cluster. Since the failed server is the only server with access to a particular transaction record, it is not possible to recover any transactions on a particular transaction until that server is again available to the cluster. Restoring this record is difficult, or at least inefficient, since the server must spend a considerable amount of time resolving these issues. Important server problems might include events such as a shorted motherboard on the server or a burned out power supply.

发明内容Contents of the invention

本发明包括这样一个系统，该系统用于对诸如存储在网络或者群集器上的服务器中的对象进行管理。该系统包括一数据源、应用、或者诸如文件系统或者Java消息服务部件这样的位于群集器之内或者群集器之外的服务。该系统包括若干个服务器，这些服务器诸如通过高速网络连接而与该文件系统或者应用进行通信。The present invention includes a system for managing objects such as stored in servers on a network or cluster. The system includes a data source, application, or service such as a file system or a Java message service component located within or outside the cluster. The system includes several servers that communicate with the file system or applications, such as through a high-speed network connection.

该系统包括诸如另一个服务器所同意的一引导服务器(1ead server)。该引导服务器包含于硬件或者软件群集器中。该系统包括用于从服务器当中选择引导服务器的一算法，该算法例如可以是内置在硬件群集器设备中的算法。该引导服务器依次包括诸如Paxos算法这样的用于选择主服务器的一分布式一致性算法。用于选择引导服务器的算法可以与用于选择主服务器的算法不同或者与其相同。The system includes a lead server such as agreed upon by another server. The boot server is contained in a hardware or software cluster. The system includes an algorithm for selecting a bootstrap server from among the servers, which may be, for example, an algorithm built into a hardware cluster device. The bootstrap server in turn includes a distributed consensus algorithm, such as the Paxos algorithm, for selecting the master server. The algorithm used to select the boot server may be different or the same as the algorithm used to select the master server.

该主服务器包括诸如存储在本地超高速缓存器中的事项或者对象的拷贝。主服务器提供了可对网络或者群集器中的任何服务器进行访问的本地拷贝。主服务器还可以提供对存储在文件系统中的对象进行访问的唯一接入点，或者可提供对应用或者服务进行访问的唯一接入点。还可以在文件系统、应用、或者服务中更新对主服务器所高速缓冲的、所接纳的、或者所拥有的事项所做出的任何变化。The master server includes copies of items or objects, such as stored in a local cache. The master server provides a local copy that is accessible to any server on the network or in the cluster. The master server may also provide a unique access point to objects stored in the file system, or may provide a unique access point to applications or services. Any changes made to items cached, admitted, or owned by the master server may also be updated in the file system, application, or service.

如果主服务器变得不能接纳所述对象，那么使用分布一致性算法选择新的主机，此后从文件系统或者服务中取出该对象所需的数据。群集器中的另一个服务器被通知一个新的服务器正在接纳该对象。通过诸如点到点连接或者通过多点传送这样的适当手段来通知服务器。If the primary server becomes unacceptable for the object, a new host is selected using a distributed consensus algorithm, after which the data required for the object is fetched from the file system or service. Another server in the cluster is notified that a new server is accepting the object. The server is notified by suitable means such as a point-to-point connection or by multicasting.

附图说明Description of drawings

图1给出了根据本发明一个实施例的分布式对象系统的示意图；Fig. 1 has provided the schematic diagram of the distributed object system according to an embodiment of the present invention;

图2给出了根据本发明一个实施例的另一个分布式对象系统的示意图；Fig. 2 has provided the schematic diagram of another distributed object system according to an embodiment of the present invention;

图3给出了根据本发明的用于选择主服务器的一方法的流程图；Fig. 3 has provided the flowchart of a method for selecting master server according to the present invention;

图4给出了根据本发明的用于选择新的主服务器的一方法的流程图；Fig. 4 has provided the flowchart of a method for selecting new main server according to the present invention;

图5给出了根据本发明的用于使用引导服务器的一方法的流程图；Figure 5 presents a flowchart of a method for using a bootstrap server according to the present invention;

图6给出了根据本发明一个实施例的JMS消息存储系统的示意图；Fig. 6 has provided the schematic diagram of the JMS message storage system according to an embodiment of the present invention;

图7给出了根据本发明所使用的计算机系统的部件的方框图。Figure 7 presents a block diagram of components of a computer system used in accordance with the present invention.

具体实施方式Detailed ways

根据本发明的系统提供了诸如当拥有数据对象的服务器变得不可用于服务器群集器时发布有效性的解决方案。这样一种解决方案可使该群集器中的另一个服务器拥有该数据对象的所有权。但是，出现了这样一个问题，即在不需对两个服务器上的数据对象进行复制的情况下即可到这两个服务器访问所述数据对象。The system according to the invention provides solutions such as publishing availability when the server owning the data object becomes unavailable to the server cluster. Such a solution would allow another server in the cluster to take ownership of the data object. However, a problem arises in that data objects can be accessed on both servers without duplication of the data objects on the two servers.

如果群集器使用文件系统、数据存储器、或者数据库(在下文中被总称为“文件系统”)以持续的存储数据，并且不止一个服务器可访问该文件系统，那么如果拥有那个对象的第一服务器遇到问题，第二服务器即可自动地接管访问数据对象的任务。另外，可利用群集器或者群集器中的服务器所使用的算法来指令服务器接管所述项的所有权。然而，另一个基本问题包括使群集器同意哪一个服务器目前拥有资源或者对象，或者在服务器中间实现“一致同意”。If the cluster uses a file system, data store, or database (collectively referred to hereinafter as a "file system") to persistently store data, and more than one server has access to the file system, then if the first server that owns that object encounters problem, the second server can automatically take over the task of accessing the data object. Additionally, the cluster or an algorithm used by the servers in the cluster may be utilized to instruct the server to take over ownership of the item. However, another fundamental problem involves getting the cluster to agree on which server currently owns the resource or object, or achieving "unanimity" among the servers.

图1给出了根据本发明的群集器系统100的一个例子，在该例子中将诸如事项注册114这样的对象存储在文件系统112中。群集器110中的所有服务器106，116，118都可访问该文件系统112，但是这些服务器中只有一个每次都可访问注册114。诸如通过存储注册114的拷贝108或者通过都可访问文件系统112中的注册114，群集器110中的服务器当中的主服务器106将“拥有”或者“接纳”该注册114。群集器110中的其它任何服务器116，118可以访问该记录的拷贝108，和/或可通过主服务器106来访问注册114。例如，一客户或者浏览器102可以对其指群集器110中的服务器116的网络104进行请求。该服务器可通过网络104来访问主服务器106上的事项记录的拷贝108。如果必须更新事项记录，那么拷贝108与文件系统112上的原注册114一起被更新。FIG. 1 shows an example of a cluster system 100 in which objects such as a transaction registry 114 are stored in a file system 112 in accordance with the present invention. All servers 106, 116, 118 in the cluster 110 have access to the file system 112, but only one of these servers has access to the registry 114 at a time. The master server 106 among the servers in the cluster 110 will "own" or "host" the registry 114 , such as by storing a copy 108 of the registry 114 or by making the registry 114 accessible in a file system 112 . Any other server 116 , 118 in the cluster 110 can access the copy 108 of the record, and/or can access the registry 114 through the master server 106 . For example, a client or browser 102 may make a request to the network 104 referring to a server 116 in the cluster 110 . The server can access a copy 108 of the transaction record on the primary server 106 via the network 104 . If the transaction record must be updated, the copy 108 is updated along with the original registry 114 on the file system 112 .

例如当服务器作为对象的存储器时，诸如通过将数据对象的拷贝存储在本地超高速缓存器中并且使该群集器中的其它服务器可使用该拷贝，或者通过使唯一的服务器可随机访问文件系统中的对象，该服务器则可“拥有”或者“接纳”该数据对象，以致该群集器中的所有其它服务器必须通过该主服务器来访问那些对象。这保证了对象“正好一次”存在于服务器群集器中。For example when a server acts as the object's store, such as by storing a copy of the data object in a local cache and making that copy available to other servers in the cluster, or by making the unique server randomly accessible in a file system The server then "owns" or "accommodates" the data objects such that all other servers in the cluster must go through the master server to access those objects. This guarantees that the object exists "exactly once" in the server cluster.

图3给出了一个处理300，该处理300用于建立对一个对象的接纳。利用诸如Paxos算法这样的分布式一致性算法302来选择主服务器。因为群集器中的服务器通常必须就怎样在群集器服务器中分布对象而达成一般同意或一致同意，所以，这种算法被称为“分布式一致性”算法。Figure 3 illustrates a process 300 for establishing admission to a subject. A master server is selected using a distributed consensus algorithm 302 such as the Paxos algorithm. Because the servers in a cluster typically must come to general agreement or consensus on how to distribute objects among the cluster servers, this algorithm is called a "distributed consensus" algorithm.

如果被接纳的对象例如是被高速缓存在主服务器中，那么，从文件系统中取出的数据对象的拷贝被传送至主服务器并作为一个对象存储在本地超高速缓存器304中。此后诸如通过主服务器向网络或者适当群集器中的其它服务器通知该对象的本地拷贝存在于主服务器中并且本地拷贝将被用在对将来的网络要求306进行的处理中。If the admitted object is cached in the host server, for example, then a copy of the data object fetched from the file system is transferred to the host server and stored as an object in the local cache 304 . Other servers in the network or appropriate cluster are thereafter notified, such as by the master server, that a local copy of the object exists in the master server and that the local copy will be used in processing 306 for future network requests.

在是分布式一致性算法的一个例子的Paxos算法中，通过网络服务器来选择一服务器以作为主服务器或者引导服务器，该网络服务器引导了一系列的“一致循环(consensus rounds)”。在每个一致循环中，建议了新的主服务器或者引导服务器。循环一直继续直到多数或者法定数目的服务器接受所建议的服务器。尽管该系统被配置成总是由引导服务器启动一循环以便选择一主服务器，但是，任何服务器都可通过启动一循环来建议主服务器或者引导服务器。同时可进行用于不同选择的循环。因此，通过循环数或者这样一对值来识别循环选择，至于这一对值，其中的一个与所述循环相关，而另一个与引导所述循环的服务器相关。In the Paxos algorithm, which is an example of a distributed consensus algorithm, a server is selected as a master or lead server by a network server that leads a series of "consensus rounds". In each consensus cycle, a new master or bootstrap server is proposed. The cycle continues until a majority or quorum of servers accepts the proposed server. Although the system is configured so that a round-robin is always initiated by the boot server to select a master server, any server can propose a master or boot server by starting a round-robin. Cycles for different selections can be performed at the same time. Thus, a round selection is identified by a round number or a pair of values, one of which is associated with said round and the other is associated with the server directing said round.

用于这样一个循环的步骤如下，尽管其它步骤和/或方法可适于某些情况或者应用。首先，通过引导服务器将一“集中”消息传送到群集器中的其它服务器来启动一个循环。集中消息集中了来自群集器中服务器的与在前这些服务器所参与进行的循环有关的信息。如果对于一个特定的选择处理存在在前的一致循环，那么所述集中消息还通知服务器不要提交来自在前循环的选择。例如，一旦引导服务器已集聚了来自至少一半群集器服务器的回应，那么引导服务器就可以决定所述值以建议下一个循环并且将该建议传送到群集器服务器以作为“启动”消息。为了在这种方法中使引导服务器选择一值以提供建议，必须接收来自服务器的初始值信息。The steps for such a cycle are as follows, although other steps and/or methods may be suitable for certain situations or applications. First, a cycle is started by instructing the server to send a "collection" message to the other servers in the cluster. Aggregated messages aggregate information from servers in the cluster about previous rounds in which those servers participated. The centralized message also informs the server not to submit selections from previous cycles if there was a previous consistent cycle for a particular selection process. For example, once the bootstrap server has aggregated responses from at least half of the cluster servers, the bootstrap server can decide on the value to suggest the next round and transmit that suggestion to the cluster servers as a "start" message. In order for the guidance server to select a value to provide a suggestion in this method, initial value information must be received from the server.

一旦服务器接收到来自引导服务器的启动消息，它通过传送一“接受”消息来做出响应，这表明该服务器接受所建议的主/引导服务器。如果该引导服务器接收到多数或者法定数目服务器的接受消息，那么引导服务器将其输出值设置为在循环中所建议的值。如果引导服务器在规定的时段内没有接收到多数或者法定数目的接受(“一致同意”)，那么引导服务器可启动新的循环。如果引导服务器接收到一致同意，那么引导服务器可以通知所述群集器或者网络服务器所述服务器将被指定为所选择的服务器。可通过任何适当的广播技术将该通知广播到网络服务器，例如可通过点到点连接或者多点传送。Once the server receives the start message from the boot server, it responds by transmitting an "accept" message, which indicates that the server accepts the proposed master/boot server. If the boot server receives an accept message from a majority or quorum of servers, the boot server sets its output value to the value suggested in the round robin. If the bootstrap server does not receive a majority or quorum of acceptances ("consensus") within a specified period of time, the bootstrap server may start a new round. If the bootstrap server receives a consensus, the bootstrap server may notify the cluster or web server that the server will be designated as the selected server. The notification may be broadcast to the web server via any suitable broadcast technique, such as via a point-to-point connection or multicast.

通过建议利用与在前循环有关的信息的选择可保证一致同意方法的一致同意条件。要求该信息来自至少大多数的网络服务器，以便对于任意两个循环来说存在至少一个参与了两个循环的服务器。The consensus condition of the consensus method can be guaranteed by suggesting options utilizing information about previous cycles. This information is required to come from at least a majority of web servers so that for any two rounds there is at least one server participating in both rounds.

所述引导服务器可通过向每个服务器询问服务器接受一值的最新循环编号、可能还要询问所接受的值来选择一值。一旦引导服务器从多数或者法定数目的服务器中获得了该信息，它可以选择用于新循环的值，该值等于响应当中最新循环的值。如果没有一个服务器涉及在前的循环，那么引导服务器还可以选择初始值。例如，如果引导服务器接收到上次所接受的循环是x的一响应，并且当前循环是y，那么服务器表示不接受x与y之间的任何循环，以便保持一致性。The bootstrap server may select a value by asking each server the latest cycle number that the server accepted a value, and possibly the accepted value. Once the bootstrap server has obtained this information from a majority or quorum of servers, it can choose a value for the new round equal to the value of the latest round in the response. The bootstrap server can also choose an initial value if none of the servers were involved in the previous cycle. For example, if the bootstrap server receives a response that the last accepted cycle was x, and the current cycle is y, then the server says not to accept any cycles between x and y in order to maintain consistency.

循环引导服务器与网络服务器之间的抽样交互包括以下消息：A sample interaction between the loop boot server and the web server includes the following messages:

(1)“Collect”——将一消息传送到正在启动一个新循环“r”的服务器。该消息可采取m＝(“Collect”，r)的形式。(1) "Collect" - sends a message to the server which is starting a new cycle "r". The message may take the form m=("Collect", r).

(2)“Last”——将来自一网络服务器的消息传送给引导服务器，该网络服务器提供了上次循环所接受的“a”以及该循环的值“v”。该消息可采取m＝(“Last”，r，a，v)的形式。(2) "Last" - sends a message to the bootstrap server from a web server providing the "a" accepted for the last round and the value "v" for this round. The message may take the form m=("Last", r, a, v).

(3)“Begin”——将一消息传送给发布与循环r相关的所述值的服务器。该消息可采取m＝(“Begin”，r，v)的形式。(3) "Begin" - send a message to the server publishing the value associated with cycle r. This message may take the form m=("Begin", r, v).

(4)“Accept”——将一消息从用于接受与循环r相关的所述值的服务器传送给所述引导服务器。该消息可采取m＝(“Accept”，r)的形式。(4) "Accept" - transmits a message from the server accepting the value associated with cycle r to the bootstrap server. The message may take the form m=("Accept", r).

(5)“Success”——将一信息传送给用于发布与循环r相关的所述值v的选择的服务器。该信息可采取m＝(“Success”，r，v)的形式。(5) "Success" - transmits a message to the server for publishing the selection of said value v in relation to the cycle r. This information may take the form m=("Success", r, v).

(6)“Ack”——将一消息从一服务器传送给引导服务器，该服务器承认已经接收到与循环r相关的决定。该信息可采取m＝(“Ack”，r)的形式。(6) "Ack" - transmits a message from a server to the bootstrap server acknowledging that a decision related to round r has been received. This information may take the form m=("Ack", r).

存在与其位于硬件群集器或者软件群集器内部或者外部的服务器相分离的一文件系统。该文件系统诸如通过将记录存储在第一磁盘并将该记录拷贝到位于文件系统之内的第二磁盘来持续的存储事项记录。如果第一磁盘划碰了，那么文件系统可使所述群集器和/或服务器无法察觉所述划碰并且可从第二磁盘中获得记录信息。该文件系统还可以选择以将该记录拷贝到其用作第二磁盘备份的第三磁盘上。There is a file system separate from its servers inside or outside the hardware cluster or software cluster. The file system persistently stores transaction records, such as by storing the records on a first disk and copying the records to a second disk located within the file system. If the first disk gets scratched, the file system can make the cluster and/or server invisible to the scratch and can obtain record information from the second disk. The file system also has the option to copy the record to a third disk which it uses as a backup of the second disk.

从群集器中的服务器的角度来看，该文件系统可以是单一资源。在一个实施例中，该服务器可以只关心单一服务器在任何时间处拥有该文件系统。From the perspective of the servers in the cluster, the file system can be a single resource. In one embodiment, the server may only be concerned that a single server owns the file system at any one time.

根据本发明的另一个例子包括位于服务器群集器中的高速缓存器。例如因为网络性能的原因，因此希望在群集环境中使单一高速缓存器群集器中的服务器示出了数据对象。将多个项保存在单一高速缓存器中可能是有利的，因为群集器中的服务器可访问高速缓存器而无需不断的回到永久性存储器。取出已位于存储器中的多个项可极大的增加该系统的利用率，因为命中数据库或者文件系统的时间相对集中。Another example in accordance with the invention includes a cache located in a server cluster. For example, for network performance reasons, it is therefore desirable in a cluster environment to have a single cache server in the cluster show data objects. It may be advantageous to keep multiple entries in a single cache, because servers in the cluster can access the cache without having to continually go back to persistent storage. Fetching multiple items that are already in memory can greatly increase the utilization of the system because of the relatively intensive time to hit the database or file system.

然而，单一高速缓存器所具有的一个问题就是必须保证存储在存储器中的对象与存储在文件系统一磁盘中的对象相同。需要这种一致性的理由是保证在所高速缓存对象上所进行的任何操作或者计算产生了正确结果。另一个原因就是必须对由于高速缓存器划碰或者以别的方式而感染或者不可用所造成的文件系统的高速缓存器进行修复。However, one problem with a single cache is that it must be guaranteed that the objects stored in memory are the same as those stored in the file system-disk. The reason for this consistency is to ensure that any operation or calculation performed on the cached object produces the correct result. Another reason is that a file system's cache must be repaired due to cache scratches or otherwise becoming infected or unavailable.

至少有两个基本方式来处理群集器中的这类高速缓存，虽然其它方式至少对某些应用起作用。一种方式就是在多个地方复制该高速缓存器。该处理是有问题的，因为正被高速缓存项的任何变化都要求拷贝所述高速缓存器的所有服务器一致同意该变化，或者至少知道该变化。就资源以及性能来讲，经证明这是非常昂贵的。There are at least two basic ways to handle this type of caching in a cluster, although others work for at least some applications. One way is to duplicate the cache in multiple places. This process is problematic because any change to the item being cached requires that all servers copying the cache agree on the change, or at least know of the change. This has proven to be very expensive in terms of resources as well as performance.

根据本发明的一可替换方法分布特定的服务器以使其是群集器中高速缓存器的所有者，并且通过这些特定的服务器都可访问高速缓存器。群集器中的任何服务器接纳这种高速缓存器。每个服务器可以接纳一个、若干个、或者不接纳高速缓存器。在单一服务器上寄主该高速缓存器，或者将其散布到群集器中某些或者所有服务器当中。群集器本身可以是任何适当的群集器，例如硬件群集器或者由处于给定“软件”群集器中的软件应用所指定的一组服务器。An alternative method according to the invention distributes specific servers so that they are the owners of the caches in the cluster and the caches are all accessible by these specific servers. Any server in the cluster hosts this cache. Each server can host one, several, or no caches. Host the cache on a single server, or spread it across some or all servers in the cluster. The cluster itself may be any suitable cluster, such as a hardware cluster or a group of servers specified by a software application residing in a given "software" cluster.

还可以考虑作为位于系统上某处的一种对象的事项登录和/或高速缓存器中的一个例子。可以指望保证任何一个这种对象在一个群集器中仅仅存在一次，并且该对象总是可用的。还可以指望保证如果接纳该对象的服务器出现了故障则所述对象可以被恢复到另一个服务器上，并且该对象可用于该群集器。Also consider a transaction log and/or cache as an example of an object located somewhere on the system. It can be counted on to guarantee that any such object exists only once in a cluster, and that the object is always available. It can also be counted on to guarantee that if the server hosting the object fails, the object can be restored to another server and made available to the cluster.

图4给出了用于恢复的一个方法400。在该方法中，确定主服务器是否能够继续接纳对象402，例如确定服务器是否仍可用于该网络。如果不是，利用分布式一致性算法选择新的主机。这个选择可以是根据选择原主机404所使用的方法来执行。从文件系统中取出数据对象的拷贝被提供给新的主机，并且将其存储在本地超高速缓存器406中。网络上或适当群集器中的其它服务器被通知新的主服务器包括所述对象的本地拷贝，并且本地拷贝将用在对将来的任何网络要求408进行处理当中。Figure 4 shows one method 400 for recovery. In this method, it is determined whether the host server can continue to accommodate objects 402, eg, whether the server is still available for the network. If not, a new host is selected using a distributed consensus algorithm. This selection may be performed according to the method used to select the original host 404 . A copy of the data object fetched from the file system is provided to the new host and stored in local cache 406 . Other servers on the network or in the appropriate cluster are notified that the new master server contains a local copy of the object, and that the local copy will be used in processing any future network requests 408 .

根据本发明的系统和方法可以定义存在于群集器中正好一个位置中的对象，并且可保证这些对象总是存在的。从服务器的角度来看，是否利用诸如一个文件系统来镜像或者拷贝诸如事项记录这样的对象是无关紧要的。从服务器的角度来看，总有一个可被群集器中任何一个服务器访问的永久性存储器，该系统周期地检查对象的存在，或将对象所有权指定为很短的时间周期，以便频繁的再指定对象以确保网络上或者群集器中的某个设备上的存在。Systems and methods according to the present invention can define objects that exist in exactly one location in a cluster, and can guarantee that these objects always exist. From the server's point of view, it does not matter whether objects such as transaction records are mirrored or copied using, for example, a file system. From the server's point of view, there is always a persistent store accessible by any server in the cluster, and the system periodically checks for the existence of objects, or assigns object ownership to short periods of time for frequent reassignment object to ensure existence on the network or on a device in the cluster.

硬件群集器包括一组设备，每个设备可运行多个服务器。在每个设备之后还存在一文件系统。硬件群集器中的服务器通常是由硬件构成的，所以，它们能更快的做出决定并且对硬件群集器内部的服务器故障进行处理。硬件群集器的大小被限制为包括有所述服务器的设备的实际硬件。硬件群集器中的服务器可以被用作软件群集器中的服务器，并且由于该设备上的单独的服务器可以用于该网络，所以还可以包括网络服务器。A hardware cluster consists of a collection of devices, each of which can run multiple servers. There is also a file system behind each device. Servers in a hardware cluster are usually made of hardware, so they can make faster decisions and handle server failures within a hardware cluster. The size of the hardware cluster is limited to the actual hardware of the devices comprising the servers. Servers in a hardware cluster can be used as servers in a software cluster, and since individual servers on the device can be used for the network, network servers can also be included.

用于这些设备中的一个的共享文件系统诸如通过高速网络而可以用于群集器中的所有服务器。文件系统还可以是冗余的。在一个实施例中，通过使用文件系统的多个数据磁盘来实现该冗余。在这样一种冗余的实现中，在将对象写入到该文件系统的任何时候都可通过多个磁盘来拷贝对象。当将该文件系统看作是“黑盒子”时，该文件系统可承受任一磁盘当中的故障并且仍可提供群集器中任何服务器对数据项的访问。A shared file system for one of these devices is available to all servers in the cluster, such as over a high-speed network. File systems can also be redundant. In one embodiment, this redundancy is achieved by using multiple data disks of the file system. In such a redundant implementation, objects can be copied across multiple disks anytime they are written to the file system. When viewed as a "black box," the file system can withstand a failure in any one of the disks and still provide access to data items by any server in the cluster.

假设保存在存储器中的这些对象总是通过可靠的、永久性的存储器机构来恢复，则可建立根据本发明的其被称为“正好一次”结构的结构。例如，存在有表示事项处理记录的一个对象。只要产生了对该对象的调用，则更新对应的事项记录。这包括从数据库中所读取的或者写入到数据库中的一个调用。表示事项记录的对象位于诸如主服务器这样的群集器中的一个服务器上。只要群集器中至少一个服务器启动(up)并且运行，那么正好一次结构可确保如果另一个服务器有故障则该服务器可接管该记录的所有权。Assuming that these objects kept in memory are always restored by a reliable, persistent storage mechanism, it is possible to build what is referred to as an "exactly-once" structure according to the present invention. For example, there exists an object representing transaction records. As long as a call to the object is generated, the corresponding event record is updated. This includes a call to read from or write to the database. The object representing the transaction record resides on a server in a cluster such as the master server. As long as at least one server in the cluster is up and running, the exactly-once structure ensures that if another server fails, that server can take over ownership of the record.

可能存在一个表示高速缓存器的对象。每当更新高速缓存器时，还可以将该更新写回到永久存储器中。当多个服务器中的一个服务器需要使用数据项时，要求该服务器全面研究(go through)该对象。如果用于接纳表示高速缓存器的对象的所述服务器发生故障，则将该对象恢复在另一个服务器上。所恢复的对象可以从永久存储器中取出所有的必要信息。There may be an object representing a cache. Whenever the cache is updated, the update can also be written back to persistent storage. When a server among multiple servers needs to use a data item, the server is required to go through the object. If the server hosting the object representing the cache fails, the object is restored on another server. Restored objects have all necessary information retrieved from persistent storage.

正好一次结构可用作群集器所使用的存储缓冲器。该结构提供了其可提供系统中数据的单个高速缓存器，该系统是通过一可靠的、永久的存储器来恢复的。只要从高速缓存器中读取数据，无需访问永久的存储器即可执行该读取。然而，当将该更新写入到高速缓存器中时，必须通过永久的存储器来写回，以便如果存在一故障则可使该系统恢复。Just-once structures are available as memory buffers used by the cluster. This architecture provides a single cache which can provide data in the system which is retrieved from a reliable, persistent memory. Whenever data is read from the cache, the read can be performed without accessing permanent memory. However, when the update is written into the cache, it must be written back through persistent storage in order to allow the system to recover if there is a failure.

正好一次结构的一个重要方面包括一种方式，可在该方式中抽象出一种根据应用和/或实现而变化的方法。创建了诸如被称为“正好一次对象”的新型分布式对象。正好一次对象例如是文件系统中数据项的本地高速缓存拷贝，或者是对群集器中服务器的该数据项进行访问的唯一接入点。实现该抽象的根本技术也很重要。An important aspect of the exactly-once architecture includes a way in which it can abstract a method that varies by application and/or implementation. New types of distributed objects such as those called "exactly-once objects" were created. An exactly-once object is, for example, a locally cached copy of a data item in a file system, or the only point of access to that data item for a server in a cluster. The underlying technology that implements that abstraction is also important.

本发明的系统可利用其可用于分布式一致性的多个方法中的任何一个，诸如利用上述Paxos算法的一方法。选择这样一种算法，该算法提供了有效的方法以使多节点和/或分布式结点同意对象的一个值。即使节点有故障和/或在协商处理过程中返回，也可选择该算法以使其起作用。The system of the present invention can utilize any of a number of methods it can use for distributed consensus, such as the one utilizing the Paxos algorithm described above. Choose an algorithm that provides an efficient way to get multiple nodes and/or distributed nodes to agree on a value for an object. This algorithm can be chosen to work even if a node is down and/or returns during the negotiation process.

网络群集器的一般方法是利用可靠广播，在该方法中保证将每个消息传送到其所预定的接受者，或者至少将其传送到每个预定作用的服务器。该方法很难使系统并行化，因为可靠广播需要在移到下一个消息或者接受者上之前接受者要肯定应答收到一消息。利用多点传送的分布式算法可降低保证的次数，因为多点传送不能保证所有服务器接收一消息。然而，多点传送可使该方法简单化以便该系统可参与并行处理，因为单一消息被并发的多点传送到所有群集器服务器，而无需等待来自每个服务器的反应。未接收到多点传送消息的服务器最后可从引导服务器或者另一个群集器服务器或者网络服务器中取出信息。如这里所使用的，网络服务器可参照网络上的任何服务器，无论该网络服务器是在硬件群集器中、软件群集器中、或者任何群集器之外。The general approach of network clusters is to use reliable broadcast, in which every message is guaranteed to be delivered to its intended recipient, or at least to every intended server. This approach makes it difficult to parallelize the system because reliable broadcasting requires the receiver to acknowledge receipt of a message before moving on to the next message or receiver. A distributed algorithm using multicast reduces the number of guarantees, because multicast cannot guarantee that all servers receive a message. However, multicasting simplifies the approach so that the system can engage in parallel processing, since a single message is multicasted concurrently to all cluster servers without waiting for a response from each server. A server that does not receive a multicast message can eventually fetch the information from the bootstrap server or another cluster server or web server. As used herein, a web server may refer to any server on the network, whether the web server is in a hardware cluster, in a software cluster, or outside of any cluster.

正好一次结构的一个重要方面是降低了一致的困难性。根据本发明，通过利用分布式一致性算法来多点传送消息可改善分布式一致性实现的性能。该方法可为使所有服务器相互一致同意而所需的消息交换和/或网络通信量最小化。An important aspect of the exactly-once structure is that it reduces the difficulty of consistency. According to the present invention, the performance of a distributed consensus implementation can be improved by utilizing a distributed consensus algorithm to multicast messages. The method may minimize the message exchange and/or network traffic required for all servers to agree with each other.

当多点传送时，可采用若干方法中的一个。在其被称为“单相分布”的第一个方法中，引导服务器将一消息多点传送到位于网络上的所有其它服务器上，例如用在Paxos算法的循环中或者用于这样一种状态，即已为一对象选择了新的主机。在该方法中，主服务器只须传送一个消息，该消息被传送到网络中任何可用服务器上。如果一服务器临时的断开该网络，那么该服务器要求在回到该网络上之后对新的主机进行识别。When multicasting, one of several methods may be employed. In the first method, which is called "single-phase distribution", the directed server multicasts a message to all other servers located on the network, for example for use in the loop of the Paxos algorithm or for such a state , a new host has been selected for an object. In this method, the master server only needs to send a message, which is sent to any available server in the network. If a server is temporarily disconnected from the network, the server is required to identify new hosts after returning to the network.

利用另一个其被称为“双相分布”的多点传送方法，引导服务器可利用适当的算法的来预选择一主服务器。然而，在将一对象分布给该主机之前，该引导服务器可与群集器中的所有其它服务器相连接以确定服务器是否同意所选择的新的主服务器。引导服务器可通过点到点连接而与每个服务器相连接，或者可传送多点传送要求并且此后等待每个服务器以应答。如果服务器不同意所选的主机，那么引导服务器利用该算法来预选择新的主机。引导服务器此后在另一个循环中传送另一个多点传送请求以及所重新预选择的主机的标识。Using another multicast method called "bi-phase distribution", the lead server can pre-select a master server using an appropriate algorithm. However, before distributing an object to the host, the bootstrap server can connect with all other servers in the cluster to determine if the servers agree with the selection of the new master. The bootstrap server can connect to each server through a point-to-point connection, or can transmit a multicast request and thereafter wait for each server to reply. If the server disagrees with the chosen host, the bootstrap server utilizes this algorithm to pre-select a new host. The bootstrap server thereafter transmits another multicast request and the identity of the re-preselected host in another cycle.

如果每个服务器都同意预选择的主机，那么引导服务器将该对象分布给主服务器。此后引导服务器多点传送一提交信息以通知服务器将新的变化已生效并且因此服务器将更新其信息。If each server agrees on the pre-selected host, the bootstrap server distributes the object to the master. The server is then directed to multicast a commit message to inform the server that the new changes have taken effect and that the server will therefore update its information.

正好一次结构还可以利用“出租”机构。在利用这种机构的过程中，使用一种算法以使群集器服务器与引导服务器一致，例如通过利用分布式一致性算法。一旦选择了，该引导服务器担负将对象正好一次分布给群集器中的各种服务器。建立该系统以便如果现有的主服务器出现了故障，那么群集器服务器总是与新的引导服务器一致。Exactly one-time structures can also make use of "rental" agencies. In utilizing this mechanism, an algorithm is used to bring the cluster server into agreement with the bootstrap server, for example by utilizing a distributed consensus algorithm. Once selected, the bootstrap server is responsible for distributing the object to the various servers in the cluster exactly once. The system is set up so that if the existing master server fails, the cluster server is always identical to the new boot server.

在所述引导服务器被激活的同时，该引导服务器可以知道需要存在于该系统之中的所有正好一次对象。该引导服务器可决定哪个服务器将接纳每个对象，并且因此可将该对象“出租”给所选择的服务器。当一对象被出租给一服务器时，该服务器可拥有或者接纳该对象某一段时间，例如出租时间段。将该引导服务器配置成定期更新这些出租。该方法可提供一方式以保证如果一服务器有故障或者以任何一种方式而断开了或者否则不能在群集器内正常的操作，那么该服务器不会使其出租被更新。At the same time as the bootstrap server is activated, the bootstrap server may know all exactly-once objects that need to exist in the system. The bootstrap server can decide which server will host each object, and can therefore "lease" the object to the selected server. When an object is leased to a server, the server may own or host the object for a certain period of time, such as the lease period. Configure the bootstrap server to update these leases periodically. This method can provide a way to ensure that if a server fails or is disconnected in any way or is otherwise unable to operate normally within the cluster, then the server will not have its lease renewed.

在出现故障的情况下分布式系统所具有的最大问题是很难分辨出具有故障的服务器与仅是未做出响应的一个服务器之间的不同。由于某种原因而与该网络断开的任何服务器不再接纳对象。即使该服务器不可以用于该群集器，但是其仍然知道在出租时间之后它将结束对所有对象的接纳。当该服务器不能被用于该群集器时，其出租将不会被更新。The biggest problem with distributed systems in the event of a failure is that it is difficult to tell the difference between a server with a failure and just one that is not responding. Any server that is disconnected from the network for any reason no longer accepts objects. Even though the server is not available for the cluster, it is still known that it will finish hosting all objects after the lease time. When the server is no longer available for the cluster, its lease will not be renewed.

引导服务器还知道，如果引导服务器不能在一定量的时间之内到达主服务器，那么该主服务器将放弃该对象的所有权。出租周期可以是任一适当的时间，例如几秒。出租周期对群集器中的所有对象都相同，或者可在对像之间变化。The bootstrap server also knows that if the bootstrap server cannot reach the master server within a certain amount of time, the master server will relinquish ownership of the object. The rental period may be any suitable time, such as a few seconds. The lease period is the same for all objects in the cluster, or can vary between objects.

利用正好一次结构的系统还可以更加紧凑。操作系统经常提供更加接近硬件并可提供更多控制的特定机制。然而，这种方法的一个问题就是它受到可用硬件的限制。例如，服务器的硬件群集器具有按照顺序的16个服务器。由于这些系统需要某种紧密的硬件耦合，因此对可以包含在群集器中的服务器数目做出了限制。The system can also be more compact using exactly one structure. Operating systems often provide specific mechanisms that are closer to the hardware and provide more control. However, a problem with this approach is that it is limited by the available hardware. For example, a hardware cluster of servers has 16 servers in sequence. Because these systems require some kind of tight hardware coupling, there is a limit to the number of servers that can be included in a cluster.

另一方面，与专用硬件群集器所处理的群集相比，正好一次结构可处理更多的群集。该结构允许从一个专用群集器中所获得的服务质量的某种杠杆作用，从而允许更大的群集器。不同的服务质量可包括例如是否通过诸如点到点连接的可靠协议来传送消息，或者通过诸如多点传送的可靠性稍差但多友好资源的协议来传送消息。利用正好一次结构的优点是能够平衡可量测性与容错性，以便用户可使该系统适应于特定应用的需要。On the other hand, exactly one fabric can handle many more clusters than a dedicated hardware cluster can handle. This structure allows some leverage of the quality of service obtained from a dedicated cluster, allowing for larger clusters. Different qualities of service may include, for example, whether messages are delivered over a reliable protocol such as a point-to-point connection, or a less reliable but more resource-friendly protocol such as multicast. An advantage of using an exactly-once structure is the ability to balance scalability and fault tolerance so that the user can adapt the system to the needs of a particular application.

诸如硬件群集器机制的现有系统通过具有(将被提供给所述群集器的)由第二机制所支持的单一机制来尝试高实用性解决方案。如果第一机制失败，则存在执行接管的一个“伙伴”，并且原来运行于第一机制的任何软件被转移到第二机制。Existing systems such as hardware cluster mechanisms attempt a high-availability solution by having a single mechanism (to be provided to the cluster) supported by a second mechanism. If the first mechanism fails, there is a "buddy" that performs takeover, and any software originally running on the first mechanism is transferred to the second mechanism.

根据本发明的正好一次结构可将引导服务器分布给位于这些硬件群集器的一个中的服务器，以便对引导服务器故障进行处理变得比对软件群集器中的故障进行处理更快。然而，该引导服务器将少量的出租发放到服务器，而不管那些服务器是否位于硬件群集器或者软件群集器中。该配置可使引导服务器更快地恢复，虽然允许软件群集器大于硬件群集器，但是软件群集器仍包括硬件群集器。The exactly-once architecture according to the invention can distribute the boot servers to servers located in one of these hardware clusters so that a boot server failure becomes faster to handle than a failure in a software cluster. However, the bootstrap server issues a small number of leases to servers regardless of whether those servers are located in a hardware cluster or a software cluster. This configuration allows for faster recovery of the boot server, and although software clusters are allowed to be larger than hardware clusters, software clusters still include hardware clusters.

一个这种系统200如图2所示。硬件群集器218包括其包含有多个服务器220，222，224的单一机构。例如为提高效率，该硬件群集器可被用于从该机构上的服务器中选择一引导服务器220。一旦选择了引导服务器220，该引导服务器就可以在文件系统212中选择用于对象214的主机206，该文件系统212可能位于软件群集器210内部或者外部。文件系统214本身可将对象214拷贝为位于文件系统的另一磁盘上的第二对象216，从而提供持久性。新的主机206利用高速缓存在所述主机206上的所述对象的拷贝208从文件系统212中取出该对象214。当经过网络204从浏览器或客户机202接收到诸如服务206、216和220的请求时，如果该服务器需要访问对象208的高速缓存拷贝，那么该服务器知道要与主服务器206连接。One such system 200 is shown in FIG. 2 . A hardware cluster 218 includes a single facility that includes multiple servers 220 , 222 , 224 . The hardware cluster may be used to select a boot server 220 from among the servers at the facility, for example, for efficiency. Once the bootstrap server 220 is selected, the bootstrap server may select the host 206 for the object 214 in the file system 212 , which may be located inside or outside the software cluster 210 . The file system 214 itself may copy the object 214 as a second object 216 located on another disk in the file system, thereby providing persistence. The new host 206 fetches the object 214 from the file system 212 using the cached copy 208 of the object on the host 206 . When a request such as services 206, 216, and 220 is received from a browser or client 202 over network 204, the server knows to connect to master server 206 if it needs to access a cached copy of object 208.

利用这种系统的一种方法500示于图5。使用硬件群集器502的算法选择引导服务器。该算法例如可以是硬件群集器设备的专用算法，或者可以是其只要求硬件群集器服务器一致的分布式一致性算法。此后利用引导服务器504所具有的诸如Paxos算法这样的分布式一致性算法来预选择主服务器。此后将所预选择的主机的标识多点传送到位于其包括有硬件群集器的软件群集器中的另一个服务器上。该引导服务器接收来自每个服务器的赞同或者不赞同，这些服务器不久将进行操作并且与群集器508相连。如果服务器与所预选择的主服务器一致，那么将一提交信息多点传送到群集器服务器以向该服务器通知所预选择的主机现在接纳该对象；否则，如果服务器不赞同新的主机是预选择的并且再次启动该处理。One method 500 of utilizing such a system is shown in FIG. 5 . A bootstrap server is selected using an algorithm of the hardware cluster 502 . The algorithm may be, for example, a dedicated algorithm of the hardware cluster device, or may be a distributed consensus algorithm which only requires the hardware cluster servers to be consistent. Thereafter, a master server is pre-selected using a distributed consensus algorithm such as the Paxos algorithm possessed by the bootstrap server 504 . Thereafter the identity of the preselected host is multicast to another server located in the software cluster which includes the hardware cluster. The bootstrap server receives approval or disapproval from each server that will soon be operational and connected to the cluster 508 . If the server agrees with the pre-selected master, then a commit message is multicast to the cluster server to inform the server that the pre-selected host now accepts the object; otherwise, if the server does not agree that the new host is pre-selected and start the process again.

正好一次结构例如用于对事项记录进行处理或者高速缓存。这种结构例如还可以用于将管理服务器定义为正好一次对象并且出租管理服务器以便该管理服务器决不会停止工作。Exactly-once structures are used, for example, to process or cache transaction records. This structure can also be used, for example, to define a management server as an exactly one-time object and to rent out a management server so that it never stops working.

图6给出了根据本发明的群集系统600的另一个例子，其中，对象608用作与Java消息服务612(JMS)相关的消息存储器。群集器610中的所有服务器606，614，616使用JMS，但是它们必须将消息传送到消息存储器608并且通过网络604而获得来自该消息存储器608的任何消息。群集器610中服务器的主服务器606将“拥有”或者“接纳”消息存储器608。客户或者浏览器602可以向网络604提出直接指向群集器610中服务器616的一个请求。服务器616仅通过经网络604将一消息传送到位于主服务器606上的消息存储器608中来访问JMS。FIG. 6 shows another example of a cluster system 600 according to the present invention, wherein an object 608 is used as a message store in relation to a Java Message Service 612 (JMS). All servers 606 , 614 , 616 in cluster 610 use JMS, but they must transfer messages to message store 608 and get any messages from message store 608 over network 604 . The master server 606 of the servers in the cluster 610 will "own" or "host" the message store 608 . Client or browser 602 may make a request to network 604 directed to server 616 in cluster 610 . Server 616 accesses JMS simply by sending a message over network 604 to message store 608 located on master server 606 .

图7给出了可用做本发明部件或者能够实现本发明方法的计算机系统的方框图700。图7的计算机系统包括一处理单元704和主存储器702。处理单元704包括单一微处理器，或者可包括多个微处理器以将计算机系统配置为多处理器系统。主存储器702部分的存储由处理器单元704所执行的指令和数据。如果本发明整个或者部分是以软件实现的，那么当进行操作时主存储器702可存储可执行的程序代码。主存储器702包括动态随机存取存储器组(DRAM)、高速超高速缓存器、以及现有技术中的所大家所熟知的其它类型的存储器。FIG. 7 shows a block diagram 700 of a computer system that can be used as a component of the present invention or can implement the method of the present invention. The computer system of FIG. 7 includes a processing unit 704 and main memory 702 . Processing unit 704 includes a single microprocessor, or may include multiple microprocessors to configure the computer system as a multi-processor system. Portion of main memory 702 stores instructions and data that are executed by processor unit 704 . If the present invention is implemented in whole or in part in software, main memory 702 may store executable program code when operating. Main memory 702 includes dynamic random access memory (DRAM), high-speed cache memory, and other types of memory well known in the art.

图7所示的系统进一步包括一大容量存储器706、一外围设备708、一用户输入装置712、一便携式存储介质驱动器714、一图形子系统718以及一输出显示716。为了简单起见，图7所示的部件通过单一总线720相连。然而，对于本领域普通技术人员来说显而易见的是，该部件可通过一个或多个数据传送装置相连。例如，处理器单元704和主存储器702通过本地微处理器总线而连接，并且大容量存储器706、外围设备708、便携式存储介质驱动器714、以及图形子系统718通过一个或多个输入/输出(I/O)总线相连。由一磁盘驱动器、光盘驱动器、以及本领域所公知的其它驱动器所实现的大容量存储器706是一非易失性存储设备以用于存储处理器单元704所使用的数据和指令。在一个实施例中，大容量存储器706存储用于实现本发明的软件以便将其装入主存储器702中。The system shown in FIG. 7 further includes mass storage 706 , a peripheral device 708 , a user input device 712 , a portable storage media drive 714 , a graphics subsystem 718 and an output display 716 . For simplicity, the components shown in FIG. 7 are connected by a single bus 720 . However, it will be apparent to one of ordinary skill in the art that the components may be connected by one or more data transfer means. For example, processor unit 704 and main memory 702 are connected by a local microprocessor bus, and mass storage 706, peripherals 708, portable storage media drive 714, and graphics subsystem 718 are connected by one or more input/output (I/O) /O) connected to the bus. Mass storage 706 , implemented by a magnetic disk drive, optical disk drive, and other drives known in the art, is a non-volatile storage device for storing data and instructions used by processor unit 704 . In one embodiment, mass storage 706 stores software for implementing the present invention for loading into main memory 702 .

便携式存储介质驱动器714与诸如软盘的便携式非易失性存储介质相连以便向/从图7所示的计算机系统输入/输出数据和程序代码。在一个实施例中，用于实现本发明的系统软件被存储在这种便携式介质上，并经过该便携式存储介质驱动器714输入给所述计算机系统。外围设备708可包括诸如输入/输出(I/O)接口的任何一种将辅助功能添加到所述计算机上的计算机支援设备。例如，外围设备708可包括其可使计算机系统与网络相连一网络接口以及诸如调制解调器、路由器、或者本领域所公知的其它硬件这样的其它网络硬件。A portable storage medium drive 714 is connected to a portable nonvolatile storage medium such as a floppy disk to input/output data and program codes to/from the computer system shown in FIG. 7 . In one embodiment, system software for implementing the present invention is stored on such a portable medium and imported to the computer system via the portable storage medium drive 714 . Peripherals 708 may include any kind of computer support device that adds auxiliary functionality to the computer, such as an input/output (I/O) interface. For example, peripherals 708 may include a network interface that enables the computer system to connect to a network, as well as other network hardware such as a modem, router, or other hardware known in the art.

用户输入装置712提供部分用户接口。用户输入装置712可包括用于输入阿尔法数字及其它信息的一阿尔法数字小键盘或者诸如鼠标、跟踪笔、触笔、或者光标方向键这样的指示设备。为了显示文本和图形信息，图7的计算机系统包括图形子系统718以及输出显示器716。输出显示716包括一阴极射线管(CRT)显示器、液晶显示器(LCD)或者其它合适的显示设备。图形子系统718接收文本和图形信息，并且对该信息进行处理以输出到显示716器。另外，图7的系统包括输出装置710。合适的输出装置的例子包括扩音器、打印机、网络接口、监视器、及本领域所公知的其它输出装置。User input device 712 provides part of the user interface. User input devices 712 may include an alphanumeric keypad or pointing devices such as a mouse, track pen, stylus, or cursor direction keys for entering alphanumeric and other information. To display textual and graphical information, the computer system of FIG. 7 includes a graphics subsystem 718 and an output display 716 . Output display 716 includes a cathode ray tube (CRT) display, liquid crystal display (LCD), or other suitable display device. Graphics subsystem 718 receives textual and graphical information and processes the information for output to display 716 . Additionally, the system of FIG. 7 includes an output device 710 . Examples of suitable output devices include microphones, printers, network interfaces, monitors, and other output devices known in the art.

一般在适用于本发明某些特定实施例的计算机系统中可找到图7的计算机系统所包含的部件，并且其表示本领域所公知的一大类这种计算机元件。因此，图7的计算机系统可以是一个人计算机、工作站、服务器、微型计算机、大型计算机、或者任何其它计算机。计算机系统700还可以采用不同的总线结构、网络平台、多处理器平台等等。可以使用包括Unix、Linux、Windows、Macintosh OS、Palm OS及其它合适的操作系统的任何操作系统。Components included in the computer system of FIG. 7 are typically found in computer systems suitable for use in certain particular embodiments of the invention, and represent a broad class of such computer elements known in the art. Thus, the computer system of FIG. 7 may be a personal computer, workstation, server, minicomputer, mainframe, or any other computer. Computer system 700 may also employ different bus structures, network platforms, multi-processor platforms, and the like. Any operating system can be used including Unix, Linux, Windows, Macintosh OS, Palm OS, and other suitable operating systems.

为了说明和描述而提出了本发明的优选实施例。这并不是对所公开的精确结构的详述或者限制。显然，对于本领域普通技术人员来说显而易见的是可做出各种修改和变化。为了更好的说明本发明的原则以及其实际应用，可选择并且描述该实施例，从而使本领域普通技术人员可知道本发明可用于各种实施例并且可做出其适合于特定使用目的的修改。本发明的范围由下述权利要求以及其等效体所规定。The preferred embodiments of the invention have been presented for purposes of illustration and description. It is not intended to detail or to limit the precise structures disclosed. Obviously, various modifications and changes will be apparent to those skilled in the art. The embodiment was chosen and described in order to better illustrate the principles of the invention and its practical application, so that those of ordinary skill in the art can understand that the invention can be utilized in various embodiments and can make its suitable use specific purpose. Revise. The scope of the invention is defined by the following claims and their equivalents.

Claims

1. system that is used for object on the managerial grid comprises:

A plurality of webservers, each webserver are used for and the network data sources traffic; And

Be arranged in a Boot Server of described a plurality of webservers, this Boot Server comprises a distributed consistency algorithm that is used for selecting from described a plurality of webservers a master server, this master server comprises an object relevant with a data item in the network data source, thereby makes in the described a plurality of webservers that need the visit data item any one can visit described object on the described master server.

2. according to the system of claim 1, wherein, the described webserver is to choose from a group of being made up of hardware cluster server and software bundling storage server.

3. according to the system of claim 1, wherein, described distributed consistency algorithm comprises the message circulation between described Boot Server and the described a plurality of server, and this circulation continues always, till the great majority in described a plurality of webservers are agreed described master server.

4. according to the system of claim 1, wherein, described master server comprises a data object, and this data object comprises the copy from the data in described network data source.

5. according to the system of claim 1, wherein, described master server comprises a data object, unique access point that this data object conducts interviews to the data item in the network data source with work.

6. according to the system of claim 1, wherein, described data item is a transaction record.

7. according to the system of claim 1, wherein, described distributed consistency algorithm is the Paxos algorithm.

8. system that is used for object on the managerial grid comprises:

Be arranged in a Boot Server of described a plurality of webservers, this Boot Server comprises a distributed consistency algorithm that is used for selecting from described a plurality of webservers a master server, this master server comprises a copy of a data item that is arranged in the network data source, thereby makes in the described a plurality of webservers that need the described data item of visit any one can visit described copy on the described master server.

9. system that is used for object on the managerial grid comprises:

Boot Server in described a plurality of webserver, this Boot Server comprises a distributed consistency algorithm that is used for selecting from described a plurality of webservers a master server, this master server comprises unique access point that the data item that is arranged in the network data source is conducted interviews, thus make described a plurality of webservers of needing this data item of visit any one all must pass through this master server and visit this data item.

10. system that is used for object on the managerial grid comprises:

One file system, this document system comprises the copy of at least one data item;

The a plurality of servers that communicate with this document system; And

Be arranged in a Boot Server of described a plurality of servers, this Boot Server comprises a distributed consistency algorithm that is used for selecting from described a plurality of servers a master server; And

Be arranged in a master server of described a plurality of servers, described master server comprises the local copy of described data item, described master server can make in described a plurality of server any one visit this local copy, and upgrades the described copy of the described data item in the described file system in any time of upgrading local copy.

11. according to the system of claim 10, wherein, described master server is further used for local copy is stored in the local cache buffer memory device.

12. according to the system of claim 10, wherein, described a plurality of servers comprise a cluster.

13. according to the system of claim 10, wherein, described file system is replicated in data item on a plurality of disks.

14. a system that is used for object on the managerial grid comprises:

The a plurality of servers that communicate with this document system;

One hardware cluster, this hardware cluster comprise the hardware cluster server that is arranged in described a plurality of servers, and described hardware cluster comprises an algorithm that is used for selecting from described a plurality of hardware clusters a Boot Server;

Be arranged in a Boot Server of described hardware cluster server, this Boot Server comprises a distributed consistency algorithm that is used for selecting from described a plurality of servers a master server; And

Be arranged in a master server of described a plurality of servers, described master server comprises the local copy of data item, if when described master server can make any one addressable this local copy in described a plurality of server and upgrade local copy with regard to the copy of the data item in the updating file system.

15. according to the system of claim 14, wherein said master server is arranged in described hardware cluster.

16. a method that is used for object on the managerial grid comprises:

Utilize distributed consistency algorithm from a plurality of webservers, to select a master server;

Send the copy of a data item to described master server from file system; And

The described master server of notifying other webserver to include a data item copy will be used in the processing of network requests.

17., further comprise when the copy on the master server is modified the step of the data item in the updating file system according to the method for claim 16.

18., further comprise and limit other webserver visits file system through described master server step according to the method for claim 16.

19., comprise that further a copy that guarantees to have only data item is present in the step of the outside of this document system according to the method for claim 16.

20., comprise that further a copy that guarantees data item always is present in the step of this document system outside according to the method for claim 16.

21. according to the method for claim 16, further comprise, then utilize distributed consistency algorithm from a plurality of webservers, to select the step of new master server if master server is no longer admitted this object.

22. method according to claim 16, further comprise if master server is no longer admitted this object, then send the copy of a data item to the step of a new master server, the master server that this new master server is to use described distributed consistency algorithm to select from described file system.

23., comprise further that the new master server of notifying one of other webserver to comprise described data item so will be used in the step in the processing that network requests is carried out according to the method for claim 16 if master server is no longer admitted this object.

24. a structural system that is used for object on the managerial grid comprises:

A plurality of servers, all cacheable data object of each server;

One file system, this document system comprise at least one copy of data item;

One distributed consistency algorithm is used for selecting a master server from described a plurality of servers, and this master server is used for the copy of cached data object; And

One compartment system is used for the copy that this principal computer of server on the informing network comprises described data object.

25. one kind is used for object is hired out method to server on the network, comprises:

Data object is hired out to described master server, and wherein this master server has in the time period or admits this data object hiring out;

From file system, take out the copy of data item to deliver to master server; And

The copy that comprises data item to other webserver notice master server is handled network requests being used for.

26., further comprise step: in another hires out the time period, upgrade termly this data object is hired out to master server according to the method for claim 25.

27., further comprise step:, then this data object is hired out to one of new master server and hired out the time period in case the taxi time period on the master server has expired according to the method for claim 25.

28. one kind is used for object is hired out method to the server on the network, comprises:

Select a Boot Server in a plurality of hardware cluster servers from the hardware cluster;

Utilize the distributed consistency algorithm on the described master server from a plurality of webservers, to select a master server;

Data object is assigned to master server, and appointed master server can carry out independent visit at an official hour to data item in the cycle;

29. the method that the entitlement that is used for the object on the network distributes comprises:

Utilize the distributed consistency algorithm on the described Boot Server from a plurality of webservers, to select a master server;

Data object is assigned to this master server, and appointed master server can be visited separately the data object on the network;

30. a method that is used for guaranteeing to exist at cluster object comprises:

The master server that use is arranged in a plurality of servers provides the visit to the data object;

If described master server can not provide the visit to described data object, so, utilize distributed consistency algorithm from a plurality of servers, to select a new master server;

With information send to need provide to described data item provide visit new master server and

Notify other server in described a plurality of server new addressable this data object of master server.

31. a method that is used at the cluster distribute objects comprises:

Utilize distributed consistency algorithm to come from a plurality of webservers, to select a master server;

One data object is assigned to this master server, and appointed this master server can be visited separately a data item; And

The one described master server of notifying other server that is multi-cast in the cluster to comprise the copy of described data item with notice will be used in the processing of network requests.

32. a method that is used at the cluster distribute objects comprises:

Whether each server that connects in the cluster can be accepted by those servers with definite selected master server; And

If it is acceptable that the Servers-all in the cluster is all agreed selected master server, notifies other server that is multi-cast in the cluster to transfer to selected new master server with one so and be responsible for notice.

33. a system that is used at the cluster distribute objects comprises:

Device is used for utilizing distributed consistency algorithm to select a master server from a plurality of webservers;

Device is used for taking out the copy of data item to deliver to master server from a file system; And

Device, this master server that is used for notifying other webserver to comprise this data item copy will be used in the processing that network requests is carried out.

34. a method that is used for object on the managerial grid comprises:

Utilize and select a master server in a plurality of webservers of Paxos algorithm from the software bundling storage; And

Data object is assigned to described master server, and wherein this master server is provided in the network the independent visit to this data object.

35., further comprise the step of from a file system, taking out these data object data according to the method for claim 34.

36., comprise that further this master server of notifying other webserver to comprise the object relevant with described data item will be used in the step in the processing that network requests is carried out according to the method for claim 34.

37., further comprise with the sign multileaving of new master server step to another webserver according to the method for claim 34.

38., wherein utilize the Paxos algorithm to come to select in a plurality of webservers from the software bundling storage the described step of a master server to comprise cyclical information be multi-cast to another webserver according to the method for claim 34.

39. the system that the object that is used for cluster distributes comprises:

Device is used for giving master server with a JMS object distribution, and this master server that is distributed provides the independent visit to the JMS object; And

Device is used to notify this master server of its webserver that independent visit to JMS is provided.

40. a method that is used for object on the managerial grid comprises:

The JMS message memory is assigned to master server, and this JMS message memory exists only on the described master server.

41., further comprise and notify other webserver step that this master server is admitted described JMS message memory according to the method for claim 40.

42., further comprise step according to the method for claim 40: with the sign multileaving of new master server to another webserver.

43., wherein utilize the Paxos algorithm to come to select in a plurality of webservers from the software bundling storage the described step of a master server to comprise cyclical information be multi-cast to another webserver according to the method for claim 40.

44. a system that is used for JMS object on the managerial grid comprises:

A plurality of webservers; And

Boot Server in described a plurality of webserver, this Boot Server comprises a distributed consistency algorithm that is used for selecting from the described a plurality of webservers master servers, and this master server comprises a JMS object so that its any one of a plurality of webservers that need visit JMS must conduct interviews to the JMS object on the master server.

45. according to the system of claim 44, wherein, the described webserver is to choose from be made up of hardware cluster server and software bundling storage server such one group.

46. according to the system of claim 44, wherein, described distributed consistency algorithm comprises the circulation message between described master server and the described a plurality of server, this circulation continues to agree master server up to the great majority of described a plurality of webservers always.

47. one kind is used for the JMS object is hired out method to the server on the network, comprises:

Utilize distributed consistency algorithm to come from a plurality of hardware cluster servers, to select a master server;

The JMS object is hired out to this master server, and wherein this master server has in the time period or admits this JMS object hiring out; And

Notify this master server that this JMS object is admitted to other webserver.

48., further comprise step: in another hires out the time period, upgrade termly this JMS object is hired out to master server according to the method for claim 47.

49., further comprise step:, then described JMS object is hired out to one of new master server and hired out the time period in case the taxi time period on the described master server has expired according to the method for claim 48.

50. one kind is used for the JMS object is hired out method to the server on the network, comprises:

Utilize the distributed consistency algorithm on the described Boot Server to come from a plurality of webservers, to select a master server;

The JMS object is assigned to this master server, and appointed master server can provide the independent visit to described JMS object in a time cycle of regulation; And

Notify this master server of other webserver to admit described JMS object.

51. one kind is used for method that the object on the network is distributed, comprises:

The JMS object is assigned to this master server, and appointed master server can provide the independent visit to described JMS object; And

Notify this master server of other webserver to admit described JMS object.