WO2016052665A1 - 分散型ストレージシステム - Google Patents
分散型ストレージシステム Download PDFInfo
- Publication number
- WO2016052665A1 WO2016052665A1 PCT/JP2015/077853 JP2015077853W WO2016052665A1 WO 2016052665 A1 WO2016052665 A1 WO 2016052665A1 JP 2015077853 W JP2015077853 W JP 2015077853W WO 2016052665 A1 WO2016052665 A1 WO 2016052665A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- data
- drive
- redundant code
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0619—Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/29—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes combining two or more codes or code structures, e.g. product codes, generalised product codes, concatenated codes, inner and outer codes
- H03M13/2906—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes combining two or more codes or code structures, e.g. product codes, generalised product codes, concatenated codes, inner and outer codes using block codes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2211/00—Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
- G06F2211/10—Indexing scheme relating to G06F11/10
- G06F2211/1002—Indexing scheme relating to G06F11/1076
- G06F2211/1028—Distributed, i.e. distributed RAID systems with parity
Definitions
- the present invention relates to a distributed storage system.
- a Server SAN type storage system that creates a storage pool by connecting a large number of general-purpose servers via a network is expected to spread in the future.
- the ServerSAN storage system is considered to be an effective solution in a system that aims at high-performance analysis by installing a high-speed SSD in a server node for large-scale big data analysis or the like.
- Patent Document 1 There is US Pat. No. 7,546,342 (Patent Document 1) as background art in this technical field.
- the publication states: “Calculate the relative importance of each file associated with a website. Using this relative importance, multiple servers in a computer cluster, such as server arrays, peer-to-peer networks, etc. Compute a plurality of subsets of content to be distributed to the device, the subsets including encoded messages created using erasure coding schemes in packets that include portions of one or more files. When a file is acquired, a certain number of clearly identifiable encoded messages are acquired from the device based on this scheme, and the file is recreated using this message. Sites are acquired significantly faster and every computing device has a large amount of storage Others without requiring bandwidth, reliability has been described as. "Improved (see Abstract).
- a conventional Server SAN type storage system uses a local storage device directly connected to each server node as a final storage location, and protects data by distributing write data and redundant data thereof to a plurality of server nodes. Specifically, the write data from the host is divided into a plurality of data blocks, a redundant code is generated from the divided data block by erasure code, and the data block and the redundant code divided into a plurality of server nodes are evenly distributed.
- the conventional Server SAN type storage system distributes the write data received from the host to a plurality of server nodes. Therefore, when the application program reads data from the Server SAN type storage, the data block is transferred on the network between the server nodes. Therefore, the throughput of the network becomes a bottleneck, and the access latency to the data may increase as compared with the case where the data does not pass through the network.
- a typical example of the present invention is a distributed storage system, which includes a plurality of nodes that communicate via a network, and the distributed storage system further includes a plurality of storage devices, and includes at least three or more nodes.
- a first node group including the first node group is defined in advance, and each node of the first node group transmits data stored in the managed storage device to other nodes belonging to the first node group, and The first node of one node group generates a redundant code using a combination of data received from two or more other nodes of the first node group, and the first node generates the generated redundant code, Store the redundant code in a storage device different from the storage device that stores the generated data. Of the redundant code the first node is generated, the data combination of at least two redundant code, the combination of the logical address of the data to be configured differently.
- An overview of write processing in a distributed storage system is shown.
- An example of mapping images of a plurality of protection layers in a distributed storage system is shown.
- An example of a system configuration of a distributed storage system is shown. Indicates information for controlling the distributed storage system.
- the structural example of a virtual volume management table is shown.
- the structural example of a pool volume management table is shown.
- the structural example of a drive management table is shown.
- the structural example of a drive status management table is shown.
- the structural example of a node state management table is shown.
- the structural example of a site state management table is shown.
- the structural example of a page mapping table is shown.
- the structural example of a page load frequency table is shown.
- the structural example of a page load distribution table is shown.
- the structural example of a static mapping table is shown.
- the structural example of a geostatic mapping table is shown.
- the structural example of a consistent hashing table is shown.
- the structural example of a log structure mapping table is shown.
- region control table 214 is shown.
- An example of cache information is shown.
- the mapping image of a site protection layer is shown.
- the node state transition in a distributed storage system is shown. Shows site state transitions in a distributed storage system.
- An example of the logical configuration of the virtual provisioning layer in one node of the distributed storage system is shown.
- An example of page mapping of a plurality of nodes in a distributed storage system is shown. 2 shows a flowchart of read processing of a distributed storage system.
- 6 shows a flowchart of synchronous write processing.
- FIG. 4 shows a flowchart of asynchronous write processing.
- the flowchart of a destage process is shown.
- the flowchart of the process of capacity depletion management is shown.
- the concept of capacity depletion management processing is shown.
- the flowchart of a evacuation rebuild process is shown.
- the flowchart of a data resync process is shown.
- the flowchart of a rearrangement and a rebalance process is shown.
- An example of the determination method of the self threshold value in rearrangement is shown.
- An example of the determination method of the self threshold value in rearrangement is shown.
- the flowchart of a structure change process is shown. An example of stripe type addition and stripe rearrangement when a node is added will be described.
- Embodiment 2 shows an example of a hardware configuration of a distributed storage system.
- Embodiment 2 shows a method for improving the efficiency of transfer between nodes for redundancy.
- a data restoration method in the method for improving the efficiency of transfer between nodes for redundancy described with reference to FIG. 29 will be described.
- Embodiment 3 shows an example of a hardware configuration of a distributed storage system. The outline
- the table structure managed with a drive for control of a storage system is shown.
- a communication interface between a computer node and a flash drive is shown.
- Embodiment 3 the flowchart of the process in which a computer node reads the newest data from D drive is shown. In the third embodiment, old data read processing is shown. In Embodiment 3, the flowchart of the process in which a computer node writes data to D drive is shown. In the third embodiment, a processing flow when data write processing to each drive is performed in parallel in synchronous write processing is shown. In Embodiment 3, the flowchart of a garbage collection process is shown.
- Embodiment 4 shows an example of a hardware configuration of a distributed storage system. The outline
- This embodiment discloses a distributed storage system.
- the distributed storage system is configured by connecting a plurality of computer nodes each including a storage device via a network.
- the distributed storage system realizes a virtual storage system that realizes a storage pool with storage devices of a plurality of computer nodes.
- the computer node stores the write data from the host in its own storage device, and further transfers the write data to another computer node to protect the data when the computer node fails. To do.
- the other computer node is called a transfer destination computer node.
- the transfer destination computer node generates a redundancy code from the write data transferred from a plurality of different computer nodes.
- the transfer destination computer node stores the generated redundant code in its own storage device.
- each computer node when an analysis application is operated by the distributed storage system according to the present invention, each computer node often stores analysis target data in a storage area of the own node. This reduces load time for data analysis, improves business agility, and reduces storage costs.
- the distributed storage system provides a virtual volume to the host.
- the distributed storage system allocates logical pages from the pool volume to virtual pages that have been write-accessed.
- the pool volume is a logical volume, and the physical storage area of the storage device is assigned to the logical storage area of the pool volume.
- the computer node selects a virtual page to which a logical page is allocated from its own storage device based on the network bandwidth of the distributed storage system and the access frequency to the computer node for each virtual page from the host. For example, the computer node determines a threshold value based on the network bandwidth of the distributed storage system, and places a logical page having a higher access frequency than the threshold value in its own storage device. As a result, it is possible to realize a page layout that allows high-speed page access while avoiding a network bottleneck.
- the computer node has an interface for an application program and a user to specify the location of the virtual page.
- the virtual page is specified by a logical address of the virtual volume, for example.
- the location of the virtual page is indicated by a computer node in which data of the virtual page is stored.
- the distributed network system can include all of the plurality of configuration examples at the same time, or may include only a part of the configuration.
- the storage device includes one storage drive such as one HDD or SSD, a RAID device including a plurality of storage drives, and a plurality of RAID devices.
- a stripe or stripe data is a data unit from which a redundant code for data protection is generated.
- the stripe is sometimes referred to as user data in order to differentiate it from the redundant code.
- the stripe is stored in a storage device in the computer node and is used in generating redundant codes in other computer nodes.
- the stripe type is a stripe class that generates a redundant code.
- the stripe type to which the stripe belongs is determined by, for example, the logical address of the stripe and the computer node that stores the stripe.
- the stripe type number which is the stripe type identifier, indicates a group of corresponding computer nodes.
- One stripe can belong to each stripe type of different protection layers.
- the host is a computer that accesses the storage system, a processor that operates on the computer, or a program that is executed by the processor.
- FIG. 1 shows an outline of write processing of a distributed storage system according to an example of this embodiment.
- the computer nodes 101A, 101B, and 101C are included in the same computer domain (hereinafter also referred to as a domain). In the example described below, it is assumed that a domain is associated with a site.
- the computer node 101D and the computer node 101E are arranged at a different site from other computer nodes.
- the computer nodes 101A to 101E communicate via a network.
- the computer node is also simply referred to as a node.
- Each computer node of the nodes 101A to 101E includes a cache 181 and a storage drive 113.
- Each of the nodes 101A to 101E provides a volume 1303.
- the node 101A stores the write data DATA1 (1501A) received from the host in its own cache 181 and further stores it in its own storage drive 113.
- the write data DATA1 is a stripe.
- the node 101A generates a node redundancy code P from the write data DATA1 and stores it in the own storage drive 113.
- the node redundancy code is a redundancy code generated from a data unit stored in the own storage device, and is indicated by a symbol P.
- the node 101A transfers the write data DATA1 in its own cache 181 to the cache 181 of the other node 101B.
- the node 101C stores the write data DATA2 (1501B) received from the external device in its own cache 181 and further stores it in its own storage drive 113.
- the write data DATA2 is a stripe.
- the node 101C generates an intra-node redundancy code P from the write data DATA2 and stores it in the own storage drive 113.
- the node 101C transfers the write data DATA2 in its own cache 181 to the cache 181 of the other node 101B.
- the node 101B generates a site redundancy code Q (1502B) from the data DATA1 and DATA2 stored in the local cache 181 and stores it in the local storage drive 113 in order to protect the data in the event of a node failure.
- the site redundancy code is an inter-node redundancy code in the site, and is indicated by a symbol Q.
- the site redundancy code Q belongs to a different protection layer from the node redundancy code P.
- the node 101E stores the write data DATA3 (1501C) received from the host in its own cache 181 and further stores it in its own storage drive 113.
- the write data DATA3 is a stripe.
- the node 101E generates a node redundancy code P from the write data DATA3 and stores it in the own storage drive 113.
- the node 101A transfers the write data DATA1 in its own cache 181 to the cache 181 of the other node 101D.
- the node 101E transfers the write data DATA3 in its own cache 181 to the cache 181 of another node 101D.
- the node 101D generates a geo-redundant code R (1502C) from the data DATA1 and DATA3 stored in the local cache 181 and stores it in the local storage drive 113 for data protection in the event of a node failure.
- the geo-redundant code is a redundant code between nodes at different sites, and is indicated by the symbol R.
- the geo redundancy code R belongs to a different protection layer from the node redundancy code P and the site redundancy code Q.
- FIG. 2 shows an example of mapping images of multiple protection layers in a distributed storage system.
- FIG. 2 shows an image of performing redundancy between sites while making redundancy between nodes at the same site. For example, the first redundancy is achieved between the nodes in the data center, and further, the redundancy is established between different bases to protect the data in multiple layers and improve the system reliability. Can do.
- FIG. 2 only some elements are indicated by reference numerals, and reference numerals of the same type of elements are partially omitted.
- square poles represent nodes
- dashed rectangles represent sites (domains)
- rectangles within the nodes represent stripes or stripe addresses (data positions).
- FIG. 2 shows four sites 102 where four nodes are located at each site.
- FIG. 2 does not show redundant codes generated from a plurality of stripes.
- the number (X_Y) in the stripe 103 indicates the identifier of the stripe type to which the stripe 103 belongs.
- X is an identifier of an inter-node stripe type (site stripe type) in the site
- Y is an identifier of an inter-site stripe type (geo stripe type).
- One stripe 103 belongs to one site stripe type and one geo stripe type.
- the stripe 1_A stored in the node 101A1 belongs to the site stripe type 1001 and the geo stripe type 1002.
- the stripes belonging to the site stripe type 1001 are the stripe 1_A of the node 101A1, the stripe 1_D of the node 101A2, and the stripe 1_C of the node 101A3.
- the node 101A4 in the same site that does not hold these stripes generates and holds redundant codes of these stripes.
- the stripes belonging to the geo stripe type 1002 are the stripe 1_A of the node 101A1, the stripe 1_A of the node 101B1, and the stripe 2_A of the node 101C2.
- the node 101D4 at a site different from these nodes generates and holds redundant codes of these stripes.
- each of the plurality of nodes receives and holds a stripe (data unit) that is received and held by one transfer destination node, and the transfer destination node generates and holds a redundant code from the transferred data unit. Stripes and redundant codes are stored in different nodes, and data protection against node failure is realized.
- the node that has received the host command transmits the received write data to another node without reading the old data in order to generate the site redundancy code or the geo redundancy code. Therefore, the response performance to the write command is improved.
- stripe movement for generating redundant codes is performed between caches, and the drive 113 does not intervene. Therefore, when the drive 113 uses a flash medium, the life can be improved by reducing the write amount.
- the node Since the node stores the stripe received from the host in its own storage device without dividing it, the read response time and network traffic are reduced. Also, redundant code transfer is not required, reducing network traffic.
- the distributed storage system may be configured with a single protection layer that generates only inter-node redundancy codes within a site or between sites.
- FIG. 3 shows a system configuration example of a distributed storage system.
- the node 101 has, for example, a general server computer configuration.
- the hardware configuration of the node 101 is not particularly limited.
- the node 101 is connected to another node 101 via the network 103 through the port 106.
- the network 103 is configured by, for example, InfiniBand or Ethernet.
- a plurality of nodes 101 form a domain 102.
- the domain 102 may correspond to, for example, a geographical region, or may correspond to the topology of the virtual or physical network 103.
- the network 104 connects a plurality of domains 102. In the following, it is assumed that a domain is associated with a geographically distant site.
- the internal configuration of the node 101 is connected to a port 106, a processor package 111, and a disk drive (hereinafter also referred to as a drive) 113 via an internal network 112.
- the processor package 111 includes a memory 118 and a processor 119.
- the memory 118 stores information necessary for control when the processor 119 processes read and write commands and executes a storage function, and also stores storage cache data.
- the memory 118 stores a program executed by the processor 119, for example.
- the memory 118 may be a volatile DRAM or a non-volatile SCM (Storage Class Memory).
- the drive 113 is, for example, a hard disk drive having an interface such as FC (Fibre Channel), SAS (Serial Attached SCSI), SATA (Serial Advanced Technology Attachment), or an SSD (Solid State) configuration.
- FC Fibre Channel
- SAS Serial Attached SCSI
- SATA Serial Advanced Technology Attachment
- SSD Solid State
- An SCM such as NAND, PRAM, or ReRAM may be used, or a volatile memory may be used.
- the storage device may be made nonvolatile by a battery.
- the various types of drives described above have different performance.
- the throughput performance of SSD is higher than that of HDD.
- the node 101 includes a plurality of types of drives 113.
- the node 101 classifies different types of drives 113 into drive groups having similar performances, and forms tiers 115 and 116.
- the relationship between hierarchies is defined by the performance of the hierarchies.
- the performance includes access performance and fault tolerance performance.
- the access performance of the hierarchy decreases in the order of Tier1, Tier2, and Tier3.
- each of the drive groups in each layer constitutes a RAID. Note that the number of layers illustrated in FIG. 3 is two, but the number of layers depends on the design. Further, a high access performance layer may be used as a cache.
- a drive, a RAID, a hierarchy, and a set thereof are each a storage device.
- FIG. 4 shows information for controlling the distributed storage system.
- the memory 118 stores various programs including a storage program that implements a storage function, an OS, and an interface program.
- the memory 118 may further store an application program that executes a job.
- Protection layer information 201 is information related to data protection.
- the virtual provisioning information 202 is information related to virtual volume provisioning.
- the cache information 204 is information regarding the cache 181.
- the configuration information 203 is information regarding the configuration of the distributed storage system.
- the protection layer information 201 includes static mapping tables 210, 211, and 212 for protection layer numbers 1, 2, and 3, respectively.
- the protection layer information 201 further includes a log structured mapping table 213 and a local area control table 214.
- the virtual provisioning information 202 includes a page mapping table 215, a page load frequency table 216, and a page load distribution table 217.
- the configuration information 203 includes a virtual volume management table 218, a pool volume management table 219, and a drive management table 220.
- the configuration information 203 further includes a drive status management table 221, a node status management table 222, and a site status management table 223.
- a copy of all or part of the information described above may be stored in the drive 113 synchronously or asynchronously.
- the node 101 may hold the above information for each pool, for example.
- One pool is composed of one or a plurality of logical volumes. This logical volume is also called a pool volume.
- One pool is composed of one or a plurality of hierarchies. In the example described below, the pool is composed of three layers, that is, a pool volume of three layers.
- the entity of the pool volume is a storage area of the drive 113.
- the pool volume can be assigned a storage area of a drive of another node 101.
- each table indicating information held by the node 101
- blank cells are cells from which data is omitted.
- “0x” indicates a hexadecimal number.
- the drive number is unique within the node, and the node number is unique within the site.
- the site number is unique within the system.
- FIG. 5A to 5F show configuration examples of tables indicating information included in the configuration information 203.
- FIG. 5A to 5C show management information of different storage resource types.
- FIG. 5A shows a configuration example of the virtual volume management table 218.
- the virtual volume management table 218 indicates virtual volume information.
- the virtual volume management table 218 indicates information on a virtual volume provided by the node 101 that holds the information 218.
- the node 101 receives access to the virtual volume to be provided.
- the virtual volume management table 218 may hold information on virtual volumes whose own node is not the owner in case of failure.
- the virtual volume management table 218 includes a list of size (capacity) of each virtual volume and a node number of a node (owner node) that provides each virtual volume. Furthermore, it includes information indicating whether the generation and writing of the redundant code of each protection layer is synchronous or asynchronous with the writing of the write data to the own storage device.
- the size of the virtual volume indicates not the total amount of allocated logical pages but the virtual capacity (maximum capacity) of the virtual volume. Synchronous / asynchronous information is given for each protection layer.
- FIG. 5B shows a configuration example of the pool volume management table 219.
- the pool volume management table 219 indicates pool volume information.
- the pool volume management table 219 indicates information on the pool volume provided by the node 101 that holds the information 219 and the other node 101 to which the node 101 belongs.
- the pool volume management table 219 includes information on the size (capacity) of each pool volume and the node number of the node that provides each pool volume.
- FIG. 5C shows a configuration example of the drive management table 220.
- the drive management table 220 indicates a drive assigned to each pool volume.
- the drive management table 220 indicates information of the local drive 113 included in the node 101 that holds the information 220.
- the drive management table 220 shows, for each pool volume, the type of drive (SSD, NL-SAS drive, etc.), the set of striped drive numbers (set of drive numbers constituting RAID), and the drive size (capacity). Have information. When striping is not performed, only one drive is allocated to the pool volume. Note that different areas of one drive can be assigned to different pool volumes.
- 5D to 5F show failure management information in the distributed storage system held by each of the nodes 101.
- FIG. 5D to 5F show failure management information in the distributed storage system held by each of the nodes 101.
- FIG. 5D shows a configuration example of the drive status management table 221.
- the drive status management table 221 shows the status and error count of each local drive 113 in the node 101.
- FIG. 5E shows a configuration example of the node state management table 221.
- the node status management table 221 shows the status and error count of each of the other nodes 101 in the local site 102.
- the own site 10c of the node 101 is the site 102 to which the node 101 belongs.
- the node 101 increments the error count.
- FIG. 5F shows a configuration example of the site state management table 223.
- the site state management table 222 shows the state and error count for each site.
- the node 101 can communicate only with the representative node of the other site 102. Therefore, the error of the representative node 101 means an error of the site.
- the processor 119 of the node 101 When the processor 119 of the node 101 detects an error in communication with its own drive 113 or another node 101, it increments the error count in the management information 221 to 223 held.
- the processor 119 When the error count in any hardware resource (drive, node or site) reaches the first threshold, the processor 119 changes the state of the resource from the normal state to the warning state. Further, when the error count reaches the first threshold, the processor 119 changes the state of the resource from the warning state to the blocked state.
- the warning state and the blocking state are abnormal states.
- each node 101 When each node 101 detects an abnormal state of any hardware resource, it notifies the other node 101 of the information. Specifically, the node 101 notifies all the nodes 101 in the affiliated site 102 and the representative node 101 of the other site 102. The representative node 101 notifies the information to other nodes in the affiliated site 102. Thereby, the information of the abnormal state hardware resource can be shared between the nodes. The information on the abnormal state drive may not be shared between the nodes.
- Node 101 may share error count information. For example, when each node 101 detects a communication error with another node or another site, it updates its own management information and broadcasts the update information to the other node 101. The node 101 may determine the abnormal state based on the error count of the other node 101 in addition to the error count of the own node.
- the node 101 may count communication errors with the node 101 of the other site 102.
- the site error count is, for example, the total error count of all nodes 101 in the site 102.
- FIG. 6A to 6C show information included in the virtual provisioning information 202.
- FIG. FIG. 6A shows a configuration example of the page mapping table 215.
- the page mapping table 215 holds the correspondence between the virtual page of the virtual volume and the logical page of the pool volume.
- the page mapping table 215 holds virtual volume information provided by the node 101 that holds the information 215.
- the virtual page may be allocated to the logical page of the pool volume 1303B of the other node 101 via the own-system pool volume 1303C described later or directly.
- the page mapping table 215 shows the relationship between the virtual page and the own pool volume 1303C or the pool volume 1303B of the other node 101.
- the page mapping table 215 holds the first LBA (Logical Block Address) and address range of the virtual page of the virtual volume, and the first LBA of the logical page of the pool volume corresponding to the first LBA of the virtual page.
- LBA Logical Block Address
- FIG. 6B shows a configuration example of the page load frequency table 216.
- the page load frequency table 216 holds a history of I / O frequency (access frequency) for each virtual page. Specifically, the top LBA and address range of the virtual page of the virtual volume and the access frequency for the area are held.
- the page load frequency table 216 holds information on virtual pages to which logical pages for storing user data (write data) from the pool volume are allocated. Therefore, the page load frequency table 216 also indicates the access frequency of the logical page assigned to the virtual page.
- the page load frequency table 216 holds virtual volume information provided by the node 101 that holds the table 216. In addition, the page load frequency table 216 holds information on accesses received by the node holding the table 216 from the own node and other nodes.
- the access frequency information may be acquired and managed for each access source node, or may be acquired and managed separately for read access and write access.
- the node 101 may acquire and manage access frequency information by separating sequential access and random access, or may acquire and manage access frequency information in a plurality of measurement cycles.
- FIG. 6C shows a configuration example of the page load distribution table 217.
- the page load distribution table 217 divides the access frequency for each virtual page into a plurality of levels and indicates the page amount for each level. That is, the page load distribution table 217 indicates the distribution of the page amount with respect to the access frequency (I / O frequency).
- the page load distribution table 217 shows a history of page load distribution.
- the node 101 holds a page load distribution table 217 for each of a plurality of protection layers.
- the page load distribution table 217 includes information on the access frequency for each page in the node, information on the access frequency for each page in all nodes in the site, and information for each page in all nodes in the system across multiple sites. Information on access frequency may be held.
- the node 101 can generate the page load distribution table 217 from the page load frequency table 216 acquired from its own node and other nodes.
- each of the plurality of nodes 101 receives access to the same virtual page. Therefore, the total access to one virtual page in all owner nodes of the virtual volume indicates all accesses to the virtual page.
- the page load distribution table 217 has a smaller amount of information than the page load frequency table 216 and basically does not depend on the storage capacity (logical page amount) of the node 101. Therefore, the page load distribution table 217 can be shared among many nodes 101. Furthermore, by adding the number of pages for each access frequency level in the plurality of nodes 101, a page load distribution across the plurality of nodes 101 such as page load distribution information of the entire site or the entire system can be generated. A page load distribution table 217 may be created for each access source node 101.
- the page load frequency table 216 includes a high-level list of pages with high access frequency (high load) (for example, using Lossy Count Method) and a divided area obtained by dividing a storage area of a node or a group of nodes by a predetermined number of sections. It is efficient to construct two types of lists, an access frequency (page load) list. In the case of only the upper list of high-load pages, when the random load often found in the OLTP database is wide, the upper list may be saturated and the pages to be included in the list may not be included.
- high load for example, using Lossy Count Method
- the node 101 may have history tables 216 and 217 for each predetermined period (for example, one week). Although this example describes a mapping table (managed by LBA) in block storage, generally known file storage (for example, NFS / CIFS: Network File System / Common Internet File System) or object storage (for example, REST) : Representation State Transfer), the node 101 can hold similar information.
- LBA mapping table
- file storage for example, NFS / CIFS: Network File System / Common Internet File System
- object storage for example, REST
- the management information may associate the file with the page, or may associate a small area obtained by dividing the file with the page.
- the management information may correspond to the page with the object, or a small area obtained by dividing the object may correspond to the page.
- FIGS. 7A to 7C show examples of static mapping tables in the data protection layer information 201.
- FIG. The protection layer number 1 is a protection layer in the node 101, and each node 101 holds the node static mapping table 210 of its own node. The illustration of the node static mapping table 210 is omitted.
- the tables in FIGS. 7A to 7C are held in, for example, the node 101 that belongs to the site with the site number 0 and has the node number 0.
- FIG. 7A shows a configuration example of the static mapping table 211 of protection layer number 2 (site).
- the site static mapping table 211 is information shared between the nodes 101 in the site 102.
- the site static mapping table 211 includes, for each site stripe type number, a node number of a data node that stores a corresponding stripe (user data / write data) and a node of a redundant code node that stores a redundant code generated from the stripe. Keep the number relationship.
- the site stripe type number is identification information of the stripe type in the site.
- the stripe type is a stripe class, and one or a plurality of redundant codes are generated from a plurality of stripes in the stripe type.
- a stripe is a data unit having a predetermined size.
- the stripe type number also indicates a group of nodes 101 that stores user data and redundant codes included in the stripe type.
- Redundant code is generated from multiple stripes from different data nodes belonging to the site stripe.
- two redundant codes are generated and stored in different nodes 101, respectively.
- the number of redundant codes depends on the design.
- the multiple redundant codes are generated by, for example, Erasure Coding.
- the site static mapping table 211 may be shared between sites if there are no memory restrictions or security restrictions.
- one stripe belongs to a single site stripe type.
- stripes stored by one node may belong to different stripe types.
- a certain stripe stored in the node 0x00 belongs to the site stripe type 0x0000, and the other stripe belongs to the site stripe type 0x0001.
- the geo static mapping table 212A basically has the same configuration as the site static mapping table 211.
- the geo-static mapping table 212A is shared between sites.
- the geo static mapping table 212A holds the relationship between the site number of the data site where the corresponding stripe is arranged and the site number of the redundant code site where the redundant code is arranged for each geo stripe type number.
- One node 101 at each data site stores the stripe. Further, one node 101 in each redundant code site stores the redundant code.
- the consistent hashing table 212B indicates information for specifying the node 101 that stores the redundant code in the redundant code site. Each site 102 maintains its own consistent hashing table 212B. The information in the consistent hashing table 212B varies from site to site.
- the consistent hashing table 212B includes a node number of the node 101 in the redundant code site, a hash value when the node 101 stores the redundant code (1), and a case where the node 101 stores the redundant code (2). Indicates the relationship with the hash value.
- the hash value is calculated based on the information regarding the transfer source transferred from the other site 102 together with the stripe. The stripe is transferred to the node 101 associated with the calculated hash value, and the transfer destination node 101 generates and stores a redundant code.
- the static mapping table described with reference to FIGS. 7A to 7C is changed when the storage location of user data (stripes) and redundant codes is changed in the spare area in the event of a node / site failure. Furthermore, it is changed when nodes or sites are added / removed.
- the nodes 101 may share the same calculation logic so that the static mapping table can be uniquely changed from the information of the failed node / site. As a result, the node 101 does not need to multicast the static mapping table after changing the held static mapping table, and the load on the network can be reduced.
- FIG. 8 shows a configuration example of the log structured mapping table 213 in the data protection layer information 201.
- the arrow represents a pointer.
- the log structured mapping table 213 includes a data mapping table 701, a geo / site / node code mapping table 702, and an inverse mapping table 703.
- the data mapping table 701 manages user data (stripes) stored in the own storage device (drive 113) by the node 101 that holds the table 701.
- the node 101 can know the storage address (physical address) of the drive 113 (physical storage device) that stores the stripe from the storage address (logical address) of the pool volume of the stripe.
- the data mapping table 701 associates the storage address (logical address) of user data (stripe) in the pool volume with the storage address (physical address) in the physical storage area of the drive 113.
- the storage address of the stripe pool volume is specified by the LDEV number of the pool volume and the stripe number of the stripe, and each block in the stripe is specified by the LBA offset.
- the stripe size is constant.
- the stripe number is calculated by, for example, floor (LBA / stripe length).
- the storage address of the physical storage area is specified by the drive number, LBA, and data length.
- the data mapping table 701 indicates that the data of LDEV number 0, stripe number 0, and LBA offset 0 in stripe is stored in the area of drive number 0x43, LBA0x0003, and data length 8. Further, it is indicated that the data of LDEV number 0, stripe number 0, and LBA offset 1 within stripe is stored in the area of drive number 0x42, LBA0x0007, and data length 8.
- the physical storage area further stores information indicating the state of stored data.
- the status information indicates whether the data has been copied (transferred) to the corresponding redundant code node.
- the write data (stripes) is transferred to the redundant code node for redundant code generation in synchronization or asynchronously with the host write processing of the write data (stripes) according to the setting of Sync / Async.
- the redundant code mapping table 702 manages the redundant codes stored in the own storage device (drive 113) by the node 101 holding the table 701.
- the managed redundant codes include an inter-site redundant code (geo redundant code R), an intra-site redundant code (site redundant code Q), and an intra-node redundant code (node redundant code Q).
- the node 101 can know the physical address of the redundancy code of the stripe from the logical address of the pool volume storing the stripe.
- the redundant code mapping table 702 associates the logical address of the striped pool volume used for generating the redundant code with the physical address of the physical storage area of the drive 113 (own storage device).
- the redundant code is generated by an operation (for example, xor) based on a plurality of stripes. Therefore, logical addresses of a plurality of stripes are associated with the physical address of the redundant code.
- FIG. 8 shows an example of generating one redundant code from two stripes.
- the redundant code mapping table 702 shows the relationship between the physical address of one geo-redundant code and the logical addresses of the two stripes used to generate the geo-redundant code.
- the logical address of the stripe is indicated by an identifier of a site, a node, a pool volume, and an in-volume address.
- the geo-redundant code is stored in two address areas (blocks) in the physical storage area.
- the block of the geo-redundant code generated from is stored in the area of drive number 0x40, LBA 0x0020, and data length 8.
- the distributed storage system in this example stores data using a log structuring method.
- the log structuring method when data at a logical address is updated with new data, new data is added to the new physical address without updating the data at the physical address. Unnecessary data is appropriately deleted.
- reading for updating the node redundancy code P is unnecessary, and the time for the write process to the drive 113 can be shortened.
- the distributed storage system may not implement the log structuring method.
- the log structured mapping table 213 holds the relationship between the logical address and the physical address of the latest data, information on the relationship between the logical address and the physical address of the old data, and data generation management information.
- the redundant code generation management information generated from a plurality of stripes indicates the generation information of each stripe used for redundant code generation.
- a data guarantee code (write sequence number, CRC, etc.) may be added to the data mapping table 701 or the redundant code mapping table 702. By adding this information, it is possible to check data consistency only by referring to the information in this mapping table once at the time of address conversion.
- the reverse mapping table 703 is a reverse conversion table of the above tables 701 and 702. That is, the reverse mapping table is referred to for conversion from the physical area address to the pool volume address.
- the inverse mapping table 703 includes a table 732 indicating logical addresses corresponding to the address areas 731 for storing data in the physical area.
- Each table 732 includes a data type (stripes / geocode / site code / node code), number of indexes (number of references), update date / time, reference (corresponding pool volume area, site number, node number, etc.).
- FIG. 8 shows, as an example, information on a logical address corresponding to a physical address that stores a geo-redundant code.
- This example is consistent with the example of the redundant code mapping table 702 in FIG.
- the data type is a geo-redundant code and the number of indexes is 2. This indicates that two stripes are used to generate the geo-redundant code.
- the reference indicates the storage logical address of the stripe used for generating the geo-redundant code.
- the logical address is indicated by a site number, a node number, an LDEV number, a stripe number, and an offset.
- the redundant code of various stripe combinations can be appropriately managed.
- the node may add update information to the reverse mapping table 703 in synchronization with user data drive writing. This makes it possible to recover data when an unexpected power loss occurs.
- the node 101 may store in the memory 118 and update the reverse mapping table 703 in the drive 113 asynchronously with the drive write of the user data.
- the reverse mapping table 703 may hold a write sequence number.
- the reverse mapping table 703 may hold old data information in addition to the latest data information.
- FIG. 9 shows a configuration example of the local area control table 214.
- an arrow represents a pointer.
- the local area control table 214 includes an effective list 801A, an invalid list 801B, a free list 801C, and a local area amount table 802.
- the local area control table 214 manages the area of the drive 113 in the node 101.
- the arrows in the lists 801A to 801 are pointers.
- each area is indicated by a drive number and an LBA in the drive.
- the valid list 801A is a list of valid areas.
- the effective area is an area for storing the latest user data or the latest redundant data.
- the blocks LBA0, 4, and 5 each store valid data.
- the invalid list 801B is a list of invalid areas.
- the invalid area is an area for storing old user data or old redundant codes.
- An old and invalid redundant code is a redundant code in which all stripes used to generate the redundant code are invalid.
- the blocks LBA1, 3, and 7 each store invalid data.
- the free list 801C is a list of unused areas.
- the local area amount table 802 manages each stripe type, node redundant code, site redundant code, geo redundant code, target usage of spare area, actual usage, and effective area.
- the node 101 holds a local area amount table 802 for each hierarchy. Each entry in the local area amount table 802 may indicate the total amount of all layers. By individually managing the amounts of the stripe type and the redundant code, the amounts for various data can be appropriately controlled.
- the processor 119 updates the local area control table 214 synchronously or asynchronously with the host I / O.
- the local area amount table 802 holds only the stripe type entry to which the own node 101 belongs.
- the local area amount table 802 may include an entry for stripe type data to which the own node 101 does not belong in order to manage the usage amount of data transferred from the other node 101.
- FIG. 10 shows an example of the cache information 204.
- Each node 101 holds unique cache information 204.
- the cache information 204 includes a data dirty queue 900, a code dirty queue 901, a clean queue 902, a free queue 903, and an intermediate dirty queue 904. Dirty queues 900, 901, and 904 indicate data on the cache 181 that has not been reflected in the drive 113.
- the cell in the queue indicates an entry, and the entry information corresponds to the information in the cache bitmap table 905 and stores information selected from the cache bitmap table 905. Arrows in the queue represent pointers that connect the entries. The black circle is the starting point.
- the data dirty queue 900 indicates write data (stripes) from the host stored in the local drive 113. Write data belongs to one of the site stripe types.
- the data dirty queue 900 includes a queue for each site stripe type to which the node 101 belongs as a data node.
- the code dirty queue 901 indicates a stripe for generating a redundant code on the cache 181 that has not been reflected in the drive 113.
- the stripe and the redundant code generated from the stripe are dirty data.
- the code dirty queue 901 includes a queue for stripes received from other nodes for redundant code generation. Since the node 101 belongs to a plurality of protection layers, a stripe type queue of different protection layers is prepared. In the example of FIG. 10, site stripe type and geo stripe type queues are shown. A dirty queue for each set of stripe type and data position (node) is used.
- Each queue belongs to the corresponding stripe type and indicates a list of data stored in the physical area of the corresponding node.
- the queue of “SITE STRIPETYPE # 0, 0” is a queue for data that belongs to the site stripe of the site stripe type number 0 and is stored in the node of the node number 0.
- the intermediate dirty queue 904 indicates an intermediate code on the cache 181 that has not been reflected in the drive 113.
- the intermediate code is data generated from the new stripe and the old stripe. For example, xor of new stripe and old stripe.
- the intermediate code is difference data between the new stripe and the old stripe, and the node 101 can update the redundant code of the old stripe stored in the drive 113 to the redundant code of the new stripe using the intermediate code. . Details of how to use the intermediate code will be described later.
- the configuration of the intermediate dirty queue 904 is the same as the redundant code queue in the dirty queue 901. That is, in this example, a queue for each set of stripe type and data position (node) is used. Since the node 101 belongs to a plurality of protection layers, a stripe type queue of different protection layers is prepared. In the example of FIG. 10, site stripe type and geo stripe type queues are shown.
- the clean queue 902 indicates data on the cache 181 that has been reflected in the drive 113.
- the free queue 903 indicates an area of the cache 181 that is not used.
- the cache bitmap table 905 includes a logical address of data, a cache address (location on the memory), a size, a dirty bitmap, and a staging bitmap. For example, one entry indicates information of one slot of a predetermined size in the cache 181.
- the logical address corresponds to the logical address of the stripe described with reference to FIG.
- the stripe logical address transferred from the other node 101 includes, for example, a site number, a node number, an LDEV number, an LBA, and an offset.
- the dirty bitmap indicates which part of the area is dirty.
- the staging bitmap indicates which part of the area has been staged on the cache 181. For example, one bit corresponds to one block of the drive 113.
- FIG. 11 shows a mapping image of the site protection layer (layer number 2). Basically, the mapping images of the node protection layer (layer number 1) and the geo protection layer (layer number 3) are the same.
- the stripe type cycle number is represented as c, the redundant code number (parity number) as p, and the stripe number (data number) as d.
- the number of cycles is 5
- the number of redundant codes is 1
- the number of stripes is 3.
- one redundant code is generated from a maximum of three stripes and stored in a node in the site stripe type.
- the redundant code is generated from any number of stripes of 3 or less.
- Table 621 shows stripe type data nodes and redundant code nodes. Each column corresponds to a node with node numbers 0-8. Each capacity of the cylinder 622 indicating the physical storage area of the nodes of node numbers 0 to 8 is indicated by the height of the cylinder.
- the numbers in the cells indicate the stripe type numbers.
- the cells in the D area indicate the stripe type number to which the data node belongs.
- the cell in the Q region indicates the stripe type number to which the redundant code node belongs.
- the S area cell indicates the stripe type number to which the spare node belongs and the type of data to be stored (stripe / redundancy code).
- the spare node is a node that temporarily stores data of the node when a failure occurs in the node. Thereby, the redundancy is recovered in the event of a node failure.
- the stripe type number of the write data is determined by the stripe number of the write data and the node number of the node that receives and stores the write data. Specifically, the node 101 determines the stripe number based on (logical address value of write data ⁇ stripe size). In this example, the logical address is a logical address in the pool volume. A logical address in the virtual volume may be used. Further, the node 101 calculates the row number of the write data based on (stripe number modc).
- the node 101 refers to the site static mapping table 211 of layer number 2 and determines the stripe type number from the node number of the own device and the calculated line number. For example, the node 101 selects an entry including its own node number as a data node in the site static mapping table 211, and sets the site stripe type number of the entry in the order indicated by the row number as the site stripe type number of the write data. decide.
- the node 101 refers to the site static mapping table 211 of the layer number 2 and determines a write stripe type redundant code node to which the stripe belongs. This point will be described again in the description of the write processing described later.
- the stripe with row number 0 in the nodes with node numbers 0, 5, and 7 belongs to the stripe type with stripe type number 0.
- the stripe with row number 4 in the nodes with node numbers 1, 3, and 8 belongs to the stripe type with stripe type number 13.
- the node number of the redundant code node belonging to the stripe type of stripe type number 0 is 1, and the node number of the redundant code node belonging to the stripe type of stripe type number 13 is 4.
- Some nodes store multiple stripe type redundancy codes.
- the stripe distribution in the D region is uniform.
- the number of stripe-type data nodes may be changed depending on the storage capacity of the nodes. Further, when the total number of nodes is small or when a fraction occurs, the number of redundant codes of some stripe types may be reduced. Different stripe types may be made redundant by different algorithms.
- the redundant code node in the stripe type is selected from a node different from the data node of the stripe type. Data writes from the data node are concentrated on the redundant code node. Therefore, redundant code nodes are selected so that redundant codes are arranged as evenly as possible. Thereby, the lifetime of the node 101 is leveled. This is particularly effective when the drive 113 is an SSD. When the lifetimes are uneven among the nodes, the arrangement of the redundant codes Q may be changed so as to be leveled.
- the spare node is a temporary storage destination for recovering redundancy when a node failure occurs.
- the spare node for storing the redundant code is selected from a node different from the data node of the same stripe type. In the example of FIG. 10, a failure has occurred in the node with node number 6.
- the spare node associated with the stripe type number of the stripe or redundant code temporarily stores the corresponding stripe or redundant code.
- the node with the node number 0 stores the stripe with the stripe type number 2 stored in the node with the node number 6.
- the node with the node number 7 stores the redundancy code Q with the stripe type number 3 stored in the node with the node number 6.
- Data recovery is performed by a node that stores the data or a different node. Data (stripes and redundant codes) stored in the spare node is returned from the spare node to one node at the time of node recovery or node expansion.
- the stripe type does not depend on the LDEV number of the pool volume, but is determined by the address in the pool volume. Data of addresses in the same volume of different pool volumes belong to the same stripe type. The address area of one pool volume is classified into a plurality of stripe types. As will be described later, the redundant code node selects any number of arbitrary stripes from the stripes in the same stripe type without depending on the address within the volume of the stripe, and generates a redundant code from the selected stripes.
- FIG. 12A shows the state transition of the node 101 in the distributed storage system.
- FIG. 12B shows the state transition of the site 102 in the distributed storage system. Basically, the state transition is the same in each protection layer.
- the normal state indicates the initial state and the normal state during operation.
- the state shifts to a rebuilding state when a drive failure occurs.
- the node 101 can accept application I / O by collection read / write during rebuilding.
- the node 101 In the blocked state, the node 101 is down and I / O cannot be executed. However, the drive 113 may not have failed. In this case, the data can be restored and the blocked state can be changed to the normal state by data resync that reflects only data that has been newly written to the node 101 that has blocked.
- FIG. 13 shows a logical configuration example of the virtual provisioning layer in one node 101 of the distributed storage system.
- the virtual volumes 1301A and 1301B are virtual storage areas recognized by the host (same node or other nodes), and are volumes that are targeted when a read or write command is issued from the host.
- the pool 1306 is composed of one or more pool volumes.
- the pool 1306 includes pool volumes 1303A to 1303F.
- the pool 1306 may include a pool volume of another node.
- the pool volumes 1303A to 1303F are constituted by storage areas of the drive 113.
- the processor 119 configures a logical pool volume by managing the correspondence between the logical address of the pool volume and the physical address of the drive 113. Details will be described later.
- the storage administrator can create a plurality of virtual volumes on the pool 1306 by an instruction to the processor 119 via the input / output device.
- the processor 119 allocates a real storage area from the pool 1306 only to a storage area for which a write command is issued in the virtual volume.
- the virtual volume 1301A includes virtual pages 1302A, 1302B, and 1302C, and logical pages 1304A, 1304E, and 1304C are assigned to the virtual pages 1302A, 1302B, and 1302C, respectively.
- the virtual volume 1301B includes virtual pages 1302D and 1302E, and logical pages 1304D and 1304F are allocated to them.
- Logical pages are dynamically assigned to virtual pages. For example, when a write command is issued for the first time to the virtual page 1302A of the virtual volume 1301A, the processor 119 associates it with an unused area (logical page 1304A) of the pool volume 1303A (association 1305A). Also for the next read / write instruction to the same page, the processor 119 executes I / O processing for the logical page 1304A of the pool volume 1303A based on the association 1305A.
- the node 101 can appear to the host as if it is executing an I / O process (access process) for the virtual volume.
- I / O process access process
- the limited storage area can be used efficiently.
- the processor 119 cancels the association between the virtual page and the logical page, and manages the logical page as an unused page. Thereby, the limited storage area can be used more efficiently.
- the pool 1306 includes a plurality of levels 115, 116, and 117.
- the pool 1306 has three tiers: an SSD tier 115 (TIER1), a SAS tier 116 (TIER2), and a SATA tier 117 (TIER3).
- the performance of the SSD hierarchy 115 is the highest, and the performance of the SATA hierarchy 117 is the lowest.
- Pool volumes are classified into tiers 115, 116, and 117, and belong to one of the pool volumes.
- the pool volume 1303A belongs to the tier 115, the pool volumes 1303B and 1303C belong to the tier 116, and the pool volumes 1303D and 1303E belong to the tier 117.
- Each virtual page has the characteristics of I / O processing from the host. For example, there are virtual pages with high I / O frequency (access frequency) and virtual pages with low frequency. This is called access locality.
- a virtual page with a high I / O frequency is arranged in an upper layer, that is, a virtual page with a high I / O frequency is assigned to a logical page in an upper layer. Thereby, the performance of the whole system can be improved.
- the virtual page is also expressed as being assigned to a tier or assigned to a pool volume.
- the data of the logical page 1304C is copied to the unused logical page 1304B, the correspondence between the virtual page 1302C and the logical page 1304C (1305C), and the correspondence between the virtual page 1302C and the logical page 1304B (1305B).
- Change to Page demotion can be performed in the same way.
- Graph 271 represents the distribution of I / O frequency (I / O load) of a page.
- the processor 119 can create load distribution data indicating the graph 271 from the page load distribution table 217.
- a distribution line 1309 is a line that represents the number of I / Os of each page when pages are arranged in the order of high I / O frequency. That is, a page with a large number of I / Os is located on the left side, and a page with a low I / O frequency is located on the right side.
- the tier assignment threshold values 1308A and 1308B are threshold values that determine which I / O frequency page is assigned to which tier.
- the performance of the entire system can be improved by allocating pages with a high I / O frequency to the upper layer. Therefore, virtual pages are allocated from the upper hierarchy in the order of high I / O frequency.
- the initial values of the tier allocation thresholds 1308A and 1308B may be 0, for example.
- the processor 119 allocates a page belonging to the page range 1310A having the highest I / O frequency to the SSD hierarchy 115 from the intersection of the hierarchy allocation threshold 1308A and the distribution line 1309.
- the processor 119 allocates pages belonging to the range 210B from the intersection of the hierarchy allocation threshold 1308A and the distribution line 1309 to the intersection of the hierarchy allocation threshold 1308B and the distribution line 1309 to the SAS hierarchy 116.
- the processor 119 allocates to the SATA hierarchy 117 from the intersection of the hierarchy assignment threshold 1308B and the distribution line 1309 to the page with the minimum I / O frequency.
- the storage administrator may specify the values of the tier assignment thresholds 1308A and 1308B, and the processor 119 may calculate the values of the tier assignment thresholds 1308A and 1308B.
- the processor 119 may determine a tier allocation threshold that defines the tier based on the virtual page I / O frequency distribution, the tier capacity, and the drive performance of the tier.
- the drive performance of a tier is defined in advance by the amount of IO data per unit time in the tier, for example.
- FIG. 14 shows an example of page mapping of a plurality of nodes in the distributed storage system.
- the distributed storage system provides virtual volumes 1301A to 1301C.
- the node 101A provides a virtual volume 1301A.
- the node 101B provides virtual volumes 1301A and 1301B.
- the node 101N provides a virtual volume 1301C.
- the node 101 (arbitrary nodes 101A to 101N) can hold two types of volumes.
- One is a pool volume 1303A composed of the storage area of the local drive 113. Data stored in the pool volume 1303A is arranged in the local drive 113.
- the other one is a volume 1303C for straight mapping the pool volume 1303B of another node 101.
- the volume 1303C is managed as a pool volume.
- the node 101 can perform I / O processing of the other pool volume 1303B via the pool volume 1303C.
- the node 101 converts the access destination address of the volume 1303C into the address of the other-system pool volume 1303B, and transmits a command to the other-system node 101.
- the node 101 holds an address mapping table (not shown) between the local pool volume 1303C and the other pool volume 1303B.
- the processor 119 maps a virtual page having a large host access amount in the own system to the own pool volume 1303A, and maps a virtual page having a large host access amount in the other system to the pool volume 1303B of the other system. This shortens the response time to the host.
- the data of the virtual page assigned to the other system pool volume 1303B is stored in the other system drive 113.
- Each node 101 selects the number of other-system pool volumes to be mapped and virtual pages to be allocated to other-system pool volumes based on the network performance and the own-system drive performance of each tier so that the network does not become a bottleneck. Place logical pages in Details of this arrangement will be described later with reference to FIGS. 23, 24A, and 24B.
- the distributed storage system may manage the storage capacity of the entire system, and increase or decrease the number of pool volumes of each node 101 according to the page usage capacity of the virtual volume.
- the node 101 may use the pool volume 1303A as a virtual volume by straight mapping. Thereby, the memory usage of the mapping table can be reduced, and the performance and availability can be improved.
- FIG. 15 shows a flowchart of the read processing of the distributed storage system.
- the processor 119 refers to the page mapping table 215 to determine whether or not the access destination virtual page is unassigned to the pool volume for the designated address of the received read instruction (S501).
- the designated address is designated by a virtual volume number and a logical address, for example.
- LBA is represented by the start LBA and the block length.
- the processor 119 determines whether or not exclusion is necessary (S506).
- the processor 119 refers to the virtual volume management table 218 and determines that exclusion is not necessary when the owner node of the virtual volume is only the own node.
- the processor 119 acquires the exclusion (S507), and determines again whether the virtual page is unallocated to the pool volume (S508).
- the processor 119 specifies a representative node uniquely determined from the read address using a hash function, requests arbitration from the representative node, and the representative node performs arbitration.
- the processor 119 releases the exclusion (S512), and proceeds to step S502.
- the processor 119 returns zero data (S509), and determines whether or not exclusion is necessary (S510) as in the determination in step S506. If exclusion is necessary (S510: Y), since the exclusion has already been acquired, the processor 119 releases the exclusion (S511).
- step S501 when the virtual page has been allocated (S501: N) and the virtual page is allocated to the pool volume (S502: Y), the processor 119 secures its own cache area, and the pool volume The data is read from the data, and the read data is returned (S504).
- the processor 119 refers to the pool volume management table 219 and external connection management information (not shown) to determine whether or not a virtual page is allocated to the own pool volume.
- the processor 119 In securing the cache area, the processor 119 refers to the cache information 204 and identifies the cache area associated with the target logical address. If there is no corresponding cache area, the processor 119 secures a new area from the free queue 903. When the free queue 903 is empty, the processor 119 secures a new area from the clean queue 902. When the clean queue 902 is empty, the processor 119 destages the area in the dirty queue 900, 901, or 904 and changes it to a free area.
- the processor 119 transfers a read command to the other node 101 (S505).
- the processor 119 does not cache read data in its own system. That is, if the virtual page allocation destination is another node, the processor 119 does not cache the read data in the local memory 118 (read-through), and the other node 101 caches the read data.
- FIG. 16 shows a flowchart of the synchronous write process. This process is executed when a write command is issued from a host (for example, an application program). In addition to storing write data in the local pool volume, this process writes write data to other nodes 101 to generate site redundancy codes (inter-node redundancy codes) and geo-redundancy codes (inter-site redundancy codes). Forward.
- site redundancy codes inter-node redundancy codes
- geo-redundancy codes inter-site redundancy codes
- the processor 119 of the node 101 that has received the write command determines whether the page is unallocated (S601). Specifically, the processor 119 refers to the page mapping table 215 and searches for the corresponding pool volume number and LBA from the specified address (virtual volume number and LBA) of the write command. The processor 119 determines whether the virtual page is unallocated based on the presence / absence of corresponding address information.
- a plurality of applications are activated, and at least one node in the system operates each application.
- a data read request is often issued to a node that has received a data write command. Therefore, in this application, when a node receives a write request, the node preferentially stores the write request data in the storage area of the node. As a result, the probability of being able to read from the node in response to a read request increases, and it becomes possible to respond to the read request at high speed.
- the system throughput may be improved by distributing the data to a large number of nodes.
- the allocation destination storage area may be changed using a technique such as round robin according to the performance of the network 103 or the performance of the drive 113 connected to the node 101.
- the above allocation policy is not only based on an index of performance, but when a flash is used as the drive 113, an index such as a lifetime may be used to improve cost effectiveness.
- the processor 119 executes a process of allocating the virtual page to the pool volume.
- the processor 119 first determines whether or not the update of the page mapping table 215 needs to be exclusive (S611). The reason for obtaining the exclusion is to prevent a plurality of different pool volume areas from being allocated to the virtual page when the virtual page is simultaneously allocated in the other node 101.
- the processor 119 refers to the virtual volume management table 218 and determines that exclusion is necessary when the owner node includes a node other than the own node. When the owner node is only the own node, exclusion is not necessary. judge. When it is determined that exclusion is necessary (S611: Y), the processor 119 acquires exclusion (S612).
- the exclusive acquisition method is the same as the method shown in the read processing described with reference to FIG.
- the processor 119 determines again whether or not the virtual page is unallocated (S613). This is because after determining whether or not a virtual page has been allocated in step S601, it is possible that the exclusion has been acquired by another node before the exclusion is acquired in step S612.
- the processor 119 determines a pool volume to which the virtual page is allocated (S614). The processor 119 first checks whether there is a free page in the own pool volume.
- the target amount and the usage amount in the local area amount table 802 are referred to, and it is determined whether the usage amount is smaller than the target amount in the write data stripe type entry.
- the processor 119 allocates the virtual page to the own pool volume.
- the node 101 holds local area tier management information (not shown), and selects the highest-order tier pool volume including empty pages.
- the processor 119 mounts the pool volume of the other system (other node) locally and allocates a page to the area.
- the processor 119 allocates a virtual page to the pool volume (S615). Specifically, the processor 119 updates the correspondence relationship in the page mapping table 215.
- the storage area of another node is used.
- the performance of the node that received the write request is prevented from degrading, and the capacity efficiency and performance of the entire system are maintained.
- the processor 119 determines whether or not exclusion is necessary (S616). This determination is the same as in step S611. When exclusion is necessary (S616: Y), the processor 119 releases the acquired exclusion (S618). If exclusion is not necessary (S616: N), the processor 119 proceeds to step S602.
- the processor 119 determines whether the logical address (virtual page) in the virtual volume of the write instruction is assigned to the own pool volume with reference to the page mapping table 215 (step 602).
- the processor 119 transfers a write command to the other node 101 (S603).
- the other node 101 executes the write process according to this flowchart.
- the processor 119 does not cache write data in its own system.
- the processor 119 starts write processing for each protection layer (S604 to S610).
- a distributed storage system is composed of three protection layers. They are, for example, a node protection layer, a site protection layer, a geo protection layer.
- the processor 119 repeats the process three times with three layers.
- the node protection layer is set to synchronous write.
- the processor 119 determines whether or not the layer is a synchronous write target (S604). Specifically, the processor 119 makes a determination with reference to the Sync / Async field corresponding to the write target virtual volume in the virtual volume management table 218.
- the processor 119 records “incomplete” in the status field of the data mapping table 701 without transferring the write data (stripes) to the other nodes 101.
- the status field indicates the status of each protection layer. Data on the cache 181 whose status field indicates “incomplete” is maintained until transfer.
- the processor 119 determines whether all the protection layers have been completed (S608), and if completed, ends this processing. If not completed (S608: N), the processor 119 repeats the processing of the next protection layer from step S604. If it is a synchronous write target (S604: Y), the processor 119 secures a cache in its own cache area 181 (S605). The method is the same as the method described with reference to FIG.
- the processor 119 determines whether or not to transfer the intermediate code (S606).
- the intermediate code represents an update difference between the old data (the latest data so far) and the new data (data to be written this time). For example, in the case of redundant data corresponding to RAID 5, the intermediate code is the xor value of old data and new data.
- the processor 119 may generate a plurality of xor results obtained by multiplying matrix coefficients.
- Some criteria can be used as criteria for determining whether intermediate code transfer is necessary. For example, the processor 119 determines that intermediate code transfer is necessary when the remaining amount of the redundant code area of the transfer destination node 101 is less than a threshold value. As a result, the redundant code required at the transfer destination node can be reliably stored. The processor 119 acquires information on the local area amount of the transfer destination node 101 from the transfer destination node 101.
- the processor 119 may generate an intermediate code when the response reduction effect at the time of a cache hit is small in its own system. For example, when the writing mode is set in the own system, when a predetermined low latency drive is used in the own system, when the own system is in a load state higher than the threshold, or the communication distance between nodes is longer than the threshold When the processor 119 transfers the intermediate code.
- the processor 119 transfers the intermediate code when the write life of the drive 113 is sufficient.
- the processor 119 destages the write data from the cache 181 to the drive 113 and then returns a completion response to the host.
- the processor 119 If it is determined that intermediate code transfer is necessary (S606: Y), the processor 119 generates an intermediate code from the stripe (write data) on the cache 181 and the old stripe read from the drive 133 (S609), and the target node (transfer) The intermediate code is written in the cache 181 of the destination node) (S610).
- the processor 119 specifies the target node (transfer destination node) of the intermediate code by the following method.
- the processor 119 calculates a row number (value on the vertical axis of the D region in FIG. 11) by the following formula.
- the method for calculating the row number is the same as the method for calculating the row number of the stripe with reference to FIG. (Address value / stripe size) modc
- the processor 119 determines the stripe type number (the number in the cell in FIG. 11) by referring to the static mapping table of the protection layer from the calculated row number and own node number.
- the processor 119 refers to the static mapping table of the protection layer and determines the transfer destination node 101 from the stripe type number.
- the processor 119 transfers the intermediate code to the destination of the transfer destination node 101 together with its own address information (site number, node number, LDEV number, LBA, TL (Transfer Length)) and an identifier indicating the intermediate code.
- the LDEV number is an identifier of the pool volume.
- the processor 119 refers to the static mapping table 211 of the layer number 2 and determines the redundant code node that finally stores the site redundant code Q as the transfer destination node.
- the processor 119 refers to the static mapping table 212A of the layer number 3 and determines the transfer destination site (the storage site of the geo-redundant code R). For example, the representative node 101 of the site is set in advance, and the processor 119 transfers the intermediate code together with the accompanying data to the representative node 101.
- the representative node 101 calculates a hash value from the transfer source address information using a hash function.
- the representative node 101 refers to the consistent hashing table 212B and determines the transfer destination node 101 from the calculated hash value.
- the node 101 is the final storage node (redundant code node) of the geo-redundant R.
- the data transfer method via the representative node 101 has a problem that it requires two times of data transfer, concentration of access to the representative node 101, and reduced availability due to a failure of the representative node 101. Therefore, a plurality of representative nodes 101 may be prepared and the transfer destination representative node 101 may be selected in round robin.
- the processor 119 may determine another site node that directly stores the geo-redundant code R instead of the representative node 101. Specifically, the transfer source node 101 holds a transfer destination site consistent hashing table 212B in advance, and the processor 119 determines the transfer destination node 101 according to the table.
- the distributed storage system may periodically update without performing close synchronization by exclusive update. In that case, it is determined whether the destination node 101 that received the intermediate code from the other site is the correct destination by referring to the consistent hashing table 212B of the local site. Received data may be transferred.
- the processor 119 of the transfer destination node 101 calculates the intermediate code and xor of the dirty data, and updates the data in the cache 181. To do.
- the processor 119 of the transfer destination node 101 connects the cache information related to the intermediate code to the intermediate dirty queue 904.
- the transfer destination node 101 may update the data on the cache 181 by calculating the xor of the intermediate code from different transfer sources that are the sources of the same redundant code.
- step S606 If it is determined in step S606 that the intermediate code is not transferred (S606: N), the processor 119 writes the write data to the cache 181 of the target node (transfer destination) (S607).
- the write data is basically stored with priority over the accessed node.
- redundancy is ensured on the cache.
- the storage capacity for redundant codes is reduced and the efficiency is improved while maintaining redundancy.
- the method for determining the transfer destination node 101 and the data transfer method are the same as those in step S610.
- the transfer source node 101 transfers write data together with its own address information (site number, node number, LDEV number, LBA, TL) and an identifier indicating normal data to the transfer destination node.
- the processor 119 connects the cache information corresponding to the write data to the corresponding dirty code dirty queue 901.
- the conventional Erasure Coding method may be adopted for writing the write data to the pool volume of another system instead of the pool volume of the own system.
- write data is divided into stripes, redundant data is generated from the divided data, and the divided data and redundant data are distributed and stored in a plurality of nodes.
- the application destination of the conventional Erasure Coding method may be limited only to data in which the network does not become a bottleneck due to leads from other systems.
- FIG. 17 shows a flowchart of asynchronous write processing. This process is executed asynchronously with the host I / O, and transfers data that has not yet been transferred to another system in the protection layer in which Async is specified. Steps S702 to S708 in FIG. 17 are the same as steps S605 to S608 in FIG. Here, only the difference will be described.
- the processor 119 refers to the page mapping table 215 and executes this process for all registered virtual volumes.
- the processor 119 determines whether the target virtual page is an asynchronous write target (S701). Specifically, the processor 119 checks the state of the area on the pool volume corresponding to the virtual page in the data mapping table 701. If the protection layer is in an “incomplete” state, the processor 119 determines that it is an asynchronous write target (S701: Y), and proceeds to step S702.
- the processor 119 When the processing of all virtual pages is completed (S709: Y), the processor 119 ends this flow.
- the processor 119 may periodically execute asynchronous write processing or may always execute it.
- the processor 119 may dynamically change the execution frequency and the data transfer rate of this process according to the page amount in the “incomplete” state.
- FIG. 18 shows a flowchart of the destage processing. This process is executed asynchronously with the host I / O when there is dirty data on the cache 181, that is, data not reflected in the drive 113.
- the generation of redundant data is basically completed within each node (redundant data is generated between data from different senders within the node), so the amount of communication between nodes for generating redundant data is reduced. Can be reduced.
- the redundant data destinations are distributed among many nodes by the static mapping table 211, the destage processing can be efficiently distributed.
- dirty data in the cache 181 There are two types of dirty data in the cache 181. One is write data stored in the local drive 113. The other is data transferred from another node 101 for redundant data generation. Here, the data transferred from the other node includes an intermediate code.
- Dirty data is managed by a data dirty queue 900, a code dirty queue 901, and an intermediate dirty queue 904.
- the flowchart in FIG. 18 shows the destage of dirty data managed by the data dirty queue 900 and the code dirty queue 901.
- the processor 119 refers to the data dirty queue 900 and the code dirty queue 901 to find target dirty data.
- the processor 119 determines whether the target data is write data to be stored in the local drive 113 (S801). When the target data is indicated by the data dirty queue 900, the target data is write data.
- the processor 119 When the target data is write data (S801: Y), the processor 119 writes the write data to the local drive 113 (S808). Data is stored in a log structured format. When the write data is stored in the log structured format in the drive 113, the processor 119, as shown in FIG. Record in table 701.
- the processor 119 records the correspondence between the logical address in the pool volume and the physical address in the drive 113 in the reverse mapping table 703. If there is no free space in the drive 113, the processor 119 may execute data write management to the drive 113 after executing the capacity depletion management processing described with reference to FIG.
- the processor 119 determines whether all dirty data has been processed (S806). If all dirty data has been processed (S806: Y), the processor 119 ends this flow.
- the processor 119 finds dirty data of the same stripe type (S802).
- the processor 119 acquires a plurality of stripes transferred from different nodes 101 including the target data in the target data queue in the code dirty queue 901.
- the processor 119 acquires as many X stripes as possible in accordance with a data protection policy designated by the user (XDYP: redundant data number Y with respect to maximum Data number X). User designation of the data protection policy will be described later with reference to FIG.
- the processor 119 selects as many stripes as possible within a range not exceeding the number of data node IDs indicated by the site static mapping table 211 or the geo static mapping table 212A. As a result, redundancy that satisfies user specifications is performed as much as possible.
- the transfer source nodes of the stripes to be selected are all different.
- the processor 119 selects a stripe from all data nodes. In selecting a stripe, the logical address at the transfer source is not limited.
- the processor 119 may store single target data in the drive 113 as a redundant code.
- the redundant data storage destination node only data that has been destaged by the transfer source node must be used to generate redundant data.
- the redundant data storage node may be destaged.
- the data may be transferred to the redundant data storage destination node at the destage timing in the transfer source node. Further, it may be prevented from being overwritten (for example, stored in a log buffer format) when updating data on the cache.
- the processor 119 may find dirty data from queues of the same stripe type in the intermediate dirty queue 904.
- the processor 119 calculates the corresponding redundant code and intermediate code xor stored in the drive 113 and updates the redundant code.
- the processor 119 When the updated redundant code is generated only from the stripe of the node 101 different from the transfer source node 101 of the target data, the processor 119 generates a new redundant code from the target data and the updated redundant code.
- the processor 119 may select a stripe for generating a redundant code so that the ratio of old data (old stripe) is as large as possible. If the processor 119 can generate a redundant code using only the old stripe, the processor 119 selects the stripe as such. By increasing the ratio of old data in redundant code generation, the time when the redundant code becomes invalid data can be advanced, and the free capacity of the redundant code storage area can be increased efficiently.
- the processor 119 calculates a redundant code from the selected stripe and writes it in the drive 113 (S803).
- the writing to the drive 113 is an additional writing in the log structured format basically in the same manner as in step S808. Thereby, reading of old data is omitted, and high-speed and efficient redundant code generation and drive write processing are realized.
- the processor 119 records the correspondence relationship between the calculated physical area of the redundant code and the pool volume page in the redundant code mapping table 702, not in the data mapping table 701.
- the processor 119 further records the correspondence between the logical address in the pool volume and the physical address in the drive 113 in the reverse mapping table 703. Since the redundant code is generated from a plurality of stripes, the mapping table has a plurality of references to one physical address.
- the processor 119 After writing the redundant code to the drive 113, the processor 119 notifies the transfer source node 101 (S805).
- the transfer source node 101 changes the state of the target layer of the target data in the data mapping table 701 to “completed”.
- the status field is referred to in order to determine whether or not the data is to be retransferred when a node failure occurs.
- the processor 119 ends this flow.
- the static mapping table 211 is referenced to identify the node that generates the second and subsequent redundant codes, and the first redundant code The node that generates the redundant code at the node that generates the data is notified to the node that generates the second and subsequent redundant codes.
- the node that generates the second and subsequent redundant codes should maintain the MDS property and enable data restoration by generating the second and subsequent redundant codes from the notified data address set. I can do it.
- a method of realizing the first redundant code by generating the second and subsequent redundant codes and transferring the redundant codes to the corresponding nodes may be considered.
- the processor 119 At the destage of the intermediate code, the processor 119 generates a new redundant code from the old redundant code stored in the drive 113 and the intermediate code, and overwrites the old redundant code in the drive 113. Because of overwriting, the mapping table does not change. Updating the redundant code by the intermediate code requires reading of old data, but can reduce the local area usage at the redundant code node.
- the processor 119 calculates xor of all intermediate codes to generate a new intermediate code, and updates the redundant code with the new intermediate code.
- the intermediate code corresponding to the same redundant code includes different generation data of the same logical address and intermediate codes of different nodes 101.
- AxorB examples of intermediate codes corresponding to the same redundant code are intermediate code AxorA ′, intermediate code BxorB ′, and intermediate code A′xorA ′ ′.
- a ′ ′ is the latest data
- a ′ is the oldest data.
- Data B is new data
- data B ' is old data.
- the processor 119 can know the physical address of the redundant code of the intermediate code selected from the intermediate dirty queue 904 using the redundant code mapping table 702. Further, the processor 119 can specify the logical address of the intermediate code corresponding to the redundant code using the reverse mapping table 703.
- RAID 6 Galois coefficients: A1 to A3 using Reed-Solomon codes is taken as an example.
- the processor 119 selects the dirty data X1 to X3 from the dirty queue 901, and calculates the redundant code P1 or P2 by the following equation.
- P1 X1xorX2xorX3
- P2 (X1 * A1) xor (X2 * A2) xor (X3 * A3) Redundant codes
- P1 and P2 are respectively written to new areas of the own storage device.
- the processor 119 extracts new intermediate dirty data M1 and M2 corresponding to the old redundant data P1 ′ or P2 ′ written to the local drive 113 from the intermediate dirty queue 904.
- the number of intermediate dirty data is not necessarily two.
- the processor 119 calculates the new intermediate code MP1 or MP2 by the following equation.
- MP1 M1xorM2
- MP2 (M1 * A1) xor (M2 * A2)
- the processor 119 calculates a new redundant code P1 or P2 by the following equation.
- P1 P1'xorMP1
- P2 P2'xorMP2
- the new redundant codes P1 and P2 are overwritten in the old area (P1 ′, P2 ′).
- the redundant code node 101 dynamically selects a stripe from stripes in one stripe type, and generates a redundant code from the selected stripe. Thereby, a redundant code can be efficiently generated from the transferred stripe without reading the existing redundant code.
- the dynamic selection of stripes in this example is a selection in which at least one of the combination of stripes to be selected and the number of stripes is indefinite.
- a stripe is selected independently from both the number of stripes and the address combination, but one of them may be fixed.
- the address in the address combination is an address specified by a node, a volume, and an in-volume address.
- the log structuring method may not be applied to the redundant code drive write.
- the node 101 may rewrite the old redundant code with the new redundant code without adding to the new redundant code local area generated from the same address combination as the old redundant code.
- a redundant code having an address combination different from that of all existing redundant codes is added to the local area.
- a redundant code is generated only from stripes within a predefined stripe type.
- the system may generate a redundant code from any combination of stripes without defining a stripe type.
- FIG. 19 shows a flowchart of the capacity depletion management process. This process tries to erase data when the amount of data on the drive 113 exceeds the set target amount. Thereby, necessary data can be stored in a limited area.
- the types of data to be erased are write data (stripes) and redundant codes. This processing may be performed asynchronously with the host I / O. The relationship between the usage amount and the target amount is shown in the local area amount table 802.
- the processor 119 refers to the local area amount table 802 and checks whether the usage amount of the selected target data type exceeds the target amount (S901). When the usage amount of the target data type has exceeded (S901: Y), the processor 119 determines whether or not the target data type is a redundant code type (S902).
- the data type is classified into a redundant code type, a write data type (stripe type), and a data type on the spare area. Further, the redundant code type is classified into each type of node redundant code, site redundant code, and geo redundant code, and the write data type is classified into each site stripe type.
- the processor 119 refers to the invalid list 801B and the log structured mapping table 213 to determine the redundancy of the redundant code type.
- the code is searched (S907).
- the invalid redundant code is a redundant code in which all the stripes of the calculation source are invalid. All stripes of the calculation source All stripes of the calculation source are updated old data, and the redundant code can be erased.
- the processor 119 releases the area (S908).
- the area release the relationship between the physical address of the target area and the logical address of the pool volume in the redundant code mapping table 702 is deleted, the target area is deleted from the invalid list 801B and reconnected to the free list 801C. The area usage of the corresponding redundant code type is reduced.
- the processor 119 executes a redundant code merging process (S909). This process can reduce the amount of redundant code used.
- the processor 119 refers to the log structured mapping table 213 and acquires the logical address and generation information of the stripes that make up the redundant code.
- the processor 119 acquires X ′, Y ′, and L ′ from the other node 101.
- the processor 119 can reduce the use amount of the redundant code by releasing the areas of the redundant codes P1 and P2 and writing a new redundant code P3 in the drive 113.
- the redundant code may be preferentially selected and implemented so that the amount of reduction in use amount due to the redundant code is increased.
- the processor 119 rechecks whether the usage amount by the target redundant code type exceeds the target amount (S910). When the usage amount exceeds the target amount (S910: Y), the processor 119 executes a rebalance process (S906). As will be described later, the rebalancing process adjusts the page usage amount between pool volumes. For example, the data is moved to a pool volume in another hierarchy or a pool volume (another pool volume) of another node 101. After executing the rebalance, the processor 119 proceeds to step S901. If the usage amount of the target redundant code type does not exceed the target amount (S910: N), the processor 119 proceeds to step S901.
- the processor 119 determines whether there is erasable write data (stripes) in the target stripe type (S903). ). An erasable stripe is an updated old stripe and an invalid stripe. The processor 119 searches the invalid list 801B and the log structured mapping table 213 for invalid stripes of the stripe type.
- the processor 119 performs a redundant code cleanup process (S904). This process cleans up the redundant code corresponding to the stripe to be erased. Cleanup of both site redundancy code and geo redundancy code is performed. Specifically, the following steps are executed in each protection layer.
- the processor 119 inquires of the redundant code node 101 of the erase target stripe whether there is a redundant code including the erase target stripe.
- the target stripe is designated by, for example, a site number, a node number, an LDEV number, and an LBA.
- the processor 119 transmits the erasure target stripe to the redundant code node 101. If there is no redundant code, the process ends.
- the redundant code node 101 generates a new redundant code by erasing the erasure target stripe from the current redundant code with the received erasure target stripe. For example, the redundant code node 101 calculates xor of the stripe to be erased and the old redundant code, and generates a new redundant code. The redundant code node 101 overwrites the old redundant code stored in the drive 113 with the new redundant code.
- the update of the redundant code accompanying the above-mentioned stripe erasure prevents the redundancy of the other stripes of the redundant code from decreasing due to the deletion of the redundant code generator stripe.
- the redundant code node When the redundant code node erases the redundant code, it may be inquired whether the stripe corresponding to the redundant code node is the latest version.
- the stripe is specified by a logical address indicated by the reverse mapping table 703. If the corresponding stripe is the latest version, the redundant code node regenerates a new redundant code of the stripe.
- step S905 the processor 119 releases the target area (S905). This is the same as step S908. After that, the processor 119 returns to step S901.
- the processor 119 erases the stripe in the flowchart of FIG. 19, then erases the redundant code, and further executes rebalancing.
- the stripe erasure and the redundant code erasure order may be reversed. If the amount used is less than or equal to the target amount in any step, the subsequent steps are unnecessary.
- FIG. 20 shows the concept of capacity depletion management processing. This figure shows the redundant code cleanup process.
- the node 101A transfers the stripe 781 to be written to the node 101B (T212).
- the nodes 101C and 101D transfer the stripes 782 and 783 to the node 101B. Transfer stripes 781 to 783 are denoted by Z, D, and J, respectively.
- X ′′ be the old stripe.
- X ′, X ′′, and the like represent past data (invalid data), and X represents current data.
- Redundant code generated by past stripes only has no meaning to keep and can be deleted.
- the redundant code generated from the stripe set including the current stripe cannot be erased.
- the past stripe used to generate the redundant code cannot be erased from the drive. This is because the stripe cannot be restored.
- the node transmits the stripe to the node storing the redundant code of the stripe and cleans it up.
- a redundant code X ′′ xorCxorH exists in the node 101B.
- the node 101A transmits the past stripe X ′′ to the node 101B before erasing the past stripe X ′′ (T202).
- the node 101B calculates CxorH by X ′′ xorCxorHxorX ′′ from the past stripe X ′′ and the redundant code X ′′ xorCxorH. Thereafter, the node 101A erases the past stripe X ′′ of the drive 113.
- FIG. 21 shows a flowchart of the evacuation rebuild process. This process is executed by each node 101 that should deal with an abnormality when an abnormality occurs in the distributed storage system.
- the processor 119 of each node 101 can detect the occurrence of an abnormality by referring to the state control table for each protection layer, specifically, the drive state management table 221, the node state management table 222, and the site state management table 223. As described above, information about an abnormality detected by any one of the nodes 101 is shared in the system.
- the node 101 determines whether or not an abnormal resource (drive, node, site, etc.) is blocked (S211). There are three types of resource states. “Normal” state, “Blocked” state and “Warning” state. The node 101 can determine the state of the abnormal resource by referring to the state management table for each protection layer.
- a node for rebuilding data held by the resource is set in advance.
- Each node 101 holds information indicating a resource that the node itself becomes a spare node and data to be rebuilt, and the processor 119 rebuilds necessary data when detecting a blockage state of a resource to which the node 101 corresponds. .
- the processor 119 determines that the state of the abnormal resource is blocked (S211: Y), and executes a priority rebuild (S212).
- the rebuild is executed in order from the stripe type data with low redundancy in the protection layer.
- the node 101 refers to the static mapping tables 210 to 212 of each protection layer to know the stripe type whose redundancy is reduced due to the loss of the data stored in the error resource and the number of redundancy.
- the node 101 notifies each other of the processing to be executed and the progress of the processing, and waits for the completion of the priority rebuild with lower redundancy by the other nodes 101. For example, the node 101 waits for the completion of the rebuilding of the stripe type with redundancy 0 and the other node 101 starts the rebuilding of the stripe type with redundancy 1. Thereby, it is possible to avoid an increase in the rebuild time of the stripe type with redundancy 0 due to the rebuild of the stripe type with redundancy 1.
- the spare node that holds the rebuilt data in its own storage device reads the redundant code and stripe and rebuilds the data.
- another node may rebuild the data and transfer it to the spare node.
- the redundancy code may be changed without rebuilding at the spare node.
- the spare node writes zero data
- the redundant code node generates a new redundant code with stripes other than the lost stripe of the old redundant code and zero data.
- the redundant code of the upper protection layer lost due to the blocked resource is regenerated. For example, when a failure occurs in a drive of a certain node, the node 101 regenerates the site redundancy code and the geo redundancy code in the node 101. The node 101 requests the other nodes 101 to transfer the stripes necessary for generating the site redundant code and the geo redundant code. The node 101 can specify the node that holds the stripe from the redundant code mapping table 702 and the reverse mapping table 703.
- site redundancy code and geo redundancy code may be made redundant. Although the overhead due to redundancy (processor processing time, storage capacity, flash media lifetime consumption, etc.) increases, inter-node communication in the event of a drive failure becomes unnecessary.
- the node updates the registered node of the stripe type in the static mapping tables 210 to 212 with the spare node after the priority rebuild is executed.
- Each node 101 checks whether the redundancy of all stripe types of the protection layer has been recovered (S213). The nodes 101 notify each other of the completion of data recovery. When the redundancy of all stripe types in the protection layer is restored, the process proceeds to step S214. When the processing has not been completed in all layers (S214: N), the distributed storage system executes again from step S211 for a higher protection layer.
- the distributed storage system reviews the owner of the virtual volume (S215). Specifically, when a certain node 101 becomes blocked, another predetermined node 101 takes over the virtual volume held by that node 101.
- step S211 If it is determined in step S211 that the block is not blocked (S211: N), that is, if the state is “warning”, the node 101 determines whether data saving is necessary (S216). This necessity is determined based on the degree of risk of data loss in the distributed storage system.
- the system redundancy is 2 and two or more drives are in the "warning" state, the amount of data to be saved is saved if priority is given to stripe type data with many stripes in the warning state. Can be reduced and it is efficient.
- the system redundancy is the minimum redundancy number in the entire system.
- the node 101 determines that it is necessary to save in step S216 when N or more resources are in the “warning” state.
- N is an integer set in advance based on the system redundancy.
- the node 101 executes priority saving (S217).
- the data save destination is the same as that for rebuilding.
- the save data may be overwritten each time a warning occurs, as in the LRU cache.
- the execution priority is determined based on the redundancy number of the stripe type, but the node 101 may determine the execution priority based on the redundancy number of the stripe and the redundant code.
- the stripe and the redundancy code belong to a plurality of protection layers, and the total redundancy number thereof is the redundancy number of the data. As a result, the system redundancy can be increased as the rebuild / save process proceeds.
- the owner of the virtual volume is distributed in advance in order to continue the processing at another node (site). For example, different nodes in the site and nodes at other sites are set as the owner of the same virtual volume.
- rebuilding and saving processing may be executed across protection layers. For example, when a certain drive fails and the rebuild process is executed, the rebuild process is executed in the node, and at the same time, the drive data is repaired using the redundant code between the nodes. As a result, data can be read from more drives at the same time, and rebuilding can be executed at high speed. Whether to recover across the protection layers may be adjusted according to the network load, allowable load, and the like.
- FIG. 22 shows a flowchart of the data resync process. This process is executed as a resurrection process or a copyback process upon power interruption.
- the copy back process is a process of copying from the spare area data after resource replacement to a new resource after rebuilding. After execution of this process is completed, the resource state becomes normal.
- the processor 119 of the node 101 executing this process determines whether the process to be executed is a restoration process (S221). Specifically, the processor 119 determines whether the own node is a new node or is in a state of being recovered from a failure such as a power failure. When recovering from the failure, the processor 119 determines that the process is a restoration process (S221: Y).
- the processor 119 holds, as shared information in the distributed storage system, a correspondence table of identifiers and node numbers that are uniquely determined for each node, such as a mac address of a LAN controller, and refers to the correspondence table. Then, the presence / absence of registration of the own node in the storage system is determined.
- the processor 119 inspects an area that needs to be recovered. As a specific method of checking the area requiring recovery, for the redundant code, the state of the data mapping table 701 of the other node 101 is referred to, and the stripe of the unreflected redundant code is acquired from the other node 101. When the redundant code is rebuilt in the spare area, the processor 119 acquires the redundant code.
- the other node 101 manages the difference written after the failure occurs in a bitmap.
- the processor 119 recovers by copying back only the difference from the spare area. Further, the processor 119 may identify the last update time with reference to its own reverse mapping table 703 and request the other node 101 for valid data written after the last update time. In this way, the processor 119 determines the write data (stripes) to be recovered and the redundant code, and executes the area recovery process (S225).
- the processor 119 executes a copy back process.
- the processor 119 copies back the write data (stripe) and the redundant code rebuilt in the spare area.
- the processor 119 executes this processing for each protection layer hierarchy. For higher layers, only redundant code copying is performed. When the processing is completed for all layers (S227: Y), this flow ends.
- FIG. 23 shows a flowchart of the rearrangement process. This process optimizes the page layout of the distributed storage system. This process is used when adding a new resource to a distributed storage system, when reducing resources, when some pool volume capacity is exhausted, at regular intervals to review changes in load, etc. It is executed by each related node 101.
- the processor 119 calculates an overall threshold of the pool based on the page load distribution table 217 indicating the total I / O load of each virtual page (S231).
- the total I / O load of a virtual page is the total load due to host access in all owner nodes of the virtual page.
- an I / O load caused by host access to a virtual page in each owner node is called a local load.
- the I / O load of the virtual page is represented by, for example, I / O frequency.
- the overall threshold value can be calculated by the same method as the hierarchical allocation threshold value in the description of FIG.
- Each overall threshold indicates the boundary page I / O frequency between hierarchies.
- the capacity and I / O performance of each tier in the pool are determined from the capacity and I / O performance of all pool volumes of each tier. Pool volume hierarchy, capacity, and I / O performance are managed by management information (not shown).
- the processor 119 calculates a self-system threshold in each tier based on the page load distribution table 217 indicating the total I / O load of each virtual page and the page load distribution table 217 indicating the self-system load of the self-node. (S232).
- the own-system threshold indicates the boundary I / O frequency of the virtual page in which the data is arranged in the own node in the virtual page in the hierarchy determined by the overall threshold.
- FIGS. 24A and 24B each show an example of a self-threshold determination method.
- the way of viewing the graphs in FIGS. 24A and 24B is the same as that of the graph 271 in FIG.
- the vertical axis represents the page I / O load indicated by the page I / O frequency
- the horizontal axis represents the virtual pages arranged in order of the local I / O load.
- FIGS. 24A and 24B show all I / O load lines 241 and own system I / O load lines 242 in one layer, respectively.
- the virtual page assigned to each tier is determined by the total I / O load of the virtual page and the overall threshold.
- 24A and 24B show the I / O load distribution of a virtual page assigned to one virtual in the virtual page of which the own node 101 is the owner.
- the virtual page owned by the own node 101 can include a virtual page assigned to another pool volume in addition to a virtual page assigned to the own pool volume.
- 24A and 24B show the own system threshold value 246, respectively.
- a virtual page with a local I / O load higher than the local threshold 246 is assigned to the local pool volume.
- the data of the virtual page currently allocated to the other system pool volume is moved to the own system drive 113.
- the virtual page of the own system I / O load with the own system threshold value 246 or less is allocated to the own system pool volume or the other system pool volume.
- the processor 119 determines that the virtual page currently assigned to the other system pool volume is directly assigned to the other system pool volume.
- the processor 119 determines whether or not the virtual page currently allocated to the local pool volume moves the data to another node 101 (rebalance) according to the free capacity of the local pool volume. Details will be described later.
- 24A and 24B show a capacity limit 243, a drive performance limit 244, and an allowable network limit 245, respectively.
- the processor 119 determines the self-system threshold 246 so that the virtual pages allocated to the self-system pool volume are within these limit values.
- the processor 119 determines the page I / O load at the intersection of the capacity limit 243, the drive performance limit 244, and the minimum value of the allowable network limit 245 and the own system I / O load line 242 as the own system threshold 246. decide.
- the drive performance limit 244 is the minimum value
- the allowable network limit 245 is the minimum value.
- the capacity limit 243 indicates a capacity limit at which the own system can be arranged.
- the capacity limit 243 is determined by a predetermined formula from the own pool volume capacity of the node 101 and the page size.
- the drive performance limit 244 is determined so that the size of all virtual pages allocated to the local pool volume is within the local pool volume capacity.
- the local pool volume capacity is the capacity of the pool volume formed from the local drive 113.
- the drive performance limit 244 is determined by a predetermined formula from the access performance of the own pool volume and the total I / O load line 241.
- the access performance of the pool volume is indicated by, for example, an I / O amount per unit time.
- the drive performance limit 244 is determined so that the total I / O load of the virtual pages allocated to the own pool volume is within the access performance of the own pool volume.
- the hatched area in FIG. 24A indicates the total sum of all I / O loads of virtual pages allocated to the local pool volume.
- the hatched area indicates the total of other system I / O loads, that is, (total I / O load ⁇ own system I / O load).
- the allowable network limit 245 is determined by a predetermined formula from the sum of the other system I / O loads and the local network performance.
- the network performance is indicated by, for example, an I / O amount per unit time.
- the node 101 When assigning a virtual page to the next system pool volume, the node 101 receives another system access of the virtual page via the network. Therefore, the processor 119 determines the network limit 245 so that the other system I / O load is within the own system network performance.
- the drive performance limit 244 As described above, by determining the own system threshold based on the drive performance and the network performance, it is possible to suppress the occurrence of a bottleneck in data transfer in the host I / O. In particular, by using the drive performance limit 244, occurrence of a bottleneck on the network due to data arranged in another node can be effectively suppressed.
- the capacity limit 243 is essential, but the drive performance limit 244 and the allowable network limit 245 may not be used.
- the processor 119 reviews the pool volume configuration in the pool (S233).
- the processor 119 calculates the total capacity and the total I / O load of virtual pages (own system virtual pages) allocated to the own pool volume in each tier in determining the own system threshold value in step S232.
- the processor 119 determines the number of pool volumes 1303C to be mapped to the other-system pool volume 1303B based on these values and the capacity and performance of the local drive 113 in each tier. When the capacity or performance of the local drive 113 is insufficient with respect to the total capacity of the own virtual page or the total I / O load, the processor 119 increases the number of pool volumes 1303C.
- the processor 119 sequentially selects the virtual pages of the virtual volume in which the node 101 is the owner, and repeatedly executes the following steps.
- the processor 119 determines whether it is necessary to move the data of the virtual page from the other pool volume to the own pool volume (S234). Specifically, the processor determines the tier of the virtual volume from the overall threshold, and further determines whether to allocate the virtual page to the local pool volume from the local threshold. As described above, the processor 119 determines that a virtual page having an I / O load larger than the own system threshold is allocated to the own system pool volume. The processor 119 determines that it is not necessary to allocate a virtual page with an I / O load equal to or less than the own system threshold to the own volume.
- the processor 119 transfers the data of the virtual page from the other system pool volume to the local system pool. Determine that you need to move to a volume.
- the processor 119 allocates the data of the virtual page to the own pool. It is determined that there is no need to move to the volume.
- the processor 119 moves the data of the virtual page to the local pool volume (local drive 113) (S235).
- the movement includes a necessary hierarchy movement of the virtual page.
- Step 1 stages the data in its own cache 181.
- Step 2 changes the pool volume area corresponding to the virtual page in the page mapping table 215 to the own pool volume.
- Step 3 destages data to the own pool volume.
- Step 4 frees the cache area.
- Step 5 clears the page area of the other allocated pool volume that was originally allocated (for example, zero data write) to make it free. That is, in this step, the area is connected to the free list 801C of the local area control table 214, and the usage amount and the effective amount of the local area amount table 802 are reduced.
- each node 101 determines a virtual page to be moved to its own pool volume using the self threshold, when the virtual page is owned by a plurality of nodes 101, One node holding the data is determined.
- the node 101 that currently hold the virtual page data determines to allocate the virtual page to the own pool volume
- the data is moved to the other node 101. Therefore, the node 101 that is different from the node 101 that holds the virtual page data and that is finally determined to allocate the current virtual page to the local pool volume holds the virtual page data.
- the processor 119 determines whether or not tier movement is necessary (S236). If it is determined that the virtual page needs to be allocated to the local pool volume, the virtual page is currently allocated to the local pool volume, and the current tier is different from the tier determined from the overall threshold, the processor 119 It is determined that hierarchy movement is necessary.
- the processor 119 executes the hierarchy movement (S237).
- a specific method for moving the hierarchy is realized by a method basically similar to step S235.
- the processor 119 determines whether or not the virtual page needs to be rebalanced (S238).
- the virtual page rebalance moves the data of the virtual page from the current pool volume to the other pool volume.
- the processor 119 determines that the virtual page does not need to be allocated to the local pool volume and the local pool volume to which the virtual page is currently allocated is exhausted, the processor 119 allocates the virtual page to the other pool. It is determined that rebalancing assigned to the volume is necessary.
- the processor 119 refers to the local area amount table 802 of the hierarchy, and determines whether the area of the entry of the virtual page is exhausted (insufficient). For example, when the value obtained by subtracting the effective amount from the target amount is less than the threshold value, it is determined that the area is depleted.
- the processor 119 moves the data of the virtual page from the own pool volume (own node) to the other pool volume (other node) (S239).
- a specific method of rebalancing page movement is realized by a method basically similar to step S235.
- the processor 119 makes an inquiry to the other node 101 or acquires the information of the local area amount table 802 from the other node, and selects the other node 101 having an undepleted area for storing the data of the virtual page. .
- the determination of whether or not a certain node 101 has an area that is not depleted is based on the local area amount table 802 of the hierarchy in the node 101.
- the destination node 101 is selected from, for example, the owner node of the virtual page and the node belonging to the stripe type of the virtual page.
- FIG. 25A shows a flowchart of the configuration change process. This process is executed when the configuration of the distributed storage system is changed. For example, each node executes when a resource is newly added to the distributed storage system.
- the processor 119 changes the static mapping table of the protection layer (S251). For example, when a node is added, each node 101 in the site protection layer increases the number of stripe types and changes the data node and redundant code node of each of the plurality of stripe types. For example, one node 101 determines a new node configuration of the stripe type, and each node 101 updates the static mapping table accordingly.
- the node 101 changes a part of the stripe node corresponding to a part of the stripe type in the current mapping table 211 to a node to be newly added, and selects a plurality of the part nodes to form a new stripe type. include.
- FIG. 25B shows an example of stripe type addition and stripe rearrangement when a node is added.
- Nodes 101A to 101D are existing nodes, and node 101E is an additional node.
- the rectangle in each node indicates the data position (address) of the stripe, and the number in the rectangle indicates the stripe type number.
- Stripe type 1 to stripe type 5 are existing stripe types, and stripe type 6 is an additional stripe type.
- the stripe address of the node 101E does not belong to any stripe type, and the rectangle is empty.
- the stripe type to which some stripe addresses of the nodes 101A, 101C, and 101D that are part of the existing nodes belong has been changed to the stripe type 6.
- a part of the stripe address of the added node 101E is assigned to the stripe types 2, 3, and 4 changed in the existing node.
- the redundant code node is determined most so that the usage amount of the site redundant code Q is as uniform as possible between the added node and the existing node.
- each node 101 recalculates the target amount in the local area amount table 802 (S252).
- the target amount recalculation determines each site stripe type, the redundancy code of each protection layer, and the target capacity of the spare area.
- the target capacity of the redundant code of each protection layer is determined by, for example, the following equation according to the data protection policy (XDYP: maximum number of data X, redundant code number Y) specified by the user (described in FIG. 27).
- Target capacity Total capacity x Max (Y / number of resources, Y / (X + Y)) (However, the number of resources> Y)
- the total capacity is the total capacity of the local area of the node 101
- Max (A, B) is the maximum value of A and B
- the number of resources is the number of resources in the protection layer.
- the number of resources is the number of drives in the node
- the number of resources is the number of nodes in the site.
- the target amount of the spare area is a fixed value, and the target amount of each site stripe type is equal to the remaining capacity of the entire capacity.
- the redundant code rebalance is executed (S253).
- the processor 119 executes page rebalancing and rearrangement (S254).
- page rearrangement is executed for a newly added node or drive.
- a specific method is as described with reference to FIG.
- the target amount may be gradually reduced by a known method such as feedback control. With this configuration, it is possible to control data to be arranged in each node constituting the system while considering the performance of the entire system.
- FIG. 26 shows an example of a command line management I / F.
- an application program 2601, an API 2603, and a storage device 2602 implemented by software are operating.
- the application program 2601 designates the virtual page in the virtual volume to be allocated to the own logical page to the storage device 2602 through the API 2603. For example, the application program 2601 designates a virtual page by a virtual volume number, LBA, and data length. Thereby, it is possible to specify in units of pages.
- the storage apparatus 2602 refers to the page mapping table 215 to determine the logical page node assigned to the designated virtual page.
- the storage device 2602 reads the corresponding data from the other node, and Assign the specified virtual page to the logical page and store the data in the local drive. If no page is allocated to the storage area specified by the API 2603, data is stored in the local drive when a new page is allocated in response to a write request.
- the logical page that is next used by the application program 2601 in the own system can be arranged in advance in the own system, and an optimal page arrangement for the application can be realized.
- the node 101 may accept designation of a virtual page in a virtual volume to be allocated to a local logical page (local storage device) from a user via a user interface. As described above, the virtual page is indicated by the identifier of the virtual volume and the logical address in the virtual volume. Further, the node 101 may accept an instruction to assign a virtual page to a logical page of another node.
- FIG. 27 shows an example of the GUI management I / F of the distributed storage system.
- the GUI 2701 is an I / F for the user to perform various settings of the present distributed storage system.
- the node 101 receives various settings from the user via the input / output device.
- the GUI 2701 accepts resource designation (2702A to C) for each protection layer and enables hierarchical setting. For example, when the site 2702A is designated, the GUI 2701 accepts selection of each node (2702B) of the designated site. When a node is designated, the GUI 2701 receives a setting for a volume (2702C) in the designated node.
- Network performance is information on network bandwidth.
- AUTO is designated, each node 101 automatically determines the network bandwidth from the result of measuring the network bandwidth.
- each node uses the designated network bandwidth in determining the page arrangement.
- the failure threshold indicates the number of errors for determining that the resource is blocked when a communication error to the resource occurs.
- Takeover designates a takeover destination resource when a failure occurs in the resource. Multiple takeover destinations can be selected. If the user does not specify a takeover destination, the storage system may automatically select it.
- Protection policy is a setting that can be specified for each protection layer.
- Data protection policy (XDYP: maximum number of data X, number of redundant codes Y) can be specified for each protection layer.
- XDYP maximum number of data X, number of redundant codes Y
- the storage system uses values close to these in the actual configuration.
- each virtual volume there is synchronous / asynchronous information. For each virtual volume, whether to copy synchronously or asynchronously for each protection layer can be specified. Copy invalidation of each protection layer can be specified.
- a setting is made to invalidate the copy of the geo-protection layer.
- the virtual volume cannot be rebuilt when the site fails, and the rebuild when the site fails is skipped.
- the cache mode can be selected from “write” and “write back”.
- the write mode the write data is stored in the cache and simultaneously reflected in the drive, and then the write completion is notified to the host (application program).
- the write back mode when the write data is stored in the cache, the write completion is notified to the host (application program).
- the node that mounts the virtual volume is set according to the specification of the used node. This setting is reflected in the virtual volume management table 218.
- FIG. 28 shows a hardware configuration example of the distributed storage system.
- the difference between the configuration examples illustrated in FIG. 1 is that the back-end switch 2801 is shared among a plurality of nodes 101.
- the drive 113 shared via the back-end switch 2801 is a local drive that can be accessed by each node 101 sharing the back-end switch 2801 without going through other nodes, and is managed by each node 101. As described above, one drive 113 can be included in the plurality of nodes 101 via the back-end switch 2801.
- the shared range may be defined as a domain, and data protection may be multidimensional within and between domains.
- a domain may be defined in a relatively wide section according to the transfer bandwidth.
- FIG. 29 shows a method for improving the efficiency of transfer between nodes for redundancy.
- the transfer amount increases in proportion to the redundancy with respect to the write amount for the node. For example, in the example of FIG. 1, in order to recover data when a two-node failure occurs, write data is transferred from one node to the cache memory 181 of the two nodes.
- the write data DATA1 (1501A) written in the node 101A is transferred to the cache memories 181 of the nodes 101B and 101D. That is, in this example, network transfer twice as much as the write amount for the node occurs.
- a method for reducing the transfer amount for generating redundant codes in other nodes will be described.
- FIG. 29 shows an example in which data is protected with a 2D2P redundant configuration in four nodes from the nodes 101A to 101D. That is, this system has redundancy that can recover all data in the event of a two-node failure.
- the node 101A divides the received write data having a long data length into two blocks (d1, d2 blocks) 2901, 2902, and further, two intra-node redundancy codes, two parities (p, q parity) 2903, 2904 is generated. Parity is also a data block.
- a data block is a broad term including a data unit.
- the p parity 2901 and the q parity 2902 are primary redundant codes (Class 1 Code).
- the node 101A distributes and copies the write data and parity to the caches (buffers) of the nodes 101B to 101D.
- a combination of one or more data blocks is a data block.
- one write data block (d2 block) 2902 and two parities (p, q parity) 2903 and 2904 are distributed and copied to the three nodes 101B to 101D.
- the necessary redundancy is obtained (data recovery in the event of a two-node failure), so the synchronous write process is completed.
- each of the nodes 101B to 101D divides the received write data into two blocks (d1 and d2 blocks), and further generates p and q parities.
- Each of the nodes 101B to 101D distributes and copies one write data block (d2 data block) and two parities (p, q parity) to the caches (buffers) of the other three nodes.
- Each node stores a data block (write data or parity) from each of the other three nodes in a cache.
- Each of the nodes 101A to 101D asynchronously generates secondary redundant codes (x1, y1 parity) from the data blocks (respectively write data or parity) aggregated from the other three nodes, and sends them to the local drive. Write and free cache.
- the redundant code (x1, y1 parity) is called Class2 Code. Class2 Code corresponds to the redundant code described in FIG.
- the node 101C receives the p parity 2903 from the node 101A, the p parity 2905 from the node 101B, and the q parity 2906 from the node 101D.
- Node 101C generates x1 parity 2908 and y1 parity 2909 from them, writes them to the local drive, and releases the cache.
- each of the nodes 101A to 101D writes the write data (d1 + d2) to the local drive and releases the cache.
- the node 101A writes the d1 block 2901 and the d2 block 2902 to the local drive, and releases the cache.
- write data (d1 + d2) is transferred to the other two nodes in order to enable data recovery in the event of a two-node failure.
- a part of the write data (d2) and the primary redundant code (p, q parity) generated from the write data are transferred to another node. Therefore, it is possible to improve the efficiency of data transfer between nodes while maintaining the required redundancy. All the stripe data (d1 + d2) are stored in the local drive.
- FIG. 29 shows an example of a 2D2P redundant configuration, but the method of this example can be applied to an arbitrary mDnP configuration (m and n are natural numbers).
- the write data (mD) is stored in the local drive, and data in a state where the redundancy is reduced by 1 (redundancy is n ⁇ 1) is transferred to another node.
- the write data (d1 + d2 + d3) is stored in the local drive, and the data blocks d2, d3, p, q are transferred to different nodes, respectively.
- the set of data blocks to be transferred is not limited to this.
- the data blocks d1, d2, d3, and p may be transferred to another node.
- a stripe is dynamically selected from the stripes in one stripe type described in the method of the present embodiment and the first embodiment, a redundant code is generated from the selected stripe, and information about them is stored in metadata (for example, a log structure).
- metadata for example, a log structure.
- the redundancy processing of this example may be executed only when the data length is larger than the threshold (sequential write).
- the threshold for example, the method shown in FIG. 1 is applied.
- the system may add information indicating whether or not the Class2 Code generation method is applied to metadata (for example, the log structured mapping table 213), and switch the data processing according to the information.
- Class 1 code may be written to the local drive as intra-node parity to improve the efficiency of parity generation processing.
- FIG. 30 shows a data restoration method in the method for improving the efficiency of transfer between nodes for redundancy described with reference to FIG.
- FIG. 30 shows an example in which the write data is restored when the nodes 101A and 101B fail.
- the nodes 101C and 101D restore the Class1 code from the Class2 code, respectively, and further restore the user data of the nodes 101A and 101B from the Class1 code.
- the node 101C restores the p parity of the nodes 101A and 101B from the q parity of the node 101D acquired from the node 101D and the local x1, y1 parity.
- the node 101D generates the q parity of the node 101D from the user data (local user data) of the node 101D (if the parity is stored locally, it may be used instead).
- the node 101D restores the q parity of the nodes 101A and 101B from the q parity of the node 101C acquired from the node 101C and the local x1, y1 parity.
- the node 101C generates the q parity of the node 101C from the write data of the node 101C.
- the node 101C restores the user data d1 and d2 of the node 101A from the q parity of the node 101A acquired from the node 101D and the restored p parity of the node 101A.
- the node 101D restores the user data d1 and d2 of the node 101B from the p parity of the node 101B acquired from the node 101C and the restored q parity of the node 101B.
- the write data can be recovered by the two-stage restoration process.
- FIG. 31 shows a hardware configuration example of a distributed storage system.
- the main difference from the configuration example shown in FIG. 3 is that the back-end ports of the computer nodes 101 connected by the network 104 are connected to a plurality of flash drives 3105 via a virtual or physical network 103. It is.
- One or more computer nodes 101 are installed in one site.
- the computer node 101 can communicate with each of the flash drives 3105 via the network 103 without going through other computer nodes, and can be used as a local drive.
- One flash drive 3105 communicates only with one computer node 101.
- the back-end network 103 may interconnect a plurality of computer nodes 101, and the computer nodes 101 connected to the back-end network 103 communicate using the back-end network 103.
- the external network 104 is used for communication between nodes not connected by the back-end network 103.
- a flash drive 3105 as an example of a storage drive includes an I / F 3101 for connecting to the computer node 101, a buffer memory 3102 for temporarily storing data, an internal processor 3103 for controlling the flash drive 3105, and a plurality of data storing data.
- the flash memory 3104 is configured.
- FIG. 32 shows an outline of this example.
- a parity generation process and a data storage process in a log structured format are performed by a flash drive.
- the computer node can perform the write process without being aware of the redundant code generation and the log structuring format, so that the write process time can be shortened.
- the computer node 101 uses the static mapping table (for example, the site static mapping table 211) described in the first embodiment to determine the drive that stores the write data and the redundant code.
- a drive is determined.
- two D drives 3219, P1 drives 3220, and P2 drives 3221 shown in FIG. 32 correspond to one stripe type data drive and redundant code drive.
- the computer node 101 selects an entry in the static mapping table based on the write data access destination (for example, volume identifier and in-volume address) from the host, and sets the multiple drives indicated by the entry to the write data and redundant code. Determine the drive to store.
- the computer node 101 transfers the write data to the computer node 101 at another site.
- the host program is executed in the computer node 101, for example.
- the computer node 101 writes data to one drive (D drive) 3219 for storing the write data and one drive (P1 drive) 3220 for storing the main parity when writing the write data to the drive. (Overwrite).
- the computer node 101 writes to the D drive 3219 using a normal write command (D_WRITE) (3210), and sends the data to the medium (LBA area) 3204 via the data buffer 3202 of the D drive 3219.
- D_WRITE normal write command
- the computer node 101 issues a parity write command (P_WRITE) to the P1 drive 3220, and writes the data together with the storage destination information of the data stored in the D drive 3219 (3211).
- P_WRITE parity write command
- the P1 drive 3220 After writing data to the parity generation buffer 3203, the P1 drive 3220 generates a P1 parity 3207 inside the drive and writes the P1 parity 3207 to the medium 3204.
- the P1 drive 3220 dynamically combines the data blocks written in the parity generation buffer 3203 to generate the P1 parity 3227 as described for the stripe-type redundant code generation in the first embodiment.
- the P1 drive 3220 writes the data storage destination information that generated the P1 parity 3207 to the metadata storage area 3205 as metadata 3209.
- the computer node 101 transfers to the drive (P2 drive) 3221 for storing the sub-parity (P2 parity) which is the second and subsequent parity in addition to the D drive 3219 and the P1 drive 3220.
- the P2 drive 3221 stores data in the parity generation buffer 3203, and dynamically combines the data blocks written in the parity generation buffer 3203 to generate P2 parity 3227.
- the parity data block combinations generated by the P1 drive 3220 and the P2 drive 3221 must be the same.
- the P1 drive 3220 After the P1 drive 3220 generates the P1 parity, the P1 drive 3220 notifies the P2 drive 3221 of the combination of the data blocks that generated the P1 parity via the computer node 101 (P_GET, P_PUSH) (3215). Thereafter, the P2 drive 3221 generates P2 parity with the notified combination of data blocks.
- the computer node 101 reads the latest data 3206 from the D drive 3219 with a normal read command (D_READ) when reading the latest data (3212). Further, the computer node 101 reads the old data 3208 from the D drive 3219 by a read command (OLD_D_READ) for reading the old data 3208 (3213).
- D_READ normal read command
- OLD_D_READ read command
- the computer node 101 monitors the used amount (free capacity) of the drives 3219 to 3221 and performs a garbage collection process as necessary in order to secure an area for writing in the log structured format.
- the capacity management job 3201 of the computer node 101 issues a command (STAT_GET) for acquiring the drive usage (free capacity) after the write is completed or periodically, and monitors and detects the drive usage (drive free capacity) ( 3214).
- STAT_GET the command for acquiring the drive usage (free capacity) after the write is completed or periodically, and monitors and detects the drive usage (drive free capacity) ( 3214).
- the usage amount is larger than the threshold value (the free space is smaller than the threshold value) and the drive free space is exhausted, the computer node 101 performs a garbage collection process.
- a command (SEARCH) for searching for a deletion target parity is issued to the P2 drive 3221 (3218), and the storage destination information of the deletion target parity and the information of the data constituting the deletion target parity are acquired from the drive 3221.
- the data constituting the parity is the latest data from the parity configuration data information, and the latest data is transferred to the P1 drive 3220 to be re-dirty.
- the parity configuration data information indicates information on each data block used in generating the parity.
- a command (INVALID) for deleting the parity and invalidating the old data is issued (3217), and the old data is deleted.
- FIG. 33 shows a table structure managed by the drive 3105 for control of the storage system.
- the flash memory 3104 stores a logical-physical conversion table 3301, a log conversion table 3302, which is information related to a log structure, a parity data conversion table 3307, which is information related to data protection, a data parity conversion table 3308, and an address identifier free queue 3309.
- the logical-physical conversion table 3301 indicates a correspondence relationship between the logical address 3302 provided to the computer node 101 by the drive 3105 and the physical address 3303 of the data stored in the physical storage area.
- the log conversion table 3304 shows a correspondence relationship between an address identifier 3305 for uniquely identifying data and log information 3306 storing logical-physical conversion information.
- the drive 3105 assigns and manages an address identifier using the updated logical-physical conversion information as log information.
- Information of data constituting the parity held by another drive is held by an address identifier.
- the parity-data conversion table 3307 includes the address (LBA, data length) of the physical storage area storing the parity of the own drive and the address (drive number, LBA, data address) of the data of the other drive that generated the parity. (Data length, address identifier).
- the logical address of the data storage destination on a plurality of other drives corresponds to one parity.
- the data of the logical address can include the address of the old data. For this reason, the address identifier is stored at the same time so that the storage destination of the data that generated the parity can be uniquely determined.
- the data-parity conversion table 3308 is an inverse conversion table of the above-described parity data conversion table. A correspondence relationship between an address (LBA, drive number) of a physical storage area storing data of another drive and an address of a physical storage area storing the parity of the own drive is shown.
- the drive 3105 uses the data-parity conversion table 3308 to specify the address of the physical storage area that stores the parity necessary for restoring the data on the other drive. Is identified. Further, the parity-data conversion table 3307 can identify the address of the physical storage area that stores the data of another drive necessary for data recovery.
- the address identifier free queue 3309 is a queue that is used when write processing to be described later is executed in parallel, and stores unused address identifiers.
- the computer node 101 acquires (dequeues) the address identifier from the head of the address identifier free queue 3309 and issues a data write process to the drive 3105 together with the address identifier.
- the drive 3105 stores the log information in the log conversion table 3304 with the specified address identifier. Also, the computer node 101 registers (enqueues) the invalidated address identifier at the end of the address identifier free queue 3309 when the old data is invalidated.
- FIG. 34 shows a communication interface between the computer node 101 and the flash drive 3105.
- the D_WRITE command 3401 uses the drive number, LBA, and data transfer length of the D drive 3219 as arguments and writes to the D drive 3219. Thereafter, an address identifier that is metadata of a log structure is output.
- the address identifier is an invariant identifier associated with the data stored in the drive. Specifically, the address identifier is an identifier unique within the drive, which is assigned to mapping information between logical addresses and physical addresses within the drive.
- the P_WRITE command 3402 uses the drive number, data transfer length, and data storage information of the P1 drive 3220 or P2 drive 3221 storing the parity as arguments and writes to the drive.
- the data storage information includes a drive number of the D drive, an LBA, and an address identifier.
- the D_READ command 3403 reads the latest data from the D drive 3219 using the drive number, LBA, and data transfer length as arguments.
- the OLD_D_READ command 3404 is a command for reading old data from the D drive 3219 using the drive number, address identifier, and data transfer length as arguments.
- the P_GET command 3405 uses the drive number of the D drive as an argument, and outputs unreported parity configuration data information to the P2 drive 3221 with the parity generated by the asynchronous destage processing from the P1 drive 3220 specified by the argument.
- the parity configuration data information includes a drive number, an LBA, and an address identifier of the D drive of each data block used for generating the parity.
- the P_PUSH command 3406 uses the drive number of the P1 drive 3220 and the parity configuration data information as arguments, and notifies the P2 drive 3221 of the parity configuration data information.
- the parity configuration data information includes a drive number, an LBA, and an address identifier of the D drive.
- the STAT_GET command 3407 uses the drive number as an argument and outputs information on the usage of the drive specified by the argument.
- the STAT_GET command 3407 is used for monitoring the drive capacity depletion.
- the INVALID command 3408 invalidates old data that is no longer needed, using the drive number and address identifier of the D drive 3219 as arguments during the garbage collection process.
- the SEARCH command 3409 requests the P1 drive 3220 to search for the deletion target parity during the garbage collection process, and outputs the deletion target parity information and the parity configuration data information of the deletion target parity as a search result.
- the deletion target parity information is composed of the drive number and LBA of the P1 drive 3220
- the deletion target parity configuration data information is composed of the drive number of the D drive, LBA, an address identifier, and information indicating whether the data is the latest data.
- FIG. 35 shows a flowchart of processing in which the computer node 101 reads the latest data from the D drive 3219. This process is executed when a read command is received from the host (S3501).
- the processor 119 of the computer node 101 that has received the read command from the host checks whether data exists in the cache (S3502). If there is data on the cache (S3502: Y), the processor 119 returns the data on the cache to the host (S3510).
- the processor 119 secures the cache (S3503), and then issues a D_READ command to the D drive 3219 (S3504).
- the D drive 3219 When the D drive 3219 receives the D_READ command (S3505), the D drive 3219 refers to the logical-physical conversion table 3301 and acquires the physical address storing the data (S3506). Next, the D drive 3219 reads data from the flash memory (medium) 3104 (S3507), and returns the result to the computer node 101 (S3508). Upon receiving the result of D_READ from the D drive 3219 (S3509), the computer node 101 returns the result to the host (S3510).
- FIG. 36 shows old data read processing.
- the computer node 101 first issues an OLD_D_READ command to the D drive 3219 (S3601).
- the D drive 3219 receives the OLD_D_READ command (S3602)
- the D drive 3219 acquires a physical address storing old data corresponding to the designated address identifier from the log conversion table 3304 (S3603).
- the D drive 3219 reads old data from the flash memory (medium) 3104 (S3604), and returns the result to the computer node 101 (S3605).
- the computer node 101 receives the result of OLD_D_READ from the D drive 3219 (S3606).
- FIG. 37 shows a flowchart of processing in which the computer node 101 writes data to the D drive 3219.
- the write process includes two processes. One process is a synchronous write process until a write result is returned to the host. Another process is an asynchronous write process in which parity is generated from data accumulated in the parity generation buffer in the drive and stored in the medium.
- the synchronous write process will be described. This process is executed when a write command is received from the host.
- the write data is stored in the D drive 3219 and the data is written to the drives (P1 drive 3220 and P2 drive 3221) that generate parity as a set with the address identifier.
- the processor 119 of the computer node 101 When the processor 119 of the computer node 101 receives a write command from the host (S3701), it issues a D_WRITE command to the D drive 3219 (S3702).
- the D_WRITE command includes write data.
- the D drive 3219 Upon receiving the D_WRITE command (S 3703), the D drive 3219 writes the write data to the flash memory (medium) 3104 in a log structure format (S 3704), and the D drive 3219 further stores the metadata (logical / physical conversion table 3301). And the log conversion table 3304) are updated (S3705).
- the D drive 3219 returns the address identifier of the data storage destination to the computer node 101 (S3706).
- the computer node 101 Upon receiving the D_WRITE result from the D drive 3219 (S3707), the computer node 101 issues a P_WRITE command to the P1 drive 3220 as a set together with the data storage information in the D drive 3219 (S3708).
- the P1 drive 3220 Upon receiving the P_WRITE command (S3709), the P1 drive 3220 stores the write data in the parity generation buffer 3203 of the drive (S3710), and returns the result to the computer node 101 (S3711).
- the computer node 101 Upon receiving the result of the P_WRITE command from the P1 drive 3220 (S3712), the computer node 101 issues a P_WRITE command to the P2 drive 3221 together with the data storage information in the D drive 3219 (S3713).
- the P2 drive 3221 Upon receiving the P_WRITE command (S3714), the P2 drive 3221 writes the write data to the parity generation buffer 3203 (S3715), and returns the result to the computer node 101 (S3716). Upon receiving the result of the P_WRITE command from the P2 drive 3221 (S3717), the computer node 101 returns the result to the host (S3718).
- the P1 drive 3220 dynamically selects a data block from the data stored in the parity generation buffer 3203 to generate P1 parity (S3720).
- the metadata parity-data conversion table 3307 and data-parity conversion table 3308) is updated (S3721), and the P1 parity is written to the flash memory (medium) 3104 (S3722).
- the computer node 101 acquires the parity configuration data information of P1 parity from the P1 drive 3220 by the P_GET command (S3723, S3724).
- the computer node 101 notifies the P2 drive 3221 of the parity configuration data information acquired from the P1 drive 3220 using the P_PUSH command (S3725).
- the P2 drive 3221 Upon receiving the P_PUSH command from the computer node 101, the P2 drive 3221 generates P2 parity based on the received parity configuration data information (S3726), and metadata (P2 parity parity-data conversion table 3307 and data-parity conversion table). 3308) is updated (S3727), and the P2 parity is written to the flash memory (medium) 3104 (S3728).
- FIG. 38 shows a processing flow when data write processing to each drive is performed in parallel in the synchronous write processing.
- the computer node 101 issues the write command to the drives 3220 and 3221 that generate the parity without waiting for the response of the D drive 3219 by specifying the address identifier to be used to the drives 3219 to 3221. This is the point.
- the D_WRITE2 command 3805 for designating and writing the address identifier is used instead of the D_WRITE command 3401.
- the D_WRITE2 command 3805 is a command for writing to the D drive 3219 using the drive number, LBA, data transfer length, and address identifier of the D drive 3219 as arguments.
- the computer node 101 When the computer node 101 receives a write command from the host (S3701), it acquires an address identifier from the head of the address identifier free queue 3309 (S3801), and updates the head pointer of the address identifier free queue 3309 (S3802). Next, the computer node 101 issues a D_WRITE2 command to the D drive 3219 using the acquired address identifier as an argument (S3803).
- the computer node 101 designates the acquired address identifier in the data storage information to the P1 drive 3220 and the P2 drive 3221 and issues a P_WRITE command (S3708, S3713).
- the D drive 3219 stores log information in the log conversion table 3304 using the designated address identifier.
- the P1 drive 3220 and the P2 drive 3221 perform the same processing as in FIG. 37, and then return the results to the computer node 101 (S3703 to S3706, S3709 to S3711, S3714 to S3716).
- the computer node 101 stands by until results are received from all the drives 3219 to 3221 (S3804). Upon receiving the results from all the drives 3219 to 3221, the computer node 101 returns the results to the host (S3718).
- the P1 drive 3220 and the P2 drive 3221 generate a parity asynchronously and store it in the flash memory (medium) 3104 in the same manner as the processing described in S3719 to S3728 of FIG. As described above, the response time to the host can be shortened by performing write processing in parallel on each drive.
- FIG. 39 shows a flowchart of the garbage collection process. This process deletes unnecessary data when the amount of data stored in the drive exceeds a preset target capacity (threshold). Thereby, necessary data can be stored in a limited area.
- the types of data to be erased are write data and parity. This process may be executed in synchronization with the host I / O or may be executed asynchronously with the host I / O.
- the computer node 101 checks whether or not the usage amount of the D drive 3219 exceeds the target amount (S3901). Specifically, the computer node 101 determines from the monitoring result of the capacity management job 3201 based on whether the usage amount exceeds the target capacity. Note that the monitoring result of the capacity management job 3201 may be managed by the local area amount table 802.
- the computer node 101 When the drive usage exceeds the target capacity (S3901: Y), the computer node 101 starts garbage collection processing. In the garbage collection process, the computer node 101 issues a SERCH command for searching for the P1 parity to be deleted to the P1 drive 3220 that stores the P1 parity generated from the data of the D drive 3219 that has detected the capacity depletion.
- the P1 drive 3220 When the P1 drive 3220 receives the SERCH command, the P1 drive 3220 refers to the parity-data conversion table 3307 and searches for the P1 parity having the drive number specified by the argument as the parity configuration data information. When the target P1 parity is found, the P1 drive 3220 next refers to the data-parity conversion table 3308 to check whether the search result data is old data.
- the P1 drive 3220 determines that the P1 parity is a deletion target parity.
- the P1 drive 3220 confirms whether all the data constituting the P1 parity is new or old with reference to the data-parity conversion table 3308, and sends the result (deletion target parity and deletion target parity configuration data information) to the computer node 101. ) Is returned (S3902).
- the computer node 101 confirms the new and old information of each data constituting the P1 parity from the returned deletion target parity configuration data information, and determines whether or not the deletion target P1 parity can be immediately deleted (S3903).
- the computer node 101 deletes the P1 parity (S3906), and further stores the data constituting the P1 parity by the INVALID command. It is deleted from the previous D drive 3219 (S3907).
- the computer node 101 registers (enqueues) an INVALID address identifier at the end of the address identifier free queue 3309 upon receiving the result of the INVALID command.
- the computer node 101 also instructs the P2 drive 3221 to delete the P2 parity configured by the same data combination.
- the computer node 101 reads the latest data from the D drive 3219 by the D_READ command, and the P1 drive 3220 and the P2 drive by the P_WRITE command.
- the data storage information and the set are written to 3221 (S3905, S3908).
- the computer node 101 deletes the old P1 parity and the old P2 parity from the P1 drive 3220 and the P2 drive 3221 (S3906, S3909), and deletes the old data from the D drive 3219 by the INVALID command (S3907). The above processing is repeated to delete the parity and data.
- the P1 drive 3220 generates a new P1 parity, updates the metadata, and stores the new P1 parity in the flash memory (medium) 3104 by the asynchronous write processing described with reference to FIG.
- the P2 drive 3221 also generates new P2 parity, updates the metadata, and stores the new P2 parity in the flash memory (medium) 3104 by asynchronous write processing.
- FIG. 40 shows a hardware configuration example of the distributed storage system.
- the difference from the third embodiment is that a parity generation processing unit is installed in the computer node 101.
- the parity generation processing unit can be implemented by hardware or software.
- the storage system includes a plurality of computer nodes 101, and each computer node 101 includes a parity generation processing unit 4006 having a function of generating parity.
- Each computer node 101 is connected to the host computer 4001 via the front-end network 4002, the computer nodes 101 are connected via the internal network 4003, and the computer node 101 and the drive 3105 are connected via the back-end network 4004. Has been. A plurality of computer nodes 101 can access one drive 3105.
- FIG. 41 shows an outline of this example.
- the difference from the third embodiment is that since the computer node performs the parity generation process, the P1 drive 3220 and the P2 drive 3221 do not need to generate the parity asynchronously. For this reason, when the number of parities is 2 or more, it is not necessary to notify the P2 parity parity configuration data information to the P2 drive 3221, the processing load on the computer node 101 and the drives 3219 to 3221 can be reduced, and the write processing time can be shortened. .
- the data received from the host is stored in the parity generation buffer 4101 in the computer node 101, and the parity generation process is requested from the parity generation buffer 4101 to the parity generation processing unit 4006 (4101).
- the parity generation processing unit 4006 generates parity, and writes the generated parity to the drive that stores the parity (4102).
- the difference from the third embodiment in the garbage collection process is that when the latest data is included in the data constituting the parity to be deleted, the latest data is read from the D drive 3219 and then the data is sent to the parity generation processing unit 4006. The point of transfer is to generate new parity.
- the read process is the same as in the third embodiment.
- FIG. 42 shows a communication interface between the computer node 101 and the drives 3219 to 3221.
- the P_WRITE2 command 4201 uses the array of the drive number, LBA, data transfer length, and parity configuration data information as arguments, and writes the parity to the drive.
- the parity configuration data information includes a drive number, an LBA, and an address identifier. That is, the P_WRITE2 command 4201 writes a plurality of data storage destinations as parity configuration data information to the drive together with the parity.
- the write process of this example includes a synchronous write process and an asynchronous write process, as in the third embodiment.
- FIG. 43 shows a flowchart of the synchronous write process in this example.
- the D drive 3219 Upon receiving the D_WRITE command (S4303), the D drive 3219 writes the data to the flash memory (medium) 3104 (S4304), updates the metadata (the logical / physical conversion table 3301 and the log conversion table 3304) (S4305), The result (address identifier) is returned to the computer node 101 (S4306).
- the computer node 101 when the computer node 101 receives the result from the D drive 3219 (S4307), the computer node 101 stores the data in the parity generation buffer 4101 in the computer node 101 (S4308) and returns the result to the host (S4309).
- the synchronous write processing uses the address identifier free queue 3309 and the D_WRITE2 command 3805 as described with reference to FIG. 38 to write data in the D drive 3219 and store data 4101 in the parity generation buffer in parallel. It may be executed.
- FIG. 44 shows a flowchart of asynchronous write processing in this example.
- the computer node 101 performs asynchronous write processing (S4401).
- the main processing unit 4405 of the computer node 101 transfers the parity generation target data from the data accumulated in the parity generation buffer 4101 to the parity generation processing unit 4006 (S4402).
- the main processing unit 4405 is realized by, for example, the processor 119 that operates according to a program.
- the parity generation processing unit 4006 receives data (S4403), it stores the received data in its internal buffer (S4404).
- the parity generation processing unit 4006 generates P1 parity and P2 parity with the received data (S4405), and transfers the generated parity to the main processing unit 4405 (S4406).
- the main processing unit 4405 When the main processing unit 4405 receives the P1 parity and the P2 parity from the parity generation processing unit 4006 (S4407), the main processing unit 4405 writes the data information constituting the parity to the P1 drive 3220 and the P2 drive 3221 by the P_WRITE command (S4408).
- the P1 drive 3220 Upon receiving the P_WRITE command (S4409), the P1 drive 3220 writes parity to the flash memory (medium) 3104 (S4410), and updates the metadata (parity-data conversion table 3307 and data-parity conversion table 3308) (S4411). ), The result is returned to the computer node 101 (S4412).
- the P2 drive 3221 also performs the same processing as the P1 drive 3220, and returns the result to the computer node 101 (S4413 to S4416).
- the main processing unit 4405 receives the results from the P1 drive 3220 and the P2 drive 3221, the main processing unit 4405 ends the processing (S4417).
- FIG. 45 shows a flowchart of the garbage collection process of this embodiment. Steps S4201 to S4204 and S4207 correspond to steps S3901 to S3904 and S3907.
- Steps S4501 and S4206 are executed for the P1 parity storage drive and the P2 parity storage drive.
- the computer node 101 When a predetermined number of data accumulates in the parity generation buffer 4101 or when a predetermined time elapses, the computer node 101 performs the asynchronous write processing described with reference to FIG. 44, generates a new parity, and then writes the parity to the drive. .
- the correspondence between the redundancy code and the data addressing is managed in each node.
- the data protection technique may be implemented by preparing two types of virtual spaces and dynamically changing the correspondence between the virtual spaces. Specifically, the system prepares a first virtual space to be provided to the upper logical unit, and a second virtual space statically associated with the redundant code on the physical storage area and the data storage address. To do. The system dynamically generates the redundant code data from the data from a plurality of nodes by dynamically associating the first virtual space and the second virtual space.
- the system shares a write destination pointer among a plurality of nodes constituting the stripe type.
- the write destination pointer is a pointer that represents the current position of the write, assuming that additional writing is incrementally performed in a log format with respect to the second virtual space between a plurality of nodes.
- the system associates each data from a plurality of nodes with a plurality of data composed of redundant codes corresponding to a predetermined area in the second virtual space so that the write destination pointers match.
- the correspondence between the first virtual space and the second virtual space is controlled so as to be written.
- the data protection technology and data placement technology of the present disclosure dynamically generate a redundancy code from a set of data units (data blocks) transferred from a plurality of different nodes on the cache. That is, among the data managed by the code dirty queue 901, the same stripe type data is arbitrarily selected (S802 in FIG. 18), and as a result, the logical address of the data block constituting the inter-node redundant code combination generated by one node Are not fixed to one combination, and two or more combinations are allowed.
- the present disclosure by managing each data block and its transfer source address in association with each other, redundant code generation with a combination of dynamic logical addresses is allowed. Furthermore, the number of data used for redundant code generation is not limited to a specific value and can be changed dynamically. With the above configuration, a data arrangement capable of local access at high speed can be realized while avoiding a network bottleneck while realizing data protection with a small overhead. Further, when the drive is an SSD, the write amount can be reduced, and a long life can be realized.
- the data protection technology and data placement technology disclosed herein enable both data placement suitable for local read and data protection, and can avoid network bottlenecks. Furthermore, since the management information of the data stored in the local storage device can be held in the own system, the information of the virtual volume and the pool volume can be closed to the sharing of a small number of nodes, and the information to be shared is reduced. Thereby, high scalability independent of the number of nodes can be realized. In addition, the network cost for system construction can be reduced due to high scalability.
- the multiple functions in the distributed storage system can be implemented independently. For example, only one of a redundant code generation function, a rearrangement function, and a function for accepting designation of a page arrangement node may not be implemented in the distributed storage system.
- the node configuration is not limited to the above computer configuration.
- the node protection layer may be omitted.
- only one of the site protection layer or the site protection layer may be implemented.
- this invention is not limited to the above-mentioned Example, Various modifications are included.
- the drive 113 shown in FIG. 3 does not need to be present in the case of the computer node 101, and each processor may be a storage device of its own system and recognized as a management target.
- the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.
- each of the above-described configurations, functions, processing units, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
- Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
- Information such as programs, tables, and files for realizing each function can be stored in a recording device such as a memory, a hard disk, or an SSD, or a recording medium such as an IC card or an SD card.
- control lines and information lines indicate what is considered necessary for the explanation, and not all control lines and information lines on the product are necessarily shown. In practice, it may be considered that almost all the components are connected to each other.
- the storage system includes one or more computers and a plurality of storage drives, The one or more computers determine a data drive that stores a write data block and a first redundant code drive that stores a redundant code of the write data block; The one or more computers transmit the write data block to the data drive and the first redundant code drive, respectively; The data drive stores the write data block in a storage medium; The first redundant code drive generates a redundant code using a plurality of write data blocks received from the one or more computers, and stores the redundant code in a storage medium.
- the first redundant code drive is Based on the write destination of each received write data block, determine the stripe type to which each received write data block belongs, A redundant code is generated from a plurality of write data blocks included in the same stripe type.
- the first redundant code drive is Further receiving storage location information of the write data block from the one or more computers, The relationship between the storage location of the redundant code and the storage location of the write data block is managed.
- the one or more computers further transmit the write data block together with storage location information of the write data block to a second redundant code drive, The second redundant code drive acquires configuration information indicating data block information used in generating the redundant code in the first redundant code drive, and uses the data block selected according to the configuration information to generate a redundant code. Generate.
- the storage system includes a computer and a plurality of storage drives,
- the computer determines a data drive for storing a write data block and a redundant code drive for storing a redundant code of the write data block,
- the computer sends the write data block to the data drive,
- the data drive stores the write data block in a storage medium;
- the computer generates a redundant code using the write data,
- the computer transmits the redundant code and configuration information indicating information of a data block used in the generation of the redundant code to the redundant code drive,
- the redundant code drive stores the redundant code in a storage medium;
- the redundant code drive manages the relationship between the storage location of the redundant code and the storage location of the write data block.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Quality & Reliability (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Detection And Correction Of Errors (AREA)
- Hardware Redundancy (AREA)
Abstract
Description
概要
本実施形態は、分散型ストレージシステムを開示する。分散型ストレージシステムは、それぞれがストレージデバイスを含む複数の計算機ノードをネットワークにより接続して構成される。分散型ストレージシステムは、複数の計算機ノードのストレージデバイスによってストレージプールを実現する仮想的なストレージシステムを実現する。
本開示において、ストレージデバイスは、1台のHDDやSSD等の1台のストレージドライブ及び複数台のストレージドライブを含むRAID装置、及び複数のRAID装置を含む。ストライプ又はストライプデータは、データ保護のための冗長コードの生成の元となるデータユニットである。ストライプを、冗長コードと差別化するためにユーザデータと呼ぶことがある。ストライプは、計算機ノード内のストレージデバイスに格納されると共に、他の計算機ノードにおける冗長コードの生成において使用される。
(アドレス値/ストライプサイズ)modc
プロセッサ119は、ダーティキュー901からX1~X3のダーティデータを選択し、下記の式で、冗長コードP1又はP2を算出する。
P1=X1xorX2xorX3
P2=(X1*A1)xor(X2*A2)xor(X3*A3)
冗長コードP1、P2は、それぞれ、自系のストレージデバイスの新規領域にライトされる。
プロセッサ119は、中間ダーティキュー904から、自系のドライブ113にライト済みの旧冗長データP1’又はP2’に対応する新らたな中間ダーティデータM1、M2を抽出する。中間ダーティデータの個数は2とは限らない。プロセッサ119は、下記の式で新中間コードMP1又はMP2を算出する。
MP1=M1xorM2
MP2=(M1*A1)xor(M2*A2)
P1=P1’xorMP1
P2=P2’xorMP2
新冗長コードP1、P2は、を旧領域(P1’、P2’)にオーバーライトされる。
=全体容量×Max(Y÷リソース数、Y÷(X+Y))
(但し、リソース数>Y)
図29は、冗長化のためのノード間の転送を効率化する方法を示す。上述した方法では、ノードに対するライト量に対して、冗長度に比例して転送量が増加する。例えば、図1の例において、2ノード障害時にデータを回復するためには、1個のノードから、2個のノードのキャッシュメモリ181に対してライトデータが転送される。
(ログ構造(ドライブ)+パリティ生成(ドライブ)オフロード方式)
図31は、分散型ストレージシステムのハードウェア構成例を示す。図3が示す構成例との主な差は、ネットワーク104により接続された計算機ノード101のバックエンドポートが、仮想的又は物理的なネットワーク103を介して複数のフラッシュドライブ3105に接続されている点である。一つのサイトには、1又は複数の計算機ノード101が設置されている。
図32は、本例の概要を示す。本例は、パリティ生成処理、及びログ構造化形式でのデータ格納処理をフラッシュドライブで実施する。これにより、計算機ノードは、冗長コードの生成及びログ構造化形式を意識することなく、ライト処理を実施できるため、ライト処理の時間を短縮できる。
図33は、ストレージシステムの制御のためにドライブ3105で管理するテーブル構造を示す。フラッシュメモリ3104は、ログ構造に関する情報である論物変換表3301、ログ変換表3302、データ保護に関する情報であるパリティーデータ変換表3307、データーパリティ変換表3308、及びアドレス識別子フリーキュー3309を格納する。
図34は、計算機ノード101とフラッシュドライブ3105との間の通信インタフェースを示している。D_WRITEコマンド3401は、Dドライブ3219のドライブ番号、LBA、データ転送長を引数とし、Dドライブ3219へ書き込みを行う。その後、ログ構造のメタデータであるアドレス識別子を出力する。
(最新データのリード)
図35は、計算機ノード101がDドライブ3219から最新データを読み込む処理のフローチャートを示す。本処理は、ホストからリード命令を受領した場合に実行される(S3501)。
図36は、旧データのリード処理を示している。旧データのリード処理では、まず計算機ノード101は、OLD_D_READコマンドをDドライブ3219へ発行する(S3601)。Dドライブ3219は、OLD_D_READコマンドを受領すると(S3602)、指定されたアドレス識別子に対応する旧データを格納している物理アドレスを、ログ変換表3304から取得する(S3603)。
図37は、計算機ノード101がDドライブ3219へデータを書き込む処理のフローチャートを示す。ライト処理は、二つの処理を含む。一の処理は、ホストへライト結果を返却するまでの同期ライト処理である。もう一つの処理は、ドライブ内のパリティ生成バッファに蓄積されたデータからパリティを生成し、媒体へ格納する非同期ライト処理である。
図39は、ガベージコレクション処理のフローチャートを示す。本処理は、ドライブに格納されたデータ量が予め設定された目標容量(閾値)を超えた場合に、不要なデータを消去する。これにより、必要なデータを限られた領域に格納できる。消去されるデータの種類は、ライトデータとパリティである。本処理は、ホストI/Oと同期して実行されてもよいし、ホストI/Oと非同期で実行されてもよい。
(ログ構造(ドライブ)+パリティ生成(コントローラ)オフロード方式)
図40は、分散型ストレージシステムのハードウェア構成例を示す。実施形態3との差は、計算機ノード101内部に、パリティ生成処理部を実装している点である。パリティ生成処理部は、ハードウェア又はソフトウェアで実装できる。ストレージシステムは、複数の計算機ノード101を含んで構成されており、各計算機ノード101は、内部に、パリティを生成する機能を持つ、パリティ生成処理部4006を含む。
図41は、本例の概要を示す。実施形態3との差は、パリティ生成処理を計算機ノードが実施するため、P1ドライブ3220とP2ドライブ3221は、I/O非同期でパリティを生成する必要がない点である。このため、パリティ数が2以上の場合、P1パリティのパリティ構成データ情報をP2ドライブ3221へ通知する必要がなく、計算機ノード101とドライブ3219~3221の処理負荷を削減でき、ライト処理時間を短縮できる。
図42は、計算機ノード101とドライブ3219~3221との間の通信インタフェースを示している。実施形態3におけるP_WRITEコマンド3402の代わりに、P_WRITE2コマンド4201がある。
(同期ライト処理)
本例のライト処理は、実施形態3と同様に、同期ライト処理と非同期ライト処理とを含む。図43は、本例での同期ライト処理のフローチャートを示している。まず、ホストからライト命令を受領すると(S4301)、計算機ノード101は、Dドライブ3219へD_WRITEコマンドを発行する(S4302)。
図44は、本例での非同期ライト処理のフローチャートを示している。同期ライト処理を繰り返し実行した結果、パリティ生成バッファ4101内に所定数のデータが蓄積する、又は、所定時間が経過すると、計算機ノード101は、非同期ライト処理を実施する(S4401)。
図45は、本実施形態のガベージコレクション処理のフローチャートを示している。ステップS4201~S4204、S4207は、ステップS3901~S3904、S3907に対応する。
(1)
ストレージシステムは、1以上の計算機と、複数のストレージドライブと、を含み、
前記1以上の計算機は、ライトデータブロックを格納するデータドライブ及び前記ライトデータブロックの冗長コードを格納する第1冗長コードドライブを決定し、
前記1以上の計算機は、前記ライトデータブロックを前記データドライブ及び前記第1冗長コードドライブにそれぞれ送信し、
前記データドライブは、前記ライトデータブロックを記憶媒体に格納し、
前記第1冗長コードドライブは、前記1以上の計算機から受信した複数のライトデータブロックを使用して冗長コードを生成し、記憶媒体に格納する。
(2)
前記第1冗長コードドライブは、
受信したライトデータブロックそれぞれのライト先に基づいて、前記受信したライトデータブロックそれぞれが属するストライプタイプを決定し、
同一ストライプタイプに含まれる複数のライトデータブロックから冗長コードを生成する。
(3)
前記第1冗長コードドライブは、
前記1以上の計算機から、前記ライトデータブロックの格納先情報をさらに受信し、
前記冗長コードの格納先と前記ライトデータブロックの格納先との関係を管理する。
(4)
前記1以上の計算機は、前記ライトデータブロックを、前記ライトデータブロックの格納先情報と共に、第2冗長コードドライブにさらに送信し、
前記第2冗長コードドライブは、前記第1冗長コードドライブにおいて冗長コードの生成で使用されたデータブロックの情報を示す構成情報を取得し、前記構成情報に従って選択したデータブロックを使用して冗長コードを生成する。
(5)
ストレージシステムは、計算機と、複数のストレージドライブと、を含み、
前記計算機は、ライトデータブロックを格納するデータドライブ及び前記ライトデータブロックの冗長コードを格納する冗長コードドライブを決定し、
前記計算機は、前記ライトデータブロックを前記データドライブに送信し、
前記データドライブは、前記ライトデータブロックを記憶媒体に格納し、
前記計算機は、前記ライトデータを使用して冗長コードを生成し、
前記計算機は、前記冗長コードと、前記冗長コードの生成で使用されたデータブロックの情報を示す構成情報と、を前記冗長コードドライブに送信し、
前記冗長コードドライブは、前記冗長コードを記憶媒体に格納し、
前記冗長コードドライブは、前記冗長コードの格納先と前記ライトデータブロックの格納先との関係を管理する。
Claims (15)
- 分散型ストレージシステムであって、
ネットワークを介して通信する複数のノードと、
前記分散型ストレージシステムは更に複数のストレージデバイスと、を含み、
少なくとも3以上のノードを含む第1ノードグループが予め定義されており、
前記第1ノードグループのノードそれぞれは、その管理しているストレージデバイスに格納するデータを、前記第1ノードグループに属する他ノードに送信し、
前記第1ノードグループの第1ノードは、前記第1ノードグループの2以上の他ノードから、データを受信し、
前記第1ノードは、前記2以上の他ノードから受信したデータの組み合わせを使用して冗長コードを生成し、
前記第1ノードは、前記生成した冗長コードを、前記冗長コードを生成したデータを格納するストレージデバイスとは異なるストレージデバイスに格納し、
前記第1ノードが生成する冗長コードのうち、少なくとも二つの冗長コードのデータ組み合わせは、構成するデータの論理アドレスの組み合わせが異なる、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードグループのノードそれぞれは、前記管理しているストレージデバイスに格納するデータから、ノード内冗長コードを生成する、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードは、
キャッシュを含み、
前記2以上の他ノードから受信したデータを前記キャッシュに一時的に格納し、
前記キャッシュに一時的に格納した前記データからデータを選択し、
前記選択したデータから一つの冗長コードを生成する、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードは、前記冗長コードそれぞれと、前記冗長コードそれぞれの生成に使用したデータの送信元ノードそれぞれにおける論理アドレス情報とを、関連付けて管理する、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記冗長コードそれぞれを生成するデータの数は不定である、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記複数のノードにおいて、少なくとも3以上のノードを含む第2ノードグループ及び第3ノードグループが予め定義されており、
前記第2ノードグループに属する第2ノードは、
前記第1ノードグループに属するノード及び前記第3ノードグループに属するノードから受信したデータを使用して第2レベル冗長コードを生成し、
前記第2レベル冗長コードを、前記第2ノードが管理するストレージデバイスに格納する、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードは、前記冗長コードを格納する領域が閾値に達した後、
前記領域に格納されている第1冗長コードと第2冗長コードを選択し、
前記第1冗長コードと前記第2冗長コードをマージして、異なるノードのみから送信されたデータの第3冗長コードを生成し、
前記第1冗長コードと前記第2冗長コードとを消去して前記第3冗長コードを前記領域に格納する、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードに第1データを送信した前記第1ノードグループに属する第2ノードは、前記第1データを前記第2ノードが管理するストレージでバイスから消去する前に、前記第1データを前記第1ノードに送信し、
前記第1ノードは、前記第1データを使用して、前記第1データを使用して生成した第1冗長コードを更新する、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードグループに属し、前記第1ノードに第1データを送信した第2ノードは、
前記第1データの更新データと、前記第1データとを使用して、中間データを生成し、前記中間データを前記第1ノードに送信し、
前記第1ノードは、前記中間データを使用して前記第1データを使用して生成した冗長コードを更新する、分散型ストレージシステム。 - 請求項2に記載の分散型ストレージシステムであって、
前記第1ノードは、
前記管理しているストレージデバイスに格納するデータを分割して、ノード内冗長コードを生成し、
前記分割したデータの少なくとも一部、及び、前記ノード内冗長コードを、前記第1ノードグループの他ノードに送信し、
前記第1ノードが生成する冗長コードに使用されるデータの組み合わせは、他のノードから送信されたノード内冗長コードを含む、分散型ストレージシステム。 - 請求項1に記載の分散型ストレージシステムであって、
前記第1ノードグループに属する複数のノードが格納するデータを使用して冗長コードを生成するノードは、前記第1ノードグループにおいて分散されている、分散型ストレージシステム。 - ネットワークを介して通信する複数ノードを含む分散型ストレージシステムにおける一つのノードにおいて実行されるデータ制御方法であって、
前記分散型ストレージシステムは更に複数のストレージデバイスを含み、
少なくとも3以上のノードを含む第1ノードグループが予め定義されており、
前記データ制御方法は、
管理するストレージデバイスに格納するデータを、前記第1ノードグループに属する他ノードに送信し、
前記第1ノードグループに属する2以上の他ノードから受信したデータの組み合わせを使用して冗長コードを生成し、
前記生成した冗長コードを、前記冗長コードを生成したデータを格納するストレージデバイスとは異なるストレージデバイスに、格納する、ことを含み、
生成する冗長コードのうち、少なくとも二つの冗長コードを生成するデータ組み合わせは、構成するデータの論理アドレスの組み合わせが異なる、方法。 - 請求項12に記載の方法であって、
前記管理しているストレージデバイスに格納するデータから、ノード内冗長コードを生成することをさらに含む、方法。 - 請求項12に記載の方法であって、
前記2以上の他ノードから受信したデータをキャッシュに一時的に格納し、
前記キャッシュに一時的に格納した前記データからデータを選択し、
前記選択したデータから一つの冗長コードを生成する、方法。 - 請求項12に記載の方法であって、
前記冗長コードそれぞれと、前記冗長コードそれぞれの生成に使用したデータの送信元ノードそれぞれにおける論理アドレス情報とを、関連付けて管理する、方法。
Priority Applications (12)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010024851.5A CN111258500B (zh) | 2014-09-30 | 2015-09-30 | 分布式存储系统以及数据控制方法 |
| GB1614431.3A GB2545290B (en) | 2014-09-30 | 2015-09-30 | Distributed storage system |
| JP2016552148A JP6752149B2 (ja) | 2014-09-30 | 2015-09-30 | 分散型ストレージシステム |
| DE112015000710.5T DE112015000710B4 (de) | 2014-09-30 | 2015-09-30 | Verteiltes Speichersystem |
| US15/120,840 US20160371145A1 (en) | 2014-09-30 | 2015-09-30 | Distributed storage system |
| CN202010024845.XA CN111190552B (zh) | 2014-09-30 | 2015-09-30 | 分布式存储系统 |
| CN201580010284.5A CN106030501B (zh) | 2014-09-30 | 2015-09-30 | 系统、方法以及分布式存储系统 |
| US15/662,510 US10185624B2 (en) | 2014-09-30 | 2017-07-28 | Distributed storage system |
| US16/108,265 US10496479B2 (en) | 2014-09-30 | 2018-08-22 | Distributed storage system |
| US16/680,772 US11036585B2 (en) | 2014-09-30 | 2019-11-12 | Distributed storage system |
| US17/326,504 US11487619B2 (en) | 2014-09-30 | 2021-05-21 | Distributed storage system |
| US17/969,763 US11886294B2 (en) | 2014-09-30 | 2022-10-20 | Distributed storage system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2014/076105 WO2016051512A1 (ja) | 2014-09-30 | 2014-09-30 | 分散型ストレージシステム |
| JPPCT/JP2014/076105 | 2014-09-30 |
Related Child Applications (3)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/120,840 A-371-Of-International US20160371145A1 (en) | 2014-09-30 | 2015-09-30 | Distributed storage system |
| US15/662,510 Continuation US10185624B2 (en) | 2014-09-30 | 2017-07-28 | Distributed storage system |
| US16/108,265 Division US10496479B2 (en) | 2014-09-30 | 2018-08-22 | Distributed storage system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2016052665A1 true WO2016052665A1 (ja) | 2016-04-07 |
Family
ID=55629607
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2014/076105 Ceased WO2016051512A1 (ja) | 2014-09-30 | 2014-09-30 | 分散型ストレージシステム |
| PCT/JP2015/077853 Ceased WO2016052665A1 (ja) | 2014-09-30 | 2015-09-30 | 分散型ストレージシステム |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2014/076105 Ceased WO2016051512A1 (ja) | 2014-09-30 | 2014-09-30 | 分散型ストレージシステム |
Country Status (6)
| Country | Link |
|---|---|
| US (6) | US20160371145A1 (ja) |
| JP (5) | JP6752149B2 (ja) |
| CN (3) | CN111258500B (ja) |
| DE (1) | DE112015000710B4 (ja) |
| GB (1) | GB2545290B (ja) |
| WO (2) | WO2016051512A1 (ja) |
Cited By (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018179073A1 (ja) * | 2017-03-28 | 2018-10-04 | 株式会社日立製作所 | ストレージシステム、コンピュータ読み取り可能な記録媒体、システムの制御方法 |
| KR101907786B1 (ko) | 2017-04-28 | 2018-10-12 | 한국과학기술원 | 캐시 메모리를 이용한 다수의 보조 노드 무선 통신에서 노드간 협력을 통해 사용자의 체감 효과를 향상시키는 분산 저장 방법 및 장치 |
| JP2019191997A (ja) * | 2018-04-26 | 2019-10-31 | 株式会社日立製作所 | ストレージシステム、ストレージシステムの制御方法及び管理ノード |
| JP2020052919A (ja) * | 2018-09-28 | 2020-04-02 | 株式会社日立製作所 | ストレージ装置、管理方法及びプログラム |
| US10628349B2 (en) | 2017-03-24 | 2020-04-21 | Hitachi, Ltd. | I/O control method and I/O control system |
| JP2020091901A (ja) * | 2017-11-30 | 2020-06-11 | 株式会社日立製作所 | 記憶システム及びその制御方法 |
| CN111367868A (zh) * | 2018-12-26 | 2020-07-03 | 北京奇虎科技有限公司 | 一种文件获取请求的处理方法和装置 |
| JP2020107082A (ja) * | 2018-12-27 | 2020-07-09 | 株式会社日立製作所 | ストレージシステム |
| JP2020144459A (ja) * | 2019-03-04 | 2020-09-10 | 株式会社日立製作所 | ストレージシステム、データ管理方法、及びデータ管理プログラム |
| JP2020154626A (ja) * | 2019-03-19 | 2020-09-24 | 株式会社日立製作所 | 分散ストレージシステム、データ管理方法、及びデータ管理プログラム |
| CN111788558A (zh) * | 2018-03-20 | 2020-10-16 | 华睿泰科技有限责任公司 | 用于检测具有故障域的分布式存储设备中的位衰减的系统和方法 |
| JP2021114137A (ja) * | 2020-01-20 | 2021-08-05 | 株式会社日立製作所 | データ分析を支援するシステム及び方法 |
| JP2021521701A (ja) * | 2018-04-13 | 2021-08-26 | アンスティテュ・ミーヌ・テレコム | データを符号化及び復号化する方法及び装置 |
| JP2021522577A (ja) * | 2018-05-09 | 2021-08-30 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | ホスト認識更新書き込みの方法、システム、およびコンピュータ・プログラム |
| US11128535B2 (en) | 2019-03-19 | 2021-09-21 | Hitachi, Ltd. | Computer system and data management method |
| US20220027048A1 (en) * | 2016-01-22 | 2022-01-27 | Netapp, Inc. | Garbage Collection Pacing in a Storage System |
| KR20220149231A (ko) * | 2021-04-30 | 2022-11-08 | 계명대학교 산학협력단 | 데이터 분산 저장 방법 및 상기 방법을 수행하는 컴퓨팅 장치 |
| US11494089B2 (en) | 2019-12-23 | 2022-11-08 | Hitachi, Ltd. | Distributed storage system, data control method and storage medium |
| JP2023106886A (ja) * | 2022-01-21 | 2023-08-02 | 株式会社日立製作所 | ストレージシステム |
| US12068926B2 (en) | 2019-09-24 | 2024-08-20 | Ntt Communications Corporation | Display control system, display method, and program |
Families Citing this family (160)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11188665B2 (en) | 2015-02-27 | 2021-11-30 | Pure Storage, Inc. | Using internal sensors to detect adverse interference and take defensive actions |
| US10437677B2 (en) * | 2015-02-27 | 2019-10-08 | Pure Storage, Inc. | Optimized distributed rebuilding within a dispersed storage network |
| CN106649406B (zh) * | 2015-11-04 | 2020-04-28 | 华为技术有限公司 | 一种自适应存储文件的方法和装置 |
| JP6597231B2 (ja) * | 2015-11-27 | 2019-10-30 | 富士通株式会社 | 演算装置、プログラム、情報処理方法 |
| US10423362B2 (en) * | 2015-11-30 | 2019-09-24 | International Business Machines Corporation | Utilizing multiple dispersal algorithms to encode data for storage in a dispersed storage network |
| US10592128B1 (en) * | 2015-12-30 | 2020-03-17 | EMC IP Holding Company LLC | Abstraction layer |
| US10459638B2 (en) * | 2016-02-22 | 2019-10-29 | Hitachi Ltd. | Computer system that generates group information and redundant code based on user data and changes the group information and redundant code based on transmission data, control method for computer system, and recording medium |
| US10331371B2 (en) * | 2016-02-23 | 2019-06-25 | International Business Machines Corporation | Determining maximum volume size |
| US10204045B2 (en) * | 2016-08-30 | 2019-02-12 | International Business Machines Corporation | Data file handling in a volatile memory |
| CN106569742B (zh) * | 2016-10-20 | 2019-07-23 | 华为技术有限公司 | 存储管理方法及存储设备 |
| CN106528330A (zh) * | 2016-10-29 | 2017-03-22 | 华为技术有限公司 | 一种数据备份方法、节点及数据备份系统 |
| US11301144B2 (en) | 2016-12-28 | 2022-04-12 | Amazon Technologies, Inc. | Data storage system |
| US10514847B2 (en) | 2016-12-28 | 2019-12-24 | Amazon Technologies, Inc. | Data storage system with multiple durability levels |
| US10484015B2 (en) | 2016-12-28 | 2019-11-19 | Amazon Technologies, Inc. | Data storage system with enforced fencing |
| US11010064B2 (en) | 2017-02-15 | 2021-05-18 | Amazon Technologies, Inc. | Data system with flush views |
| US12326841B1 (en) * | 2017-03-15 | 2025-06-10 | Amazon Technologies, Inc. | Background incremental deletion cleanup techniques at storage services |
| US10705911B2 (en) * | 2017-04-24 | 2020-07-07 | Hewlett Packard Enterprise Development Lp | Storing data in a distributed storage system |
| US10732893B2 (en) * | 2017-05-25 | 2020-08-04 | Western Digital Technologies, Inc. | Non-volatile memory over fabric controller with memory bypass |
| JP6807457B2 (ja) * | 2017-06-15 | 2021-01-06 | 株式会社日立製作所 | ストレージシステム及びストレージシステムの制御方法 |
| JP7032631B2 (ja) * | 2017-07-04 | 2022-03-09 | 富士通株式会社 | 送受信システム、送受信システムの制御方法、及び送信装置 |
| US10761743B1 (en) | 2017-07-17 | 2020-09-01 | EMC IP Holding Company LLC | Establishing data reliability groups within a geographically distributed data storage environment |
| US10817388B1 (en) | 2017-07-21 | 2020-10-27 | EMC IP Holding Company LLC | Recovery of tree data in a geographically distributed environment |
| US10684780B1 (en) | 2017-07-27 | 2020-06-16 | EMC IP Holding Company LLC | Time sensitive data convolution and de-convolution |
| CN107577433B (zh) * | 2017-09-13 | 2020-09-22 | 苏州浪潮智能科技有限公司 | 一种存储介质和文件数据的迁移方法、装置及设备 |
| US10481979B2 (en) * | 2017-09-28 | 2019-11-19 | Intel Corporation | Storage system, computing system, and methods thereof |
| US10379948B2 (en) | 2017-10-02 | 2019-08-13 | Western Digital Technologies, Inc. | Redundancy coding stripe based on internal addresses of storage devices |
| US10474528B2 (en) * | 2017-10-02 | 2019-11-12 | Western Digital Technologies, Inc. | Redundancy coding stripe based on coordinated internal address scheme across multiple devices |
| US10880040B1 (en) | 2017-10-23 | 2020-12-29 | EMC IP Holding Company LLC | Scale-out distributed erasure coding |
| CN109725830B (zh) * | 2017-10-27 | 2022-02-08 | 伊姆西Ip控股有限责任公司 | 管理独立磁盘冗余阵列的方法、设备和存储介质 |
| US11061622B2 (en) | 2017-11-13 | 2021-07-13 | Weka.IO Ltd. | Tiering data strategy for a distributed storage system |
| CN107807797B (zh) * | 2017-11-17 | 2021-03-23 | 北京联想超融合科技有限公司 | 数据写入的方法、装置及服务器 |
| TWI656442B (zh) * | 2017-11-30 | 2019-04-11 | 慧榮科技股份有限公司 | 用來於一記憶裝置中進行存取控制之方法以及記憶裝置及其控制器 |
| US10379950B2 (en) * | 2017-11-30 | 2019-08-13 | Western Digital Technologies, Inc. | Updating write-in-place storage devices |
| US11314635B1 (en) * | 2017-12-12 | 2022-04-26 | Amazon Technologies, Inc. | Tracking persistent memory usage |
| CN108459824B (zh) * | 2017-12-19 | 2021-05-04 | 西安华为技术有限公司 | 一种数据修改写方法及装置 |
| US10382554B1 (en) | 2018-01-04 | 2019-08-13 | Emc Corporation | Handling deletes with distributed erasure coding |
| TWI643066B (zh) * | 2018-01-15 | 2018-12-01 | 慧榮科技股份有限公司 | 用來於一記憶裝置中重新使用關於垃圾收集的一目的地區塊之方法、記憶裝置及其控制器以及電子裝置 |
| CN110071949B (zh) * | 2018-01-23 | 2022-05-24 | 阿里巴巴集团控股有限公司 | 一种跨地理区域管理计算应用的系统、方法和装置 |
| US10817374B2 (en) | 2018-04-12 | 2020-10-27 | EMC IP Holding Company LLC | Meta chunks |
| US10579297B2 (en) | 2018-04-27 | 2020-03-03 | EMC IP Holding Company LLC | Scaling-in for geographically diverse storage |
| US10936196B2 (en) * | 2018-06-15 | 2021-03-02 | EMC IP Holding Company LLC | Data convolution for geographically diverse storage |
| US11023130B2 (en) * | 2018-06-15 | 2021-06-01 | EMC IP Holding Company LLC | Deleting data in a geographically diverse storage construct |
| US10929229B2 (en) | 2018-06-21 | 2021-02-23 | International Business Machines Corporation | Decentralized RAID scheme having distributed parity computation and recovery |
| US20190050161A1 (en) * | 2018-06-21 | 2019-02-14 | Intel Corporation | Data storage controller |
| US10719250B2 (en) | 2018-06-29 | 2020-07-21 | EMC IP Holding Company LLC | System and method for combining erasure-coded protection sets |
| US10409511B1 (en) | 2018-06-30 | 2019-09-10 | Western Digital Technologies, Inc. | Multi-device storage system with distributed read/write processing |
| US10725941B2 (en) | 2018-06-30 | 2020-07-28 | Western Digital Technologies, Inc. | Multi-device storage system with hosted services on peer storage devices |
| CN108958660B (zh) * | 2018-07-02 | 2021-03-19 | 深圳市茁壮网络股份有限公司 | 分布式存储系统及其数据处理方法和装置 |
| US11115490B2 (en) * | 2018-07-31 | 2021-09-07 | EMC IP Holding Company LLC | Host based read cache for san supporting NVMEF with E2E validation |
| WO2020028812A1 (en) * | 2018-08-03 | 2020-02-06 | Burlywood, Inc. | Power loss protection and recovery |
| US10592144B2 (en) | 2018-08-03 | 2020-03-17 | Western Digital Technologies, Inc. | Storage system fabric with multichannel compute complex |
| JP6949801B2 (ja) | 2018-10-17 | 2021-10-13 | 株式会社日立製作所 | ストレージシステム及びストレージシステムにおけるデータ配置方法 |
| US10922178B2 (en) * | 2018-10-31 | 2021-02-16 | Hewlett Packard Enterprise Development Lp | Masterless raid for byte-addressable non-volatile memory |
| US11436203B2 (en) | 2018-11-02 | 2022-09-06 | EMC IP Holding Company LLC | Scaling out geographically diverse storage |
| TWI704456B (zh) * | 2018-11-22 | 2020-09-11 | 慧榮科技股份有限公司 | 資料儲存裝置與資料存取方法 |
| US10901635B2 (en) | 2018-12-04 | 2021-01-26 | EMC IP Holding Company LLC | Mapped redundant array of independent nodes for data storage with high performance using logical columns of the nodes with different widths and different positioning patterns |
| US11119683B2 (en) | 2018-12-20 | 2021-09-14 | EMC IP Holding Company LLC | Logical compaction of a degraded chunk in a geographically diverse data storage system |
| US10931777B2 (en) | 2018-12-20 | 2021-02-23 | EMC IP Holding Company LLC | Network efficient geographically diverse data storage system employing degraded chunks |
| DE102018133482A1 (de) * | 2018-12-21 | 2020-06-25 | Interroll Holding Ag | Förderanordnung mit Sensoren mit Busdatenkodierung |
| US10892782B2 (en) | 2018-12-21 | 2021-01-12 | EMC IP Holding Company LLC | Flexible system and method for combining erasure-coded protection sets |
| CN111367461B (zh) * | 2018-12-25 | 2024-02-20 | 兆易创新科技集团股份有限公司 | 一种存储空间管理方法及装置 |
| US11023307B2 (en) * | 2019-01-03 | 2021-06-01 | International Business Machines Corporation | Automatic remediation of distributed storage system node components through visualization |
| US10768840B2 (en) * | 2019-01-04 | 2020-09-08 | EMC IP Holding Company LLC | Updating protection sets in a geographically distributed storage environment |
| US11023331B2 (en) | 2019-01-04 | 2021-06-01 | EMC IP Holding Company LLC | Fast recovery of data in a geographically distributed storage environment |
| US10942827B2 (en) | 2019-01-22 | 2021-03-09 | EMC IP Holding Company LLC | Replication of data in a geographically distributed storage environment |
| US20200241781A1 (en) | 2019-01-29 | 2020-07-30 | Dell Products L.P. | Method and system for inline deduplication using erasure coding |
| US10901641B2 (en) | 2019-01-29 | 2021-01-26 | Dell Products L.P. | Method and system for inline deduplication |
| US10972343B2 (en) | 2019-01-29 | 2021-04-06 | Dell Products L.P. | System and method for device configuration update |
| US10979312B2 (en) | 2019-01-29 | 2021-04-13 | Dell Products L.P. | System and method to assign, monitor, and validate solution infrastructure deployment prerequisites in a customer data center |
| US10942825B2 (en) | 2019-01-29 | 2021-03-09 | EMC IP Holding Company LLC | Mitigating real node failure in a mapped redundant array of independent nodes |
| US10911307B2 (en) | 2019-01-29 | 2021-02-02 | Dell Products L.P. | System and method for out of the box solution-level configuration and diagnostic logging and reporting |
| US11442642B2 (en) * | 2019-01-29 | 2022-09-13 | Dell Products L.P. | Method and system for inline deduplication using erasure coding to minimize read and write operations |
| US10866766B2 (en) | 2019-01-29 | 2020-12-15 | EMC IP Holding Company LLC | Affinity sensitive data convolution for data storage systems |
| US10846003B2 (en) | 2019-01-29 | 2020-11-24 | EMC IP Holding Company LLC | Doubly mapped redundant array of independent nodes for data storage |
| US10936239B2 (en) | 2019-01-29 | 2021-03-02 | EMC IP Holding Company LLC | Cluster contraction of a mapped redundant array of independent nodes |
| US11210232B2 (en) * | 2019-02-08 | 2021-12-28 | Samsung Electronics Co., Ltd. | Processor to detect redundancy of page table walk |
| US11171671B2 (en) * | 2019-02-25 | 2021-11-09 | Samsung Electronics Co., Ltd. | Reducing vulnerability window in key value storage server without sacrificing usable capacity |
| US10944826B2 (en) | 2019-04-03 | 2021-03-09 | EMC IP Holding Company LLC | Selective instantiation of a storage service for a mapped redundant array of independent nodes |
| US11029865B2 (en) | 2019-04-03 | 2021-06-08 | EMC IP Holding Company LLC | Affinity sensitive storage of data corresponding to a mapped redundant array of independent nodes |
| US11119686B2 (en) | 2019-04-30 | 2021-09-14 | EMC IP Holding Company LLC | Preservation of data during scaling of a geographically diverse data storage system |
| US11113146B2 (en) | 2019-04-30 | 2021-09-07 | EMC IP Holding Company LLC | Chunk segment recovery via hierarchical erasure coding in a geographically diverse data storage system |
| US11121727B2 (en) | 2019-04-30 | 2021-09-14 | EMC IP Holding Company LLC | Adaptive data storing for data storage systems employing erasure coding |
| US11748004B2 (en) | 2019-05-03 | 2023-09-05 | EMC IP Holding Company LLC | Data replication using active and passive data storage modes |
| US11169723B2 (en) | 2019-06-28 | 2021-11-09 | Amazon Technologies, Inc. | Data storage system with metadata check-pointing |
| CN110333770B (zh) * | 2019-07-10 | 2023-05-09 | 合肥兆芯电子有限公司 | 存储器管理方法、存储器存储装置及存储器控制电路单元 |
| US11209996B2 (en) | 2019-07-15 | 2021-12-28 | EMC IP Holding Company LLC | Mapped cluster stretching for increasing workload in a data storage system |
| US11023145B2 (en) | 2019-07-30 | 2021-06-01 | EMC IP Holding Company LLC | Hybrid mapped clusters for data storage |
| US11449399B2 (en) | 2019-07-30 | 2022-09-20 | EMC IP Holding Company LLC | Mitigating real node failure of a doubly mapped redundant array of independent nodes |
| US10963345B2 (en) * | 2019-07-31 | 2021-03-30 | Dell Products L.P. | Method and system for a proactive health check and reconstruction of data |
| US11372730B2 (en) | 2019-07-31 | 2022-06-28 | Dell Products L.P. | Method and system for offloading a continuous health-check and reconstruction of data in a non-accelerator pool |
| US11328071B2 (en) | 2019-07-31 | 2022-05-10 | Dell Products L.P. | Method and system for identifying actor of a fraudulent action during legal hold and litigation |
| US11609820B2 (en) | 2019-07-31 | 2023-03-21 | Dell Products L.P. | Method and system for redundant distribution and reconstruction of storage metadata |
| US11775193B2 (en) | 2019-08-01 | 2023-10-03 | Dell Products L.P. | System and method for indirect data classification in a storage system operations |
| US11228322B2 (en) | 2019-09-13 | 2022-01-18 | EMC IP Holding Company LLC | Rebalancing in a geographically diverse storage system employing erasure coding |
| US11449248B2 (en) | 2019-09-26 | 2022-09-20 | EMC IP Holding Company LLC | Mapped redundant array of independent data storage regions |
| US11422741B2 (en) | 2019-09-30 | 2022-08-23 | Dell Products L.P. | Method and system for data placement of a linked node system using replica paths |
| US11604771B2 (en) | 2019-09-30 | 2023-03-14 | Dell Products L.P. | Method and system for data placement in a linked node system |
| US11360949B2 (en) * | 2019-09-30 | 2022-06-14 | Dell Products L.P. | Method and system for efficient updating of data in a linked node system |
| US11481293B2 (en) | 2019-09-30 | 2022-10-25 | Dell Products L.P. | Method and system for replica placement in a linked node system |
| CN112667529B (zh) * | 2019-10-16 | 2024-02-13 | 戴尔产品有限公司 | 网络织物存储系统 |
| US11435910B2 (en) | 2019-10-31 | 2022-09-06 | EMC IP Holding Company LLC | Heterogeneous mapped redundant array of independent nodes for data storage |
| US11288139B2 (en) | 2019-10-31 | 2022-03-29 | EMC IP Holding Company LLC | Two-step recovery employing erasure coding in a geographically diverse data storage system |
| US11119690B2 (en) | 2019-10-31 | 2021-09-14 | EMC IP Holding Company LLC | Consolidation of protection sets in a geographically diverse data storage environment |
| US11435957B2 (en) | 2019-11-27 | 2022-09-06 | EMC IP Holding Company LLC | Selective instantiation of a storage service for a doubly mapped redundant array of independent nodes |
| US11144220B2 (en) | 2019-12-24 | 2021-10-12 | EMC IP Holding Company LLC | Affinity sensitive storage of data corresponding to a doubly mapped redundant array of independent nodes |
| US11349500B2 (en) * | 2020-01-15 | 2022-05-31 | EMC IP Holding Company LLC | Data recovery in a geographically diverse storage system employing erasure coding technology and data convolution technology |
| US10992532B1 (en) * | 2020-01-15 | 2021-04-27 | EMC IP Holding Company LLC | Automated network configuration changes for I/O load redistribution |
| US11347419B2 (en) * | 2020-01-15 | 2022-05-31 | EMC IP Holding Company LLC | Valency-based data convolution for geographically diverse storage |
| US11231860B2 (en) | 2020-01-17 | 2022-01-25 | EMC IP Holding Company LLC | Doubly mapped redundant array of independent nodes for data storage with high performance |
| US11210002B2 (en) * | 2020-01-29 | 2021-12-28 | Samsung Electronics Co., Ltd. | Offloaded device-driven erasure coding |
| US11435948B2 (en) * | 2020-01-31 | 2022-09-06 | EMC IP Holding Company LLC | Methods and systems for user space storage management |
| US11119858B1 (en) | 2020-03-06 | 2021-09-14 | Dell Products L.P. | Method and system for performing a proactive copy operation for a spare persistent storage |
| US11416357B2 (en) | 2020-03-06 | 2022-08-16 | Dell Products L.P. | Method and system for managing a spare fault domain in a multi-fault domain data cluster |
| US11175842B2 (en) | 2020-03-06 | 2021-11-16 | Dell Products L.P. | Method and system for performing data deduplication in a data pipeline |
| US11301327B2 (en) | 2020-03-06 | 2022-04-12 | Dell Products L.P. | Method and system for managing a spare persistent storage device and a spare node in a multi-node data cluster |
| US11281535B2 (en) | 2020-03-06 | 2022-03-22 | Dell Products L.P. | Method and system for performing a checkpoint zone operation for a spare persistent storage |
| CN113835637B (zh) * | 2020-03-19 | 2024-07-16 | 北京奥星贝斯科技有限公司 | 一种数据的写入方法、装置以及设备 |
| US12086445B1 (en) * | 2020-03-23 | 2024-09-10 | Amazon Technologies, Inc. | Maintaining partition-level parity data for improved volume durability |
| US11507308B2 (en) | 2020-03-30 | 2022-11-22 | EMC IP Holding Company LLC | Disk access event control for mapped nodes supported by a real cluster storage system |
| JP7419956B2 (ja) * | 2020-04-28 | 2024-01-23 | オムロン株式会社 | 情報処理装置、情報処理方法およびプログラム |
| US11182096B1 (en) | 2020-05-18 | 2021-11-23 | Amazon Technologies, Inc. | Data storage system with configurable durability |
| US11418326B2 (en) | 2020-05-21 | 2022-08-16 | Dell Products L.P. | Method and system for performing secure data transactions in a data cluster |
| US11288229B2 (en) | 2020-05-29 | 2022-03-29 | EMC IP Holding Company LLC | Verifiable intra-cluster migration for a chunk storage system |
| US11831773B1 (en) | 2020-06-29 | 2023-11-28 | Amazon Technologies, Inc. | Secured database restoration across service regions |
| US11675747B2 (en) * | 2020-07-10 | 2023-06-13 | EMC IP Holding Company, LLC | System and method for early tail-release in a log structure log using multi-line PLB structure supporting multiple partial transactions |
| CN114063890A (zh) * | 2020-08-05 | 2022-02-18 | 宇瞻科技股份有限公司 | 数据备份方法及存储装置 |
| US11681443B1 (en) | 2020-08-28 | 2023-06-20 | Amazon Technologies, Inc. | Durable data storage with snapshot storage space optimization |
| US11693983B2 (en) | 2020-10-28 | 2023-07-04 | EMC IP Holding Company LLC | Data protection via commutative erasure coding in a geographically diverse data storage system |
| TWI739676B (zh) * | 2020-11-25 | 2021-09-11 | 群聯電子股份有限公司 | 記憶體控制方法、記憶體儲存裝置及記憶體控制電路單元 |
| JP7244482B2 (ja) * | 2020-12-16 | 2023-03-22 | 株式会社日立製作所 | ストレージ管理システム、管理方法 |
| US11983440B2 (en) * | 2020-12-30 | 2024-05-14 | Samsung Electronics Co., Ltd. | Storage device including memory controller implementing journaling and operating method of the memory controller |
| CN112711386B (zh) * | 2021-01-18 | 2021-07-16 | 深圳市龙信信息技术有限公司 | 存储装置的存储容量检测方法、设备及可读存储介质 |
| US11847141B2 (en) | 2021-01-19 | 2023-12-19 | EMC IP Holding Company LLC | Mapped redundant array of independent nodes employing mapped reliability groups for data storage |
| US11625174B2 (en) | 2021-01-20 | 2023-04-11 | EMC IP Holding Company LLC | Parity allocation for a virtual redundant array of independent disks |
| US11860802B2 (en) * | 2021-02-22 | 2024-01-02 | Nutanix, Inc. | Instant recovery as an enabler for uninhibited mobility between primary storage and secondary storage |
| US11726684B1 (en) * | 2021-02-26 | 2023-08-15 | Pure Storage, Inc. | Cluster rebalance using user defined rules |
| CN115237667B (zh) * | 2021-04-23 | 2026-01-02 | 伊姆西Ip控股有限责任公司 | 用于存储管理的方法、电子设备和计算机程序产品 |
| EP4323874A1 (en) * | 2021-04-26 | 2024-02-21 | Huawei Technologies Co., Ltd. | Memory system and method for use in the memory system |
| JP7266060B2 (ja) * | 2021-04-30 | 2023-04-27 | 株式会社日立製作所 | ストレージシステムの構成変更方法及びストレージシステム |
| US11354191B1 (en) | 2021-05-28 | 2022-06-07 | EMC IP Holding Company LLC | Erasure coding in a large geographically diverse data storage system |
| US11449234B1 (en) | 2021-05-28 | 2022-09-20 | EMC IP Holding Company LLC | Efficient data access operations via a mapping layer instance for a doubly mapped redundant array of independent nodes |
| JP7253007B2 (ja) * | 2021-05-28 | 2023-04-05 | 株式会社日立製作所 | ストレージシステム |
| US11567883B2 (en) | 2021-06-04 | 2023-01-31 | Western Digital Technologies, Inc. | Connection virtualization for data storage device arrays |
| US11507321B1 (en) * | 2021-06-04 | 2022-11-22 | Western Digital Technologies, Inc. | Managing queue limit overflow for data storage device arrays |
| JP7520773B2 (ja) * | 2021-06-16 | 2024-07-23 | 株式会社日立製作所 | 記憶システムおよびデータ処理方法 |
| JP7412397B2 (ja) | 2021-09-10 | 2024-01-12 | 株式会社日立製作所 | ストレージシステム |
| JP7342089B2 (ja) * | 2021-11-09 | 2023-09-11 | 株式会社日立製作所 | 計算機システム及び計算機システムのスケールアウト方法 |
| US11874821B2 (en) | 2021-12-22 | 2024-01-16 | Ebay Inc. | Block aggregation for shared streams |
| JP7412405B2 (ja) | 2021-12-23 | 2024-01-12 | 株式会社日立製作所 | 情報処理システム、情報処理方法 |
| JP2023100222A (ja) | 2022-01-05 | 2023-07-18 | 株式会社日立製作所 | システム構成管理装置、システム構成管理方法、及びシステム構成管理プログラム |
| JP2023136323A (ja) | 2022-03-16 | 2023-09-29 | 株式会社日立製作所 | ストレージシステム |
| JP2023136081A (ja) | 2022-03-16 | 2023-09-29 | キオクシア株式会社 | メモリシステムおよび制御方法 |
| US12164785B2 (en) | 2022-05-31 | 2024-12-10 | Samsung Electronics Co., Ltd. | Storage device preventing loss of data in situation of lacking power and operating method thereof |
| US12292800B2 (en) * | 2022-07-12 | 2025-05-06 | Dell Products L.P. | Restoring from a temporary backup target in an intelligent destination target selection system for remote backups |
| CN116340214B (zh) * | 2023-02-28 | 2024-01-02 | 中科驭数(北京)科技有限公司 | 缓存数据存读方法、装置、设备和介质 |
| CN116027989B (zh) * | 2023-03-29 | 2023-06-09 | 中诚华隆计算机技术有限公司 | 一种基于存储管理芯片对文件集进行存储的方法及系统 |
| US12222865B2 (en) * | 2023-05-23 | 2025-02-11 | Microsoft Technology Licensing, Llc | Cache for identifiers representing merged access control information |
| CN116703651B (zh) * | 2023-08-08 | 2023-11-14 | 成都秦川物联网科技股份有限公司 | 一种智慧燃气数据中心运行管理方法、物联网系统和介质 |
| US12443494B2 (en) * | 2024-03-12 | 2025-10-14 | Netapp, Inc. | Prevention of residual data writes after non-graceful node failure in a cluster |
| CN119336251B (zh) * | 2024-09-06 | 2025-11-28 | 深圳大普微电子股份有限公司 | 混合映射结构的构建方法及相关装置 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004126716A (ja) * | 2002-09-30 | 2004-04-22 | Fujitsu Ltd | 広域分散ストレージシステムを利用したデータ格納方法、その方法をコンピュータに実現させるプログラム、記録媒体、及び広域分散ストレージシステムにおける制御装置 |
| JP2008511064A (ja) * | 2004-08-25 | 2008-04-10 | インターナショナル・ビジネス・マシーンズ・コーポレーション | データ復旧のためのパリティ情報の記憶 |
| JP2012033169A (ja) * | 2010-07-29 | 2012-02-16 | Ntt Docomo Inc | バックアップシステムにおける符号化を使用して、ライブチェックポインティング、同期、及び/又は復旧をサポートするための方法及び装置 |
Family Cites Families (95)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2888401B2 (ja) * | 1992-08-03 | 1999-05-10 | インターナショナル・ビジネス・マシーンズ・コーポレイション | 冗長ディスクドライブアレイに対する同期方法 |
| US5388108A (en) * | 1992-10-23 | 1995-02-07 | Ncr Corporation | Delayed initiation of read-modify-write parity operations in a raid level 5 disk array |
| US5519849A (en) * | 1992-12-07 | 1996-05-21 | Digital Equipment Corporation | Method of reducing the complexity of an I/O request to a RAID-4 or RAID-5 array |
| US5689678A (en) * | 1993-03-11 | 1997-11-18 | Emc Corporation | Distributed storage array system having a plurality of modular control units |
| JP2000039970A (ja) * | 1998-07-24 | 2000-02-08 | Nec Software Kobe Ltd | ディスクアレイシステムの二重障害防止制御方式 |
| US6460122B1 (en) | 1999-03-31 | 2002-10-01 | International Business Machine Corporation | System, apparatus and method for multi-level cache in a multi-processor/multi-controller environment |
| US20010044879A1 (en) * | 2000-02-18 | 2001-11-22 | Moulton Gregory Hagan | System and method for distributed management of data storage |
| US6826711B2 (en) * | 2000-02-18 | 2004-11-30 | Avamar Technologies, Inc. | System and method for data protection with multidimensional parity |
| DE60029747D1 (de) | 2000-03-31 | 2006-09-14 | St Microelectronics Srl | Regelungsverfahren für den Stromfluss des Antriebssystems für bürstenlose Motoren, insbesondere während der Schaltphasen |
| US20020138559A1 (en) * | 2001-01-29 | 2002-09-26 | Ulrich Thomas R. | Dynamically distributed file system |
| US20020169827A1 (en) * | 2001-01-29 | 2002-11-14 | Ulrich Thomas R. | Hot adding file system processors |
| US20020124137A1 (en) * | 2001-01-29 | 2002-09-05 | Ulrich Thomas R. | Enhancing disk array performance via variable parity based load balancing |
| US20020174296A1 (en) * | 2001-01-29 | 2002-11-21 | Ulrich Thomas R. | Disk replacement via hot swapping with variable parity |
| US6862692B2 (en) * | 2001-01-29 | 2005-03-01 | Adaptec, Inc. | Dynamic redistribution of parity groups |
| US7054927B2 (en) * | 2001-01-29 | 2006-05-30 | Adaptec, Inc. | File system metadata describing server directory information |
| US6990667B2 (en) * | 2001-01-29 | 2006-01-24 | Adaptec, Inc. | Server-independent object positioning for load balancing drives and servers |
| US6990547B2 (en) * | 2001-01-29 | 2006-01-24 | Adaptec, Inc. | Replacing file system processors by hot swapping |
| US20020191311A1 (en) * | 2001-01-29 | 2002-12-19 | Ulrich Thomas R. | Dynamically scalable disk array |
| JP2003131818A (ja) * | 2001-10-25 | 2003-05-09 | Hitachi Ltd | クラスタ構成ストレージにおけるクラスタ間raid構成 |
| JP4211282B2 (ja) * | 2002-05-14 | 2009-01-21 | ソニー株式会社 | データ蓄積方法及びデータ蓄積システム、並びに、データ記録制御装置、データ記録指令装置、データ受信装置及び情報処理端末 |
| JP4172259B2 (ja) * | 2002-11-26 | 2008-10-29 | ソニー株式会社 | 情報処理装置および方法、並びにコンピュータ・プログラム |
| JP2004180092A (ja) * | 2002-11-28 | 2004-06-24 | Sony Corp | 情報処理装置および情報処理方法、並びにコンピュータ・プログラム |
| US7159150B2 (en) * | 2002-12-31 | 2007-01-02 | International Business Machines Corporation | Distributed storage system capable of restoring data in case of a storage failure |
| US7546342B2 (en) | 2004-05-14 | 2009-06-09 | Microsoft Corporation | Distributed hosting of web content using partial replication |
| US7296180B1 (en) * | 2004-06-30 | 2007-11-13 | Sun Microsystems, Inc. | Method for recovery of data |
| US7552356B1 (en) * | 2004-06-30 | 2009-06-23 | Sun Microsystems, Inc. | Distributed data storage system for fixed content |
| US7328303B1 (en) * | 2004-06-30 | 2008-02-05 | Sun Microsystems, Inc. | Method and system for remote execution of code on a distributed data storage system |
| US7734643B1 (en) * | 2004-06-30 | 2010-06-08 | Oracle America, Inc. | Method for distributed storage of data |
| US7536693B1 (en) * | 2004-06-30 | 2009-05-19 | Sun Microsystems, Inc. | Method for load spreading of requests in a distributed data storage system |
| JP4519563B2 (ja) * | 2004-08-04 | 2010-08-04 | 株式会社日立製作所 | 記憶システム及びデータ処理システム |
| US7389393B1 (en) * | 2004-10-21 | 2008-06-17 | Symantec Operating Corporation | System and method for write forwarding in a storage environment employing distributed virtualization |
| US7636814B1 (en) * | 2005-04-28 | 2009-12-22 | Symantec Operating Corporation | System and method for asynchronous reads of old data blocks updated through a write-back cache |
| US7904649B2 (en) * | 2005-04-29 | 2011-03-08 | Netapp, Inc. | System and method for restriping data across a plurality of volumes |
| US7962689B1 (en) * | 2005-04-29 | 2011-06-14 | Netapp, Inc. | System and method for performing transactional processing in a striped volume set |
| US7698334B2 (en) * | 2005-04-29 | 2010-04-13 | Netapp, Inc. | System and method for multi-tiered meta-data caching and distribution in a clustered computer environment |
| US7617370B2 (en) * | 2005-04-29 | 2009-11-10 | Netapp, Inc. | Data allocation within a storage system architecture |
| US7743210B1 (en) * | 2005-04-29 | 2010-06-22 | Netapp, Inc. | System and method for implementing atomic cross-stripe write operations in a striped volume set |
| US8001580B1 (en) * | 2005-07-25 | 2011-08-16 | Netapp, Inc. | System and method for revoking soft locks in a distributed storage system environment |
| US7376796B2 (en) * | 2005-11-01 | 2008-05-20 | Network Appliance, Inc. | Lightweight coherency control protocol for clustered storage system |
| US7716180B2 (en) * | 2005-12-29 | 2010-05-11 | Amazon Technologies, Inc. | Distributed storage system with web services client interface |
| US20070214183A1 (en) * | 2006-03-08 | 2007-09-13 | Omneon Video Networks | Methods for dynamic partitioning of a redundant data fabric |
| US8082362B1 (en) * | 2006-04-27 | 2011-12-20 | Netapp, Inc. | System and method for selection of data paths in a clustered storage system |
| JP4902403B2 (ja) * | 2006-10-30 | 2012-03-21 | 株式会社日立製作所 | 情報システム及びデータ転送方法 |
| JP5137413B2 (ja) * | 2006-11-28 | 2013-02-06 | 株式会社日立製作所 | 半導体記憶装置 |
| US8151060B2 (en) | 2006-11-28 | 2012-04-03 | Hitachi, Ltd. | Semiconductor memory system having a snapshot function |
| WO2008126324A1 (ja) * | 2007-03-30 | 2008-10-23 | Fujitsu Limited | アクセス制御プログラム、アクセス制御装置およびアクセス制御方法 |
| WO2008136074A1 (ja) * | 2007-04-20 | 2008-11-13 | Fujitsu Limited | 2重化組み合わせ管理プログラム、2重化組み合わせ管理装置、および2重化組み合わせ管理方法 |
| US7827350B1 (en) * | 2007-04-27 | 2010-11-02 | Netapp, Inc. | Method and system for promoting a snapshot in a distributed file system |
| US8341623B2 (en) * | 2007-05-22 | 2012-12-25 | International Business Machines Corporation | Integrated placement planning for heterogenous storage area network data centers |
| US7975102B1 (en) * | 2007-08-06 | 2011-07-05 | Netapp, Inc. | Technique to avoid cascaded hot spotting |
| JP4386932B2 (ja) * | 2007-08-17 | 2009-12-16 | 富士通株式会社 | ストレージ管理プログラム、ストレージ管理装置およびストレージ管理方法 |
| JP4519179B2 (ja) * | 2008-02-25 | 2010-08-04 | 富士通株式会社 | 論理ボリューム管理プログラム、論理ボリューム管理装置、論理ボリューム管理方法、および分散ストレージシステム |
| US8302192B1 (en) * | 2008-04-30 | 2012-10-30 | Netapp, Inc. | Integrating anti-virus in a clustered storage system |
| CN101291347B (zh) * | 2008-06-06 | 2010-12-22 | 中国科学院计算技术研究所 | 一种网络存储系统 |
| US7992037B2 (en) * | 2008-09-11 | 2011-08-02 | Nec Laboratories America, Inc. | Scalable secondary storage systems and methods |
| US8495417B2 (en) * | 2009-01-09 | 2013-07-23 | Netapp, Inc. | System and method for redundancy-protected aggregates |
| US8572346B2 (en) * | 2009-02-20 | 2013-10-29 | Hitachi, Ltd. | Storage system and method for efficiently utilizing storage capacity within a storage system |
| CN101488104B (zh) * | 2009-02-26 | 2011-05-04 | 北京云快线软件服务有限公司 | 一种实现高效安全存储的系统和方法 |
| KR20120059569A (ko) * | 2009-08-21 | 2012-06-08 | 램버스 인코포레이티드 | 인-시츄 메모리 어닐링 |
| US8554994B2 (en) * | 2009-09-29 | 2013-10-08 | Cleversafe, Inc. | Distributed storage network utilizing memory stripes |
| JP4838878B2 (ja) * | 2009-12-04 | 2011-12-14 | 富士通株式会社 | データ管理プログラム、データ管理装置、およびデータ管理方法 |
| CN102656566B (zh) * | 2009-12-17 | 2015-12-16 | 国际商业机器公司 | 固态存储系统中的数据管理 |
| US8832370B2 (en) * | 2010-02-12 | 2014-09-09 | Netapp, Inc. | Redundant array of independent storage |
| US8103904B2 (en) * | 2010-02-22 | 2012-01-24 | International Business Machines Corporation | Read-other protocol for maintaining parity coherency in a write-back distributed redundancy data storage system |
| US8583866B2 (en) * | 2010-02-22 | 2013-11-12 | International Business Machines Corporation | Full-stripe-write protocol for maintaining parity coherency in a write-back distributed redundancy data storage system |
| US8103903B2 (en) * | 2010-02-22 | 2012-01-24 | International Business Machines Corporation | Read-modify-write protocol for maintaining parity coherency in a write-back distributed redundancy data storage system |
| US8756598B1 (en) * | 2010-03-31 | 2014-06-17 | Netapp, Inc. | Diskless virtual machine cloning by separately cloning a virtual drive and configuration data of a source virtual machine for combination into a cloned virtual machine |
| US9015126B2 (en) * | 2010-05-22 | 2015-04-21 | Nokia Corporation | Method and apparatus for eventually consistent delete in a distributed data store |
| JP5567147B2 (ja) * | 2010-07-16 | 2014-08-06 | 株式会社日立製作所 | 記憶制御装置、又は複数の当該記憶制御装置を有する記憶システム |
| US8775868B2 (en) * | 2010-09-28 | 2014-07-08 | Pure Storage, Inc. | Adaptive RAID for an SSD environment |
| US8463991B2 (en) * | 2010-09-28 | 2013-06-11 | Pure Storage Inc. | Intra-device data protection in a raid array |
| US20120084504A1 (en) * | 2010-10-01 | 2012-04-05 | John Colgrove | Dynamic raid geometries in an ssd environment |
| US9128626B2 (en) | 2010-10-01 | 2015-09-08 | Peter Chacko | Distributed virtual storage cloud architecture and a method thereof |
| US20120084507A1 (en) * | 2010-10-01 | 2012-04-05 | John Colgrove | Multi-level protection with intra-device protection in a raid array based storage system |
| US9430330B1 (en) * | 2010-12-29 | 2016-08-30 | Netapp, Inc. | System and method for managing environment metadata during data backups to a storage system |
| US8645799B2 (en) * | 2010-12-31 | 2014-02-04 | Microsoft Corporation | Storage codes for data recovery |
| CN103299265B (zh) * | 2011-03-25 | 2016-05-18 | 株式会社日立制作所 | 存储系统和存储区域分配方法 |
| US8910156B1 (en) * | 2011-04-29 | 2014-12-09 | Netapp, Inc. | Virtual machine dependency |
| US20120290714A1 (en) * | 2011-05-13 | 2012-11-15 | Nokia Corporation | Method and apparatus for providing heuristic-based cluster management |
| WO2012164714A1 (ja) * | 2011-06-02 | 2012-12-06 | 株式会社日立製作所 | ストレージ管理システム、計算機システム、及びストレージ管理方法 |
| JP2013037611A (ja) * | 2011-08-10 | 2013-02-21 | Nec Corp | 記憶装置、記憶方法およびプログラム |
| US8886910B2 (en) * | 2011-09-12 | 2014-11-11 | Microsoft Corporation | Storage device drivers and cluster participation |
| CN103019614B (zh) * | 2011-09-23 | 2015-11-25 | 阿里巴巴集团控股有限公司 | 分布式存储系统管理装置及方法 |
| US8607122B2 (en) * | 2011-11-01 | 2013-12-10 | Cleversafe, Inc. | Accessing a large data object in a dispersed storage network |
| CN103095767B (zh) * | 2011-11-03 | 2019-04-23 | 中兴通讯股份有限公司 | 分布式缓存系统及基于分布式缓存系统的数据重构方法 |
| JP2013114624A (ja) * | 2011-11-30 | 2013-06-10 | Hitachi Ltd | ストレージシステム及びプール容量縮小の制御方法 |
| US8595586B2 (en) * | 2012-04-25 | 2013-11-26 | Facebook, Inc. | Distributed system for fault-tolerant data storage |
| US9613656B2 (en) * | 2012-09-04 | 2017-04-04 | Seagate Technology Llc | Scalable storage protection |
| CN102882983B (zh) * | 2012-10-22 | 2015-06-10 | 南京云创存储科技有限公司 | 一种云存储系统中提升并发访问性能的数据快速存储方法 |
| US9086991B2 (en) * | 2013-02-19 | 2015-07-21 | Infinidat Ltd. | Solid state drive cache recovery in a clustered storage system |
| IN2014DE00404A (ja) * | 2014-02-13 | 2015-08-14 | Netapp Inc | |
| KR102190670B1 (ko) * | 2014-03-03 | 2020-12-14 | 삼성전자주식회사 | 마이그레이션 관리자를 포함하는 메모리 시스템 |
| US9294558B1 (en) * | 2014-03-31 | 2016-03-22 | Amazon Technologies, Inc. | Connection re-balancing in distributed storage systems |
| CN104008152B (zh) * | 2014-05-21 | 2017-12-01 | 华南理工大学 | 支持海量数据访问的分布式文件系统的架构方法 |
| US9727485B1 (en) * | 2014-11-24 | 2017-08-08 | Pure Storage, Inc. | Metadata rewrite and flatten optimization |
-
2014
- 2014-09-30 WO PCT/JP2014/076105 patent/WO2016051512A1/ja not_active Ceased
-
2015
- 2015-09-30 WO PCT/JP2015/077853 patent/WO2016052665A1/ja not_active Ceased
- 2015-09-30 US US15/120,840 patent/US20160371145A1/en not_active Abandoned
- 2015-09-30 JP JP2016552148A patent/JP6752149B2/ja active Active
- 2015-09-30 CN CN202010024851.5A patent/CN111258500B/zh active Active
- 2015-09-30 CN CN201580010284.5A patent/CN106030501B/zh active Active
- 2015-09-30 CN CN202010024845.XA patent/CN111190552B/zh active Active
- 2015-09-30 DE DE112015000710.5T patent/DE112015000710B4/de active Active
- 2015-09-30 GB GB1614431.3A patent/GB2545290B/en active Active
-
2017
- 2017-07-28 US US15/662,510 patent/US10185624B2/en active Active
-
2018
- 2018-08-22 US US16/108,265 patent/US10496479B2/en active Active
- 2018-12-27 JP JP2018244619A patent/JP6815378B2/ja active Active
-
2019
- 2019-11-12 US US16/680,772 patent/US11036585B2/en active Active
-
2020
- 2020-05-19 JP JP2020087326A patent/JP7077359B2/ja active Active
-
2021
- 2021-02-15 JP JP2021021418A patent/JP7374939B2/ja active Active
- 2021-05-21 US US17/326,504 patent/US11487619B2/en active Active
-
2022
- 2022-10-20 US US17/969,763 patent/US11886294B2/en active Active
-
2023
- 2023-07-04 JP JP2023110180A patent/JP7595711B2/ja active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004126716A (ja) * | 2002-09-30 | 2004-04-22 | Fujitsu Ltd | 広域分散ストレージシステムを利用したデータ格納方法、その方法をコンピュータに実現させるプログラム、記録媒体、及び広域分散ストレージシステムにおける制御装置 |
| JP2008511064A (ja) * | 2004-08-25 | 2008-04-10 | インターナショナル・ビジネス・マシーンズ・コーポレーション | データ復旧のためのパリティ情報の記憶 |
| JP2012033169A (ja) * | 2010-07-29 | 2012-02-16 | Ntt Docomo Inc | バックアップシステムにおける符号化を使用して、ライブチェックポインティング、同期、及び/又は復旧をサポートするための方法及び装置 |
Cited By (35)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220027048A1 (en) * | 2016-01-22 | 2022-01-27 | Netapp, Inc. | Garbage Collection Pacing in a Storage System |
| US11561696B2 (en) * | 2016-01-22 | 2023-01-24 | Netapp, Inc. | Garbage collection pacing in a storage system |
| US10628349B2 (en) | 2017-03-24 | 2020-04-21 | Hitachi, Ltd. | I/O control method and I/O control system |
| US11150846B2 (en) | 2017-03-28 | 2021-10-19 | Hitachi, Ltd. | Storage system, computer-readable recording medium, and control method for system that reconstructs and distributes data |
| WO2018179073A1 (ja) * | 2017-03-28 | 2018-10-04 | 株式会社日立製作所 | ストレージシステム、コンピュータ読み取り可能な記録媒体、システムの制御方法 |
| KR101907786B1 (ko) | 2017-04-28 | 2018-10-12 | 한국과학기술원 | 캐시 메모리를 이용한 다수의 보조 노드 무선 통신에서 노드간 협력을 통해 사용자의 체감 효과를 향상시키는 분산 저장 방법 및 장치 |
| JP2020091901A (ja) * | 2017-11-30 | 2020-06-11 | 株式会社日立製作所 | 記憶システム及びその制御方法 |
| CN111788558A (zh) * | 2018-03-20 | 2020-10-16 | 华睿泰科技有限责任公司 | 用于检测具有故障域的分布式存储设备中的位衰减的系统和方法 |
| US11914475B2 (en) | 2018-04-13 | 2024-02-27 | Institut Mines Telecom | Erasure recovery in a distributed storage system |
| JP7555821B2 (ja) | 2018-04-13 | 2024-09-25 | アンスティテュ・ミーヌ・テレコム | データを符号化及び復号化する方法及び装置 |
| JP2021521701A (ja) * | 2018-04-13 | 2021-08-26 | アンスティテュ・ミーヌ・テレコム | データを符号化及び復号化する方法及び装置 |
| JP2019191997A (ja) * | 2018-04-26 | 2019-10-31 | 株式会社日立製作所 | ストレージシステム、ストレージシステムの制御方法及び管理ノード |
| JP2021522577A (ja) * | 2018-05-09 | 2021-08-30 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | ホスト認識更新書き込みの方法、システム、およびコンピュータ・プログラム |
| JP7189965B2 (ja) | 2018-05-09 | 2022-12-14 | インターナショナル・ビジネス・マシーンズ・コーポレーション | ホスト認識更新書き込みの方法、システム、およびコンピュータ・プログラム |
| JP6995728B2 (ja) | 2018-09-28 | 2022-01-17 | 株式会社日立製作所 | ストレージ装置、管理方法及びプログラム |
| JP2020052919A (ja) * | 2018-09-28 | 2020-04-02 | 株式会社日立製作所 | ストレージ装置、管理方法及びプログラム |
| CN111367868B (zh) * | 2018-12-26 | 2023-12-29 | 三六零科技集团有限公司 | 一种文件获取请求的处理方法和装置 |
| CN111367868A (zh) * | 2018-12-26 | 2020-07-03 | 北京奇虎科技有限公司 | 一种文件获取请求的处理方法和装置 |
| JP2023063502A (ja) * | 2018-12-27 | 2023-05-09 | 株式会社日立製作所 | ストレージシステム |
| US11669396B2 (en) | 2018-12-27 | 2023-06-06 | Hitachi, Ltd. | Storage system |
| JP2020107082A (ja) * | 2018-12-27 | 2020-07-09 | 株式会社日立製作所 | ストレージシステム |
| JP7553060B2 (ja) | 2018-12-27 | 2024-09-18 | 日立ヴァンタラ株式会社 | ストレージシステム |
| US12066894B2 (en) | 2018-12-27 | 2024-08-20 | Hitachi, Ltd. | Storage system |
| US11169879B2 (en) | 2018-12-27 | 2021-11-09 | Hitachi, Ltd. | Storage system |
| JP2020144459A (ja) * | 2019-03-04 | 2020-09-10 | 株式会社日立製作所 | ストレージシステム、データ管理方法、及びデータ管理プログラム |
| JP2020154626A (ja) * | 2019-03-19 | 2020-09-24 | 株式会社日立製作所 | 分散ストレージシステム、データ管理方法、及びデータ管理プログラム |
| US11128535B2 (en) | 2019-03-19 | 2021-09-21 | Hitachi, Ltd. | Computer system and data management method |
| US11151045B2 (en) | 2019-03-19 | 2021-10-19 | Hitachi, Ltd. | Distributed storage system, data management method, and data management program |
| US12068926B2 (en) | 2019-09-24 | 2024-08-20 | Ntt Communications Corporation | Display control system, display method, and program |
| US11494089B2 (en) | 2019-12-23 | 2022-11-08 | Hitachi, Ltd. | Distributed storage system, data control method and storage medium |
| JP2021114137A (ja) * | 2020-01-20 | 2021-08-05 | 株式会社日立製作所 | データ分析を支援するシステム及び方法 |
| KR102645033B1 (ko) | 2021-04-30 | 2024-03-07 | 계명대학교 산학협력단 | 데이터 분산 저장 방법 및 상기 방법을 수행하는 컴퓨팅 장치 |
| KR20220149231A (ko) * | 2021-04-30 | 2022-11-08 | 계명대학교 산학협력단 | 데이터 분산 저장 방법 및 상기 방법을 수행하는 컴퓨팅 장치 |
| JP2023106886A (ja) * | 2022-01-21 | 2023-08-02 | 株式会社日立製作所 | ストレージシステム |
| JP7443404B2 (ja) | 2022-01-21 | 2024-03-05 | 株式会社日立製作所 | ストレージシステム |
Also Published As
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7374939B2 (ja) | 分散型ストレージシステム | |
| US12292792B1 (en) | Erasure coding techniques for flash memory | |
| US10459638B2 (en) | Computer system that generates group information and redundant code based on user data and changes the group information and redundant code based on transmission data, control method for computer system, and recording medium | |
| CN107209714B (zh) | 分布式存储系统及分布式存储系统的控制方法 | |
| JP4818812B2 (ja) | フラッシュメモリストレージシステム | |
| CN110383251B (zh) | 存储系统、计算机可读记录介质、系统的控制方法 | |
| JP2022512064A (ja) | 様々なデータ冗長性スキームを備えた、システムにおける利用可能なストレージ空間を改善すること | |
| WO2013018132A1 (en) | Computer system with thin-provisioning and data management method thereof for dynamic tiering | |
| US20120278560A1 (en) | Pre-fetching in a storage system that maintains a mapping tree | |
| JP6677740B2 (ja) | ストレージシステム | |
| WO2012016209A2 (en) | Apparatus, system, and method for redundant write caching | |
| WO2016056104A1 (ja) | ストレージ装置、及び、記憶制御方法 | |
| WO2019026221A1 (ja) | ストレージシステム及びストレージ制御方法 | |
| KR20250138644A (ko) | 씬 프로비저닝을 통한 ssd 가상화 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15847281 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 15120840 Country of ref document: US |
|
| ENP | Entry into the national phase |
Ref document number: 201614431 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20150930 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 1614431.3 Country of ref document: GB |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 112015000710 Country of ref document: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2016552148 Country of ref document: JP Kind code of ref document: A |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 15847281 Country of ref document: EP Kind code of ref document: A1 |