HK1198065A

HK1198065A - Variable length encoding in a storage system

Info

Publication number: HK1198065A
Application number: HK14110667.5A
Authority: HK
Inventors: J．科尔格洛夫; J.海斯; E.米勒
Original assignee: 净睿存储股份有限公司
Priority date: 2011-09-30
Filing date: 2012-09-27
Publication date: 2015-03-06

Description

Variable length coding in a storage system

Background

Technical Field

The present invention relates to computer networks, and more particularly to maintaining mapping structures in a storage system.

Prior Art

As computer memory storage and data bandwidth increase, the amount and complexity of data that an enterprise routinely manages also increases. Large-scale distributed storage systems, such as data centers, typically run many enterprise operations. A data center, which may also be referred to as a server room, is a physical or virtual centralized repository for storing, managing, and disseminating data related to one or more enterprises. The distributed storage system may be coupled to client computers interconnected by one or more networks. Company operations may suffer if the performance of any part of the distributed storage system is poor. Thus, the distributed storage system maintains a high standard for data availability and high performance functionality.

A distributed storage system includes physical volumes, which may be hard disks, solid state devices, storage devices utilizing another storage technology, or portions of storage devices. Software applications, such as logical volume managers or disk array managers, provide a means to allocate space on mass storage arrays. In addition, the software allows a system administrator to create storage group units that include logical volumes. Storage virtualization provides an abstraction (separation) of logical storage from physical storage to access the logical storage without the end user having to identify the physical storage.

To support storage virtualization, the volume manager performs input/output (I/O) redirection by translating incoming I/O requests from end users utilizing logical addresses into new requests utilizing addresses associated with physical locations in the storage device. Since some storage devices may include additional address translation mechanisms, such as an address translation layer that may be used in solid state storage devices, the aforementioned translation from a logical address to another address may not represent the only or final address translation. The redirection utilizes metadata stored in one or more mapping tables. In addition, information stored in one or more mapping tables may be used to store deduplication (deduplication) and to map virtual sectors of a particular snapshot level to physical locations. The volume manager may maintain a consistent view of mapping information corresponding to the virtualized storage. The address space supported may be limited by the storage capacity used to hold the mapping table.

The technology and mechanisms associated with the selected storage disk determine the method used by the volume manager. For example, a volume manager that provides a granular level of mapping corresponding to a hard disk, hard disk partition, or Logical Unit Number (LUN) of an external storage device is limited to redirecting, locating, deduplicating, etc. corresponding to large chunks of data. One example of another type of storage disk is a Solid State Disk (SSD). SSDs can emulate HDD interfaces, but SSDs utilize solid state memory to store persistent data rather than electromagnetic devices as seen in HDDs. For example, an SSD may include flash memory banks. Accordingly, with mapping table allocation algorithms developed for HDDs, the larger address space supported by one or more mapping tables may not be achievable in systems that include SSDs for storage.

In view of the foregoing, it would be desirable to have a system and method that efficiently performs storage virtualization for data stored among a plurality of solid-state storage devices.

Disclosure of Invention

The present invention contemplates various embodiments of a computer system and method for efficiently managing mapping tables in a data storage system.

In one embodiment, a data storage subsystem coupled to a network receives read and write requests from client computers over the network. The data storage subsystem includes a plurality of data storage locations on a device group, wherein the device group includes a plurality of storage devices. The data storage subsystem further includes at least one mapping table. The mapping table includes a plurality of entries, where each entry includes a tuple with a key. The data storage controller is configured to encode each tuple in the mapping table with a variable length code. Further, the mapping table may be organized into a plurality of time ordered levels, with each level including one or more mapping table entries. Further, a particular one of the plurality of encodings corresponding to a given tuple can be selected based at least in part on: the size of a given tuple that is not encoded, the size of a given tuple that is encoded, and the time to encode a given tuple. The present invention further contemplates embodiments in which the data storage controller is configured to perform a plurality of different encodings on a given tuple, compare the various encodings, and select a particular encoding that is deemed optimal.

The present invention further contemplates a mapping table in which the mapping table stores entries whose keys correspond to virtual blocks in the system. In various embodiments, an entry corresponding to a given virtual block range stores one hash value computed over data in the given virtual block range, and the entry stores information that facilitates locating data that includes the block range.

The foregoing and other embodiments will become apparent upon consideration of the following description and accompanying drawings.

Drawings

FIG. 1 is a generalized block diagram illustrating one embodiment of a network architecture.

FIG. 2 is a generalized block diagram of one embodiment of a mapping table.

FIG. 3A is a generalized block diagram of one embodiment of a primary index used to access a mapping table.

FIG. 3B is a generalized block diagram of another embodiment of a primary index used to access a mapping table.

FIG. 4 is a generalized block diagram of another embodiment of a primary index and mapping table.

FIG. 5A is a generalized flow diagram illustrating one embodiment of a method for performing a read access.

FIG. 5B is a generalized flow diagram illustrating one embodiment of a method for performing a write operation.

FIG. 5C is a generalized flow diagram illustrating one embodiment of a method for encoding and storing tuples.

FIG. 5D illustrates one embodiment of tuple encoding.

FIG. 5E is a generalized flow diagram illustrating one embodiment of a method for selecting and encoding a scheme.

FIG. 6 is a generalized block diagram of one embodiment of a multi-node network with a shared mapping table.

FIG. 7 is a generalized block diagram of one embodiment of a secondary index used to access a mapping table.

FIG. 8 is a generalized block diagram of one embodiment of accessing a three-level index of a mapping table.

FIG. 9 illustrates one embodiment of a method of utilizing an overlay table.

FIG. 10 is a generalized block diagram of one embodiment of a planarization operation corresponding to various levels within a mapping table.

FIG. 11 is a generalized block diagram of another embodiment of a planarization operation corresponding to various levels within a mapping table.

FIG. 12 is a generalized flow diagram illustrating one embodiment of a method for flattening various levels within a mapping table.

FIG. 13 is a generalized flow diagram illustrating one embodiment of a method for efficiently handling bulk array tasks within a mapping table.

FIG. 14 is a generalized block diagram illustrating one embodiment of a data layout architecture within a storage device.

While the invention is amenable to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and have been described in detail herein. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. In some instances, well-known circuits, structures, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the present invention.

Referring to FIG. 1, a generalized block diagram of one embodiment of a network architecture 100 is shown. As described further below, one embodiment of the network architecture 100 includes client computer systems 110a-110b interconnected to each other and to data storage arrays 120a-120b via a network 180. The network 180 may be coupled to a second network 190 through a switch 140. Client computer system 110c is coupled to client computer systems 110a-110b and data storage arrays 120a-120b via network 190. In addition, network 190 may be coupled to the internet 160 or other external network through switch 150.

It should be noted that in alternative embodiments, the number and types of client computers and servers, switches, networks, data storage arrays, and data storage devices are not limited to those shown in FIG. 1. At different times, one or more clients may operate offline. Further, during operation, as users connect, disconnect, and reconnect to the network architecture 100, the various client computer connection types may change. Further, while the present specification generally discusses network attached storage, the systems and methods described herein may also be applied to direct attached storage systems and may include a host operating system configured to perform one or more aspects of the described methods. Many such alternatives are conceivable. A further description is briefly provided regarding each of the components shown in fig. 1. An overview of some of the features provided by the data storage arrays 120a-120b will first be described.

In the network architecture 100, each data storage array 120a-120b may be used to share data between different servers and computers, such as client computer systems 110a-110 b. In addition, the data storage arrays 120a-120b may be used for disk mirroring, backup and restore, data archiving and retrieval of archived data, and data migration from one storage device to another. In an alternative embodiment, one or more client computer systems 110a-110c may be associated with each other through a fast Local Area Network (LAN) to form a cluster. Such clients may share storage resources, such as cluster shared volumes residing within one of the data storage arrays 120a-120 b.

Each of the data storage arrays 120a-120b includes a storage subsystem 170 for data storage. The storage subsystem 170 may include a plurality of storage devices 176a-176 m. These storage devices 176a-176m may provide data storage services for client computer systems 110a-110 c. Each storage device 176a-176m performs data storage using specific techniques and mechanisms. The type of techniques and mechanisms used within each of the storage devices 176a-176m may be used, at least in part, to determine algorithms for controlling and scheduling read and write operations to and from each of the storage devices 176a-176 m. For example, the algorithm may locate a particular physical location corresponding to the operation. Further, the algorithm may perform input/output (I/O) redirection corresponding to the operation, de-duplication in the storage subsystem 170, and support one or more mapping tables used for address redirection and de-duplication.

The logic used in the preceding algorithms may be included in one or more of the following: a basic Operating System (OS)132, a volume manager 134 within a storage subsystem controller 174, control logic within each storage device 176a-176m, and so on. Further, the logic, algorithms, and control mechanisms described herein may comprise hardware and/or software.

Each of the storage devices 176a-176m may be configured to receive read and write requests and include a plurality of data storage locations, each of which may be addressed as a row and a column in an array. In one embodiment, the data storage locations within the storage devices 176a-176m may be arranged as logically redundant storage containers or RAID arrays (redundant arrays of inexpensive/independent disks).

In some embodiments, each storage device 176a-176m may utilize different technologies for data storage than a conventional Hard Disk Drive (HDD). For example, one or more of the storage devices 176a-176m may include or be further coupled to storage comprised of solid state memory to store persistent data. In other embodiments, one or more of the storage devices 176a-176m may include or be further coupled to storage utilizing other technologies, such as spin-torque transfer technology, Magnetoresistive Random Access Memory (MRAM) technology, shingled disks, memristors, phase change memory, or other storage technologies. These different storage technologies may result in different I/O characteristics between storage devices.

In one embodiment, the included solid state memory includes Solid State Drive (SSD) technology. Technological and mechanical differences between HDD technology and SDD technology may result in differences in input/output (I/O) characteristics of the storage devices 176a-176 m. Solid State Disks (SSDs) may also be referred to as solid state drives. Without moving parts or mechanical delays, SSDs may have lower read access times and latencies than HDDs. But the write performance of SSDs is typically slower than the read performance and may be significantly affected by the availability of free programmable blocks within the SSD.

Storage array efficiency may be improved by creating a storage virtualization layer between user storage and physical locations within storage devices 176a-176 m. In one embodiment, the virtual layer of the volume manager is placed in the device-driver stack of the Operating System (OS), rather than within the storage device or in the network. Many storage arrays perform storage virtualization at a coarse-grained level to allow virtual-to-physical mapping table entries to be stored in memory. Such storage arrays, however, cannot integrate, for example, data compression, deduplication, and copy-on-modify (copy-on-modify) operations. Many file systems support fine-grained virtual to physical mapping tables, but they do not support large storage arrays, such as device groups 173a-173 m. Instead, a volume manager or disk array manager is used to support the device groups 173a-173 m.

In one embodiment, one or more mapping tables may be stored in storage devices 176a-176m instead of memory such as RAM172, memory medium 130, or cache within processor 122. Storage devices 176a-176m may be SSDs that utilize flash memory. The lower read access and latency corresponding to SSDs may allow a small number of dependent read operations to occur while servicing storage access requests from client computers. The dependent read operation may be used to access one or more indexes, one or more mapping tables, and user data during servicing of a storage access request.

In one example, I/O redirection may be performed by a dependent read operation. In another example, online deduplication may be performed by dependent read operations. In another example, rather than accessing storage locations that hold user data, bulk array tasks such as large copy, move, or zeroing operations may be performed entirely within the mapping table. Such direct mapping manipulation may substantially reduce I/O traffic and data movement within the storage devices 176a-176 m. The combined time corresponding to servicing a storage access request and performing a dependent read operation from the SSD may be less than the time to service a storage access request from the rotating HDD.

In addition, the information within the mapping table may be compressed. The particular compression algorithm may be selected to allow individual components, such as a key within a record of the plurality of records, to be identified. A search for a given key from among multiple compressed records may occur. In various embodiments, a search for a given key may be performed without decompressing each tuple by comparing the compressed representation of the key to the compressed information stored in the relevant fields of the tuple. If a match is found, only the matching records may be decompressed. Compressing the tuples within the records of the mapping table may also allow fine-grained level mapping. This fine-grained level of mapping may allow direct mapping operations as an alternative to typical batch array tasks. Further details regarding efficient storage virtualization will be discussed later.

Also, as shown, network architecture 100 includes client computer systems 110a-110c interconnected to each other and to data storage arrays 120a-120b via networks 180 and 190. Networks 180 and 190 may include a variety of technologies including wireless connections, direct Local Area Network (LAN) connections, Wide Area Network (WAN) connections such as the internet, routers, storage area networks, ethernet, and so forth. Networks 180 and 190 may include one or more LANs that may be wireless. Networks 180 and 190 may also include Remote Direct Memory Access (RDMA) hardware and/or software, transmission control protocol/internet protocol (TCP/IP) hardware and/or software, routers, repeaters, switches, grids, and so forth. Protocols such as fibre channel, fibre channel over ethernet (FCoE), iSCSI, etc. may be used in networks 180 and 190. Switch 140 may utilize a protocol associated with both networks 180 and 190. The network 190 may interface with a collection of communication protocols used for the internet 160, such as the Transmission Control Protocol (TCP) and the Internet Protocol (IP), or TCP/IP. Switch 150 may be a TCP/IP switch.

Client computer systems 110a-110c represent a number of stationary or mobile computers, such as desktop Personal Computers (PCs), servers, server farms, workstations, laptop computers, handheld computers, servers, Personal Digital Assistants (PDAs), smart phones, and so forth. Typically, client computer systems 110a-110c include one or more processors, including one or more processor cores. Each processor core includes circuitry for executing instructions according to a predefined general-purpose instruction set. For example, the x86 instruction set architecture may be selected. Or may selectOr any other general instruction set architecture. The processor cores may access a cache memory subsystem for data and computer program instructions. The cache subsystem may be coupled to a memory architecture including Random Access Memory (RAM) and a storage device.

Each processor core and memory fabric within the client computer system may be connected to a network interface. In addition to hardware components, each client computer system 110a-110c may also include a basic Operating System (OS) stored within the memory architecture. The base OS may represent any of a variety of operating systems, such asDART, etc. Thus, the base OS may be adapted to provide various services to end users and to provide a software framework adapted to support the execution of various programs. In addition, each client computer system 110a-110c may include a hypervisor (hypervisor) that is used to support Virtual Machines (VMs). As known to those skilled in the art, virtualization may be used in desktop computers and servers to decouple software, such as an OS, completely or partially from the hardware of the system. Virtualization may provide the illusion to an end user that multiple OSs are running on the same machine, where each OS has its own resources and access to logical storage entities (e.g., LUNs) built on each data storage array 120a-120 b.

Each of the data storage arrays 120a-120b may be used to share data between different servers, such as client computer systems 110a-110 c. Each of the data storage arrays 120a-120b includes a storage subsystem 170 for data storage. The storage subsystem 170 may include a plurality of storage devices 176a-176 m. Each of these storage devices 176a-176m may be an SSD. Controller 174 may include logic for handling received read/write requests. Random Access Memory (RAM)172 may be used to batch operations such as received write requests. In various embodiments, when batching write operations (or other operations), a non-volatile storage (e.g., NVRAM) may be used.

The base OS132, the volume manager 134 (or the disk array manager 134), any OS drivers (not shown), and other software stored in the memory medium 130 may provide functions for accessing files and management of those functions. The base OS132 may be, for example, NetApp DataOr other storage operating system. Basic OS132 and the OS driver may comprise program instructions stored on the memory medium 130 and executable by the processor 122 to perform one or more memory access operations in the storage subsystem 170 corresponding to the received request. The system shown in fig. 1 may generally include one or more file servers and/or block servers.

Each of the data storage arrays 120a-120b may be connected to the network 180 using a network interface 124. Similar to client computer systems 110a-110c, in one embodiment, the functionality of network interface 124 may be included in a network adapter card. The functionality of the network interface 124 may be implemented using both hardware and software. Both Random Access Memory (RAM) and Read Only Memory (ROM) may be included in the network card implementation of the network interface 124. One or more Application Specific Integrated Circuits (ASICs) may be used to provide the functionality of the network interface 124.

In addition to the foregoing, each storage controller 174 within a data storage array 120a-120b may support storage array functions such as snapshot, replication, and high availability. In addition, each storage controller 174 may support a virtual machine environment that includes multiple volumes, each volume including multiple snapshots. In one example, storage controller 174 may support hundreds of thousands of volumes, where each volume includes thousands of snapshots. In one embodiment, a volume may be mapped in fixed-size sectors, such as 4 Kilobyte (KB) pages within storage devices 176a-176 m. In another embodiment, a volume may be mapped in variable sized sectors, for example, for write requests. A given volume may be identified using a volume ID, a snapshot ID, and a sector number.

An address translation table may include a plurality of entries, each of which holds a virtual-to-physical mapping corresponding to a respective data component. The mapping table may be used to map logical read/write requests from each client computer system 110a-110c to physical locations in the storage devices 176a-176 m. The "physical" pointer value may be read from the mapping table during a lookup operation corresponding to a received read/write request. The physical pointer value may then be used to locate a physical location within the storage devices 176a-176 m. It should be noted that the physical pointer value may be used to access another mapping table within a given one of the storage devices 176a-176 m. Thus, there may be one or more levels of indirection between the physical pointer value and the target storage location.

In another embodiment, the mapping table may include information that is used to de-duplicate data (de-duplication of table-related information). The information stored in the deduplication table may include a mapping between one or more hash values computed for a given data component and a physical pointer to a physical location in one of the storage devices 176a-176m that holds the given data component. In addition, the length of the given data component and the state information corresponding to the respective entry may also be stored in a deduplication table.

Referring now to FIG. 2, a generalized block diagram of one embodiment of a mapping table is shown. As previously described, one or more mapping tables may be used for I/O redirection or translation, deduplication of duplicate copies of user data, volume snapshot mapping, and so forth. The mapping table may be stored in storage devices 176a-176 m. The diagram shown in FIG. 2 represents a logical representation of one embodiment of the organization and storage of mapping tables. Each of the illustrated levels may include mapping table entries corresponding to a different time period. For example, level "1" may include information that is earlier than the information stored in level "2". Similarly, level "2" may include information that is earlier than the information stored in level "3". The information stored in the records, pages, and hierarchies illustrated in FIG. 2 may be stored within storage devices 176a-176m in a random access manner. Further, copies of some or all of a given mapping table entry may be stored in RAM172, in a buffer within controller 174, in memory medium 130, and in one or more caches within or coupled to processor 122. In various embodiments, a respective index corresponding to a mapping that is part of the hierarchy may be included in each hierarchy (as depicted later in FIG. 4). Such an index may include an identification of the mapping table entry and its storage location within the hierarchy (e.g., an identification of a page). In other embodiments, the index associated with a mapping table entry may be one or more different entities that are not logically part of the hierarchy itself.

Generally, each mapping table includes a set of rows and columns. A single record may be stored as a row in the mapping table. A record may also be referred to as an entry. In one embodiment, a record stores at least one tuple comprising a key. Each tuple may (or may not) also include data fields including data such as pointers used to identify or locate data components stored in the storage subsystem 170. It should be noted that in various embodiments, the storage subsystem may include a storage device (e.g., SSD) with an internal mapping mechanism. In such an embodiment, the pointer in the tuple may not itself be an actual physical address. Instead, the pointer may be a logical address that is mapped by the storage device to a physical location within the device. Over time, this internal mapping between logical addresses and physical addresses may change. In other embodiments, the records in the mapping table may contain only key fields without additional associated data fields. Attributes associated with data components corresponding to a given record may be stored in columns or fields in a table. Status information such as validity indicator, data age, data size, etc. may be stored in fields such as field 0 through field N shown in fig. 2. In various embodiments, each column stores information corresponding to a given type. In some embodiments, compression techniques may be utilized for selected fields, such that a zero-bit length field may be available for its compressed representation in some cases. It should be noted that while the following discussion generally describes the mapping table as a mapping address (e.g., a virtual to physical address), in other embodiments, the key may be a file identifier or an object identifier when the table, method, and mechanism are applied. For example, in such embodiments, the system may be used as a file server or an object server. In various embodiments, the methods and mechanisms described herein may be used to service blocks, objects, and files, and dynamically move space therebetween. Many such embodiments may be envisaged.

A key is an entity in the mapping table that may distinguish one row of data from another. Each row may also be referred to as an entry or record. A key may be a single column or may consist of a set of columns that are used to identify a record. In some embodiments, a key may correspond to a range of values rather than a single value. For example, a keyword corresponding to a range may be represented as a start and end of a range, or as a start and length, or in other ways. In addition, a range corresponding to a keyword may overlap with other keywords, including ranges or individual values. In one example, the address translation mapping table may utilize keys that include a volume Identifier (ID), an address such as a logical address or a virtual address, a snapshot ID, a sector number, and so forth. A given read/write storage access request received may identify a particular volume, sector, and length. A sector may be a logical block of data stored in a volume. The sectors may have different sizes on different volumes. The address translation mapping table may map volumes in sector size units.

A volume Identifier (ID) may be used to access a volume table that conveys volume IDs and corresponding current snapshot IDs. This information along with the received sector number may be used to access an address translation mapping table. Thus, in such embodiments, the key value used to access the address translation mapping table is a combination of the volume ID, snapshot ID, and received volume number. In one embodiment, the records within the address translation mapping table are sorted by volume ID, followed by sector number, followed by snapshot ID. This ordering may group together different versions of data components in different snapshots. Thus, during a lookup corresponding to a storage access read request, the corresponding data component may be found with fewer read operations directed to storage devices 176a-176 m.

The address translation map may convey a physical pointer value indicating a location within the data storage subsystem 170 where the data component corresponding to the received data storage device access request is stored. The key value may be compared to one or more key values stored in a mapping table. In the illustrated example, simpler key values such as "0", "2", "12", etc. are shown for ease of illustration. The physical pointer value may be stored in one or more fields in the corresponding record.

The physical pointer value may include a fragment Identifier (ID) and a physical address identifying the storage location. One segment may be the basic allocation unit in each of the storage devices 176a-176 m. A segment may have a Redundant Array of Independent Devices (RAID) level and a data type. During allocation, a segment may have one or more selected storage devices 176a-176m for respective storage. In one embodiment, a segment may be allocated an equal amount of storage space on each of one or more selected ones of storage devices 176a-176 m. Data storage access requests may correspond to multiple sectors, potentially resulting in multiple parallel lookups. The write request may be placed in an NVRAM buffer (such as RAM172) and a write completion acknowledgement may be sent to a respective one of the client computers 110a-119 c. At some later time, an asynchronous process may flush the buffered write requests to the storage devices 176a-176 m.

In another example, the mapping table shown in FIG. 2 may be a deduplication table. The deduplication table may utilize a hash value that includes a hash value determined from the data component associated with the storage access request. The initial steps of the deduplication operation may be performed concurrently with other operations, such as read/write requests, garbage collection operations, trim (immediate erase) operations, and so on. For a given write request, the data sent from one of the client computer systems 110a-110c may be a data stream, such as a byte stream. As is well known to those skilled in the art, a data stream may be divided into a sequence of fixed-length or variable-length chunks. Chunking algorithms may divide a data stream into discrete data components that may be referred to as "chunks. A chunk may be a subfile content addressable data unit. In various embodiments, a table or other structure may be used to determine the specific chunking algorithm to be used for a given file type or data type. The type of file can be determined by referring to its file name extension, individual identification information, the content of the data itself, and the like. The resulting chunk may then be stored in one of the data storage arrays 120a-120b to allow the chunk to be shared. Such chunks may be stored separately or grouped together in a variety of ways.

In various embodiments, the chunks may be represented by a data structure that allows larger data components to be reconstructed from their chunks (e.g., a particular file may be reconstructed based on one or more smaller chunks of stored data). The corresponding data structure may record its corresponding chunk including the computed associated hash value, a pointer (physical and/or logical) to its location in one of the data storage arrays 120a-120b, and its length. For each data component, a respective hash value may be calculated using a deduplication application. For example, an algorithm such as message digest algorithm 5(MD5), Secure Hash Algorithm (SHA), or the like may be used to compute the corresponding hash value. To know whether a given data component corresponding to a received write request has been stored in one of the data storage arrays 120a-120b, the bits of the hash value computed for the given data component (or a subset of the bits of the hash value) may be compared to the bits of the hash value for the data component stored in one or more of the data storage arrays 120a-120 b.

The mapping table may include one or more levels as shown in fig. 2. The mapping table may include 16 to 64 levels, but another number of levels supported in the mapping table may be envisaged. Three levels are shown in fig. 2 for ease of illustration, labeled level "1", level "2", and level "N", respectively. Each level within the mapping table may include one or more partitions. In one embodiment, each partition is a 4 Kilobyte (KB) page. For example, level "N" is shown to include pages 210a-210g, level "2" includes pages 210h-210j, and level "1" includes pages 210 k-210N. It is contemplated that other partition sizes may also be selected for each level within the mapping table. Furthermore, one or more levels may have a single partition, i.e., the level itself.

In one embodiment, the multiple levels within the mapping table are ordered by time. For example, in fig. 2, level "1" may be earlier than level "2". Similarly, level "2" may be earlier than level "N". In one embodiment, a new hierarchy may be created when a condition is detected that corresponds to the insertion of one or more new entries in the mapping table. In various embodiments, when a new level is created, the number/designation given for the new level is greater than the number given for each level that precedes the new level in time. For example, if the most recently created tier is assigned a value of 8, the newly created tier may be assigned a value of 9. In this manner, temporal relationships between the various tiers may be established or determined. It will be appreciated that the numerical values need not be strictly sequential. Furthermore, alternative embodiments may reverse the numbering scheme so that updated levels have smaller numerical designations. In addition, other embodiments may utilize non-numeric designations to distinguish between the various levels. Many such embodiments may be envisaged. Each next earlier level has a tag decremented by 1 from the tag integer value of the previous later level. The mapping table may be logically described using a separate table, not shown. For example, each entry of the separate table may include a given tier ID and a list of page IDs stored within the given tier ID.

The mapping table is updated by appending new records by creating a new highest hierarchical level for inserting new records. In one embodiment, a single hierarchy is created as the new highest hierarchy and each new record is inserted into the single hierarchy. In another embodiment, the new record may be searched for duplicate keys before insertion into the mapping table. A single hierarchy level may be created as the new highest hierarchy level. When a given record is found that stores a duplicate key, each record that is buffered before the given record may be inserted into the single hierarchy. The new record may be buffered to preserve memory ordering, such as in-order completion of requests. Another single level can then be created and, unless another record is found that stores a duplicate key, the remaining new records can be inserted into the other single level. If such a record is found, the steps are repeated. Existing records in the mapping table that store the same key value as one of the new records are not edited or overwritten in place by inserting a new record.

Although the hierarchy is shown to increase in size as lower levels are larger than the update levels, higher levels may alternate between being larger or smaller than adjacent levels. The number of update records to be inserted into the mapping table may change over time and produce fluctuating tier sizes. Lower levels may be larger than update levels due to the lower level of platformization. Two or more lower levels may be planarized to a single level when a particular condition is detected. Further details will be provided later.

In the case where there is no in-place editing for records stored in the mapping table, updated records placed in a higher hierarchy may overwrite records storing the same key located in a lower hierarchy. For example, when the mapping table is accessed by a given key number value, one or more levels of records storing key values that match the given key number value may be found. In this case, the highest hierarchical level among the one or more hierarchical levels may be selected to provide the information stored in its respective record as a result of the access. Further details will be provided later. Further details regarding the detected conditions corresponding to the insertion of one or more new records into the mapping table and the storage of information will be provided later herein.

In one embodiment, the entries within a given page may be ordered by keyword. For example, the entries may be sorted in ascending order according to the keywords included in the entries. Further, in various embodiments, the various pages within a hierarchy may be ordered according to any desired ordering order. In various embodiments, pages within a hierarchy may also be ordered (e.g., according to key number values, etc.). In the example of fig. 2, level N page 210a includes records sorted in ascending order according to key value. In various embodiments, one or more columns may be used to store key values. In the example of fig. 2, two columns or fields are shown in each tuple for storing key values. With such key values, the records may then be sorted in a desired order. The sorting may be performed according to any key number value corresponding to a record or any key number combination corresponding to a record. In the example shown, the first record store includes key values of 0 and 8 stored in two columns, and the last record store includes key values of 12 and 33. In the example shown, each sorted record in page 210a between the first and last record stores a key value between 0 and 12 in the first column, and the records are arranged to store key values based on the first column (at least in part) in ascending order from 0 to 12. Similarly, page 210b includes sorted records, where the first record stores key values 12 and 39 and the last record stores key values 31 and 19. In the example shown, each sorted record in page 210b between the first and last record stores a key value in the first column of between 12 and 31, and the records are arranged to store key values in ascending order from 12 to 31.

In addition to the foregoing, the various pages within level N are ordered according to a desired order. In various embodiments, the pages within a hierarchy are ordered in a manner that reflects the order in which the entries within a page are ordered. For example, pages within a hierarchy may be ordered according to ascending order of key values. Since the first key value in page 210b is greater than the last key value in page 210a, page 210b follows page 210a in the sorted order. The page 210g will then include entries (not shown) having key number values greater than those included in the pages 210a-210 f. In this manner, all entries within a hierarchy are ordered according to a common scheme. Each entry is simply subdivided into pages or other units of size. It can be appreciated that other ordering schemes can be used as desired.

Referring now to FIG. 3A, a generalized block diagram of one embodiment of a primary index used to access a mapping table is shown. The keyword generator 304 may receive one or more requested data inputs 302. In one embodiment, the mapping table is an address translation directory table. A given read/write request received may identify a particular volume, sector, and length. Key generator 304 may generate a query key value 306 that includes a volume Identifier (ID), a logical or virtual address, a snapshot ID, and a sector number. Other combinations are possible and other or additional values may be utilized. Different portions of the query key value 306 may be compared to values stored in columns within the mapping table that may or may not be contiguous. In the example shown, the keyword value "22" is used for ease of illustration.

As previously described, a chunking algorithm and/or a segmentation algorithm associated with the key generator 304 may receive the data 302 corresponding to the storage access request. These algorithms may generate one or more data components and, for each data component, select a hash function to compute a corresponding hash value or query key value 306. The resulting hash value may be used to index the duplicate data deletion table.

The primary index 310 as shown in FIG. 3A may provide location identifying information corresponding to data stored in storage devices 176a-176 m. For example, referring again to fig. 2, a respective primary index 310 (or a portion thereof) may be logically included in each of level "1", level "2", and level "N". Likewise, each tier and each corresponding primary index may be physically stored in the storage devices 176a-176m in a random access manner.

In one embodiment, primary index 310 may be partitioned into various partitions, such as partitions 312a-312 b. In one embodiment, the size of the partitions may range from 4 Kilobyte (KB) pages to 256KB, although other sizes are also possible. Each entry of the primary index 310 may store a key value. In addition, each entry may store a respective unique virtual page Identifier (ID) and a hierarchy ID corresponding to a key value. Each entry may store corresponding state information, e.g., validity information. When the primary index 310 is accessed with one query key number value, the various entries within the index 310 may be searched for one or more entries that match or otherwise correspond to the key number value. The information from the matching entry may then be used to locate and retrieve a map that identifies the storage location that is the target of the received read or write request. In other words, the index 310 identifies the location of the mapping. In one embodiment, a hit in the index provides a corresponding page ID identifying a page within storage devices 176a-176m, where the page stores the key value and a corresponding physical pointer value. The key value may be utilized to search the page identified by the corresponding page ID to find the physical pointer value.

In the example of fig. 3A, the received request corresponds to the keyword "22". The key is then used to access the index 310. The search of the index 310 results in a hit on an entry within the partition 312 b. The matching entry in this example includes information such as page 28 and level 3. Based on this result, the desired mapping corresponding to the request is found in the pages identified as pages 28 within level 3 of the mapping table. With this information, the mapping table can then be accessed to retrieve the desired mapping. If access to the primary index 310 requires access to storage, at least two storage accesses will be required to obtain the desired mapping. Thus, in various embodiments described later, various portions of the primary index are cached or stored in relatively fast-access memory in order to eliminate one access to the storage device. In various embodiments, the entire primary index corresponding to the mapping table is cached. In some embodiments, if the primary index has become too large to cache in its entirety or larger than desired, a secondary, tertiary, or other index portion may be used in the cache to reduce its size. The secondary type index will be discussed later. In addition to the foregoing, in various embodiments, the mapped page corresponding to the most recent hit is also cached for at least some time. In this way, processes that exhibit accesses with temporal locality can be serviced more quickly (that is, the mapping of recently accessed locations will be cached and readily available).

Referring now to FIG. 3B, a generalized block diagram of one embodiment of a cached primary index used to access a mapping table is shown. Those circuits and logic portions corresponding to those of fig. 3A are denoted by identical reference numerals. The cached primary index 314 may include a copy of the information stored in each primary index 310 corresponding to multiple levels in the mapping table. The primary index 314 may be stored in one or more of: RAM172, buffers within controller 174, memory medium 130, and caches within processor 122. In one embodiment, the primary index 314 may be ordered by key value, but other orderings are possible. The primary index 314 may also be divided into various partitions, such as partitions 316a-316 b. In one embodiment, the size of the partitions 316a-316b may be the same as the size of the partitions 312a-312b within the primary index 310.

Similar to the primary index 310, each entry of the primary index 314 may store one or more of: a key value, a corresponding unique virtual page Identifier (ID), a hierarchy ID corresponding to the key value, and status information such as valid information. When query key value 306 is utilized to access primary index 314, it may convey a corresponding page ID identifying a page within storage devices 176a-176m that stores both the key value and a corresponding pointer value. The key value may be utilized to search the page identified by the corresponding page ID to find the pointer value. As shown, the primary index 314 may have multiple records that store the same key value. Thus, multiple hits from the search are possible for a given key value. In one embodiment, a hit with the highest level ID value (or any index used to identify the latest level or most recent entry) may be selected. A hit from among the multiple hits may be selected by merge logic not shown here. Further description of the merge logic will be provided later.

Referring now to FIG. 4, a generalized block diagram of another embodiment of a mapping table and a primary index used to access the mapping table is shown. Circuits and logic portions corresponding to those in fig. 3A are denoted by identical reference numerals. Mapping table 340 may have a structure similar to the mapping table shown in fig. 2. But does not show the storage of a respective primary index 310 corresponding to each level. Wherein a copy of one or more primary index portions 310a-310i may be included in index copy 330 (e.g., cached copy). Copy 330 may generally correspond to the cached index depicted in fig. 3B. The information in index copy 330 may be stored in RAM172, buffers within controller 174, memory medium 130, and caches within processors 122. In the illustrated embodiment, the information in primary indexes 310a-310i may be stored in storage devices 176a-176m along with the mapped pages. Also shown is a secondary index 320 that can be used to access a primary index, such as primary index 310i shown in the figure. Similarly, the accessing and updating of the mapping table 340 may occur as previously described.

Mapping table 340 includes a plurality of levels, such as level "1" through level "N". In the example shown, each level includes a plurality of pages. Level "N" is shown to include pages "0" through "D", level N-1 includes pages "E" through "G", and so on. Likewise, the various levels within mapping table 340 may be ordered by time. Level "N" may be later than level "N-1", and so on. The mapping table 340 may be accessed at least by key number values. In the illustrated example, the mapping table 340 is accessed by a key value of "27" and a page ID of "32". For example, in one embodiment, the tier ID "8" may be used to identify a particular tier (or "sub-table") of the mapping table 340 to be searched. After identifying the desired sub-table, the page ID may then be used to identify the desired page within the sub-table. Finally, the keywords may be used to identify a desired entry within a desired page.

As discussed previously, accesses to cached index 330 may result in multiple hits. In one embodiment, the results of the multiple hits are provided to merge logic 350, which merge logic 350 identifies which hit was used to access mapping table 340. The merge logic 350 may represent hardware and/or software included within a storage controller. In one embodiment, merge logic 350 is configured to identify a hit corresponding to the most recent (most recent) mapping. Such identification may be based on identifying a respective hierarchy corresponding to an entry, and so forth. In the example shown, a query is received corresponding to level 8, page 32, keyword 27. In response to the query, page 32 of level 8 is accessed. If the key 27 is found (hit) within the page 32, the corresponding result is returned (e.g., pointer xF3209B24 in the illustrated example). If no keyword 27 is found within the page 32, a miss indication is returned. The physical pointer value may be output from mapping table 340 to service the storage access request corresponding to the key value of "27".

In one embodiment, mapping table 340 supports online mapping. For example, a map detected as having a sufficiently small target may be represented without actual physical sectors storing user data within storage devices 176a-176 m. One example may be a repeating pattern within the user data. Instead of actually storing multiple copies of a repeating pattern (e.g., a series of zeros) as user data in the storage devices 176a-176m, the corresponding mapping may have an indication marked in the status information, such as an indication marked in one of fields 0 through N in the mapping table, indicating which data value is to be returned for a read request. But there is no actual storage of that user data at the target location within storage devices 176a-176 m. Further, an indication may be stored within the state information of primary index 310, and any additional index (not shown here) may be used.

In addition to the foregoing, in various embodiments, the storage system may simultaneously support multiple versions of data organization, data schemas, and so forth. For example, as system hardware and software evolve, new features may be incorporated or otherwise provided. The updated data, index, and (for example) mapping may take advantage of these new features. In the example of FIG. 4, the new level N may correspond to one version of the system, while the earlier level N-1 may correspond to a previous version. To accommodate these different versions, metadata may be stored in association with each tier that indicates which version, which features, compression schemes, etc. the tier has used. This metadata may be stored as part of the index, the page itself, or both. When accessed, the metadata then indicates how the data should be handled correctly. Furthermore, new schemes and features can be dynamically applied without the need to quiesce the system. In this way, the system is more flexible to upgrade and there is no need to reconstruct earlier data to reflect the new scheme and method.

Referring now to FIG. 5A, one embodiment of a method for servicing read access is shown. The components embodied in the network architecture 100 and the mapping table 340 as previously described may generally operate in accordance with the method 500. For purposes of discussion, the various steps in this embodiment are shown in sequential order. In other embodiments, however, some steps may occur in different orders than shown, some steps may be performed simultaneously, some steps may be combined with other steps, and some steps may be absent.

Read and store (write) requests may be communicated from one of clients 110a-110c to one of data storage arrays 120a-120 b. In the illustrated example, a read request 500 is received and a corresponding query key value is generated in block 502. In some embodiments, the request itself may include the key used to access the index, and the key 502 need not be "generated". As previously described, the query key number value may be a virtual address index that includes a volume ID, a logical or virtual address associated with the received request, a snapshot ID, a sector number, and so forth. In embodiments used for deduplication, the query key value may be generated using a hash function or other function. Other values may be envisaged for the value of the query key used to access the mapping table.

In block 504, the query key number value may be used to access one or more cached indices to identify one or more portions of a mapping table that may store mappings corresponding to the key number value. In addition, the most recently used mappings that have been cached can be searched. If a hit is detected on the cached mapping (block 505), the requested access may be performed using the cached mapping (block 512). If there is no hit on the cached mapping, a determination may be made as to whether there is a hit on the cached index (block 506). If so, the result corresponding to the hit is used to identify and access a mapping table (block 508). For example, using the primary index 310, the entry storing the value of the query key may also store a unique virtual page ID that identifies a single particular page within the mapping table. The single particular page may store both the query key value and the associated physical pointer value. In block 508, the identified mapping table portion may be accessed and a search performed using the query key number value. The mapping table results may then be returned (block 510) and used to perform storage access corresponding to the target location of the original read request (block 512).

In some embodiments, an index lookup in response to a read request may result in a miss. Such a miss may be due to only a portion of the index being cached, or due to an error condition (e.g., a read access to a non-existent location, an address corruption, etc.). In this case, access to the stored index may be performed. If the access to the stored index results in a hit (block 520), a result may be returned (block 522), which is used to access the mapping table (block 508). On the other hand, if an access to the stored index results in a miss, an error condition may be detected. The error condition may be addressed in any of a number of desired ways. In one embodiment, an exception may be generated (block 524) and then dealt with in a desired manner. In one embodiment, a portion of the mapping table is returned in block 510. In various embodiments, the portion is a page, which may be a 4KB page or the like. As previously discussed, the records within a page may be ordered to facilitate a faster search for the content included therein.

In one embodiment, the mapping table utilizes conventional database system methods for storing information in each page. For example, each record (or row or entry) within the mapping table is stored next to each other. This method can be used in row-oriented or row-stored databases and can additionally be used for relational databases. These types of databases utilize a value-based storage structure. The value-based storage (VBS) architecture stores only one data value once, and the automatically generated indexing system maintains a context for all values. In various embodiments, data may be stored by row, and compression may be used for columns (fields) within a row. In some embodiments, the technique used may include storing the base value and having a smaller field size corresponding to the offset, and/or having a set of base values where one column of a row is made up of the base selector and the offset from the base. In both cases, the compression information may be stored within the partition (e.g., at the beginning of the partition).

In some embodiments, the mapping table utilizes a column-oriented database system (column store) approach for information storage in each page. The column store stores each database table column separately. Further, individual attribute values belonging to the same column may be adjacently, compressed, and densely packed. Accordingly, a read of a subset of the list within, for example, a page can be performed relatively quickly. Column data may be of a uniform type and may allow for optimization using memory sizes that may not be available in row-oriented data. Some compression schemes, such as Lempel-Ziv-welch (lz) and run-length encoding (RLE), utilize the similarity of detected proximity data for compression. In addition, as described more fully above, other compression schemes may encode a value as a difference from the base value, requiring fewer bits to represent the difference than would otherwise be required to represent the full value. A compression algorithm may be selected that allows individual records within a page to be identified and indexed. Fine grained mapping may be achieved by compressing the records within the mapping table. In various embodiments, the type of compression used for a particular portion of data may be stored in association with the data. For example, the compression type may be stored in an index, as part of the same page as the compressed data (e.g., in some type of header), or otherwise stored. In this manner, a variety of compression techniques and algorithms may be used together within the storage system. Further, in various embodiments, the type of compression used to store page data may be dynamically determined when the data is stored. In one embodiment, one of a plurality of compression techniques may be selected based at least in part on the nature and type of data being compressed and/or the anticipated resource requirements corresponding to the compression technique and the resources currently available in the system. In some embodiments, multiple compression techniques will be applied, and then the one exhibiting the best compression will be selected for compressing the data. Many such methods may be envisaged.

If a match is found for the query key value 306 in any level of the mapping table (block 508), then in block 510, one or more indications of the hit may be communicated to the merge logic 350. For example, as shown in FIG. 4, one or more hit indications may be communicated from level "1" to "J". Merge logic 350 may select the highest level, which may also be the latest level, among levels "1" through "J" that convey a hit indication. As a result of the access, the selected hierarchy may provide information stored in the corresponding record.

In block 512, one or more corresponding fields within the matching record for the selected page may be read to process the corresponding request. In one embodiment, when the data within the page is stored in a compressed format, the page is decompressed and the corresponding physical pointer values are read. In another embodiment, only the matching records are decompressed and the corresponding physical pointer values are read. In one embodiment, a complete physical pointer value may be resolved between the mapping table and the corresponding target physical location. Multiple physical locations where user data is stored may be accessed in order to complete a data storage access request.

Referring now to FIG. 5B, one embodiment of a method corresponding to a received write request is illustrated. In response to the received write request (block 530), a new mapping table entry corresponding to the request may be created (block 532). In one embodiment, a new virtual-to-physical address mapping may be added (block 534) to the mapping table that pairs the virtual address of the write request with the physical location where the corresponding data component is stored. In various embodiments, the new mapping may be cached along with other new mappings and added to the new highest-level mapping table entry. A write operation to persistent storage may then be performed (block 536). In various embodiments, the new mapping table entry may not be written to the mapping table in the persistent storage until a later point in time (block 538), which may be considered more efficient. As discussed previously, in a storage system utilizing solid state storage devices, writes to the storage device are much slower than reads from the storage device. Accordingly, writes to the storage are scheduled in a manner that minimizes the impact on overall system performance. In some embodiments, the insertion of new records into the mapping table may be combined with other larger data updates. By combining updates in this manner, a more efficient write operation may be provided. It should be noted that in the method of 5B, as with each of the methods described herein, the operations are described as occurring in a particular order for ease of discussion. The operations may in fact occur in a different order and, in some cases, the operations may occur concurrently. All such embodiments are contemplated.

In addition to the foregoing, a deduplication mechanism may be used in some embodiments. FIG. 5B depicts operations 550 that may generally correspond to the deduplication systems and methods. In the illustrated example, a hash corresponding to the received write request may be generated (block 540), which is used to access a deduplication table (block 542). If there is a hit in the deduplication table (block 544) (i.e., a copy of the data already exists within the system), a new entry may be added to the deduplication table (block 548) to reflect the new write. In this case, the data itself need not be written to the storage device, and the received write data may be discarded. Alternatively, if there is a miss in the deduplication table, a new entry corresponding to the new data is created and stored in the deduplication table (block 546). In addition, a data write to the storage device is performed (block 536). In addition, a new entry may be created in the index to reflect the new data (block 538). In some embodiments, if a miss occurs during an online deduplication operation, no insertion is performed in the deduplication table at this time. In contrast, during an online deduplication operation, a query with a hash value may occur for only a portion of the entire deduplication table (e.g., a cached portion of the deduplication table). If a miss occurs, a new entry may be created and stored in the cache. Subsequently, during post-processing deduplication operations (such as those occurring during garbage collection), a query with a hash value may occur for the entire deduplication table. A miss may indicate that the hash value is a unique hash value. Thus, a new entry, such as a hash to physical pointer mapping, may be inserted into the deduplication table. Alternatively, if a hit is detected (i.e., a duplicate is detected) during post-processing deduplication, then deduplication may be performed to eliminate one or more of the detected copies.

As previously described, various compression schemes may be used to encode the data associated with the mapping table in order to reduce the amount of storage required. Referring now to FIG. 5C, one embodiment of a method for compressing a set of tuples is shown. This method may be used to write entries to a mapping table or other table. First, a target size corresponding to the set of encoded tuples to be stored may be selected (block 560) and a default encoding algorithm may be selected (block 561). The tuples for encoding and storage in the table are then selected based on the selected size and algorithm (block 562). In such an embodiment, the encoded size of each tuple is calculated using the currently selected encoding method. If the tuple being added would cause the current accumulated tuple in the set to exceed the target size (conditional block 564), the system may attempt to find a better encoding algorithm corresponding to all tuples accumulated so far in order to reduce the total space required for the encoded tuples (block 565). If no smaller encoding is found (block 565), the nearest tuple is ignored and the remaining tuples are written using the current encoding method (block 567). If a smaller code is found (block 565), a determination is made as to whether the new smaller code falls within the target size (block 566). If the new encoding does not fall within the target size, the most recently provided tuple may be ignored and the remaining tuples are encoded and written using the current encoding method (block 567). If the current tuple under consideration does not cause the current cumulative tuple in the set to exceed the target size (conditional block 564), an attempt may be made to add another tuple (block 562). Similarly, if a new encoding is found in conditional block 466 that satisfies the requirement, an attempt may be made to add another tuple (block 562).

FIG. 5D illustrates one embodiment of a method for encoding tuples. The original unencoded tuples 584 are depicted in this example, and the encoded tuples 580 encoded in the encoded pages 568 are depicted. Generally, the illustrated example represents each field in the table with one or two values. The first value is a base value selector used to select a base value and the second value is an offset from the selected base value. In one embodiment, the base selector comprises b bits and the offset comprises k bits, where b and k are integers. The values b and k may be chosen separately for each field, and either or both of b and k may be zero. The values of b and k may be stored for each encoded field along with up to 2b bases, each of which may have the number of bits needed to represent the base value. If b is zero, only one basis is stored. Each field encoded in this way then requires at most b + k bits to be encoded. The encoder may consider different values for b and k in order to minimize the total encoded size for the field, where a larger value of b generally requires a smaller value of k.

Fig. 5D shows one sample of the encoded tuple 584 and the resulting encoded page 568. The page includes a header 570 with the first two values containing the number of fields in each tuple (572) and the number of tuples in the page (574). The header 570 then has a table or set of values corresponding to each field. The table lists first the number of bases corresponding to a given field, followed by the number of bits k used to encode the offset from the bases. The page then stores each tuple encoded with the information in the header. For example, the first value (572) in the header 570 indicates that there are 3 fields for each tuple. The second value (574) indicates that there are 84 tuples in page 568. The latter three tables 576A-576C then provide base values and encoding information corresponding to each of the three fields. Table 576A indicates that the first field has 1 base with 4 bits used to encode the offset. The only basis for the first field is 12 (i.e., b is zero). The second table 576B indicates that there are 3 bases for the second field and that 3 bits will be used to encode the offset. The three bases corresponding to the second field 576B are 5, 113, and 203. Finally, the third table 576C indicates that the third field has 2 bases and 0 bits are used to encode the offset.

Looking at the encoded tuples 580, various numerical values may be determined. In the example shown, the values in a given row/column of the encoded tuple 580 correspond to the values in the same row/column of the original tuple. It will be appreciated that the ordering and positioning of the various numerical values in this figure is merely exemplary. The actual ordering of the individual values and the corresponding encoded values may vary widely from that depicted. The first field in the first tuple 582 is encoded to 3 because the value 15 (unencoded value) can be represented as an offset of 3 from the base 12 (i.e., 15-12-3). It should be noted in this example that there is only one basis, and b is zero. No bits are used to encode the base selector value corresponding to that field. The offset value 3 is encoded with 4 bits, thereby significantly reducing compared to typical encoding that may require 8, 32 or 64 bits. The second value in the first tuple 582A is encoded as 1, 3. A 1 indicates that base 1 is selected in table 576B (i.e., base 113 is selected), and a 3 indicates an offset of 3 from base 113. The value 1 is encoded in 2 bits (22 being greater than or equal to the minimum power of 2 of the base number) and the value 3 is encoded in 3 bits, thus totaling 5 bits. This is also much smaller than the na iotave encoding of the field. Finally, the last field is encoded as an index indicating which basis should be used. In this example, no bits are used to represent the offset. The first tuple here has a 0 because the stored value is 4927, which is the entry (base) 0 in the table corresponding to field 576C in header 570. Thus, the total encoded space corresponding to each tuple is (0+4) + (2+3) + (1+0) — 10 bits, which is greatly reduced compared to the required uncoded space.

In various embodiments, if the maximum size of the field is increased, as may be done to accommodate a larger virtual address or LUN identifier, then there is no need to re-encode the page. In the worst case, the header may need to be modified slightly to accommodate a larger base value, but this aspect requires minimal effort. It is furthermore possible to modify a number of values by a fixed amount by simply modifying the basis, as might be done when copying a range of blocks to a new location, without the need to decompress and then re-encode each affected tuple.

It should be noted that there are several different methods to find the optimal or otherwise desirable values of b and k for a particular field. Fig. 5E illustrates one embodiment of a method for evaluating and selecting a coding scheme from among multiple possibilities. In the illustrated method, each unique value to be recorded in a field of a page is recorded in a list (block 585). To find a more efficient coding, the method starts with a representation that: where b is zero (a basis) and k is large enough (the minimum number of necessary bits) to encode the largest value in the list as a difference or offset from the smallest value in the list (block 586). The encoder then tries successively smaller values of k, resulting in larger values of b (more basis). As each combination of b and k is evaluated, those combinations that produce codes that are deemed better (e.g., smaller) are retained for comparison with possible codes in the future. The algorithm may then select an encoding that results in a smaller overall size, including both the table in the header and the total space required for the encoded fields in the tuple. For example, starting with the minimum number as a basis (block 587), the minimum number in the list that is at least 2k greater than the current basis is found (block 588). If such a value exists (conditional block 589), the value is selected as the next basis (block 594). If such a value does not exist (conditional block 589), the total encoded size corresponding to the header and each encoded field is determined using the currently selected base and the value of k. If such encoding is desired (e.g., smallest so far) (conditional block 591), such encoding is retained (block 592). Regardless of whether the encoding is retained, the value of k may be decremented by 1 (block 593), and if k is greater than or equal to zero (conditional block 595), the process may be repeated by returning to block 587. If k is below zero by decrementing k, the process ends and the best code found so far is selected (block 596).

Referring now to FIG. 6, a generalized block diagram of one embodiment of a multi-node network with a shared mapping table is shown. In the illustrated example, three nodes 360a-360c are used to form a cluster of mapping nodes. In one embodiment, each node 360a-360c may be responsible for one or more Logical Unit Numbers (LUNs). A number of mapping table levels 1-N are shown in the depicted embodiment. Level 1 may correspond to the earliest level and level N may correspond to the latest level. For each mapping table entry of a LUN managed by a particular node, the particular node itself may have an updated entry stored on the node itself. For example, node 360a is shown storing mapping sub-tables 362a and 364 a. These sub-tables 362a and 364b may correspond to LUNs for which node 360a is generally responsible. Similarly, node 360b includes sub-tables 362b and 364b, which may correspond to LUNs managed by the node, and node 360c includes sub-tables 362c and 364c, which may correspond to LUNs managed by the node. In such embodiments, these "updated" hierarchical mapping table entries are maintained only by their respective management nodes and are not typically found on other nodes.

Unlike the relatively updated hierarchy discussed above, the earlier hierarchies (i.e., hierarchy N-2 down to hierarchy 1) represent mapping table entries that can be shared by all nodes 360a-360c, in the sense that any of the nodes can store copies of these entries. In the illustrated example, these earlier levels 370, 372, and 374 are collectively identified as a shared table 380. Further, as discussed previously, in various embodiments, these earlier levels are static, except for merging or similar operations discussed later. Generally, a static layer is one that is not modified (i.e., it is "fixed"). Given that such a hierarchy is fixed in this sense, any copy of these lower hierarchies can be accessed without worrying about whether another of the copies has been or is being modified. Thus, any of the nodes can securely store a copy of the shared tables 380 and service requests for these tables with confidence that the requests can be properly serviced. Storing copies of the shared table 380 across multiple nodes 360 may allow various load balancing schemes to be used when performing lookups or otherwise servicing requests.

In addition to the foregoing, in various embodiments, the hierarchy 380 that may be shared may be organized in a manner that reflects the nodes 360 themselves. For example, node 360a may be responsible for LUNs 1 and 2, node 360b may be responsible for LUNs 3 and 4, and node 360c may be responsible for LUNs 5 and 6. In various embodiments, the mapping table entry may include a tuple that itself identifies the corresponding LUN. In such embodiments, the shared mapping table 380 may be sorted according to key value, absolute width or amount of storage space, or by other means. If the classification of the mapping table entries in hierarchy 380 is based at least in part on LUNs, entry 370a may correspond to LUNs 1 and 2, entry 370b may correspond to LUNs 3 and 4, and entry 370c may correspond to LUNs 5 and 6. Such an organization may speed up the lookup by a given node for a request targeting a particular LUN by effectively reducing the amount of data that needs to be searched, allowing the coordinator to directly select the node responsible for the particular LUN targeted by the request. The foregoing and other organization and classification schemes are contemplated. Furthermore, if it is desired to move responsibility for a certain LUN from one node to another, the original node map corresponding to that node can be flushed to the shared hierarchy (and merged, for example). Responsibility for the LUN is then transferred to the new node, which then begins servicing the LUN.

Referring now to FIG. 7, a generalized block diagram of one embodiment of a secondary index used to access a mapping table is shown. As previously described, the requester data input 302 may be received by a keyword generator 304, the keyword generator 304 generating a query keyword value 306. The key value 306 is used to access the mapping table. In some embodiments, the primary index shown in FIG. 3 may be too large (or larger than desired) for storage in RAM172 or storage medium 130. For example, the earlier index levels may become very large due to the merge and flatten operations described later in fig. 10 and 11. Thus, the secondary index 320 may be cached for at least a portion of the primary index instead of the corresponding portion of the primary index 310. The secondary index 320 may provide a more granular level of location identification for data stored in the storage devices 176a-176 m. Thus, the secondary index 320 may be smaller than the portion of the primary index 310 corresponding thereto. Accordingly, the secondary index 320 may be stored in the RAM172 or the storage medium 130.

In one embodiment, secondary index 320 is divided into partitions, such as partitions 322a-322 b. Further, the secondary index may be organized according to a hierarchy, with more recent hierarchies appearing first. In one embodiment, earlier levels have lower numbers and later levels have higher numbers (e.g., the level ID may be incremented with each new level). Each entry of the secondary index 320 may identify a range of key number values. For example, the first entry shown in this example may represent a range of key number values from 0 to 12 in the hierarchy 22. These key number values may correspond to key number values associated with the first and last record within a given page of the primary index 310. In other words, an entry in the secondary index may simply store the identification of key 0 and the identification of key 12 to indicate that the corresponding page contains entries that are within the range. Referring again to fig. 3A, the partition 312a may be a page and the key values of the first record and the last record thereof are 0 and 12, respectively. Thus, as shown in FIG. 7, one entry within the secondary index 320 stores the range 0 to 12. Since remapping is maintained in various levels within the mapping table, a range of key number values may correspond to multiple pages and associated levels. As shown in FIG. 7, various fields within the secondary index 320 may store this information. Each entry may store one or more respective unique virtual page Identifiers (IDs) and associated tier IDs corresponding to the key number range. Each entry may also store corresponding status information, e.g., validity information. The maintained list of tier IDs associated with a page ID may indicate where a given query key number value may be stored, but does not confirm that the key number value is present in that page and tier. The secondary index 320 is smaller than the primary index 310, but also has a coarse-grained level of location identification of the data stored in the storage devices 176a-176 m. The secondary index 320 may be small enough to be stored in RAM172 or storage medium 130.

When the secondary index 320 is accessed with the query key value 306, it may convey one or more corresponding page IDs and associated tier IDs. These results are then used to access and retrieve portions of the stored primary index. The one or more identified pages may then be searched using the query key value to find a physical pointer value. In one embodiment, the tier ID may be used to determine the latest tier among the identified one or more tiers that also store the query key value 306. A record within the corresponding page may then be retrieved and an independent pointer value may be read for use in processing the storage access request. In the example shown, the query keyword value 27 is in the keyword range 16 to 31. The query key value is used to convey to a mapping table the page ID and tier ID stored in the corresponding entry.

Referring now to FIG. 8, a generalized block diagram of one embodiment of a three-level index used to access a mapping table is shown. Those circuits and logic portions corresponding to those of fig. 4 have identical reference numerals. As previously described, the primary index 310 in FIG. 3 may be too large for storage in RAM172 or storage medium 130. Furthermore, as mapping table 340 grows, secondary index 320 may also become too large for storage in these memories. Thus, tertiary index 330 may be accessed before secondary index 320, which may still be faster than accessing primary index 310.

The tertiary index 330 may provide location identification for a coarser level of granularity with respect to data stored in the storage devices 176a-176m than the secondary index 320. Thus, tertiary index 330 may be smaller than the portion of secondary index 320 corresponding thereto. It should be noted that each of the primary index 310, the secondary index 320, the tertiary index 330, and so on may be stored in a compressed format. The selected compressed format may be the same compressed format used to store information within mapping table 340.

In one embodiment, tertiary index 330 may include multiple partitions, such as partitions 332a, 332b, and so on. The tertiary index 330 may be accessed using the query key value 306. In the illustrated example, the query key value 306 "27" is found to range between key values from 0 to 78. The first entry in tertiary index 330 corresponds to the key number range. A column in tertiary index 330 may indicate which partition within secondary index 320 is to be accessed. In the example shown, the key number range of 0 to 78 corresponds to partition 0 within the secondary index 320.

It should also be noted that a filter (not shown) may be accessed to determine whether a query key value is not within any of the indexes 310 and 330. The filter may be a probabilistic data structure that determines whether an element is a member of a collection. False positives may occur, but false negatives may not occur. An example of such a filter is a Bloom filter. If access to such a filter determines that a particular value is not in the full index 142, then the query is not sent to storage. If access to the filter determines that the query key number value is in the corresponding index, it may not be known whether the corresponding physical pointer value is stored in storage devices 176a-176 m.

In addition to the foregoing, in various embodiments, one or more overlay tables may be used to modify or cancel tuples provided by a mapping table in response to a query. Such an overlay table may be used to apply filter conditions for use in response to accesses to the mapping table, or for use during a flattening operation to create a new level. In some embodiments, the overlay table may be organized into various time ordered levels in a manner similar to the mapping table described above. In other embodiments, they may be organized differently. The key corresponding to the overlay table need not match the key corresponding to the underlying mapping table. For example, the overlay table may contain a single entry stating that a particular column has been deleted or otherwise inaccessible (e.g., there is no natural access path to query the tuple), and the response to the query corresponding to the tuple involving the column identifier is otherwise invalid. In another embodiment, an entry in the overlay table may indicate that a memory location has been freed and any tuples relating to that memory location are invalid, thereby invalidating the result of the lookup rather than the key used by the mapping table. In some embodiments, the overlay table may modify various fields in response to a query against the underlying mapping table. In some embodiments, one key range (range of key values) may be used to efficiently identify multiple values for which the same operation (cancellation or modification) is to be applied. In this way, elements can be (effectively) "deleted" from the mapping table by creating a "cancel" entry in the overlay table, and without modifying the mapping table. In this case, the overlay table may include keys that do not have associated non-key data fields.

Referring now to FIG. 9, one embodiment of a method for processing a read request in a system including a mapping and overlay table is shown. In response to receiving a read request (block 900), a mapping table key corresponding to the request is generated (block 908) and a first overlay table key is generated (block 902). In this example, access to the overlay and mapping table are shown as occurring simultaneously. In other embodiments, however, access to the tables may be performed non-simultaneously in any desired order (e.g., sequentially or otherwise temporally separated for authenticity). Using the key generated for the mapping table, the corresponding tuple may be retrieved from the mapping table (block 910). If the first overlay table contains a "cancel" entry corresponding to the overlay table key (conditional block 906), any tuples found in the mapping table are deemed invalid and an indication of this may be returned to the requestor. On the other hand, if the overlay table contains a "modify" entry corresponding to the overlay table key (conditional block 912), the value in the first overlay table entry may be used to modify one or more fields in the tuple retrieved from the mapping table (block 922). Once this is complete, the tuple from the mapping table (whether it is modified or not) is given a second overlay table key (block 914) and a second lookup is made in the second overlay table (block 916), where the second overlay table may or may not be the same table as the first overlay table. If a "cancel" entry is always found in the second overlay table (conditional block 920), the tuple from the mapping table is considered invalid (block 918). If a "modify" entry is found in the second overlay table (conditional block 924), one or more fields of the tuple from the mapping table may be modified (block 926). Such modifications may include discarding tuples, normalizing tuples, and the like. The modified tuple can then be returned to the requestor. If the second overlay table does not contain a modified entry (conditional block 924), the tuple is returned to the requestor unmodified. In some embodiments, at least some portions of the overlay table(s) may be cached to provide faster access to their contents. In various embodiments, the detected cancel entry in the first overlay table may be used to short any corresponding lookup (e.g., blocks 914, 916, etc.). In other embodiments, accesses and "contests" may be performed in parallel. Many such embodiments may be envisaged.

Referring now to FIG. 10, a generalized block diagram of one embodiment of a flattening operation corresponding to various levels within a mapping table is shown. In various embodiments, a planarization operation may be performed in response to detecting one or more conditions. For example, as mapping table 340 grows over time due to the insertion of new records and accumulates tiers, the cost of searching more tiers for a query key number value may become undesirably high. To constrain the number of levels to be searched, multiple levels can be flattened into a single new level. For example, two or more levels that are logically adjacent or contiguous in time sequence may be selected for the planarization operation. If two or more records correspond to the same key value, the latest record may be retained while other records are not included in the new "flattened" level. In such embodiments, for a search of a given key value, the newly flattened tier will return the same results that would otherwise be provided by searches against the respective multiple tiers. Since the search results in the new flattening level do not change compared to the two or more levels it replaces, there is no need to synchronize the flattening operation with the update operation for the mapping table. In other words, the flattening operations may be performed asynchronously on a table with respect to updates to the table.

As previously mentioned, the earlier hierarchy is fixed in the sense that its mapping is not modified (i.e. the mapping from a to B remains unchanged). Thus, no modifications are made to the levels being planarized (e.g., due to user writing) and no synchronization locks are required for the various levels. Further, in a node-based cluster environment, each node may store a copy of an earlier level of the index (e.g., as discussed with respect to FIG. 6), so that flattening operations may be performed on one node without locking the corresponding levels in other nodes. Thus, when planarization occurs in an asynchronous manner on any node, processing can continue in all nodes. At some later point in time, other nodes may planarize the various levels or use one level that has already been planarized. In one embodiment, the two or more levels used to form a planarization level may be retained for error recovery, mirroring, or other purposes. In addition to the foregoing, in various embodiments records that have been cancelled are not re-inserted into the new hierarchy. The flattening described above may be performed, for example, in response to detecting that the number of levels in the mapping table has reached a given threshold. Or the planarization may be performed in response to detecting that the size of one or more levels has exceeded a certain threshold. Another condition that may be considered is the load on the system. In addition to considering these conditions individually, the decision as to whether to planarize each level may also consider a combination of these conditions. The decision as to whether or not to planarize may also take into account a current value corresponding to the condition and a future predicted value corresponding to the condition. It is possible to envisage other conditions for which planarization can be performed.

In the illustrated example, the various records are simply displayed as key and pointer pairs. For ease of illustration, each page is shown as including four records. Level "F" and its next adjacent logical neighbor level "F-1" may be considered for planarization operations. Level "F" may be later than level "F-1". Although planarization of two levels is shown here, it is contemplated that three or more levels may be selected for planarization. In the example shown, level "F-1" may have records that store the same key value found in level "F". The double-headed arrow is used to identify records that store the same key value across two contiguous levels.

The new level "new F" includes a key corresponding to the duplicate key value found in level "F" and level "F-1". In addition, the new level "new F" includes a pointer value corresponding to the latest (or in this example later) record among the records storing the repeat key value. For example, each of level "F" and level "F-1" includes a record storing a key value of 4. The later record is in level "F" and the record also stores a pointer value 512. Accordingly, the level "F-1" includes a record that stores the key value 4 and the pointer value 512, rather than the pointer value 656 found in the earlier level "F-1". In addition, the new level "New F" includes records with unique key values found between level "F" and level "F-1". For example, level "F-1" includes records with key and pointer pairs 6 and 246 found in level "F" and key and pointer pairs 2 and 398 found in level "F-1". As shown, each page within the various levels is sorted by keyword value.

As previously described, in various embodiments, an overlay table may be used to modify or cancel tuples corresponding to key values in the underlying mapping table. Such overlay table(s) may be managed in a manner similar to a mapping table. For example, the overlay table may be flattened and adjacent entries may be merged together to save space. Or the overlay table may be managed in a different manner than used to manage the mapping table. In some embodiments, the overlay table may contain a single entry that refers to a range of overlay table keys. In this way, the size of the overlay table may be limited. For example, if the mapping table includes k valid entries, the overlay table (after flattening) needs to include no more than k +1 entries marking the respective ranges as invalid, which correspond to the gaps between valid entries in the mapping table. Accordingly, the overlay table may be used to identify tuples that may be discarded from the mapping table in a relatively efficient manner. In addition to the foregoing, while the foregoing discussion describes using an overlay table to cancel a response to a request from the mapping table(s), the overlay table may also be used to cancel or modify a value at the flattening operator of the mapping table. Accordingly, when a new level is created during the flattening operation of the mapping table, a key value that might otherwise be inserted into the new level may be eliminated. Or may modify a certain value before insertion into a new level. Such a modification may result in (in the new hierarchy) replacing a single record corresponding to a given key number range in the mapping table with a plurality of records, each of which corresponds to a sub-range of the original record. Further, a record may be replaced with a new record corresponding to a smaller range, or a plurality of records may be replaced with a single record whose range covers all the ranges of the original records. All such embodiments are contemplated.

Referring now to FIG. 11, a generalized block diagram of one embodiment of a flattening operation corresponding to various levels within a mapping table is shown. As previously described, the various tiers may be time ordered. In the illustrated example, level "F", which includes one or more indexes and corresponding mappings, is logically above an earlier level "F-1". Further, level "F" is logically below a later level "F + 1". Similarly, level "F-2" is logically above a later level "F-1" and level "F + 2" is logically below an earlier level "F + 1". In one embodiment, levels "F" and "F-1" may be considered for the planarization operation. A double-headed arrow is used to illustrate that there are records storing the same key value across these two contiguous levels.

As previously described, the new level "New F" includes key values corresponding to duplicate key values found in level "F" and level "F-1". In addition, the new level "new F" includes a pointer value corresponding to the latest (or in this example later) record among the records storing the repeat key value. After the planarization operation is completed, level "F" and level "F-1" may not have been removed from the mapping table. Likewise, in a node-based cluster, each node may verify that it is ready to utilize a new single level, such as level "New F", and no longer use the two or more levels (such as level "F" and level "F-1") that it replaces. This verification may be performed before the new level becomes the replacement. In one embodiment, the two or more replaced levels (such as level "F" and level "F-1") may be maintained in storage for error recovery, mirroring, or other purposes. To maintain the temporal ordering of the various levels and their mappings, a new planarization level, F, is logically placed below each later level (e.g., level F +1) and above the original level it replaces (e.g., level F and level F-1).

Referring now to FIG. 12, one embodiment of a method 1000 for planarizing various levels within a mapping table is illustrated. Various components embodied in the network architecture 100 and mapping table 340 as previously described may operate generally in accordance with the method 1000. For purposes of discussion, the various steps in this embodiment are shown in sequential order. In other embodiments, however, some steps may occur in different orders than shown, some steps may be performed concurrently, some steps may be combined with other steps, and some steps may be absent.

In block 1002, storage space is allocated for the mapping table and corresponding index. In block 1004, one or more conditions corresponding to two or more levels within the flattening map are determined. For example, the cost of searching the current number of levels within the mapping table may be greater than the cost of performing the planarization operation. Further, the cost may be based on at least one of: the number of current (or predicted) levels in the structure to be flattened, the number of entries in one or more levels, the number of mapping entries to be cancelled or modified, and the load on the system. The cost may also include: the time for performing the respective operation, the occupancy of one or more buses, the storage space used during the respective operation, the number of duplicate entries in a hierarchical set having reached a certain threshold, and so on. In addition, the count of the number of records in each level can be used to estimate when a planarization operation performed on two adjacent levels can produce a new single level with a number of entries equal to twice the number of entries in the next previous level. It is possible to envisage these conditions and other conditions, taken individually or in any combination.

In block 1006, the index and mapping table are accessed and updated as data is stored and a new mapping is found. As new records are inserted into the mapping table, the number of levels within the mapping table increases. If conditions corresponding to two or more levels within the flattening map are detected (conditional block 1008), in block 1010, one or more level groups for flattening are identified. A hierarchy group may include two or more hierarchies. In one embodiment, the two or more levels are contiguous levels. While the lowest or earliest level may be the best candidate for planarization, a later group may also be selected.

In block 1012, a new single hierarchy is generated for each group that includes the most recent records within the corresponding group. In the previous example, the new single level "new F" includes the latest record among level "F" and level "F + 1". In block 1014, in a node-based cluster, an acknowledgement may be requested from each node within the cluster to indicate that the corresponding node is ready to take advantage of the various new levels generated by the flattening operation. When each node confirms that it can utilize the new hierarchy, the new hierarchy is utilized to replace the current hierarchy within the identified groups in block 1016. In other embodiments, synchronization across nodes is not required. In such embodiments, some nodes may begin using the new hierarchy before others. Furthermore, some nodes may continue to use the original level even after the newly planarized level is available. For example, a particular node may have the original level data cached and preferably use that data, rather than using the uncached data of the newly flattened level. Many such embodiments are possible.

Referring now to FIG. 13, one embodiment of a method 1100 for efficiently processing bulk array tasks within a mapping table is illustrated. Like the other methods described, the various components embodied in the network architecture 100 and mapping table 340 as previously described may generally operate in accordance with the method 1100. In addition, the steps in this embodiment are shown in chronological order. In other embodiments, however, some steps may occur in different orders than shown, some steps may be performed concurrently, some steps may be combined with other steps, and some steps may be absent.

Fine-grained mapping may be achieved by storing information in a mapping table in a compressed format, thereby allowing mapping information in the mapping table to be directly manipulated as an alternative to common bulk array tasks. Direct mapping manipulation may reduce I/O network and bus traffic. As previously described, flash memory has a lower "seek time," allowing a number of dependent read operations to occur in less time than a single operation from a rotating disk. These dependent reads may be used to perform online fine-grained mapping to integrate space saving features such as compression and deduplication. In addition, these dependent read operations may allow the storage controller 174 to perform bulk array tasks entirely within the mapping table, rather than accessing (reading and writing) user data stored within the storage devices 176a-176 m.

A large or batch array task is received in block 1102. For example, a batch copy or move request may correspond to a backup of a virtual machine in addition to enterprise application data being executed and updated by tens or hundreds of virtual machines. For received requests associated with all such moves, branches, clones, or copies of data, the associated amount of data may be as large as 16 Gigabytes (GB) or more. If user data is accessed to process the request, a significant amount of processing time may be spent on the request and system performance may be degraded. Furthermore, the total input/output (I/O) resources of a virtualized environment are typically less than those of a physical environment.

In block 1104, the storage controller 174 may store an indication corresponding to a received request to associate a range of new keys with a range of old keys, where both ranges of keys correspond to the received request. For example, if the received request is to copy 16GB of data, a start key value and an end key value corresponding to the 16GB of data may be stored. Likewise, each of the start and end key number values may include a volume ID, a logical or virtual address within a received request, a snapshot ID, a sector number, and the like. In one embodiment, this information may be stored separately from information stored in indexes such as primary index 310, secondary index 320, tertiary index 330, and so forth. But this information may be accessed when accessing the index during processing of subsequent requests.

In block 1106, the data storage controller 174 may communicate a response to the respective client of the client computer systems 110a-110c indicating completion of the received request without prior access to the user data. Thus, memory controller 174 may process received requests with little or no downtime and no load on processor 112.

In block 1108, the memory controller 174 may set a condition, indication or flag or a buffered update operation for updating one or more records in the mapping table that correspond to replacing the old key in the mapping table with the new key. For both move requests and copy requests, one or more new records corresponding to the new key may be inserted into the mapping table. As previously described, the key may be inserted into the new highest hierarchical level created. For a move request, one or more old records may be removed from the mapping table after the corresponding new record has been inserted into the mapping table. The record in the mapping table is actually updated immediately or at some later time.

For a return to zero or erase request, an indication may be stored that a range of key value values now corresponds to a series of binary zeros. Further, as discussed previously, an overlay table may be used to identify key values that are not (or are no longer) valid. The user data may not be overwritten. For an erase request, user data may be overwritten at some later time when a "freed" storage location is allocated with new data corresponding to a subsequent store (write) request. For externally directed defragmentation requests, contiguous addresses may be selected for sector reorganization, which may benefit applications executing on the clients of the client computer systems 110a-110 c.

If the storage controller 174 receives a data storage access request corresponding to one of the new keys (conditional block 1110) and the new key has been inserted into the mapping table (conditional block 1112), then in block 1114, the index and mapping table may be accessed using the new key. For example, the new key may be utilized to access the primary index 310, the secondary index 320, or the tertiary index 330. When one or more pages of the mapping table are identified by the index, the identified pages may then be accessed. In block 1116, the storage access request may be serviced with the physical pointer value associated with the new key found in the mapping table.

If the storage controller 174 receives a data storage access request corresponding to one of the new keys (conditional block 1110) and the new key has not been inserted into the mapping table (conditional block 1112), then in block 1118, the index and mapping table may be accessed using the corresponding old key. The storage means holding the old key range and the new key range may be accessed to determine the corresponding old key value. When one or more pages of the mapping table are identified by the index, the identified pages may then be accessed. In block 1120, the storage access request may be serviced with a physical pointer value associated with the old key found in the mapping table.

Referring now to FIG. 14, a generalized block diagram of one embodiment of a data layout architecture within a storage device is shown. In one embodiment, the data storage locations within storage devices 176a-176m may be arranged as a Redundant Array of Independent Devices (RAID) array. As shown, different types of data may be stored in the storage devices 176a-176k according to a data layout architecture. In one embodiment, each storage device 176a-176k is an SSD. One allocation unit within an SSD may include one or more erase blocks within the SSD.

User data 1230 may be stored in one or more pages included in one or more of storage devices 176a-176 k. Within each intersection of a RAID stripe with one of the storage devices 176a-176k, the stored information may be formatted as a series of logical pages. Each logical page, in turn, may include a header and a checksum corresponding to the data in the page. When a read is issued, the read may correspond to one or more logical pages, and the checksum may be utilized to validate the data in each page. Since each logical page may include a page header and the page header contains a checksum (which may be referred to as a "media" checksum) corresponding to the page, the actual page size corresponding to the data may be less than one logical page. In some embodiments, the page header may be smaller for pages of inter-storage device recovery data 1250 (such as RAID parity information) so that the parity page protects the page checksum in the data page. In other embodiments, the checksum in the parity page of the inter-storage device recovery data 1250 may be calculated such that the checksum of the data page checksum is the same as the checksum of the parity page that covers the corresponding data page. In such embodiments, the header corresponding to the parity page need not be smaller than the header corresponding to the data page.

Inter-device ECC data 1250 may be parity information generated from one or more pages on other storage devices that hold user data. For example, the inter-device ECC data 1250 may be parity information used in a RAID data layout architecture. Although the stored information is shown as contiguous logical pages in storage devices 176a-176k, it is known in the art that the individual logical pages may be arranged in a random order, where each storage device 176a-176k is an SSD.

The intra-device ECC data 1240 may include information used by an intra-device redundancy scheme. The intra-device redundancy scheme utilizes ECC information, such as parity information, within a given storage device. The intra-device redundancy scheme and its ECC information correspond to a given device and may be maintained within the given device, but is different from an ECC that may be generated and maintained internally by the device itself. Typically, the ECC generated and maintained internally to the device is not visible to the system in which the device is included.

The in-device ECC data 1240 may also be referred to as in-device error recovery data 1240. The in-device error recovery data 1240 may be used to protect a given storage device from potential sector errors (LSEs). LSE is an error that is not detected until a given sector is accessed. Thus, any data previously stored in a given sector may be lost. A single LSE may cause data loss when encountered during RAID rebuilds following a storage device failure. The term "sector" generally refers to a basic storage unit on the HDD, such as a segment within a given track on the disk. Herein, the term "sector" may also refer to a basic allocation unit on the SSD. A potential sector error (LSE) may occur when a given sector or other unit of storage within a storage device is not accessible. A read or write operation may not be completed for the given sector. In addition, uncorrectable Error Correction Code (ECC) errors may occur.

The intra-device error recovery data 1240 included within a given storage device may be used to improve data storage reliability within the given storage device. The intra-device error recovery data 1240 is in addition to other ECC information that may be included in another storage device, such as parity information utilized in a RAID data layout architecture.

Within each storage device, the intra-device error recovery data 1240 may be stored in one or more pages. As is well known to those skilled in the art, the in-device error recovery data 1240 may be obtained by performing a function on selected information bits within the user data 1230. Parity information may be derived for storage in the intra-device error recovery data 1240 using an XOR (exclusive or) based operation. Other examples of intra-device redundancy schemes include Single Parity (SPC), Maximum Distance Separable (MDS) erasure codes, Interleaved Parity Codes (IPC), hybrid SPC and MDS codes (MDS + SPC), and Column Diagonal Parity (CDP). The various schemes differ in the reliability and overhead presented depending on the manner in which the data 1240 is computed.

In addition to the error recovery information described above, the system may be configured to calculate a checksum value corresponding to a zone on the device. For example, a checksum may be calculated when writing information to the device. The checksum is stored by the system. When reading back information from the device, the system may again calculate a checksum and compare it to the originally stored value. If the two checksums are different, the information is not read correctly and the system may use other schemes to recover the data. Examples of checksum functions include Cyclic Redundancy Check (CRC), MD5, and SHA-1.

An erase block within an SSD may include several pages. One page may comprise 4KB of data storage space. An erase block may include 64 pages or 256 KB. In other embodiments, an erase block may be as large as 1 Megabyte (MB) and include 256 pages. The allocation unit size may be selected in a manner that provides both sufficiently large size units and a relatively low number of units in order to reduce the overhead of tracking allocation units. In one embodiment, one or more state tables may maintain a status of an allocation unit (allocated, free, erased, error), a wear-out level, and a count (correctable and/or uncorrectable) of the number of errors that have occurred within the allocation unit. In one embodiment, one allocation unit is relatively small compared to the total storage capacity of the SSD. Other amounts of data storage space corresponding to pages, erase blocks, and other unit arrangements are contemplated.

Metadata 1260 may include page header information, RAID stripe identification information, log data corresponding to one or more RAID stripes, and so on. In various embodiments, a single metadata page at the beginning of each stripe may be reconstructed from the other stripe headers. Or the page may be at a different offset in the parity slice so that the data may be protected by inter-device parity. In one embodiment, the metadata 1260 may store or be associated with a particular tag value indicating that the data should not be copied.

In addition to inter-device parity protection and intra-device parity protection, each page of storage devices 176a-176k may include additional protection, such as a checksum stored within a given page. The checksum (8 bytes, 4 bytes or other size) may be placed after the header inside the page and before the corresponding data, which may be compressed. For another level of protection, data location information may be included in the checksum value. The data in each of the pages may include this information. The information may include both virtual addresses and physical addresses. The number of sectors, the number of data chunks and offsets, the number of tracks, the number of planes, and the like may also be included in the information. This mapping information may also be used to rebuild the address translation mapping table in the event that the contents of the table are lost.

In one embodiment, each page in storage devices 176a-176k stores a number of particular types, such as data type 1230-. Or each page may store more than one byte of data. The page header may store information identifying the data type corresponding to the corresponding page. In one embodiment, an intra-device redundancy scheme divides a device into various location groups for storing user data. For example, a partition may be a set of locations within a device that correspond to a stripe within a RAID layout. In the example shown, only two strips 1270a and 1270b are shown for ease of illustration.

In one embodiment, a RAID engine within the storage controller 174 may determine the protection tier level to use for the storage devices 176a-176 k. For example, the RAID engine may determine to utilize a RAID dual parity for the storage devices 176a-176 k. The inter-device redundancy data 1250 may represent RAID double parity values generated from corresponding user data. In one embodiment, storage devices 176j and 176k may store the double parity information. It should be understood that other tiers of RAID parity protection are possible. Further, in other embodiments, storage for dual parity information may be rotated among the various storage devices instead of being stored within storage devices 176j and 176k corresponding to each RAID stripe. For ease of illustration and description, storage for dual parity information is shown as being stored in storage devices 176j and 176 k. Although each storage device 176a-176k includes multiple pages, only page 1212 and page 1220 are labeled for ease of illustration.

It should be noted that the previously described embodiments may include software. In such embodiments, program instructions which implement the various methods and/or mechanisms may be transmitted over or stored on a computer-readable medium. Various types of media are available that are configured to store program instructions, including hard disks, floppy disks, CD-ROMs, DVDs, flash memory, programmable ROMs (proms), Random Access Memories (RAMs), and various other forms of volatile or non-volatile storage.

In various embodiments, one or more portions of the various methods and mechanisms described herein may form part of a cloud computing environment. In such embodiments, resources may be provided as services over the Internet according to one or more models. Such models may include infrastructure as a service (IaaS), platform as a service (Paas), and software as a service (SaaS). In IaaS, the computer infrastructure is given as a service. In this case, the computing equipment is typically owned and operated by the service provider. In the PaaS model, software tools and underlying equipment used by developers to develop software solutions may be provided and hosted as services by service providers. SaaS typically includes on-demand licensing of software as a service by a service provider. The service provider may host the software or may arrange the software for the customer for a given time period. Many combinations of the aforementioned models are possible.

Although various embodiments have been described in considerable detail above, those skilled in the art will recognize many variations and modifications once the foregoing disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims

1. A computer system, comprising:

a data storage medium;

a data storage controller coupled to the data storage medium; and

a mapping table comprising a plurality of entries, each mapping table entry comprising a tuple comprising a key;

wherein the data storage controller is configured to encode each tuple in the mapping table with a variable length code.

2. The computer system of claim 1, wherein the mapping table is organized into a plurality of time ordered levels, each level including one or more mapping table entries.

3. The computer system of claim 1, wherein a particular encoding of the plurality of encodings for the given tuple is selected based at least in part on a size of the unencoded given tuple, a size of the encoded given tuple, and a time to encode the given tuple.

4. The computer system of claim 1, wherein the data storage controller is configured to perform a plurality of encodings on a given tuple and to select one of the plurality of encodings as a final encoding.

5. The computer system of claim 1, wherein the data storage controller is configured to check a given mapping table entry to see if it satisfies a query without first decoding the entry.

6. The computer system of claim 1, wherein the mapping table stores entries whose keys correspond to virtual blocks in the system.

7. The computer system of claim 6, wherein an entry for a given virtual block range stores information that facilitates locating data comprising the range of blocks.

8. The computer system of claim 6, wherein an entry for a given virtual block range stores a hash value computed over data within the given virtual block range.

9. The computer system of claim 6, wherein an entry for a given virtual block range stores a data pattern that can be used to reconstruct the virtual block range without further access to storage.

10. The computer system of claim 1, wherein the mapping table includes data stored as pages in a data storage medium, and the data storage controller is configured to use different data encodings for at least some of the pages.

11. The computer system of claim 10, wherein a ratio between a size of an encoded representation of a tuple and a size of an unencoded representation of the tuple can be different for each page.

12. The computer system of claim 1, wherein a maximum number of bits used to represent an unencoded tuple can be altered without re-encoding the tuple.

13. The computer system of claim 1, wherein a size of the mapping table is proportional to an amount of space for which a valid mapping exists.

14. The computer system of claim 1, wherein the variable length code comprises a base and an offset.

15. A method for use in a storage system, the method comprising:

storing a mapping table organized into a plurality of levels, each level of the plurality of levels comprising one or more mapping table entries, wherein each of the plurality of entries comprises a tuple comprising a key; and

each tuple in the mapping table is encoded with a variable length code.

16. The method as recited in claim 15, wherein the mapping table is organized into a plurality of time ordered levels, each level including one or more mapping table entries.

17. The method as recited in claim 15, further comprising: a particular encoding of the plurality of encodings for the given tuple is selected based at least in part on a size of the unencoded given tuple, a size of the encoded given tuple, and a time to encode the given tuple.

18. The method as recited in claim 15, further comprising: multiple encodings are performed on a given tuple and one of the multiple encodings is selected as the final encoding.

19. The method as recited in claim 15, further comprising: different data encodings are used for at least some of the pages.

20. The method of claim 15, wherein the maximum number of bits used to represent an unencoded tuple can be altered without re-encoding the tuple.

21. The method as recited in claim 15, wherein the size of the mapping table is proportional to the amount of space for which there is a valid mapping.

22. A non-transitory computer readable storage medium storing program instructions executable by a processor to:

each tuple in the mapping table is encoded with a variable length code.