[go: up one dir, main page]

GB2442285A - A distributed file system - Google Patents

A distributed file system Download PDF

Info

Publication number
GB2442285A
GB2442285A GB0624664A GB0624664A GB2442285A GB 2442285 A GB2442285 A GB 2442285A GB 0624664 A GB0624664 A GB 0624664A GB 0624664 A GB0624664 A GB 0624664A GB 2442285 A GB2442285 A GB 2442285A
Authority
GB
United Kingdom
Prior art keywords
metadata
file system
data
node
distributed file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0624664A
Other versions
GB0624664D0 (en
Inventor
Eric Zigmund Sandler
Konstantin Belousov
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FUJIN TECHNOLOGY PLC
XPLOITE PLC
Original Assignee
FUJIN TECHNOLOGY PLC
XPLOITE PLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FUJIN TECHNOLOGY PLC, XPLOITE PLC filed Critical FUJIN TECHNOLOGY PLC
Publication of GB0624664D0 publication Critical patent/GB0624664D0/en
Priority to PCT/GB2007/003363 priority Critical patent/WO2008029146A1/en
Publication of GB2442285A publication Critical patent/GB2442285A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed file system for managing access to data, comprises: a plurality of storage nodes 2-4, each node including a memory arranged for storing data items e.g. web pages and a memory management unit arranged for execution on one of a plurality of different operating systems (UNIX, LINUX, WINDOWS NT) and arranged for providing access to a data item to a client node 12, 13; a master node 7, 8 arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes 2-4 containing the data item; and a client node 12, 13 arranged for requesting metadata related to a data item from the master node 7, 8 and accessing the data item from one or more storage nodes 2-4 containing the item using the requested metadata. A client management unit 15 may provide a standard interface (Application Programming Interface - API) for the distributed file system such that applications executing in user space can utilise the client management unit in the same/similar way to use of the local file system of the operating system.

Description

I
A DISTRIBUTED FILE SYSTEM
Field of Invention
The present invention relates to a distributed file system, particularly a file system distributed over a heterogeneous operating system environment.
Background
Data is stored by computer systems within secondary memory, such as on hard disk, using a file system. The file system manages access to the data, for example, for read, write, and delete purposes.
Access to the data can be managed for multiple users by storing the data on a file server which is accessible to the users over a network.
There is a requirement for some organisations or for some purposes for the storage of very large quantities of data (for example, archiving the internet).
One way of storing very large quantities of data is by use of a specialised computer system with multiple large hard disks. However, specialised hardware is very expensive.
There are several alternative methods for providing access to very large quantities of data using a distributed model.
One method is implemented by Google and is known as the Google File System (GFS). GFS stores data within chunks across multiple storage nodes.
All the metadata related to the data, such as on which storage node the data is stored, is stored on a single master node. Client nodes request access to the data by obtaining metadata from the master node and using the metadata to access the data directly from the storage nodes. GFS includes the use of a customised Linux kernel for the storage nodes.
Another distributed file system is a POSIX-compliant open-source system called Lustre.
Each file stored on a Lustre file system is considered as an object. Lustre provides clients with concurrent read and write access to the objects. A Lustre file system has four functional units. These are a Meta data server (MDS) for storing the meta data; an Object storage target (OST) for storing the actual data; an Object storage server (OSS) for managing the OSTs; and client(s) for accessing and the actual usage of the data. OSTs are block-based devices.
An MDS, OSS, and an OST can be on the same node or on different nodes.
Lustre is deployable only on systems running variants of the Linux operating system.
The disadvantage with GFS and Lustre is that neither system provides the capability for implementation across systems with heterogeneous operating system deployment.
Within organisations there is generally an established hardware and software infrastructure. Within this infrastructure a large portion of processing cycles and storage space goes unused, particularly the processing cycles and storage space of user machines.
There is a desire to utilise the unused resources of an organisation's existing infrastructure for storing very large quantities of data.
However, user machines are unreliable in that a user may dominate processing cycles or the user machine may be temporarily unavailable due to user-caused reboots.
In addition, in most infrastructures there is heterogeneity in hardware and operating systems across the organisation.
It is an object of the present invention to provide a reliable distributed file system for providing access to very large quantities of data stored across a heterogeneous infrastructure which overcomes the above disadvantages, or to at least provide a useful alternative.
Summary of the Invention
According to a first aspect of the invention there is provided a distributed file system for managing access to data, including: a plurality of storage nodes, each node including a memory arranged for storing data items and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node; a master node arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item; and a client node arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.
The metadata may include a universal identifier.
It is preferred that the memory of each storage node is secondary memory.
It is also preferred that the metadata database is stored within primary memory of the master node.
The data may be web content, such as web pages.
The client node may be further arranged for requesting the metadata by transmitting a URL for the web content to a master node.
Preferably, the client node is further arranged for requesting metadata for data to be stored in the file system. The master node is preferably further arranged, in response to the request for metadata for data to be stored, for providing a metadata corresponding to an empty block on a storage node. Alternatively, the master node may be further arranged, in response to the request for metadata for data to be stored, for providing a plurality of metadata, the metadata corresponding to empty blocks on a plurality of storage nodes.
The client node may be further arranged for writing the data to each of the storage nodes.
The metadata may include checksums related to the corresponding data item.
The client node may be further arranged for calculating checksums for a data item and comparing the calculated checksums to the checksums within the metadata for the data item.
It is preferred that the memory management unit is executing within user space of the operating system.
At least some of the nodes may exist on the same computing device.
Preferably, the client node includes a file management unit and the file management unit is arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata. The file management unit is preferably arranged for execution within kernel space of an operating system.
Brief Description of the Drawings
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which: Figure 1: shows a schematic diagram illustrating a distributed file system in accordance with an embodiment of the invention.
Figure 2: shows a schematic diagram illustrating a storage node for use within the distributed file system.
Detailed Description of the Preferred Embodiments
The present invention provides a file system distributed over a heterogeneous operating system environment. The system comprises a number of storage nodes executing on different computer platforms containing data for the file system, a master node containing metadata for the file system and a client node which accesses data on the storage nodes by first querying the master node for the relevant metadata.
Figure 1 shows storage nodes 1, 2, and 3.
Each storage node includes a storage management unit 4 connected with a memory 5. The memory 5 is secondary non-volatile non-removable memory such as a hard disk. The storage management unit may be connected with the memory via a hard disk interface such as IDE or SCSI.
In other embodiments the memory 5 may be removable or volatile.
The storage management unit 4 may be connected through a network 6 or inter-network to master nodes and client nodes.
Master nodes 7 and 8 are shown in Figure 1.
Each master node includes a master management unit 9 connected with a memory 10. The memory 10 is primary memory such as RAM. The master management unit 9 may be connected with the memory 10 via a bus 11.
In other embodiments the memory 10 may be secondary memory, such as a hard disk or flash RAM.
The master management unit 9 may be connected through the network 6 to client nodes and storage nodes.
Client nodes 12 and 13 are also shown.
Each client node includes an application 14 and a client management unit 15.
The application 14 is executing within user space 16 on the operating system of the client node 13.
The client management unit 15 may be executing within kernel space 17 of the operating system of the client node 13. A request by the application 14 for data stored within the distributed file system is sent to the client management unit 15 by the operating system.
The client management unit 15 may provide a standard interface (Application Programming Interface -API) for the distributed file system such that applications executing in user space can utilise the client management unit in the same/similar way to use of the local file system of the operating system.
The client management unit 15 may be connected through the network 6 to master nodes and storage nodes.
An embodiment of the invention will now be described.
The client management unit 15 receives a request for access to data from an application 14 executing on the client node 13. The access can be for read, write, or delete access.
The request includes a universal identifier for the data such as a fixed length key. A 64-bit key may be used.
In alternative embodiments other identifiers may be used such as path and file name identifiers.
The client management unit 15 of the client node 13 transmits the request to the master node 7 over the network 6.
The master management unit 9 of the master node 7 receives the request for access to data from the client node 13.
The master management unit 9 includes a database for storing metadata about the data within the distributed file system. The metadata is stored within a database, such as Berkley DB or another relational database. The database is stored within memory 10.
The metadata database maps identifiers for the data to the metadata for the data using an index. The database can use any indexing system, such as hashing or tries.
The data may be stored as files. The metadata includes attributes about the file such as modes. The modes include which blocks comprise the file, on which storage nodes the blocks are located, and checksums for the blocks.
The master management unit 9 includes a free extents list -a list of free blocks across the storage nodes.
The master management unit 9 may include a locking system to ensure that multiple write accesses are not given on the same file to different client nodes, and write accesses are not given when the file is open for read access. The locking system may be a lease-based locking where client nodes obtain leases for specific access to specific data from the master node.
The master management unit 9 uses the index within the metadata database to map the identifier to the metadata and then sends the metadata to the client node 13.
The client management unit 15 uses the received metadata to access the storage node 1 which contains the data. The unit forms requests for access to the data held by the storage node 1 and transmits this to the storage node 1 over the network 6.
It will be appreciated that in some embodiments of the invention the blocks for the data may all be stored on more than one storage node and requests for these blocks will go to all the relevant storage nodes.
The request may include identifiers of the blocks held by the storage node and the type of access that is required.
The storage node 1 receives the request and the storage management unit 4 locates the data or portion of data that is stored at the storage node. The data may be stored within blocks in the secondary memory 5 of the storage node 1.
In accordance with the access that is required the storage management unit 4 receives data from the client node 13 to write to blocks on the storage node 1 (write access), transmits blocks of the data from secondary memory 5 to the client node 13 (read access), or marks the data blocks as free and notifies the master node 7 (delete access).
Where blocks of data are transmitted to the client node 13, the client management unit 15 on the client node 13 may use the checksums to verify the data received from the storage node 1. This helps to ensure the integrity of the data as it enables the detection of corrupted data.
Where blocks of data are written by the client node 13 to the storage node 1, the storage node 1 may notify the client node 13 of a successful action when the process of writing the data is completed.
The storage nodes 1, 2, 3 are executing on more than one type of operating system.
Referring to Figure 2, one embodiment of a storage node 20 in accordance with present invention will be further described.
The storage management unit 21 executes within user space 22 of the operating system. The unit 21 connects with the network 23 or inter-network to communicate with client and master nodes.
The unit 21 is connected to the local file system 24 of the operating system.
The storage management unit 21 uses the local file system 24 to store and retrieve data in a secondary memory 25.
It will be appreciated that the local file system 24 may store data within a primary memory buffer.
User applications 26 may also be executing within user space 22 on the operating system. The user applications 27 can access other data stored within the secondary memory 25 for their own purposes through the local file system 24 using known methods.
In alternative embodiments the storage node 20 may also be a client node and include a client management unit. The user applications 26 may be connected with the client management unit to access data stored within the distributed file system, which may or may not be stored in the secondary memory 25 of the storage node 20. In this embodiment the client management unit may be connected with the network 23 or inter-network and the storage management unit 21.
An example of a distributed file system providing access to data for a web crawler client application in accordance with the invention will be described.
The web crawler application is executing on a client node and accessing the internet to download web pages. Also on the client node is a database, such as a Berkley DB. The database converts each URL for the web page into a 64-bit unique identifier (UUID) using a hashing technique.
The database then requests free blocks to store the downloaded web page from a client management unit on the client node. The management unit sends the request to the master nodes in the distributed file system. The master node provides free data block identifiers and responsible storage nodes using its free extents list. Each identifier is a unique block identifier.
The unit provides the identifiers to the database which stores the block identifiers for each web page with its associated UUID. The database uses the unit which maps the identifiers to the storage nodes to access the blocks stored on the storage nodes to store the web page data.
A web page analyzer is also executing on the client node. The web analyzer uses the database to access a specific web page by providing the URL for the web page. The database converts the URL for the web page into a UUID and maps this to the associated block identifiers. The block identifiers are provided to the web page analyzer.
The web page analyzer requests the blocks from the client management unit.
The unit sends the request to the master nodes which return the storage nodes responsible for the blocks.
The unit maps access to the blocks for the web page analyzer to the correct storage node.
To provide for replication of data the master nodes may provide multiple block identifiers for a new block request or for write access. The client management unit stores the same data for a block within multiple blocks corresponding to the multiple identifiers. When the client node receives notification from all storage nodes that the block containing the data has been written, the client management unit can treat the writing action as successful. The replication and notification only when all writing action is successful provides a robust -11 redundancy method to protect against failure of storage nodes and replacement of corrupted data.
Potential advantages of embodiments of the present invention include the ability to efficiently store very large amounts of data, ability to deploy the system on existing infrastructure utilizing already deployed computers with variety of operating systems, ability to efficiently satisfy the most frequent access pattern -random read/write requests, a high distribution degree, and fault-tolerant.
The advantage of utilizing existing network and workstation infrastructure, results in system deployment cheaper and easier. In addition storage capability of the system can be expanded by adding computers to the system configuration.
While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art.
Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims (22)

  1. Claims A distributed file system for managing access to data,
    including: a plurality of storage nodes, each node including a memory arranged for storing data items and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node; a master node arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item; and a client node arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.
  2. 2. A distributed file system as claimed in claim I wherein the metadata includes a universal identifier.
  3. 3. A distributed file system as claimed in one of the preceding claims wherein the memory of each storage node is secondary memory.
  4. 4. A distributed file system as claimed in one of the preceding claims wherein the metadata database is stored within primary memory of the master node.
  5. 5. A distributed file system as claimed in one of the preceding claims wherein the data is web content, such as web pages.
  6. 6. A distributed file system as claimed in one of the preceding claims wherein the client node is further arranged for mapping a URL for the web content to a universal identifier and requesting the metadata by transmitting the universal identifier to the master node.
  7. 7. A distributed file system as claimed in one of the preceding claims wherein the client node is further arranged for requesting metadata for data to be stored in the file system.
  8. 8. A distributed file system as claimed in claim 7 wherein the master node is further arranged, in response to the request for metadata for data to be stored, for providing a metadata corresponding to an empty block on a storage node.
  9. 9. A distributed file system as claimed claim 7 wherein the master node is further arranged, in response to the request for metadata for data to be stored, for providing a plurality of metadata, the metadata corresponding to empty blocks on a plurality of storage nodes.
  10. 10. A distributed file system as claimed in claim 9 wherein the client node is further arranged for writing the data to each of the storage nodes.
  11. 11. A distributed file system as claimed in one of the preceding claims wherein metadata includes checksums related to the corresponding data item.
  12. 12. A distributed file system as claimed in claim 11 where the client node is further arranged for calculating checksums for a data item and comparing the calculated checksums to the checksums within the metadata for the data item.
  13. 13. A distributed file system as claimed in one of the preceding claims wherein the memory management unit is executing within user space of the operating system. V 14
  14. 14. A distributed file system as claimed in one of the preceding claims wherein at least some of the nodes exist on the same computing device.
  15. 15. A distributed file system as claimed in one of the preceding claims wherein the client node includes a file management unit, the file management unit is arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata..
  16. 16. A distributed file system as claimed in claim 15 wherein the file management unit is arranged for execution within kernel space of an operating system.
  17. 17. A storage node for a distributed file system as claimed in any one of the preceding claims including: a memory arranged for storing data items; and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node.
  18. 18. A master node for a distributed file system as claimed in any one of claims ito 16 including: a memory arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item.
  19. 19. A client node for a distributed file system as claimed in any one of claims 1 to 16 including: a file management unit arranged for requesting metadata related to a data item from a master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.
  20. 20. A computer program arranged for effecting the system as claimed in any one of claims 1 to 16.
  21. 21. A computer program arranged for effecting the node as claimed in any one of claims 18 to 19.
  22. 22. Storage media arranged for storing a computer program as claimed in any one of claims 20 to 21.
GB0624664A 2006-09-07 2006-12-11 A distributed file system Withdrawn GB2442285A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/GB2007/003363 WO2008029146A1 (en) 2006-09-07 2007-09-07 A distributed file system operable with a plurality of different operating systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
UA200609645 2006-09-07

Publications (2)

Publication Number Publication Date
GB0624664D0 GB0624664D0 (en) 2007-01-17
GB2442285A true GB2442285A (en) 2008-04-02

Family

ID=37711887

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0624664A Withdrawn GB2442285A (en) 2006-09-07 2006-12-11 A distributed file system

Country Status (1)

Country Link
GB (1) GB2442285A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055273A1 (en) * 2009-08-27 2011-03-03 Cleversafe, Inc. Dispersed storage processing unit and methods with operating system diversity for use in a dispersed storage system
EP4471611A4 (en) * 2022-08-26 2025-06-04 Huawei Technologies Co., Ltd. DATA MANAGEMENT PROCEDURE AND CORRESPONDING DEVICE

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749178A (en) * 2019-10-31 2021-05-04 华为技术有限公司 Method for ensuring data consistency and related equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119151A (en) * 1994-03-07 2000-09-12 International Business Machines Corp. System and method for efficient cache management in a distributed file system
US20050251522A1 (en) * 2004-05-07 2005-11-10 Clark Thomas K File system architecture requiring no direct access to user data from a metadata manager

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119151A (en) * 1994-03-07 2000-09-12 International Business Machines Corp. System and method for efficient cache management in a distributed file system
US20050251522A1 (en) * 2004-05-07 2005-11-10 Clark Thomas K File system architecture requiring no direct access to user data from a metadata manager

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055273A1 (en) * 2009-08-27 2011-03-03 Cleversafe, Inc. Dispersed storage processing unit and methods with operating system diversity for use in a dispersed storage system
US20110055474A1 (en) * 2009-08-27 2011-03-03 Cleversafe, Inc. Dispersed storage processing unit and methods with geographical diversity for use in a dispersed storage system
US9690513B2 (en) * 2009-08-27 2017-06-27 International Business Machines Corporation Dispersed storage processing unit and methods with operating system diversity for use in a dispersed storage system
US9772791B2 (en) * 2009-08-27 2017-09-26 International Business Machines Corporation Dispersed storage processing unit and methods with geographical diversity for use in a dispersed storage system
EP4471611A4 (en) * 2022-08-26 2025-06-04 Huawei Technologies Co., Ltd. DATA MANAGEMENT PROCEDURE AND CORRESPONDING DEVICE
US12530316B2 (en) 2022-08-26 2026-01-20 Huawei Technologies Co., Ltd. Data management method and corresponding device

Also Published As

Publication number Publication date
GB0624664D0 (en) 2007-01-17

Similar Documents

Publication Publication Date Title
US10216757B1 (en) Managing deletion of replicas of files
US10430392B2 (en) Computer file system with path lookup tables
US10809932B1 (en) Managing data relocations in storage systems
US10346360B1 (en) Managing prefetching of data in storage systems
US8943282B1 (en) Managing snapshots in cache-based storage systems
US8285967B1 (en) Method for on-demand block map generation for direct mapped LUN
US9760574B1 (en) Managing I/O requests in file systems
US9891860B1 (en) Managing copying of data in storage systems
US9442955B1 (en) Managing delete operations in files of file systems
US10387369B1 (en) Managing file deletions of files and versions of files in storage systems
US9842117B1 (en) Managing replication of file systems
US8996490B1 (en) Managing logical views of directories
US9311333B1 (en) Managing files of file systems
US8095678B2 (en) Data processing
US10409768B2 (en) Managing data inconsistencies in files of file systems
US11176165B2 (en) Search and analytics for storage systems
US10409687B1 (en) Managing backing up of file systems
US10242012B1 (en) Managing truncation of files of file systems
US10261944B1 (en) Managing file deletions in storage systems
Liu et al. Cfs: A distributed file system for large scale container platforms
US10719554B1 (en) Selective maintenance of a spatial index
US8090925B2 (en) Storing data streams in memory based on upper and lower stream size thresholds
US9430492B1 (en) Efficient scavenging of data and metadata file system blocks
US10872073B1 (en) Lock-free updates to a data retention index
US12229017B2 (en) Mapping service level agreements for backup systems to multiple storage tiers in a clustered network

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)