CN1279455C - Fiber Channel - Logical Unit Number Caching Method for Storage Area Network Systems - Google Patents
Fiber Channel - Logical Unit Number Caching Method for Storage Area Network Systems Download PDFInfo
- Publication number
- CN1279455C CN1279455C CN 200310113532 CN200310113532A CN1279455C CN 1279455 C CN1279455 C CN 1279455C CN 200310113532 CN200310113532 CN 200310113532 CN 200310113532 A CN200310113532 A CN 200310113532A CN 1279455 C CN1279455 C CN 1279455C
- Authority
- CN
- China
- Prior art keywords
- read
- data
- lun
- buffer
- lock
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Memory System Of A Hierarchy Structure (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
光纤通道—存储区域网络系统的逻辑单元号高速缓存方法属于光纤通道—存储区域网络中存储技术的领域Fiber channel - logical unit number caching method for storage area network system belongs to the field of storage technology in fiber channel - storage area network
背景技术Background technique
在数据存储领域,相对于CPU性能和内存容量以摩尔定律的提高,作为外存的主要设备--硬磁盘却由于机械运动的本质导致性能提高非常有限。与此同时,磁盘存储的密度却每年提高约60%,这一点甚至高于CPU和内存的提高速度。针对容量不断扩大、速度增加非常有限的情况,目前磁盘上全部具有MB级的缓存子系统,以提高磁盘的性能。研究表明,磁盘上的高速缓存能极大的提高磁盘的读写性能。In the field of data storage, compared with the improvement of CPU performance and memory capacity according to Moore's Law, as the main device of external storage - hard disk, the performance improvement is very limited due to the nature of mechanical movement. At the same time, the density of disk storage has increased by about 60% per year, which is even higher than the improvement rate of CPU and memory. In view of the continuous expansion of capacity and very limited speed increase, all disks currently have MB-level cache subsystems to improve disk performance. Studies have shown that on-disk cache can greatly improve the read and write performance of the disk.
我们提出的LUN CACHE方法不同于以往的技术,不是将用户发出的命令直接交给磁盘执行,而是将该命令缓存在FC-SAN的I/O节点之上,并根据缓冲区策略分治执行。The LUN CACHE method we proposed is different from the previous technology. Instead of directly handing over the command issued by the user to the disk for execution, the command is cached on the I/O node of FC-SAN and executed according to the buffer strategy. .
发明内容Contents of the invention
本发明的目的在于在不降低数据一致性和可靠性的同时提高FC-SAN存储系统的响应速度。本方法的核心在于:使用改进的Cache替换算法提高Cache利用率,使得读写操作能够在Cache中完成,而不是发送到具体的SCSI设备;使用延迟写的技术使得数据更新操作能够在后台完成,极大的提高了用户的响应时间;使用读写锁的机制保证多用户之间共享数据的一致性;以及良好的接口设计使得LUN CACHE完全对用户透明。The purpose of the invention is to improve the response speed of the FC-SAN storage system without reducing the data consistency and reliability. The core of this method is: use the improved Cache replacement algorithm to improve the Cache utilization rate, so that the read and write operations can be completed in the Cache, rather than sent to a specific SCSI device; use the delayed write technology to enable the data update operation to be completed in the background, It greatly improves the user's response time; uses the read-write lock mechanism to ensure the consistency of shared data among multiple users; and a good interface design makes LUN CACHE completely transparent to users.
本发明的特征在于:它是一种光纤通道—存储区域网络FC-SAN环境下运行在I/O处理节点上的基于逻辑单元号LUN的高速缓存CACHE方法,它是由LUN CACHE模块运行在FC-SAN中I/O处理节点的嵌入式操作系统之上来实现的,LUN CACHE模块的结构如下:The present invention is characterized in that: it is a high-speed cache method based on the logical unit number LUN running on the I/O processing node under a fiber channel-storage area network FC-SAN environment, and it is run on the FC by the LUN CACHE module. -Implemented on the embedded operating system of the I/O processing node in the SAN, the structure of the LUN CACHE module is as follows:
接口子模块:它是模拟小型计算机系统接口,与SCSI目标模拟器中间层STML同I/O子系统的接口,它设有:Interface sub-module: it is an interface for simulating a small computer system, and an interface with the SCSI target simulator middle layer STML and the I/O subsystem. It is equipped with:
(1)与SCSI目标模拟器中间层STML模块的接口,是通过注册由STML模块定义的数据结构Scsi_Target_Template来实现的。(1) The interface with the STML module of the middle layer of the SCSI target simulator is realized by registering the data structure Scsi_Target_Template defined by the STML module.
(2)与I/O子系统的接口,是通过调用I/O子系统中的scsi_do_req函数来实现的。(2) The interface with the I/O subsystem is realized by calling the scsi_do_req function in the I/O subsystem.
共享锁子模块:设有如下二种逻辑单元号LUN读写锁:Shared lock sub-module: There are two types of logical unit number LUN read-write locks as follows:
(1)公共的READER-WRITER LOCK:BUFFER LOCK,即LUN读写锁;它是一种当接口子模块从STML模块得到SCSI命令,转发给缓冲区管理子模块之前,必须首先从共享锁子模块获得的逻辑单元号LUN读写锁,否则将转入等待状态。(1) Public READER-WRITER LOCK: BUFFER LOCK, that is, LUN read-write lock; it is a kind of when the interface sub-module obtains the SCSI command from the STML module, before forwarding it to the buffer management sub-module, it must first obtain the SCSI command from the shared lock sub-module Obtained logical unit number LUN read-write lock, otherwise it will enter the waiting state.
(2)为每个固定大小的数据块设定的READER-WRITER LOCK,即I/O读写锁,它是一种当缓冲区管理子模块发送给I/O子系统SCSI命令进行实际的物理磁盘存取之前,必须首先从共享锁子模块出获得的I/O读写锁,否则也将转入等待状态。(2) The READER-WRITER LOCK set for each fixed-size data block, that is, the I/O read-write lock, is an actual physical lock when the buffer management submodule sends the SCSI command to the I/O subsystem. Before disk access, the I/O read-write lock obtained from the shared lock submodule must first be obtained, otherwise it will also enter the waiting state.
相应于不同的逻辑单元号LUN读写锁表示为LUN id,读写锁是使用读写锁算法来获得的。The LUN read-write lock corresponding to different logical unit numbers is expressed as LUN id, and the read-write lock is obtained by using the read-write lock algorithm.
缓冲区管理子模块:它为每一个逻辑单元号LUN建立一个哈希表,把所有STML模块分发的SCSI命令所涉及到的数据保存在LUN CACHE的缓冲区中,它接受来自STML模块的SCSI命令,从缓冲区或I/O子系统中读写数据。Buffer management submodule: it creates a hash table for each logical unit number LUN, and saves the data involved in the SCSI commands distributed by all STML modules in the buffer of the LUN CACHE, and it accepts SCSI commands from the STML module , to read and write data from the buffer or the I/O subsystem.
数据同步子模块:它根据用户访问的频度更新缓冲区的内容,同时把缓冲区中的数据更新到磁盘系统中。Data synchronization sub-module: It updates the content of the buffer according to the frequency of user access, and at the same time updates the data in the buffer to the disk system.
缓冲区数据的更新是在每次LUN CACHE进行实际的I/O读写操作后,用最新最久未使用算法实现的。The update of the buffer data is implemented by using the latest and longest unused algorithm after each actual I/O read and write operation of the LUN CACHE.
磁盘系统数据的更新也在每次LUN CACHE进行缓冲区读写操作后在后台定时的把缓冲区的数据更新到磁盘系统中去的。The update of the disk system data also regularly updates the buffer data to the disk system in the background after each LUN CACHE performs buffer read and write operations.
初始化和退出子模块:它在LUN CACHE加载时进行启动工作,卸载时进行退出工作,同时申请并初始化内存数据结构。Initialization and exit sub-module: It performs startup work when LUN CACHE is loaded, exits work when unloaded, and applies for and initializes the memory data structure at the same time.
含有上述子模块的LUN CACHE模块依次按以下步骤运行在I/O处理节点的嵌入式操作系统上:The LUN CACHE module containing the above submodules runs on the embedded operating system of the I/O processing node in the following steps:
(1)SCSI目标模拟器中间层STML向接口子模块发出SCSI操作;(1) SCSI target emulator middle layer STML sends SCSI operation to the interface sub-module;
(2)接口子模块从STML得到SCSI命令后,在转发给缓冲区管理子模块之前,向共享锁子模块申请LUN读写锁,若从共享锁子模块处获得LUN读写锁,便向缓冲区管理子模块发送SCSI命令,否则转入等待状态。(2) After the interface submodule obtains the SCSI command from STML, before forwarding it to the buffer management submodule, it applies for the LUN read-write lock from the shared lock submodule. If it obtains the LUN read-write lock from the shared lock submodule, it applies to the buffer The zone management sub-module sends SCSI commands, otherwise it goes into a waiting state.
(3)缓冲区管理子模块收到来自STML的SCSI命令,首先从SCSI命令中获取命令读写数据的地址,再根据地址进行缓冲区查找操作,若命令所需要的数据在缓冲区中,则直接在缓冲区中读写数据。读写完毕,把数据结果返回给SCSI目标模拟器中间层STML,同时,数据同步子模块按规定的时间间隔把缓冲区中的数据更新到磁盘系统中;若在缓冲区中找不到所需的数据,则执行以下步骤。(3) The buffer management submodule receives the SCSI command from STML, first obtains the address of the command read and write data from the SCSI command, and then performs a buffer search operation according to the address, if the data required by the command is in the buffer, then Read and write data directly in the buffer. After reading and writing, the data result is returned to the middle layer STML of the SCSI target simulator. At the same time, the data synchronization sub-module updates the data in the buffer to the disk system according to the specified time interval; if the required data cannot be found in the buffer data, perform the following steps.
(4)缓冲区子模块便向共享锁子模块申请I/O读写锁,若从共享锁子模块处得不到I/O读写锁,便转入等待状态;否则执行以下步骤。(4) The buffer sub-module applies for the I/O read-write lock from the shared lock sub-module, if the I/O read-write lock cannot be obtained from the shared lock sub-module, it enters the waiting state; otherwise, the following steps are performed.
(5)缓冲区管理子模块申请到I/O读写锁,便通过接口子模块把SCSI命令传送给I/O子系统并获得数据,数据读写完毕,便把结果数据返回给SCSI目标模拟器中间层STML;同时数据同步子模块用最新最久未使用算法去更新缓冲区中的数据。(5) The buffer management sub-module applies for the I/O read-write lock, and then transmits the SCSI command to the I/O subsystem through the interface sub-module and obtains the data. After the data is read and written, the result data is returned to the SCSI target simulation STML in the middle layer of the device; at the same time, the data synchronization sub-module uses the latest and longest unused algorithm to update the data in the buffer.
(6)结束。(6) END.
所述的读写锁算法,它是通过不同粒度的READER-WRITER锁来实现的:READER-WRITER锁的数据结构为:The read-write lock algorithm is realized through READER-WRITER locks of different granularities: the data structure of the READER-WRITER lock is:
struct RWLock{struct RWLock{
struct semaphore rw_sem;//信号量struct semaphore rw_sem;//semaphore
spinlock_t rw_spin_lock;//共享锁spinlock_t rw_spin_lock;//shared lock
int flag;//LUN读写锁或者是I/O读写锁位int flag; //LUN read-write lock or I/O read-write lock bit
Target_Scsi_Cmnd*cmnd;//SCSI命令Target_Scsi_Cmnd*cmnd; //SCSI command
unsigned int bgblock;//数据段开始长度unsigned int bgblock;//start length of data segment
unsigned int length;//数据段长度unsigned int length;//data segment length
}}
所述的最新最久未使用的算法,它为缓冲区的每个数据块和每个逻辑单元号LUN各设置一个访问次数计数和访问时间标记,每次数据操作都会更新相应的访问次数计数和访问时间标记,当需要更新缓冲区时,它从缓冲区中选出离当前一段时间内访问次数较少,并且所在的逻辑单元号访问次数最少的数据块予以删除。The latest algorithm that has not been used for the longest time, it sets an access count and access time mark for each data block of the buffer and each logical unit number LUN, and each data operation will update the corresponding access count and access Time stamp, when the buffer needs to be updated, it selects the data block with the least number of visits from the buffer and the logical unit number with the least number of visits from the buffer and deletes it.
测试证明:写操作性能大约提高4倍,读操作性能随着负载的增加而提高,到达一定程度后,与没有CACHE的系统接近。The test proves that the performance of write operation is increased by about 4 times, and the performance of read operation increases with the increase of load. After reaching a certain level, it is close to the system without CACHE.
附图说明Description of drawings
图1.海量网络存储器设备的硬件子系统体系结构,Myrinet是一种高速互联技术。Figure 1. The hardware subsystem architecture of a massive network storage device, Myrinet is a high-speed interconnection technology.
图2.I/O节点硬件子系统模块结构Figure 2. I/O node hardware subsystem module structure
图3.海量存储系统的模块结构示意图。其中,STML是一个基于软件的可动态调整性能的SCSI命令和任务处理模块。Figure 3. Schematic diagram of the module structure of the mass storage system. Among them, STML is a software-based SCSI command and task processing module that can dynamically adjust performance.
图4.LUN CACHE的模块结构示意图。Figure 4. Schematic diagram of the module structure of LUN CACHE.
图5.LUN CACHE系统程序流程图。Figure 5. LUN CACHE system program flow chart.
图6.有无LUN CACHE的读操作性能比较图。Figure 6. Comparison of read operation performance with and without LUN CACHE.
图7.有无LUN CACHE的写操作性能比较图。Figure 7. Comparison of write performance with and without LUN CACHE.
具体实施方式Detailed ways
本发明的硬件环境为清华大学海量网络存储设备(简称为TH-MNSM),具体运行在TH-MNSM的I/O处理节点之上。清华大学海量网络存储设备的硬件结构如图1所示。The hardware environment of the present invention is Tsinghua University massive network storage device (abbreviated as TH-MNSM), which specifically runs on the I/O processing node of TH-MNSM. The hardware structure of Tsinghua University's massive network storage device is shown in Figure 1.
它主要包括以下部件:It mainly includes the following components:
(1)主机节点(HNODE)(1) Host node (HNODE)
控制中心,对系统进行监测、备份等管理。主机节点的硬件子系统包括INTEL CPU、标准的PCI总线、SCSI接口卡、标准的以太网接口卡(HBA)、Myrinet接口卡、硬盘。主机节点可以运行WINDOWS 2000等多种操作系统和WEB服务器软件系统。The control center is used for monitoring, backup and other management of the system. The hardware subsystem of the host node includes INTEL CPU, standard PCI bus, SCSI interface card, standard Ethernet interface card (HBA), Myrinet interface card, and hard disk. The host node can run various operating systems such as
(2)I/O处理节点(INODE)(2) I/O processing node (INODE)
海量网络存储器设备最基本、最重要的部分。主要功能包括数据存储、光纤通道协议处理、SCSI协议处理。I/O处理节点由2-4个INTEL XEON处理器,512M内存、基于PCI总线的主板、Myrinet接口卡、FLASH DISK、光纤通道接口卡、SCSI接口卡组成,具有较高性能。The most basic and important part of massive network storage devices. The main functions include data storage, fiber channel protocol processing, and SCSI protocol processing. The I/O processing node is composed of 2-4 INTEL XEON processors, 512M memory, motherboard based on PCI bus, Myrinet interface card, FLASH DISK, fiber channel interface card, SCSI interface card, and has high performance.
(3)高密度磁盘阵列(HARRAY)(3) High-density disk array (HARRAY)
每个I/O节点采用标准的SCSI卡直接连接目前商用的磁盘阵列,每个I/O节点最大可以连接8个磁盘阵列,最大支持150TB的容量。Each I/O node uses a standard SCSI card to directly connect to current commercial disk arrays. Each I/O node can connect up to 8 disk arrays, and supports a maximum capacity of 150TB.
(4)基于Myrinet的互联网络(4) Internet based on Myrinet
是I/O处理节点与主机节点相互连接的系统互联部件,基于Myrinet的互联网络取代了传统设计中的高速背板。该方式具有经济、可靠等优点。It is a system interconnection component that connects I/O processing nodes and host nodes. The Myrinet-based interconnection network replaces the high-speed backplane in the traditional design. This method has the advantages of economy and reliability.
(5)电源子系统(5) Power subsystem
采用N+1方式的商业化电源。Commercialized power supply using N+1 mode.
主机节点采用商业化商用PC机如联想天瑞3130,因此它的结构同PC机器结构相同。每个I/O处理节点的硬件子系统的模块结构如图2所示。The host node adopts a commercial commercial PC such as Lenovo Tianrui 3130, so its structure is the same as that of a PC. The module structure of the hardware subsystem of each I/O processing node is shown in FIG. 2 .
每个I/O节点的主板采用商用的服务器主板如Supermicro(超微)公司X5DA8、X5DAE主板等,所有的CPU采用INTEL公司XEON系列CPU。每个I/O节点包括2个商用的光纤通道HBA,如QLOGIC公司的QLA2310F系列,它们之间可以实现容错备份或者捆绑功能。每个I/O节点包括2-3个商用的SCSI接口卡,如ADAPTEC公司的7XXX系列接口卡,他们连接高密度的磁盘阵列子系统如ISD PinnacleRAID 500。电源子系统采用目前标准的、商用的N+1方式电源如山特公司的3C3系列,FLASH DISK负责存储各种软件,如M-SYSTEMS公司的DOC2000系列。Myrinet接口卡采用Myricom公司的LANai9系列接口卡。The motherboard of each I/O node adopts commercial server motherboards such as Supermicro X5DA8 and X5DAE motherboards, and all CPUs use INTEL XEON series CPUs. Each I/O node includes two commercial fiber channel HBAs, such as the QLA2310F series of QLOGIC, which can implement fault-tolerant backup or bundle functions between them. Each I/O node includes 2-3 commercial SCSI interface cards, such as ADAPTEC's 7XXX series interface cards, which are connected to high-density disk array subsystems such as ISD PinnacleRAID 500. The power supply subsystem adopts the current standard and commercial N+1 power supply, such as the 3C3 series of Santak, and the FLASH DISK is responsible for storing various software, such as the DOC2000 series of M-SYSTEMS. Myrinet interface card adopts LANai9 series interface card of Myricom Company.
LUN CACHE运行在海量网络存储器中I/O处理节点的嵌入式操作系统之上,海量网络存储器的软件结构示意图及LUN CACHE所在的I/O逻辑位置如图3所示。LUN CACHE runs on the embedded operating system of the I/O processing node in the massive network storage. The software structure diagram of the massive network storage and the I/O logical position of the LUN CACHE are shown in Figure 3.
根据LUN CACHE在海量网络存储器软件结构的位置,结合海量网络存储器嵌入式操作系统的系统结构特点,我们设计了LUN CACHE的软件结构。如图4所示,LUN CACHE的软件结构包括了:缓冲区管理子模块,共享锁子模块,数据同步子模块,接口子模块,初始化和退出子模块。具体模块的结构及作用详述如下:According to the position of LUN CACHE in the software structure of massive network storage, combined with the system structure characteristics of embedded operating system of massive network storage, we designed the software structure of LUN CACHE. As shown in Figure 4, the software structure of LUN CACHE includes: buffer management submodule, shared lock submodule, data synchronization submodule, interface submodule, initialization and exit submodule. The structure and functions of specific modules are detailed as follows:
接口子模块分为两个部分:The interface submodule is divided into two parts:
(1)与SCSI目标器中间层的接口。(1) The interface with the middle layer of the SCSI target device.
(2)与I/O子系统的接口。(2) Interface with I/O subsystem.
接口子模块的目的是为了保证LUN CACHE系统不影响整个FC-SAN的I/O路径,使LUNCACHE对用户透明,不影响用户程序的正常运行。在没有LUN CACHE的FC-SAN中,SCSI目标器中间层是与I/O子系统进行交互的,因此LUN CACHE中接口子模块的主要功能就是模拟SCSI目标器中间层与I/O子系统的接口。The purpose of the interface sub-module is to ensure that the LUN CACHE system does not affect the I/O path of the entire FC-SAN, so that LUNCACHE is transparent to users and does not affect the normal operation of user programs. In FC-SAN without LUN CACHE, the SCSI target middle layer interacts with the I/O subsystem, so the main function of the interface submodule in LUN CACHE is to simulate the connection between the SCSI target middle layer and the I/O subsystem. interface.
与SCSI目标器的接口是通过注册SCSI目标器定义的数据结构Scsi_Target_Template来实现的:The interface with the SCSI target is realized by registering the data structure Scsi_Target_Template defined by the SCSI target:
#define ISP_TARGET{\#define ISP_TARGET{\
name:″ISP″,\name: "ISP", \
detect:isp_detect,\detect:isp_detect, \
release:isp_release,\release:isp_release, \
xmit_response:isp_xmit_response,\xmit_response:isp_xmit_response, \
rdy_to_xfer:isp_rdy_to_xfer,\rdy_to_xfer:isp_rdy_to_xfer, \
task_mgmt_fn_done:isp_task_mgmt_done,\task_mgmt_fn_done:isp_task_mgmt_done, \
report_aen:isp_report_aen \report_aen:isp_report_aen \
}}
Scsi_Target_Template my_template=ISP_TARGET;Scsi_Target_Template my_template = ISP_TARGET;
与I/O子系统的接口是通过调用scsi_do_req函数来实现的,该函数由I/O子系统提供:The interface with the I/O subsystem is implemented by calling the scsi_do_req function, which is provided by the I/O subsystem:
void scsi_do_req(Scsi_Request*SRpnt,const void*cmnd,void*buffer,unsigned bufflen,void(*done)(Scsi_Cmnd*),int timeout,int retries)void scsi_do_req(Scsi_Request*SRpnt, const void*cmnd, void*buffer, unsigned bufflen, void(*done)(Scsi_Cmnd*), int timeout, int retries)
其中Scsi_Request是Linux定义的数据结构,用来存储SCSI命令,buffer是返回值所填的数据区,bufflen表示该数据区有效数据的长度,done函数是返回函数,当I/O子系统处理完该SCSI命令以后,调用done函数返回给LUN CACHE系统。Among them, Scsi_Request is a data structure defined by Linux, which is used to store SCSI commands, buffer is the data area filled in the return value, bufflen indicates the length of valid data in the data area, and the done function is the return function. When the I/O subsystem finishes processing the After the SCSI command, call the done function to return to the LUN CACHE system.
FC-SAN是一个多用户共享的存储设备,在增加LUN CACHE后,用户不仅共享磁盘系统,还同时共享Cache系统,为了使系统能正常工作,就必须保证用户数据的一致性。为此我们提出了读写锁机制,与其他的共享锁机制如Sistina公司的GFS中的DLOCK机制不同,我们的读写锁不需要用户增加新的SCSI命令,同时因为采用阻塞机制,不会像DLOCK一样会产生SPIN-LOCK的问题,而且具有较好的可扩展性。FC-SAN is a storage device shared by multiple users. After adding LUN CACHE, users not only share the disk system, but also share the Cache system at the same time. In order for the system to work normally, the consistency of user data must be guaranteed. To this end, we propose a read-write lock mechanism, which is different from other shared lock mechanisms such as the DLOCK mechanism in Sistina's GFS. Our read-write lock does not require users to add new SCSI commands. DLOCK also has the problem of SPIN-LOCK, and has good scalability.
当接口子模块从STML得到SCSI命令,转发给缓冲区管理子模块之前,必须首先从共享锁子模块处获得LUN读写锁,否则将转入等待状态。而当缓冲区管理子模块发送给I/O子系统SCSI命令进行实际物理磁盘存取之前,必须首先从共享锁子模块处获得I/O读写锁,否则也将转入等待状态。获得读写锁的算法见算法A,将在下面具体叙述。When the interface sub-module gets the SCSI command from STML and forwards it to the buffer management sub-module, it must first obtain the LUN read-write lock from the shared lock sub-module, otherwise it will enter the waiting state. Before the buffer management submodule sends the SCSI command to the I/O subsystem for actual physical disk access, it must first obtain the I/O read-write lock from the shared lock submodule, otherwise it will also enter the waiting state. The algorithm for obtaining a read-write lock is shown in Algorithm A, which will be described in detail below.
由于缓冲区管理子模块的缓冲区操作主要为内存操作,速度比较快,而实际物理磁盘的存取操作相对很慢,因此对于LUN读写锁与I/O读写锁我们使用了不同的粒度级别的共享锁控制。对于LUN读写锁,我们使用一个公共的READER-WRITER LOCK:BUFFER LOCK进行控制;对于I/O读写锁,我们给每个固定大小的数据块(默认值为16M)设定一个READER-WRITER LOCK。Since the buffer operations of the buffer management submodule are mainly memory operations, the speed is relatively fast, while the actual physical disk access operations are relatively slow, so we use different granularities for LUN read-write locks and I/O read-write locks Level of shared lock control. For LUN read-write lock, we use a common READER-WRITER LOCK: BUFFER LOCK to control; for I/O read-write lock, we set a READER-WRITER for each fixed-size data block (the default value is 16M). LOCK.
由于FC-SAN的各个LUN之间的磁盘空间是独立的,各个LUN之间不会产生读写相关的操作,结合这一特点,我们对读写锁进一步细化,给每个读写锁加上LUN id属性,加快了读写锁互斥的判断。Since the disk space between each LUN of FC-SAN is independent, there will be no read-write related operations between each LUN. Combining this feature, we further refine the read-write lock and add Adding the LUN id attribute speeds up the judgment of read-write lock mutual exclusion.
算法A 读写锁算法Algorithm A Read-write lock algorithm
此算法的核心在于通过不同粒度的READER-WRITER锁,为缓冲区操作和I/O操作提供互斥机制,并且结合LUN判断锁是否互斥。The core of this algorithm is to provide a mutual exclusion mechanism for buffer operations and I/O operations through READER-WRITER locks of different granularities, and combine LUNs to determine whether the locks are mutually exclusive.
由于缓冲区操作速度较快,所以采用单一的READER-WRITER锁,以降低存取锁的时间消耗;而I/O操作相对较慢,存取锁的时间与之相比可以忽略,为此针对每个数据块建立一个READER-WRITER锁。Because the buffer operation speed is fast, a single READER-WRITER lock is used to reduce the time consumption of accessing the lock; while the I/O operation is relatively slow, the time of accessing the lock can be ignored compared with it, so for Each data block establishes a READER-WRITER lock.
对于READER-WRITER锁,存在以下原则:任何一个锁支持多个SCSI命令同时读,而一个锁仅仅存在一个在线写的SCSI命令。但是每个锁允许多个读写SCSI命令等待执行,如果存在多个SCSI等待命令,则写命令优先于任何一个读命令。For READER-WRITER locks, there are the following principles: any lock supports multiple SCSI commands to read at the same time, and a lock only has one SCSI command for online writing. But each lock allows multiple read and write SCSI commands to wait for execution. If there are multiple SCSI waiting commands, the write command has priority over any read command.
由于FC-SAN中各LUN之间磁盘空间独立,因此对于不同LUN之间的READER-WRITER锁,它们也是完全独立的,即不同的LUN对应不同的锁空间,它们之间不存在交互的关系。Since the disk space between LUNs in FC-SAN is independent, the READER-WRITER locks between different LUNs are also completely independent, that is, different LUNs correspond to different lock spaces, and there is no interaction between them.
在具体实现上,算法维护一个READER-WRITER锁的数组,根据上面所述READER-WRITER锁的原则决定一个SCSI命令是否获得相应的读写锁,如果未获得则放入等待线程中,直到获得读写锁为止才能进行下一步的操作。In terms of specific implementation, the algorithm maintains an array of READER-WRITER locks. According to the above-mentioned READER-WRITER lock principle, it is determined whether a SCSI command obtains the corresponding read-write lock. The next operation can only be performed until the write lock is obtained.
在实际运行中,此算法很好的保证了用户数据的唯一性,同时READER-WRITER锁保证了读操作可以尽可能的并行执行,提高了系统的性能。In actual operation, this algorithm well guarantees the uniqueness of user data, and the READER-WRITER lock ensures that read operations can be executed in parallel as much as possible, improving system performance.
缓冲区管理子模块是整个LUN CACHE的核心,所有STML模块分发的SCSI命令所涉及到的数据都可以根据特定的管理策略保存在LUN CACHE的缓冲区中。The buffer management sub-module is the core of the entire LUN CACHE, and the data involved in SCSI commands distributed by all STML modules can be stored in the buffer of the LUN CACHE according to specific management strategies.
缓冲区管理子模块接受来自接口子模块的的SCSI命令,首先从SCSI命令中获取命令读写数据的地址,然后根据地址进行缓冲区查找操作,如果命令所需要的数据在缓冲区中,则直接在缓冲区中读写数据,否则将SCSI命令转发给I/O子系统。The buffer management submodule accepts the SCSI command from the interface submodule, first obtains the address of the command to read and write data from the SCSI command, and then performs a buffer search operation according to the address. If the data required by the command is in the buffer, directly Read and write data in the buffer, otherwise forward the SCSI command to the I/O subsystem.
为了加快缓冲区的查找过程,缓冲区使用哈希表组织,同时针对FC-SAN的不同LUN之间磁盘空间相互独立这一特点,对于不同LUN,为每个LUN建立一个哈希表,这样可以大大的提高查找缓冲区的速度。In order to speed up the search process of the buffer, the buffer is organized using a hash table. At the same time, considering the fact that the disk space between different LUNs of FC-SAN is independent of each other, a hash table is established for each LUN for different LUNs, so that Greatly increased the speed of finding buffers.
数据同步子模块的目的是根据用户访问的频度更新缓冲区的内容,使得LUN CACHE能够保证较高的Cache命中率;同时定时将缓冲区中的数据更新到磁盘系统中,保证用户数据的一致性。The purpose of the data synchronization sub-module is to update the content of the buffer according to the frequency of user access, so that LUN CACHE can ensure a high cache hit rate; at the same time, the data in the buffer is updated to the disk system regularly to ensure the consistency of user data sex.
每次LUN CACHE进行实际的I/O读写操作以后,都要根据最近最久未使用算法更新缓冲区,最近最久未使用算法的详细描述见算法B,将在下面详细叙述。After each actual I/O read and write operation, the LUN CACHE must update the buffer according to the algorithm that has not been used for the most recent time. For a detailed description of the algorithm that has not been used for the longest time, see Algorithm B, which will be described in detail below.
每次LUN CACHE进行缓冲区读写操作以后,都会导致缓冲区中的数据与实际磁盘中的数据不一致,因此需要在后台定时的将缓冲区中的数据更新到磁盘系统中,如果固定时间间隔已到,不管具体数据量的大小,都要执行磁盘写操作,这样可以较好的保持数据的一致性。由于更新操作是使用后台线程的方式运行,因此不会对I/O节点机的性能造成影响。Every time LUN CACHE reads and writes the buffer, the data in the buffer will be inconsistent with the data on the actual disk. Therefore, it is necessary to periodically update the data in the buffer to the disk system in the background. If the fixed time interval has It is realized that regardless of the size of the specific data volume, the disk write operation must be performed, so that the data consistency can be better maintained. Since the update operation is run using a background thread, it will not affect the performance of the I/O node machine.
算法B 最近最久未使用算法Algorithm B The least recently used algorithm
此算法的核心在于通过对缓冲区中数据块访问次数的统计,结合该块所属LUN被访问的频度,决定该数据块是否继续保留在缓冲区中。The core of this algorithm is to determine whether to keep the data block in the buffer by counting the access times of the data block in the buffer and combining the access frequency of the LUN to which the block belongs.
根据统计,用户一般的I/O操作大部分为连续的读写操作,并且一个数据块被相近的I/O操作所访问到的几率相对较大。因此如果一个数据块在较长时间内没有被访问到,那么为了提高缓冲区利用率,就可以将该数据块从缓冲区中删除,用最近被访问到的数据块来代替。According to statistics, most of the user's general I/O operations are continuous read and write operations, and the probability of a data block being accessed by similar I/O operations is relatively high. Therefore, if a data block has not been accessed for a long time, in order to improve the buffer utilization, the data block can be deleted from the buffer and replaced with the most recently accessed data block.
在FC-SAN中,不同LUN之间的磁盘空间是独立的。由于用户访问数据的连续性,连续访问同一LUN的可能性要比访问不同LUN的可能性要大。因此我们在决定一个数据块是否被删除的同时考虑到了LUM的信息,如果一个LUN在最近的访问中出现的次数比较多,则属于该LUN的数据块会被保留下来。In FC-SAN, the disk space between different LUNs is independent. Due to the continuity of user access to data, the possibility of consecutive access to the same LUN is greater than the possibility of access to different LUNs. Therefore, we consider LUM information when deciding whether to delete a data block. If a LUN appears more times in recent accesses, the data blocks belonging to this LUN will be retained.
在具体实现上,算法为缓冲区的每个数据块和每个LUN保持一个访问计数和时间标记,每次数据操作都会更新相应的访问计数和时间标记。当需要更新缓冲区的时候,算法从缓冲区中选出离当前一段时间内访问次数较少,并且所在LUN访问次数最少的数据块删除。In terms of specific implementation, the algorithm maintains an access count and time stamp for each data block in the buffer and each LUN, and each data operation will update the corresponding access count and time stamp. When the buffer needs to be updated, the algorithm selects from the buffer the data block with the least number of accesses and the LUN with the least number of accesses within the current period of time and deletes it.
此算法能够很好的保证缓冲区中数据的命中率,使得用户的大部分操作都能在缓冲区中执行,从而有效的减少了系统的响应时间。This algorithm can well guarantee the hit rate of the data in the buffer, so that most of the user's operations can be executed in the buffer, thus effectively reducing the response time of the system.
初始化和退出子模块的主要任务是在LUN CACHE加载时进行启动化工作,卸载时进行退出工作。具体执行的任务包括:The main task of initializing and exiting submodules is to perform startup work when LUN CACHE is loaded, and to perform exit work when unloading. The specific tasks performed include:
(1)执行嵌入式系统加载/卸载模块时所必需执行的函数。(1) Execute the functions that must be executed when the embedded system loads/unloads modules.
(2)申请并初始化内存数据结构。(2) Apply for and initialize the memory data structure.
(3)执行LUN CACHE模块与STML模块、I/O子系统接口所必需的函数。(3) Execute the functions necessary for the interface between the LUN CACHE module and the STML module and the I/O subsystem.
LUN CACHE方法的程序流程图见图5。The program flow chart of the LUN CACHE method is shown in Figure 5.
本系统已经使用c语言编程实现,可以在在PC机上的Linux操作系统上运行。下面给出一个运行实例,以说明系统的执行过程。This system has been realized by programming in c language, and it can run on the Linux operating system on the PC. A running example is given below to illustrate the execution process of the system.
实例一:Example one:
首先系统启动,加载LUN CACHE系统。First, the system starts and loads the LUN CACHE system.
1.用户发出读命令M1。STML模块接收到读命令以后,通过接口子模块将命令M1传递给系统。1. The user issues a read command M1. After the STML module receives the read command, it transmits the command M1 to the system through the interface sub-module.
2.系统申请LUM读写锁,申请成功获得读写锁,然后进入缓冲区管理子模块。2. The system applies for a LUM read-write lock, successfully obtains the read-write lock, and then enters the buffer management sub-module.
3.缓冲区管理子模块查找缓冲区,发现缓冲区没有M1所需要的数据。3. The buffer management submodule searches the buffer and finds that the buffer does not have the data required by M1.
4.释放LUN读写锁,申请I/O读写锁。4. Release the LUN read-write lock and apply for an I/O read-write lock.
5.申请I/O读写锁失败,进入等待队列。5. The application for the I/O read-write lock fails and enters the waiting queue.
6.经过等待M1获得I/O读写锁,通过接口函数将M1传送给底层的I/O子系统进行实际I/O读写锁。6. After waiting for M1 to obtain the I/O read-write lock, transfer M1 to the underlying I/O subsystem through the interface function for actual I/O read-write lock.
7.底层I/O子系统通过接口函数将M1的结果数据返回给LUN CACHE系统。7. The underlying I/O subsystem returns the result data of M1 to the LUN CACHE system through the interface function.
8.释放I/O读写锁,并将M1的结果数据通过接口子模块返回给STML模块。8. Release the I/O read-write lock, and return the result data of M1 to the STML module through the interface sub-module.
9.数据同步子模块定时根据最近最久未使用算法更新缓冲区。9. The data synchronization sub-module regularly updates the buffer according to the algorithm that has not been used for the longest time.
实例二:Example two:
首先系统启动,加载LUN CACHE系统。First, the system starts and loads the LUN CACHE system.
1.用户发出写命令M2。STML接收到写命令后,通过接口子模块将命令M2传递给系统。1. The user issues a write command M2. After receiving the write command, STML passes the command M2 to the system through the interface sub-module.
2.系统申请LUN读写锁,申请未成功,进行等待。2. The system applies for a LUN read-write lock, but the application is unsuccessful and waits.
3.共享锁子模块通知M2获得LUN读写锁。3. The shared lock submodule notifies M2 to obtain the LUN read-write lock.
4.缓冲区管理子模块查找缓冲区,发现缓冲区有M2所需要的数据。4. The buffer management submodule searches the buffer and finds that the buffer has the data required by M2.
5.直接往缓冲区内写入M2的数据。(执行命令M2)5. Write M2 data directly into the buffer. (execute command M2)
6.释放LUN读写锁,将M2的结果通过接口子模块返回给STML模块。6. Release the LUN read-write lock, and return the result of M2 to the STML module through the interface sub-module.
7.数据同步子模块定时根据最近最久未使用算法更新缓冲区。7. The data synchronization sub-module regularly updates the buffer according to the least recently used algorithm.
我们分别对用户的写操作和读操作在清华大学海量存储系统上进行了性能测试。测试环境如下:用户为32位安腾双CPU服务器,CPU主频为2.4G,用户的操作系统为Linux(Kernel2.4.18-3),挂接TH-MNSM海量存储系统。We conducted performance tests on the user's write and read operations on the mass storage system of Tsinghua University. The test environment is as follows: the user is a 32-bit Itanium dual-CPU server, the CPU frequency is 2.4G, the user's operating system is Linux (Kernel2.4.18-3), and the TH-MNSM mass storage system is connected.
读操作性能read performance
测试结果如图6所示。图中纵轴表示响应时间,单位为ms,横轴表示I/O请求的大小。测试是在负载密集的情况下得到的。所示分别为使用了LUN CACHE的TH-MNSM系统和没有使用LUN CACHE的TH-MNSM系统的读性能。从图中可以看出,随着负载的增加,读性能就越高。当然,负载增加到一定程度以后,CACHE的作用逐渐减小,系统读的性能趋向于没有CACHE的系统读性能。The test results are shown in Figure 6. The vertical axis in the figure represents the response time in ms, and the horizontal axis represents the size of the I/O request. Tests were obtained under intensive load conditions. Shown are the read performance of TH-MNSM system using LUN CACHE and TH-MNSM system without LUN CACHE respectively. As can be seen from the figure, as the load increases, the read performance increases. Of course, after the load increases to a certain extent, the role of CACHE gradually decreases, and the system read performance tends to be that of the system without CACHE.
写操作性能write performance
测试结果如图7所示。图中纵轴表示响应时间,单位为ms,横轴表示I/O请求的大小,单位为KB。从图中可以看出,LUN CACHE的存在极大的提高了系统写的性能,这是因为LUN CACHE总是尽可能将写操作缓冲在缓冲区内,然后通过后台线程更新数据。图中数据说明性能的提高大约在4倍左右。The test results are shown in Figure 7. The vertical axis in the figure represents the response time in ms, and the horizontal axis represents the size of the I/O request in KB. It can be seen from the figure that the existence of LUN CACHE greatly improves the performance of system writing, because LUN CACHE always buffers write operations in the buffer as much as possible, and then updates data through background threads. The data in the figure shows that the performance improvement is about 4 times.
本发明的主要特色如下:Main features of the present invention are as follows:
(1)结合海量存储系统的体系结构特点,采用改进的缓存替换策略,提高了缓存的实际利用率,减少了实际I/O的操作。(1) Combined with the architectural characteristics of the mass storage system, an improved cache replacement strategy is adopted to improve the actual utilization of the cache and reduce the actual I/O operations.
(2)设计了不同粒度的读写锁及申请读写锁的算法。保证用户数据一致性的同时,充分提高了I/O操作的并发性,最大化数据读写性能。(2) Design read-write locks of different granularities and algorithms for applying for read-write locks. While ensuring the consistency of user data, it fully improves the concurrency of I/O operations and maximizes data read and write performance.
(3)良好的接口设计使得整个系统对用户完全透明。(3) Good interface design makes the whole system completely transparent to users.
(4)在TH-MSNM的基础上无需增加任何新的硬件即可实现,节省了成本。(4) It can be realized without adding any new hardware on the basis of TH-MSNM, which saves the cost.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN 200310113532 CN1279455C (en) | 2003-11-14 | 2003-11-14 | Fiber Channel - Logical Unit Number Caching Method for Storage Area Network Systems |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN 200310113532 CN1279455C (en) | 2003-11-14 | 2003-11-14 | Fiber Channel - Logical Unit Number Caching Method for Storage Area Network Systems |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN1545033A CN1545033A (en) | 2004-11-10 |
| CN1279455C true CN1279455C (en) | 2006-10-11 |
Family
ID=34336904
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN 200310113532 Expired - Fee Related CN1279455C (en) | 2003-11-14 | 2003-11-14 | Fiber Channel - Logical Unit Number Caching Method for Storage Area Network Systems |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN1279455C (en) |
Families Citing this family (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7290094B2 (en) * | 2005-05-17 | 2007-10-30 | International Business Machines Corporation | Processor, data processing system, and method for initializing a memory block to an initialization value without a cache first obtaining a data valid copy |
| CN100353307C (en) * | 2006-02-16 | 2007-12-05 | 杭州华三通信技术有限公司 | Storage system and method of storaging data and method of reading data |
| KR101023877B1 (en) * | 2009-04-17 | 2011-03-22 | (주)인디링스 | Cache and Disk Management Method and Controller Using the Method |
| CN102033796B (en) * | 2009-09-25 | 2013-01-16 | 中国移动通信集团公司 | Testing system and method |
| CN103309818B (en) * | 2012-03-09 | 2015-07-29 | 腾讯科技(深圳)有限公司 | Store method and the device of data |
| CN102681892B (en) * | 2012-05-15 | 2014-08-20 | 西安热工研究院有限公司 | Key-Value type write-once read-many lock pool software module and running method thereof |
| CN102843183A (en) * | 2012-08-27 | 2012-12-26 | 成都成电光信科技有限责任公司 | Network data monitoring device for optical fiber channel |
| WO2014094306A1 (en) * | 2012-12-21 | 2014-06-26 | 华为技术有限公司 | Method and device for setting working mode of cache |
| CN103106048A (en) * | 2013-01-30 | 2013-05-15 | 浪潮电子信息产业股份有限公司 | Multi-control multi-activity storage system |
| CN103327074A (en) * | 2013-05-24 | 2013-09-25 | 浪潮电子信息产业股份有限公司 | Designing method of global-cache-sharing tight coupling multi-control multi-active storage system |
| CN103441948B (en) * | 2013-07-03 | 2017-09-05 | 华为技术有限公司 | A kind of data access method, network interface card and storage system |
| CN104102515A (en) * | 2014-07-18 | 2014-10-15 | 浪潮(北京)电子信息产业有限公司 | Method and server for processing logical unit number of plug-in storage equipment |
| CN104360966B (en) * | 2014-11-21 | 2017-12-12 | 浪潮(北京)电子信息产业有限公司 | To block number according to the method and apparatus for carrying out input-output operation |
| US9983995B2 (en) * | 2016-04-18 | 2018-05-29 | Futurewei Technologies, Inc. | Delayed write through cache (DWTC) and method for operating the DWTC |
| CN108170544B (en) * | 2017-12-29 | 2020-08-28 | 中国人民解放军国防科技大学 | Shared data dynamic updating method for data conflict-free program |
| CN111506436B (en) * | 2020-03-25 | 2024-05-14 | 炬星科技(深圳)有限公司 | Method for realizing memory sharing, electronic equipment and shared memory data management library |
| CN111459849B (en) | 2020-04-20 | 2021-05-11 | 网易(杭州)网络有限公司 | Memory setting method and device, electronic equipment and storage medium |
| CN112636908B (en) * | 2020-12-21 | 2022-08-05 | 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) | Key query method and device, encryption device and storage medium |
| CN113643326B (en) * | 2021-06-07 | 2022-04-08 | 深圳市智绘科技有限公司 | KNN calculating device and path planning system based on SoC |
-
2003
- 2003-11-14 CN CN 200310113532 patent/CN1279455C/en not_active Expired - Fee Related
Also Published As
| Publication number | Publication date |
|---|---|
| CN1545033A (en) | 2004-11-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN1279455C (en) | Fiber Channel - Logical Unit Number Caching Method for Storage Area Network Systems | |
| CN1304961C (en) | Memory virtualized management method based on metadata server | |
| CN101079902A (en) | A great magnitude of data hierarchical storage method | |
| Arif et al. | Exploiting cxl-based memory for distributed deep learning | |
| US10572379B2 (en) | Data accessing method and data accessing apparatus | |
| CN100338582C (en) | Storage system | |
| CN1299207C (en) | Large scale resource memory managing method based on network under SAN environment | |
| US9182912B2 (en) | Method to allow storage cache acceleration when the slow tier is on independent controller | |
| CN1940849A (en) | RAID system and rebuild/copy back processing method thereof | |
| CN1866163A (en) | Multi-disk drive system with high power and low power disk drive | |
| CN1545030A (en) | Method of Dynamic Mapping of Data Distribution Based on Disk Characteristics | |
| KR20170008153A (en) | A heuristic interface for enabling a computer device to utilize data property-based data placement inside a nonvolatile memory device | |
| CN1875348A (en) | Information system, load control method, load control program, and recording medium | |
| CN114780025B (en) | Software RAID request processing method, controller and RAID storage system | |
| US20120290789A1 (en) | Preferentially accelerating applications in a multi-tenant storage system via utility driven data caching | |
| CN1862475A (en) | Method for managing magnetic disk array buffer storage | |
| CN1945537A (en) | Method for realizing high speed solid storage device based on storage region network | |
| CN1862476A (en) | Super large capacity virtual magnetic disk storage system | |
| US10705733B1 (en) | System and method of improving deduplicated storage tier management for primary storage arrays by including workload aggregation statistics | |
| TW202205106A (en) | Content provider system and method for content provider system | |
| US20240086332A1 (en) | Data processing method and system, device, and medium | |
| US10572464B2 (en) | Predictable allocation latency in fragmented log structured file systems | |
| Guo et al. | HP-mapper: A high performance storage driver for docker containers | |
| CN1694081A (en) | Implementing method of virtual intelligent controller in SAN system | |
| CN1955939A (en) | Backup and recovery method based on virtual flash disk |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| C19 | Lapse of patent right due to non-payment of the annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |