CN104573082B - Space small documents distributed data storage method and system based on access log information - Google Patents
Space small documents distributed data storage method and system based on access log information Download PDFInfo
- Publication number
- CN104573082B CN104573082B CN201510042456.9A CN201510042456A CN104573082B CN 104573082 B CN104573082 B CN 104573082B CN 201510042456 A CN201510042456 A CN 201510042456A CN 104573082 B CN104573082 B CN 104573082B
- Authority
- CN
- China
- Prior art keywords
- file data
- msub
- small
- matrix
- frequently accessed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供基于访问日志信息的空间小文件数据分布存储方法及系统,包括将空间小文件数据集分成频繁访问的子集和非频繁访问的子集,提取频繁访问的空间小文件数据子集的访问序列,分段计算各频繁访问的空间小文件数据的关联度,并将各频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵;对关联矩阵中各元素数值进行大小转换后利用RCM排序算法重排后输出,对重排后的关联矩阵利用局部逼近搜索法寻找最佳组合,利用最佳组合对频繁访问的空间小文件数据进行分布存储,以及对非频繁访问的空间小文件数据根据空间位置相邻关系分开存储。本发明提高了空间小文件数据的并行访问性能。
The present invention provides a small space file data distributed storage method and system based on access log information, including dividing the small space file data set into frequently accessed subsets and non-frequently accessed subsets, and extracting frequently accessed small space file data subsets Access sequence, calculate the correlation degree of each frequently accessed small space file data in sections, and form an association matrix with the correlation degree values of each frequently accessed small space file data; after the size conversion of each element value in the association matrix Use the RCM sorting algorithm to rearrange the output, use the local approximation search method to find the best combination for the rearranged correlation matrix, use the best combination to store the data of frequently accessed small files in a distributed manner, and use the best combination to store the data of frequently accessed small files File data is stored separately according to the adjacent relationship of spatial position. The invention improves the parallel access performance of small space file data.
Description
技术领域technical field
本发明属于空间小文件数据的分布存储技术领域,特别是涉及一种新的基于访问日志信息的空间小文件数据分布存储方法及系统。The invention belongs to the technical field of distributed storage of small spatial file data, and in particular relates to a new method and system for distributed storage of small spatial file data based on access log information.
背景技术Background technique
海量空间信息的存储和快速访问一直是空间信息服务系统试图解决的重要问题,常用的空间信息服务系统如NASA地球观察系统每天采集的数据量达到了2TB,对这些数据的合理分布存储以便获得并行快速访问成为关键,其中一类重要的解决方案是通过对数据进行分布存储以实现对数据的并行访问来提高数据访问效率。The storage and fast access of massive spatial information has always been an important problem that the spatial information service system tries to solve. Commonly used spatial information service systems such as the NASA Earth Observation System collect up to 2TB of data per day. The reasonable distribution and storage of these data is to obtain parallelism. Fast access becomes the key, and one of the important solutions is to improve data access efficiency by distributing data storage to achieve parallel access to data.
目前比较典型的分布式文件存储系统主要包括如GFS(Google file system)、HDFS(Hadoop distributed file system)以及Lustre等。但这些系统在存储性能上的改善主要体现在对大文件的存储处理上。如GFS,其存储策略主要是,将大文件分成固定长度的块(如64MB),然后将所有的块分别存储在不同的存储器上以提高数据的并行访问率(参考文献Ghemawat S,Gobioff H,Shun-Tak L.The Google file system.In:Proceedings ofthe Nineteenth ACM Symposium on Operating Systems Principles(SOSP’03).BoltonLanding,New York:IEEE,2003.1–15)。另一类典型的存储技术如RAID(Redundant Arrayof Independent Disks),也是将每一个大的数据文件分成几个数据块后分别存储在不同的磁盘以提高对该文件的并行访问。At present, typical distributed file storage systems mainly include GFS (Google file system), HDFS (Hadoop distributed file system) and Luster. However, the improvement in storage performance of these systems is mainly reflected in the storage and processing of large files. Such as GFS, its storage strategy is mainly to divide large files into fixed-length blocks (such as 64MB), and then store all blocks in different memories to improve the parallel access rate of data (references Ghemawat S, Gobioff H, Shun-Tak L. The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP'03). Bolton Landing, New York: IEEE, 2003.1–15). Another typical storage technology, such as RAID (Redundant Array of Independent Disks), also divides each large data file into several data blocks and stores them on different disks to improve parallel access to the file.
以上分布存储方法虽然对大文件数据有效,但针对小文件数据,由于无法继续进行分块,通过分块存储的方法适应性不足,目前通用的方法只是简单的将单个文件存储在单个存储服务器上,因而难以实现对多个小文件数据的并行访问,I/O效率不高。Although the above distributed storage methods are effective for large file data, for small file data, the method of block storage is not adaptable because it cannot continue to be divided into blocks. The current general method is simply to store a single file on a single storage server. , so it is difficult to achieve parallel access to multiple small file data, and the I/O efficiency is not high.
研究表明,目前大部分系统都存在大量的小文件数据,如美国国家能源研究科学计算中心的1300万个文件中有99%的文件小于64M,小于64K的文件更是占到了44%(参考文献Carns P,Lang S,Ross R,et al..Small-file access in parallel file systems[C].Parallel&Distributed Processing,2009.IPDPS 2009.IEEE InternationalSymposium On.IEEE,2009:1-11)。Studies have shown that there are a large number of small file data in most systems at present. For example, 99% of the 13 million files of the National Energy Research Scientific Computing Center in the United States are smaller than 64M, and 44% of the files are smaller than 64K (reference Carns P, Lang S, Ross R, et al.. Small-file access in parallel file systems [C]. Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium On. IEEE, 2009: 1-11).
事实上,基于金字塔模型的空间信息服务系统,如Google Earth、World Wind等同样是以小文件的形式存储空间数据。World Wind根据金字塔模型将地球分成不同分辨率的瓦片数据,每个瓦片数据保存为一个文件,每个瓦片数据的大小固定为512×512像素,每个瓦片文件大小不超过1MB(参考文献Boschetti L,Roy D P,Justice C O.Using NASA’sWorld Wind virtual globe for interactive internet visualization of the globalMODIS burned area product.Int J Remote Sens,2008,29(11):3067–3072);GoogleEarth同样采用多分辨率模型存储空间数据,每个数据文件的大小也不超过64MB(参考文献Sample J T,Loup E.Tile-base geospatial information system:principle andpractices.New York:Springer,2010.23–200)。In fact, spatial information service systems based on the pyramid model, such as Google Earth, World Wind, etc., also store spatial data in the form of small files. World Wind divides the earth into tile data with different resolutions according to the pyramid model. Each tile data is saved as a file. The size of each tile data is fixed at 512×512 pixels, and the size of each tile file does not exceed 1MB ( References Boschetti L, Roy D P, Justice C O.Using NASA'sWorld Wind virtual globe for interactive internet visualization of the global MODIS burned area product.Int J Remote Sens,2008,29(11):3067–3072); GoogleEarth also uses The multi-resolution model stores spatial data, and the size of each data file does not exceed 64MB (reference Sample J T, Loup E. Tile-base geospatial information system: principles and practices. New York: Springer, 2010.23–200).
总之,目前针对大文件数据的分布存储方法难以应用到小文件数据的存储,而针对小文件数据的优化又集中在数据的访问优化(非存储优化,访问优化面向客户端,而存储优化面向服务端),如减少数据密集型应用程序的执行时间(参考文献J.Kim,A.Chandra,and J.B.Weissma.Using Data Accessibility for Resource Selection in Large-Scale Distributed Systems.IEEE Trans.Parallel Distributed Systems,vol.20,no.6,pp.788-801,June 2009),或降低小文件索引信息的开销(参考文献A.L.Chervenak,R.Schuler,M.Ripeanu,M.A.Amer,S.Bharathi,I.Foster,A.Iamnitchi,andC.Kesselman.The Globus Replica Location Service:Design and Experience.IEEETrans.Parallel Distributed Systems,vol.20,no.9,pp.1260-1272,Sept.2009)等。但在分布式系统中,访问延迟时间的性能不仅与存取方法有关,而且与数据的分布存储模式有关。因此对小文件数据的优化问题尚未得到根本解决。In short, the current distributed storage method for large file data is difficult to apply to the storage of small file data, and the optimization for small file data focuses on data access optimization (non-storage optimization, access optimization is client-oriented, and storage optimization is service-oriented) end), such as reducing the execution time of data-intensive applications (references J. Kim, A. Chandra, and J. B. Weissma. Using Data Accessibility for Resource Selection in Large-Scale Distributed Systems. IEEE Trans. Parallel Distributed Systems, vol. 20, no.6, pp.788-801, June 2009), or reduce the overhead of small file index information (references A.L.Chervenak, R.Schuler, M.Ripeanu, M.A.Amer, S.Bharathi, I.Foster, A Iamnitchi, and C. Kesselman. The Globus Replica Location Service: Design and Experience. IEEETrans. Parallel Distributed Systems, vol.20, no.9, pp.1260-1272, Sept.2009) etc. But in a distributed system, the performance of access latency is not only related to the access method, but also related to the data distribution storage mode. Therefore, the optimization problem of small file data has not been fundamentally resolved.
发明内容Contents of the invention
针对以上问题,本发明提供一种基于访问日志信息的空间小文件数据分布存储方法及系统,利用空间小文件数据的访问日志信息,分析各空间小文件数据之间的相互关系,并据此对空间小文件数据进行分布存储,以提高对空间小文件数据的并行访问率。In view of the above problems, the present invention provides a method and system for distributed storage of small space file data based on access log information, which uses the access log information of small space file data to analyze the relationship between each small space file data, and based on this Distributed storage of small file data in order to improve the parallel access rate of small file data.
本发明所述的一种基于访问日志信息的空间小文件数据分布存储方法及系统,所采用的技术方案是:A method and system for distributed storage of small space file data based on access log information according to the present invention, the technical solution adopted is:
一种基于访问日志信息的空间小文件数据分布存储方法,对任一种空间小文件数据类型,执行包括以下步骤:A distributed storage method for small space file data based on access log information, for any type of small space file data, the execution includes the following steps:
步骤1,将空间小文件数据集,按照访问频率不同分成频繁访问的子集和非频繁访问的子集;包括以下子步骤,Step 1, divide the data set of small spatial files into frequently accessed subsets and non-frequently accessed subsets according to different access frequencies; including the following sub-steps,
步骤1.1,获取各空间小文件数据访问热度,实现如下,Step 1.1, obtain the data access heat of small files in each space, the implementation is as follows,
设空间小文件数据集为F={f1,f2,...,fN},包含空间小文件数据f1,f2,...,fN,其中N为空间小文件数据的总个数,第i个空间小文件数据标记为fi,i=1,2,…,N;Let the small space file data set be F={f 1 , f 2 ,...,f N }, including the small space file data f 1 , f 2 ,...,f N , where N is the number of small space file data The total number, the i-th space small file data is marked as f i , i=1,2,...,N;
设访问日志信息中记录依次访问了空间小文件数据空间小文件数据的访问日志序列为A=(a1,a2,…,aM)为空间小文件数据访问序列向量,at∈[1,N],访问序号t=1,2,…,M,其中M为对F中所有空间小文件数据的访问总次数;Set the access log information to record that the space small file data is accessed in sequence The access log sequence of small file data is A=(a 1 ,a 2 ,…,a M ) is the data access sequence vector of small spatial files, a t ∈[1,N], access sequence number t=1,2,…,M, where M is the The total number of access times of all space small file data;
统计每个空间小文件数据fi在访问日志序列R中出现的次数λi,以λi为该空间小文件数据fi的访问热度;Count the number of times λ i of each small file data f i in the access log sequence R, and take λ i as the access heat of the small file data f i in this space;
步骤1.2,根据空间小文件数据访问热度提取被频繁访问的空间小文件数据,实现如下,Step 1.2, extract frequently accessed small space file data according to the access heat of small space file data, the implementation is as follows,
输入预设判别参数λ,Input the preset discriminant parameter λ,
若空间小文件数据集F中空间小文件数据fi的访问热度λi>λ,则空间小文件数据fi为频繁访问的空间小文件数据,否则fi属于非频繁访问的空间小文件数据;If the access heat of small spatial file data f i in the small spatial file data set F is λ i > λ, then the small spatial file data f i is frequently accessed small spatial file data, otherwise f i belongs to infrequently accessed small spatial file data ;
步骤1.3,根据步骤1.2所得频繁访问的空间小文件数据构成空间小文件数据集的子集,实现如下,Step 1.3, according to the frequently accessed small space file data obtained in step 1.2, constitutes a subset of the small space file data set, which is implemented as follows,
设所有频繁访问的空间小文件数据所构成子集为其中N1为频繁访问的空间小文件数据总个数,第i1、j1个频繁访问的空间小文件数据分别标记为和i1,j1∈[1,N1];Suppose the subset of all frequently accessed small file data is Among them, N 1 is the total number of frequently accessed small space file data, and the i 1 and j 1th frequently accessed small space file data are respectively marked as with i 1 ,j 1 ∈[1,N 1 ];
步骤2,从访问日志信息中提取频繁访问的空间小文件数据子集的访问序列,包括按照时间先后顺序形成访问序列为频繁访问空间小文件数据访问序列向量,访问序号t1=(11,21,…,M1),其中M1为对F1中所有频繁访问空间小文件数据的访问总次数;Step 2, extract the access sequence of frequently accessed small space file data subsets from the access log information, including forming the access sequence in chronological order Access sequence vectors for frequently accessed space small file data, Access sequence number t 1 =(1 1 ,2 1 ,...,M 1 ), where M 1 is the total number of visits to all frequently accessed space small file data in F 1 ;
步骤3,利用频繁访问的空间小文件数据子集的访问序列分段计算各频繁访问的空间小文件数据的关联度,并将各频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵;包括以下子步骤,Step 3: Use the access sequences of frequently accessed small space file data subsets to calculate the correlation degree of each frequently accessed small space file data, and form an association with the correlation degree values of each frequently accessed small space file data matrix; includes the following sub-steps,
步骤3.1,根据存储服务器数量m、频繁访问空间小文件数据子集长度N1计算频繁访问序列分段长度n=N1/m;Step 3.1, according to the number m of storage servers and the length N1 of data subsets of frequently accessed space small files, calculate the segment length n = N1/m of frequent access sequences;
步骤3.2,根据访问序列分段长度对频繁访问序列进行分段,实现如下,Step 3.2, segment the frequent access sequence according to the segment length of the access sequence, the implementation is as follows,
按照访问顺序,将频繁访问空间小文件数据访问序列向量A1以n个元素一组分割为若干子向量,表示为A1=(S1,S2,…,Sl),其中子向量Sk=(ak1,ak2,…,akn),akj∈[1,N1],1≤k≤l,1≤j≤n;将A1中所有子向量集合记为S,S={Sk:k∈[1,l]};According to the access sequence, the frequently accessed space small file data access sequence vector A 1 is divided into several sub-vectors in groups of n elements, expressed as A 1 =(S 1 ,S 2 ,…,S l ), where the sub-vector S k =(a k1 ,a k2 ,…,a kn ), a kj ∈[1,N 1 ], 1≤k≤l, 1≤j≤n; record all sub-vector sets in A 1 as S, S ={S k :k∈[1,l]};
步骤3.3,计算频繁访问的空间小文件数据相互之间的关联度数值,实现如下,Step 3.3, calculate the correlation degree value between frequently accessed small space file data, the implementation is as follows,
定义函数 define function
其中为Sk中的所有元素组成的集合;函数表示在长度为n的访问周期内频繁访问的空间小文件数据和是否具有关联性;in is the set of all elements in S k ; the function Represents small file data in space that is frequently accessed within an access cycle of length n with Is it relevant;
定义函数RS(i1,j1),Define the function R S (i 1 , j 1 ),
其中RS(i1,j1)表示S对和的总关联度;where R S (i 1 , j 1 ) represents the pair of S with the total correlation degree;
步骤3.4,将频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵RS,Step 3.4, form the correlation matrix R S with the correlation degree values between frequently accessed small space file data,
步骤4,对关联矩阵中各元素数值进行大小转换后利用RCM排序算法重排后输出;Step 4, after performing size conversion on the value of each element in the correlation matrix, use the RCM sorting algorithm to rearrange and output;
步骤5,对重排后的关联矩阵利用局部逼近搜索法依次循环寻找m个最佳组合后输出,方法如下,Step 5, using the local approximation search method for the rearranged correlation matrix to find m best combinations in turn and then output them, the method is as follows,
步骤6,利用步骤5所得最佳组合对频繁访问的空间小文件数据进行分布存储,以及对非频繁访问的空间小文件数据根据空间位置相邻关系分开存储。Step 6, use the optimal combination obtained in step 5 to store the frequently accessed small space file data in a distributed manner, and store the infrequently accessed small space file data separately according to the adjacent relationship of spatial positions.
而且,步骤4包括以下子步骤,Moreover, step 4 includes the following sub-steps,
步骤4.1,获取关联矩阵中元素最大值,包括遍历关联矩阵所有元素值,并获取最大值Rmax;Step 4.1, obtaining the maximum value of elements in the incidence matrix, including traversing all element values of the incidence matrix, and obtaining the maximum value R max ;
步骤4.2,对关联矩阵元素数值进行大小转换,包括遍历关联矩阵所有元素值,并执行操作RS(i1,j1)=Rmax-RS(i1,j1);Step 4.2, perform size conversion on the element values of the incidence matrix, including traversing all element values of the incidence matrix, and performing the operation R S (i 1 ,j 1 )=R max -R S (i 1 ,j 1 );
步骤4.3,利用标准RCM排序算法对关联矩阵进行重排。Step 4.3, using the standard RCM sorting algorithm to rearrange the incidence matrix.
而且,步骤5包括以下子步骤,Moreover, step 5 includes the following sub-steps,
步骤5.1,初始化当前迭代次数d=1;Step 5.1, initialize the current number of iterations d=1;
步骤5.2,采用局部逼近搜索法寻找一个最佳组合,包括在当前的矩阵中寻找一个n×n的块,使得该矩阵中n×n块内对应的矩阵元素值最大,相应的n个文件构成一个最佳组合;第一次执行步骤5.2时,当前的矩阵为步骤4所得重排后的关联矩阵;后续执行步骤5.2时,当前的矩阵为前一次迭代所得的矩阵;Step 5.2, using the local approximation search method to find an optimal combination, including finding an n×n block in the current matrix, so that the value of the corresponding matrix element in the n×n block in the matrix is the largest, and the corresponding n files constitute An optimal combination; when step 5.2 is executed for the first time, the current matrix is the rearranged correlation matrix obtained in step 4; when step 5.2 is subsequently executed, the current matrix is the matrix obtained in the previous iteration;
步骤5.3,在本次迭代执行步骤5.2搜索得到一个由n个文件组成的最佳组合后,将关联矩阵中对应n个文件的关联矩阵元素删除,得到(N1-dn)×(N1-dn)的矩阵;Step 5.3, after executing step 5.2 in this iteration to search for an optimal combination consisting of n files, delete the elements of the correlation matrix corresponding to n files in the correlation matrix to obtain (N 1 -dn)×(N 1 - dn) matrix;
步骤5.4,判断是否d=m-1,否则令d=d+1,以本次迭代执行步骤5.3所得(N1-dn)×(N1-dn)的矩阵为当前的矩阵,返回步骤5.2进行下一次迭代继续搜索下一个最近组合,是则停止搜索,共得到m个最佳组合。Step 5.4, judge whether d=m-1, otherwise set d=d+1, take the matrix of (N 1 -dn)×(N 1 -dn) obtained by performing step 5.3 in this iteration as the current matrix, and return to step 5.2 Carry out the next iteration and continue to search for the next closest combination, if it is, stop the search, and get m best combinations in total.
本发明还相应提供一种基于访问日志信息的空间小文件数据分布存储系统,包括以下单元,The present invention also correspondingly provides a small space file data distribution storage system based on access log information, including the following units,
空间小文件数据集预处理单元(100),用于将任一种空间小文件数据类型的空间小文件数据集,按照访问频率不同分成频繁访问的子集和非频繁访问的子集;包括以下模块,空间小文件数据访问频率统计模块(101),用于获取各空间小文件数据访问热度,实现如下,The small spatial file data set preprocessing unit (100) is used to divide the small spatial file data set of any small spatial file data type into a frequently accessed subset and a non-frequently accessed subset according to different access frequencies; including the following Module, small space file data access frequency statistics module (101), used to obtain the data access heat of each space small file, the realization is as follows,
设空间小文件数据集为F={f1,f2,...,fN},包含空间小文件数据f1,f2,...,fN,其中N为空间小文件数据的总个数,第i个空间小文件数据标记为fi,i=1,2,…,N;Let the small space file data set be F={f 1 , f 2 ,...,f N }, including the small space file data f 1 , f 2 ,...,f N , where N is the number of small space file data The total number, the i-th space small file data is marked as f i , i=1,2,...,N;
设访问日志信息中记录依次访问了空间小文件数据空间小文件数据的访问日志序列为A=(a1,a2,…,aM)为空间小文件数据访问序列向量,at∈[1,N],访问序号t=1,2,…,M,其中M为对F中所有空间小文件数据的访问总次数;Set the access log information to record that the space small file data is accessed in sequence The access log sequence of small file data is A=(a 1 ,a 2 ,…,a M ) is the data access sequence vector of small spatial files, a t ∈[1,N], access sequence number t=1,2,…,M, where M is the The total number of access times of all space small file data;
统计每个空间小文件数据fi在访问日志序列R中出现的次数λi,以λi为该空间小文件数据fi的访问热度;Count the number of times λ i of each small file data f i in the access log sequence R, and take λ i as the access heat of the small file data f i in this space;
频繁访问空间小文件数据集提取模块(102),用于根据空间小文件数据访问热度提取被频繁访问的空间小文件数据,实现如下,Frequently accessed small space file data set extraction module (102), used for extracting frequently accessed small space file data according to the access heat of small space file data, realized as follows,
输入预设判别参数λ,Input the preset discriminant parameter λ,
若空间小文件数据集F中空间小文件数据fi的访问热度λi>λ,则空间小文件数据fi为频繁访问的空间小文件数据,否则fi属于非频繁访问的空间小文件数据;If the access heat of small spatial file data f i in the small spatial file data set F is λ i > λ, then the small spatial file data f i is frequently accessed small spatial file data, otherwise f i belongs to infrequently accessed small spatial file data ;
频繁访问空间小文件子集构建模块(103),用于根据频繁访问空间小文件数据集提取模块(102)所得频繁访问的空间小文件数据构成空间小文件数据集的子集,实现如下,Frequently accessed space small file subset construction module (103), used for forming a subset of space small file data set according to frequently accessed space small file data obtained by frequently accessed space small file data set extraction module (102), as follows,
设所有频繁访问的空间小文件数据所构成子集为其中N1为频繁访问的空间小文件数据总个数,第i1、j1个频繁访问的空间小文件数据分别标记为和i1,j1∈[1,N1];Suppose the subset of all frequently accessed small file data is Among them, N 1 is the total number of frequently accessed small space file data, and the i 1 and j 1th frequently accessed small space file data are respectively marked as with i 1 ,j 1 ∈[1,N 1 ];
空间小文件数据访问向量获取单元(200),用于从访问日志信息中提取频繁访问的空间小文件数据子集的访问序列,包括按照时间先后顺序形成访问序列 为频繁访问空间小文件数据访问序列向量,访问序号t1=(11,21,…,M1),其中M1为对F1中所有频繁访问空间小文件数据的访问总次数;Small space file data access vector acquisition unit (200), used to extract the access sequence of frequently accessed small space file data subsets from the access log information, including forming the access sequence in chronological order Access sequence vectors for frequently accessed space small file data, Access sequence number t 1 =(1 1 ,2 1 ,...,M 1 ), where M 1 is the total number of visits to all frequently accessed space small file data in F 1 ;
空间小文件数据访问关联矩阵计算单元(300),用于利用频繁访问的空间小文件数据子集的访问序列分段计算各频繁访问的空间小文件数据的关联度,并将各频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵;包括以下模块,频繁访问序列分段长度计算模块(301),用于根据存储服务器数量m、频繁访问空间小文件数据子集长度N1计算频繁访问序列分段长度n=N1/m;Spatial small file data access association matrix calculation unit (300), used for utilizing the access sequence segmentation of frequently accessed small spatial file data subsets to calculate the correlation degree of each frequently accessed small spatial file data, and each frequently accessed space The correlation degree values between the small file data form a correlation matrix; including the following modules, the frequent access sequence segmentation length calculation module (301), used for calculating according to the storage server quantity m , the frequent access space small file data subset length N1 Frequent access sequence segment length n=N 1 /m;
存储服务器数量参数m由外部输入。The parameter m of the number of storage servers is input from the outside.
频繁访问序列分段模块(302),用于根据访问序列分段长度对频繁访问序列进行分段,实现如下,The frequent access sequence segmentation module (302), is used for segmenting the frequent access sequence according to the access sequence segmentation length, as follows,
按照访问顺序,将频繁访问空间小文件数据访问序列向量A1以n个元素一组分割为若干子向量,表示为A1=(S1,S2,…,Sl),其中子向量Sk=(ak1,ak2,…,akn),akj∈[1,N1],1≤k≤l,1≤j≤n;将A1中所有子向量集合记为S,S={Sk:k∈[1,l]};According to the access sequence, the frequently accessed space small file data access sequence vector A 1 is divided into several sub-vectors in groups of n elements, expressed as A 1 =(S 1 ,S 2 ,…,S l ), where the sub-vector S k =(a k1 ,a k2 ,…,a kn ), a kj ∈[1,N 1 ], 1≤k≤l, 1≤j≤n; record all sub-vector sets in A 1 as S, S ={S k :k∈[1,l]};
空间小文件数据关联度计算模块(303),用于计算频繁访问的空间小文件数据相互之间的关联度数值,实现如下,Small space file data association degree calculation module (303), used for calculating the association degree value between frequently accessed small space file data, is realized as follows,
定义函数 define function
其中为Sk中的所有元素组成的集合;函数表示在长度为n的访问周期内频繁访问的空间小文件数据和是否具有关联性;in is the set of all elements in S k ; the function Represents small file data in space that is frequently accessed within an access cycle of length n with Is it relevant;
定义函数RS(i1,j1),Define the function R S (i 1 , j 1 ),
其中RS(i1,j1)表示S对和的总关联度;where R S (i 1 , j 1 ) represents the pair of S with the total correlation degree;
空间小文件数据关联矩阵生成模块(304),用于将频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵RS,Spatial small file data association matrix generating module (304), for forming the association matrix R S with the correlation degree values between frequently accessed small spatial file data,
关联矩阵转换重排单元(400),用于对关联矩阵中各元素数值进行大小转换后利用RCM排序算法重排后输出;An association matrix conversion and rearrangement unit (400), which is used to convert the value of each element in the association matrix and output it after rearranging using the RCM sorting algorithm;
关联矩阵最佳组合搜索单元(500),用于对重排后的关联矩阵利用局部逼近搜索法寻找最佳组合;Incidence matrix best combination search unit (500), used for utilizing local approximation search method to find the best combination for rearranged incidence matrix;
空间小文件数据分布存储单元(600),用于利用关联矩阵最佳组合搜索单元(500)所得最佳组合对频繁访问的空间小文件数据进行分布存储,以及对非频繁访问的空间小文件数据根据空间位置相邻关系分开存储。The small space file data distribution storage unit (600) is used to use the optimal combination obtained by the correlation matrix optimal combination search unit (500) to distribute and store the frequently accessed small space file data, and to store the infrequently accessed small space file data Stored separately according to the adjacent relationship of spatial position.
而且,关联矩阵转换重排单元(400)包括以下模块,And, the correlation matrix conversion and rearrangement unit (400) includes the following modules,
关联矩阵元素最大值获取模块(401),用于获取关联矩阵中元素最大值,包括遍历关联矩阵所有元素值,并获取最大值Rmax;The maximum element value acquisition module (401) of the correlation matrix is used to obtain the maximum value of the elements in the correlation matrix, including traversing all element values of the correlation matrix, and obtaining the maximum value Rmax ;
关联矩阵元素值大小转换模块(402),用于对关联矩阵元素数值进行大小转换,包括遍历关联矩阵所有元素值,并执行操作RS(i1,j1)=Rmax-RS(i1,j1);Incidence matrix element value size conversion module (402), used for performing size conversion on the incidence matrix element value, including traversing all element values of the incidence matrix, and performing the operation R S (i 1 , j 1 )=R max -R S (i 1 ,j 1 );
关联矩阵重排模块(403),用于利用标准RCM排序算法对关联矩阵进行重排。A correlation matrix rearrangement module (403), configured to use a standard RCM sorting algorithm to rearrange the correlation matrix.
而且,关联矩阵最佳组合搜索单元(500)包括以下模块,And, the correlation matrix optimal combination search unit (500) comprises the following modules,
初始化模块,用于初始化当前迭代次数d=1;The initialization module is used to initialize the current number of iterations d=1;
最佳组合搜索模块,用于采用局部逼近搜索法寻找一个最佳组合,包括在当前的矩阵中寻找一个n×n的块,使得该矩阵中n×n块内对应的矩阵元素值最大,相应的n个文件构成一个最佳组合;最佳组合搜索模第一次工作时,当前的矩阵为关联矩阵转换重排单元(400)所得重排后的关联矩阵;最佳组合搜索模后续工作时,当前的矩阵为前一次迭代所得的矩阵;The optimal combination search module is used to find an optimal combination by using the local approximation search method, including finding an n×n block in the current matrix, so that the value of the corresponding matrix element in the n×n block in the matrix is the largest, corresponding The n files constitute an optimal combination; when the optimal combination search module works for the first time, the current matrix is the correlation matrix after the rearrangement of the resulting rearrangement of the correlation matrix conversion rearrangement unit (400); , the current matrix is the matrix obtained from the previous iteration;
矩阵更新模块,用于在最佳组合搜索模块进行本次迭代工作搜索得到一个由n个文件组成的最佳组合后,将关联矩阵中对应n个文件的关联矩阵元素删除,得到(N1-dn)×(N1-dn)的矩阵;The matrix update module is used to delete the elements of the correlation matrix corresponding to n files in the correlation matrix after the optimal combination search module performs this iterative work search to obtain an optimal combination consisting of n files, and obtains (N 1 - dn)×(N 1 -dn) matrix;
判断输出模块,用于判断是否d=m-1,否则令d=d+1,以矩阵更新模块本次迭代工作所得(N1-dn)×(N1-dn)的矩阵为当前的矩阵,命令最佳组合搜索模块进行下一次迭代工作继续搜索下一个最近组合,是则停止搜索,共得到m个最佳组合。Judgment output module, used to judge whether d=m-1, otherwise let d=d+1, take the matrix of (N 1 -dn)×(N 1 -dn) obtained by the matrix update module this iteration work as the current matrix , command the best combination search module to carry out the next iteration and continue to search for the next closest combination, otherwise stop searching, and get m best combinations in total.
本发明具有的有益效果是:空间小文件数据由于数量巨大,但用户访问行为存在聚集性,大部分请求集中在少部分空间小文件数据,为此本发明通过对空间小文件数据进行访问热度分类后,对频繁访问的空间小文件数据利用访问日志信息计算其相互之间的关联度,并在组成关联矩阵后通过局部逼近搜索法寻找最佳分布存储组合方案,并对不同热度的空间小文件数据采用不同的方案分布存储,在有限的计算资源消耗下,实现海量空间小文件数据的最优分布存储,达到提高其并行访问性能,改善空间信息系统的服务能力的目的。因此,本发明能够降低服务器内部空间小文件数据访问时的并发率,从而最终获得服务器间高的空间小文件数据并行访问率,提高了空间小文件数据服务性能,且减少计算数据量,效率较高,具有较好的工程实践性,可应用于大规模分布式环境下空间小文件数据的分布式存储技术领域。The beneficial effect of the present invention is that: due to the huge amount of small space file data, but the aggregation of user access behaviors, most of the requests are concentrated in a small part of small space file data, so the present invention classifies the access popularity of small space file data Finally, use the access log information to calculate the degree of correlation between frequently accessed small space files, and use the local approximation search method to find the best distribution and storage combination scheme after forming the correlation matrix. The data is distributed and stored in different schemes, and under the limited consumption of computing resources, the optimal distributed storage of massive space and small file data is realized, so as to improve its parallel access performance and the service capability of the spatial information system. Therefore, the present invention can reduce the concurrency rate of small file data access inside the server, thereby finally obtaining a high parallel access rate of small file data between servers, improving the service performance of small file data, and reducing the amount of calculation data, and the efficiency is relatively high. High, with good engineering practice, it can be applied to the field of distributed storage technology for small spatial file data in a large-scale distributed environment.
附图说明Description of drawings
图1是本发明实施例中系统结构示意图。Fig. 1 is a schematic diagram of the system structure in the embodiment of the present invention.
图2是本发明实施例中空间小文件数据集预处理单元100结构示意图。Fig. 2 is a schematic structural diagram of the preprocessing unit 100 of the small spatial file data set in the embodiment of the present invention.
图3是本发明实施例中空间小文件数据访问关联矩阵计算单元300结构示意图。FIG. 3 is a schematic structural diagram of a small spatial file data access correlation matrix calculation unit 300 in an embodiment of the present invention.
图4是本发明实施例中关联矩阵转换重排单元400结构示意图。FIG. 4 is a schematic structural diagram of an association matrix transformation and rearrangement unit 400 in an embodiment of the present invention.
图5是本发明实施例中方法流程图。Fig. 5 is a flow chart of the method in the embodiment of the present invention.
具体实施方式detailed description
分布式环境下,对空间小文件数据的访问难以通过对数据的分块分布存储实现对其的并行访问,因此需要分析各个空间小文件数据之间的相互关系,以实现在对空间小文件数据进行访问时,尽可能的使所请求的空间小文件数据存储在不同的存储服务器中,以最大可能的实现对空间小文件数据的并行获取,从而提高空间信息服务系统的性能。In a distributed environment, access to small spatial file data is difficult to achieve parallel access through distributed storage of data in blocks. Therefore, it is necessary to analyze the relationship between the data of each small spatial file in order to realize the access to small spatial file data. When accessing, store the requested data of small spatial files in different storage servers as much as possible, so as to achieve parallel acquisition of data of small spatial files as much as possible, thereby improving the performance of the spatial information service system.
由于空间小文件数据数量巨大,大规模的空间小文件数据的存储组合优化计算复杂度高,搜素时间开销大,为此需要对空间小文件数据进行热度分类,并根据不同热度分别采用不同的方法获取最佳存储组合方案。Due to the huge amount of small space file data, the storage combination optimization of large-scale small space file data has high computational complexity, and the search time is high. Therefore, it is necessary to classify the popularity of small space file data, and use different method to obtain the optimal storage combination scheme.
以下对本发明技术方案的具体实施提供详细建议说明。The following provides detailed suggestion descriptions for the specific implementation of the technical solution of the present invention.
本发明所述空间小文件数据,包含空间数据类型及空间坐标位置,每个空间小文件数据较小,不适于将其继续分成多份并分别存储在不同的服务器上以提高其并行访问效率。所述访问日志信息是由对应的空间信息服务系统按照时间顺序记录的,各个客户端应用访问空间小文件数据的日志信息,包括所访问的空间小文件数据类型和坐标。所述的访问日志信息由空间信息服务系统在运行过程中记录,形式包括但不限于文件、数据库。The small spatial file data of the present invention includes spatial data type and spatial coordinate position, and the data of each small spatial file is small, so it is not suitable to divide it into multiple parts and store them on different servers to improve the efficiency of parallel access. The access log information is recorded by the corresponding spatial information service system in chronological order, and the log information of each client application accessing the small spatial file data includes the data type and coordinates of the small spatial file accessed. The access log information is recorded by the spatial information service system during operation, and the forms include but are not limited to files and databases.
所述空间小文件数据包含不同类型,包括但不限于SRTM30(the 30m of globalShuttle Radar Topography Mission terrain data files)、SRTM90。The small spatial file data includes different types, including but not limited to SRTM30 (the 30m of global Shuttle Radar Topography Mission terrain data files), SRTM90.
所述的一种基于问日志信息的空间小文件数据分布存储方法及系统,针对每种类型的空间小文件数据分别处理,所述方法和系统对不同类型的空间小文件数据处理过程相同。The method and system for distributed storage of small space file data based on interrogation log information processes each type of small space file data separately, and the method and system process the data of different types of small space files in the same way.
如图5所示,本发明的方法所采用的技术方案是:一种基于访问日志信息的空间小文件数据分布存储方法及系统,对任一种空间小文件数据类型,执行包括以下步骤:As shown in Figure 5, the technical solution adopted by the method of the present invention is: a method and system for distributed storage of small space file data based on access log information, and for any type of small space file data, the execution includes the following steps:
(1)频繁访问空间小文件数据子集提取:将空间小文件数据集,按照访问频率不同分成频繁访问子集和非频繁访问子集;包括以下子步骤,(1) Data subset extraction of frequently accessed small files in space: divide the data set of small files in space into frequent access subsets and infrequent access subsets according to different access frequencies; including the following sub-steps,
①获取各空间小文件数据访问热度。① Obtain the data access popularity of small files in each space.
设空间小文件数据集为F={f1,f2,...,fN},包含空间小文件数据f1,f2,...,fN,其中N为空间小文件数据的总个数,第i个空间小文件数据标记为fi,i=1,2,…,N。Let the small space file data set be F={f 1 , f 2 ,...,f N }, including the small space file data f 1 , f 2 ,...,f N , where N is the number of small space file data The total number, the i-th space small file data is marked as f i , i=1,2,...,N.
设访问日志信息中记录依次访问了空间小文件数据空间小文件数据的访问日志序列为对应的称A=(a1,a2,…,aM)为空间小文件数据访问序列向量,at∈[1,N](访问序号t=1,2,…,M),其中M为对F中所有空间小文件数据的访问总次数。Set the access log information to record that the space small file data is accessed in sequence The access log sequence of small file data is Correspondingly, A=(a 1 ,a 2 ,…,a M ) is the data access sequence vector of small spatial files, at t ∈ [1,N] (access sequence number t=1,2,…,M), where M It is the total number of visits to all small file data in F.
统计每个fi(fi∈F)在访问日志序列R中出现的次数λi,则λi为该空间小文件数据fi的访问热度。Count the number of occurrences λ i of each f i (f i ∈ F) in the access log sequence R, then λ i is the access heat of the small file data f i in this space.
②根据空间小文件数据访问热度提取被频繁访问的空间小文件数据。② Extract frequently accessed small space file data according to the data access heat of small space files.
输入频繁访问空间小文件数据的预设判别参数λ,Input the preset discriminant parameter λ for frequently accessed small file data,
若空间小文件数据集F中空间小文件数据fi的访问热度λi>λ,则空间小文件数据fi为频繁访问的空间小文件数据,否则,fi属于非频繁访问的空间小文件数据。If the access heat of small spatial file data f i in the small spatial file data set F is λ i > λ, then the small spatial file data f i is frequently accessed small spatial file data; otherwise, f i belongs to the infrequently accessed small spatial file data.
③根据②所获得的频繁访问空间小文件数据构成空间小文件数据集F的子集③According to ②, the frequently accessed small space file data constitutes a subset of the small space file data set F
若设所有频繁访问的空间小文件数据构成的子集为其中N1为频繁访问的空间小文件数据总个数,第i1、j1个频繁访问的空间小文件数据分别标记为和i1,j1∈[1,N1]。If the subset of all frequently accessed small file data is Among them, N 1 is the total number of frequently accessed small space file data, and the i 1 and j 1th frequently accessed small space file data are respectively marked as with i 1 ,j 1 ∈[1,N 1 ].
同样可设非频繁访问的空间小文件数据集为其中N2为非频繁访问的空间小文件数据总个数。其中N1+N2=N。Similarly, the infrequently accessed small file data set can be set as Among them, N 2 is the total number of infrequently accessed small file data. where N 1 +N 2 =N.
(2)频繁访问空间小文件数据子集访问序列提取:从访问日志信息中提取频繁访问的空间小文件数据子集的访问序列;(2) Access sequence extraction of frequently accessed small file data subsets: extract the access sequence of frequently accessed small file data subsets from the access log information;
访问日志信息记录了空间数据的坐标,不同的坐标代表了不同的数据。因此可从访问日志信息按照访问时间先后顺序提取所访问的空间小文件数据的坐标信息。具体实施时,具体信息提取方式可依据访问日志信息的记录格式确定。坐标信息为空间小文件数据的空间经纬度坐标。Access log information records the coordinates of spatial data, and different coordinates represent different data. Therefore, the coordinate information of the accessed small spatial file data can be extracted from the access log information in order of access time. During specific implementation, the specific information extraction method may be determined according to the recording format of the access log information. The coordinate information is the spatial longitude and latitude coordinates of the small spatial file data.
根据频繁访问空间小文件数据子集提取访问序列子集,实现如下,Extract the access sequence subset according to the frequently accessed space small file data subset, the implementation is as follows,
对访问日志信息中空间小文件数据按照访问时间先后顺序,取其中频繁访问的空间小文件数据,形成频繁访问的空间小文件数据子集的访问序列对应的称为频繁访问空间小文件数据访问序列向量,(访问序号t1=(11,21,…,M1)),其中M1为对F1中所有频繁访问空间小文件数据的访问总次数。In the access log information, the small space file data in the access time sequence is selected, and the frequently accessed small space file data is taken to form the access sequence of the frequently accessed small space file data subset Corresponding name Access sequence vectors for frequently accessed space small file data, (Access sequence number t 1 =(1 1 ,2 1 ,...,M 1 )), where M 1 is the total number of accesses to all frequently accessed space small file data in F 1 .
(3)关联度计算与关联矩阵获取:利用频繁访问的空间小文件数据子集的访问序列分段计算各频繁访问的空间小文件数据的关联度,并将各频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵;包括以下子步骤,(3) Correlation degree calculation and correlation matrix acquisition: use the access sequences of frequently accessed small spatial file data subsets to calculate the correlation degree of each frequently accessed small spatial file data, and compare the frequently accessed small spatial file data with each other The association degree values between form an association matrix; including the following sub-steps,
①根据存储服务器数量、频繁访问空间小文件数据子集长度N1计算频繁访问序列分段长度n。① Calculate the frequent access sequence segment length n according to the number of storage servers and the data subset length N of frequent access space small files.
存储服务器数量m可由外部输入,例如由系统配置文件输入。The number m of storage servers can be input from outside, for example, from a system configuration file.
通过公式n=N1/m计算得到频繁访问序列分段长度n。The frequent access sequence segment length n is calculated by the formula n=N 1 /m.
②根据访问序列分段长度对频繁访问序列进行分段。②Segment the frequent access sequence according to the segment length of the access sequence.
按照频繁访问空间小文件数据的访问顺序,将频繁访问空间小文件数据访问序列向量A1以n个元素一组分割为若干子向量,表示为:A1=(S1,S2,…,Sl),其中子向量Sk=(ak1,ak2,…,akn),akj∈[1,N1],1≤k≤l,1≤j≤n为A1中的长度为n的子向量。将A1中所有长度为n的访问向量集合记为S,即A1中所有子向量的集合S={Sk:k∈[1,l]}。According to the access sequence of frequently accessed small file data, the frequently accessed small file data access sequence vector A 1 is divided into several sub-vectors in groups of n elements, expressed as: A 1 =(S 1 ,S 2 ,…, S l ), where sub-vector S k = (a k1 ,a k2 ,…,a kn ), a kj ∈ [1,N 1 ], 1≤k≤l, 1≤j≤n is the length in A 1 is a subvector of n. Record all access vector sets with length n in A 1 as S, that is, the set of all sub-vectors in A 1 S={S k :k∈[1,l]}.
③计算频繁访问的空间小文件数据相互之间的关联度数值③Calculate the value of the correlation degree between frequently accessed small space file data
首先计算各分段内空间小文件数据相互关联程度,对定义函数:Firstly, calculate the interrelationship degree of small spatial file data in each segment. Define the function:
其中为Sk中的所有元素组成的集合。函数的意义在于,在长度为n的访问周期内频繁访问的空间小文件数据和是否具有关联性。in is the set of all elements in S k . The significance of the function is that the frequently accessed small file data in the access period of length n with Is it related.
在此基础上,定义函数:Based on this, define the function:
则RS(i1,j1)表示S对和的总关联度。Then R S (i 1 , j 1 ) represents the pair of S with total correlation.
④将频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵。④ Compose the correlation matrix between the correlation degree values of frequently accessed small spatial file data.
将所有N1个频繁访问的空间小文件数据相互之间的关联度用矩阵表示,即可得到如下的关联矩阵RS。The following correlation matrix R S can be obtained by expressing the correlation degrees among all N 1 frequently accessed small spatial file data in a matrix.
(4)关联矩阵转换和重排输出:对关联矩阵中各元素数值进行大小转换后利用RCM排序算法重排后输出;包括以下子步骤,(4) Correlation matrix conversion and rearrangement output: use the RCM sorting algorithm to rearrange the output after the size conversion of each element value in the correlation matrix; include the following sub-steps,
①获取关联矩阵中元素最大值。① Obtain the maximum value of the element in the incidence matrix.
遍历关联矩阵所有元素值,并获取最大值Rmax。Traverse all element values of the incidence matrix, and obtain the maximum value R max .
②对关联矩阵元素数值进行大小转换。②Convert the size of the element values of the incidence matrix.
遍历关联矩阵所有元素值,并执行操作RS(i1,j1)=Rmax-RS(i1,j1),对关联矩阵元素值大小进行转换。Traverse all element values of the incidence matrix, and execute the operation R S (i 1 , j 1 )=R max -R S (i 1 , j 1 ), to convert the size of the element values of the incidence matrix.
③利用标准RCM排序算法对关联矩阵进行重排。③ Use the standard RCM sorting algorithm to rearrange the correlation matrix.
采用标准RCM排序算法对关联矩阵进行重排,目标是将关联矩阵中非零元素集中在对角线附近。将重排后的新矩阵记为P。标准RCM排序算法为现有技术,具体实施时可参考文献Gibbs N E,Poole W G,Stockmeyer P K.An algorithm for reducing thebandwidth and profile of a sparse matrix.SIAM Journal on Numerical Analysis,1976,13(2):236-250。The standard RCM sorting algorithm is used to rearrange the incidence matrix, and the goal is to concentrate the non-zero elements in the incidence matrix near the diagonal. Denote the rearranged new matrix as P. The standard RCM sorting algorithm is a prior art. For specific implementation, please refer to Gibbs N E, Poole W G, Stockmeyer P K. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis, 1976, 13 (2): 236-250.
(5)最优存储分布组合搜索输出。(5) Optimal storage distribution combination search output.
对(4)所得重排后的关联矩阵利用局部逼近搜索法寻找最佳组合以便获得对该子集空间小文件数据的最高并行访问率。局部逼近搜索法为现有技术,具体实施时可参考文献XIA Kai,Wen-zhan.Adaptive Genetic Algorithm Based on Local Search MechanismQuickly Solving TSP.Journal of Zhejiang Institute of Science and Technology,2014,31(3)。Use the local approximation search method to find the best combination for the rearranged association matrix obtained in (4) in order to obtain the highest parallel access rate of the small file data in the subset space. The local approximation search method is an existing technology. For specific implementation, reference can be made to XIA Kai, Wen-zhan. Adaptive Genetic Algorithm Based on Local Search Mechanism Quickly Solving TSP. Journal of Zhejiang Institute of Science and Technology, 2014, 31(3).
根据(4)所得重排后的关联矩阵,迭代使用局部逼近搜索法,每执行局部逼近搜索一次得到一个包含n个文件的最佳组合,最后可得到m个组合,以便后续分别存储在m个存储服务器上。每个组合由n个文件组成,n个文件相互之间的关联度值在矩阵中对应了一个n×n的块;具体实现如下:According to the rearranged correlation matrix obtained in (4), the local approximation search method is used iteratively. Each time the local approximation search is performed, an optimal combination containing n files can be obtained, and finally m combinations can be obtained for subsequent storage in m stored on the server. Each combination consists of n files, and the correlation value between n files corresponds to an n×n block in the matrix; the specific implementation is as follows:
①初始化当前迭代次数d=1;①Initialize the current number of iterations d=1;
②采用局部逼近搜索法寻找一个最佳组合,包括在当前的矩阵中寻找一个n×n的块,使得该矩阵中n×n块内对应的矩阵元素值最大,相应的n个文件构成一个最佳组合;② Use the local approximation search method to find an optimal combination, including finding an n×n block in the current matrix, so that the value of the corresponding matrix element in the n×n block in the matrix is the largest, and the corresponding n files constitute an optimal combination. good combination;
第一次执行②时,当前的矩阵为(4)所得重排后的关联矩阵,矩阵大小为N1×N1;后续执行②时,当前的矩阵为前一次迭代所得(N1-(d-1)n)×(N1-(d-1)n)的矩阵;When ② is executed for the first time, the current matrix is the rearranged correlation matrix obtained in (4), and the size of the matrix is N 1 ×N 1 ; when ② is executed subsequently, the current matrix is obtained from the previous iteration (N 1 -(d -1)n)×(N 1 -(d-1)n) matrix;
③在本次迭代执行②搜索得到一个由n个文件组成的最佳组合后,将关联矩阵中对应n个文件的关联矩阵元素删除,得到(N1-dn)×(N1-dn)的矩阵,缩小继续搜索的关联矩阵大小,可以节省搜索时间;③ After performing ② search in this iteration to obtain an optimal combination consisting of n files, delete the elements of the correlation matrix corresponding to n files in the correlation matrix to obtain (N 1 -dn)×(N 1 -dn) Matrix, reducing the size of the correlation matrix to continue searching can save search time;
④判断是否d=m-1,否则令d=d+1,以本次迭代执行③所得(N1-dn)×(N1-dn)的矩阵(在d=d+1后即(N1-(d-1)n)×(N1-(d-1)n))为基础作为当前的矩阵,返回②进行下一次迭代继续搜索下一个最近组合,是则停止搜索,当前矩阵为n×n,即可直接得到最后一个由n个文件组成的最佳组合,和依次循环搜索m-1次得到的最佳组合一起,共得到m个最佳组合。④ judge whether d=m-1, otherwise let d=d+1, execute ③ the matrix (N 1 -dn)×(N 1 -dn) obtained by this iteration (after d=d+1, that is (N 1 -(d-1)n)×(N 1 -(d-1)n)) as the current matrix, return to ② for the next iteration and continue to search for the next closest combination, if yes, stop the search, the current matrix is n×n, the last best combination consisting of n files can be directly obtained, together with the best combination obtained by sequentially searching m-1 times, a total of m best combinations can be obtained.
(6)空间小文件数据分布存储:利用(5)最终得到的最佳组合对频繁访问的空间小文件数据进行分布存储,以及对非频繁访问的空间小文件数据根据其空间位置相邻关系分开存储。(6) Distributed storage of small space file data: Use the best combination finally obtained in (5) to store frequently accessed small space file data in a distributed manner, and to separate infrequently accessed small space file data according to their spatial position adjacency storage.
实施例按照获取的最佳组合对频繁访问的空间小文件数据进行分布存储以获取该空间小文件数据的最高并行访问率。The embodiment distributes and stores the frequently accessed small space file data according to the obtained optimal combination to obtain the highest parallel access rate of the small space file data.
通过步骤(5)获得的最佳分布存储组合的空间小文件数据,具有相互之间关联度低的特点(即经过矩阵元素值大小转换后在重排后的关联矩阵中对应元素值大),则可将一个最佳组合中的所有空间小文件数据存储在一个服务器中,以此获得相互之间低的并发访问要求(即实现了不同服务器之间高的并行访问率)。The space-small file data of the optimal distribution storage combination obtained through step (5) has the characteristics of low correlation between each other (that is, the corresponding element value in the rearranged correlation matrix after matrix element value conversion is large), Then all the small-space file data in an optimal combination can be stored in one server, so as to obtain low concurrent access requirements among each other (that is, realize a high parallel access rate among different servers).
根据空间小文件数据的坐标信息,实施例对非频繁访问空间小文件数据根据其空间位置相关性进行分开存储。According to the coordinate information of the small spatial file data, the embodiment separately stores the infrequently accessed small spatial file data according to their spatial location correlation.
针对步骤(1)的F2,按照位置相邻,则存入不同服务器的原则,将非频繁访问的空间小文件数据存储在服务器中。For F 2 in step (1), according to the principle that if the locations are adjacent, then they are stored in different servers, and the data of small files that are infrequently accessed are stored in the server.
根据空间数据访问特征,空间数据访问具有空间访问路劲的连续性,因此,相邻的空间数据具有较高的概率被同时被访问,因此,存储在不同的服务器可以减少并发,提高并行率。According to the characteristics of spatial data access, spatial data access has the continuity of spatial access path. Therefore, adjacent spatial data has a high probability of being accessed at the same time. Therefore, storing in different servers can reduce concurrency and improve parallelism.
具体实施时,所述的频繁访问空间小文件数据的判别参数、关联矩阵RCM排序算法参数、存储服务器数量可由外部输入或由本领域技术人员预先设定。During specific implementation, the discriminant parameters of frequently accessed small file data in space, the parameters of the correlation matrix RCM sorting algorithm, and the number of storage servers can be input externally or preset by those skilled in the art.
参见图1,本发明还相应提供一种基于访问日志信息的空间小文件数据分布存储系统,包括以下单元,Referring to FIG. 1 , the present invention also provides a corresponding access log information-based small space file data distribution storage system, including the following units,
空间小文件数据集预处理单元(100),用于将任一种空间小文件数据类型的空间小文件数据集,按照访问频率不同分成频繁访问的子集和非频繁访问的子集;参见图2,包括以下模块,空间小文件数据访问频率统计模块(101),用于获取各空间小文件数据访问热度,实现如下,The small spatial file data set preprocessing unit (100) is used to divide the small spatial file data set of any small spatial file data type into a frequently accessed subset and a non-frequently accessed subset according to the access frequency; see Fig. 2. Including the following modules, the small space file data access frequency statistics module (101), used to obtain the data access heat of each small space file, the implementation is as follows,
设空间小文件数据集为F={f1,f2,...,fN},包含空间小文件数据f1,f2,...,fN,其中N为空间小文件数据的总个数,第i个空间小文件数据标记为fi,i=1,2,…,N;Let the small space file data set be F={f 1 , f 2 ,...,f N }, including the small space file data f 1 , f 2 ,...,f N , where N is the number of small space file data The total number, the i-th space small file data is marked as f i , i=1,2,...,N;
设访问日志信息中记录依次访问了空间小文件数据空间小文件数据的访问日志序列为A=(a1,a2,…,aM)为空间小文件数据访问序列向量,at∈[1,N],访问序号t=1,2,…,M,其中M为对F中所有空间小文件数据的访问总次数;Set the access log information to record that the space small file data is accessed in sequence The access log sequence of small file data is A=(a 1 ,a 2 ,…,a M ) is the data access sequence vector of small spatial files, a t ∈[1,N], access sequence number t=1,2,…,M, where M is the The total number of access times of all space small file data;
统计每个空间小文件数据fi在访问日志序列R中出现的次数λi,以λi为该空间小文件数据fi的访问热度;Count the number of times λ i of each small file data f i in the access log sequence R, and take λ i as the access heat of the small file data f i in this space;
频繁访问空间小文件数据集提取模块(102),用于根据空间小文件数据访问热度提取被频繁访问的空间小文件数据,实现如下,Frequently accessed small space file data set extraction module (102), used for extracting frequently accessed small space file data according to the access heat of small space file data, realized as follows,
输入预设判别参数λ,Input the preset discriminant parameter λ,
若空间小文件数据集F中空间小文件数据fi的访问热度λi>λ,则空间小文件数据fi为频繁访问的空间小文件数据,否则fi属于非频繁访问的空间小文件数据;If the access heat of small spatial file data f i in the small spatial file data set F is λ i > λ, then the small spatial file data f i is frequently accessed small spatial file data, otherwise f i belongs to infrequently accessed small spatial file data ;
频繁访问空间小文件子集构建模块(103),用于根据频繁访问空间小文件数据集提取模块(102)所得频繁访问的空间小文件数据构成空间小文件数据集的子集,实现如下,Frequently accessed space small file subset construction module (103), used for forming a subset of space small file data set according to frequently accessed space small file data obtained by frequently accessed space small file data set extraction module (102), as follows,
设所有频繁访问的空间小文件数据所构成子集为其中N1为频繁访问的空间小文件数据总个数,第i1、j1个频繁访问的空间小文件数据分别标记为和i1,j1∈[1,N1];Suppose the subset of all frequently accessed small file data is Among them, N 1 is the total number of frequently accessed small space file data, and the i 1 and j 1th frequently accessed small space file data are respectively marked as with i 1 ,j 1 ∈[1,N 1 ];
空间小文件数据访问向量获取单元(200),用于从访问日志信息中提取频繁访问的空间小文件数据子集的访问序列,包括按照时间先后顺序形成访问序列 为频繁访问空间小文件数据访问序列向量,访问序号t1=(11,21,…,M1),其中M1为对F1中所有频繁访问空间小文件数据的访问总次数;Small space file data access vector acquisition unit (200), used to extract the access sequence of frequently accessed small space file data subsets from the access log information, including forming the access sequence in chronological order Access sequence vectors for frequently accessed space small file data, Access sequence number t 1 =(1 1 ,2 1 ,...,M 1 ), where M 1 is the total number of visits to all frequently accessed space small file data in F 1 ;
空间小文件数据访问关联矩阵计算单元(300),用于利用频繁访问的空间小文件数据子集的访问序列分段计算各频繁访问的空间小文件数据的关联度,并将各频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵;参见图3,包括以下模块,频繁访问序列分段长度计算模块(301),用于根据存储服务器数量m、频繁访问空间小文件数据子集长度N1计算频繁访问序列分段长度n=N1/m;Spatial small file data access association matrix calculation unit (300), used for utilizing the access sequence segmentation of frequently accessed small spatial file data subsets to calculate the correlation degree of each frequently accessed small spatial file data, and each frequently accessed space The correlation degree values between the small file data form a correlation matrix; referring to Fig. 3, including the following modules, the frequent access sequence segmentation length calculation module (301), used for the small file data subset according to the number m of storage servers and frequent access space Length N 1 calculates frequent access sequence segment length n=N 1 /m;
频繁访问序列分段模块(302),用于根据访问序列分段长度对频繁访问序列进行分段,实现如下,The frequent access sequence segmentation module (302), is used for segmenting the frequent access sequence according to the access sequence segmentation length, as follows,
按照访问顺序,将频繁访问空间小文件数据访问序列向量A1以n个元素一组分割为若干子向量,表示为A1=(S1,S2,…,Sl),其中子向量Sk=(ak1,ak2,…,akn),akj∈[1,N1],1≤k≤l,1≤j≤n;将A1中所有子向量集合记为S,S={Sk:k∈[1,l]};According to the access sequence, the frequently accessed space small file data access sequence vector A 1 is divided into several sub-vectors in groups of n elements, expressed as A 1 =(S 1 ,S 2 ,…,S l ), where the sub-vector S k =(a k1 ,a k2 ,…,a kn ), a kj ∈[1,N 1 ], 1≤k≤l, 1≤j≤n; record all sub-vector sets in A 1 as S, S ={S k :k∈[1,l]};
空间小文件数据关联度计算模块(303),用于计算频繁访问的空间小文件数据相互之间的关联度数值,实现如下,Small space file data association degree calculation module (303), used for calculating the association degree value between frequently accessed small space file data, is realized as follows,
定义函数 define function
其中为Sk中的所有元素组成的集合;函数表示在长度为n的访问周期内频繁访问的空间小文件数据和是否具有关联性;in is the set of all elements in S k ; the function Represents small file data in space that is frequently accessed within an access cycle of length n with Is it relevant;
定义函数RS(i1,j1),Define the function R S (i 1 , j 1 ),
其中RS(i1,j1)表示S对和的总关联度;where R S (i 1 , j 1 ) represents the pair of S with the total correlation degree;
空间小文件数据关联矩阵生成模块(304),用于将频繁访问的空间小文件数据相互之间的关联度数值组成关联矩阵RS,Spatial small file data association matrix generating module (304), for forming the association matrix R S with the correlation degree values between frequently accessed small spatial file data,
关联矩阵转换重排单元(400),用于对关联矩阵中各元素数值进行大小转换后利用RCM排序算法重排后输出;An association matrix conversion and rearrangement unit (400), which is used to convert the value of each element in the association matrix and output it after rearranging using the RCM sorting algorithm;
关联矩阵最佳组合搜索单元(500),用于对重排后的关联矩阵利用局部逼近搜索法寻找最佳组合;Incidence matrix best combination search unit (500), used for utilizing local approximation search method to find the best combination for rearranged incidence matrix;
空间小文件数据分布存储单元(600),用于利用关联矩阵最佳组合搜索单元(500)所得最佳组合对频繁访问的空间小文件数据进行分布存储,以及对非频繁访问的空间小文件数据根据空间位置相邻关系分开存储。The small space file data distribution storage unit (600) is used to use the optimal combination obtained by the correlation matrix optimal combination search unit (500) to distribute and store the frequently accessed small space file data, and to store the infrequently accessed small space file data Stored separately according to the adjacent relationship of spatial position.
参见图4,关联矩阵转换重排单元(400)进一步包括以下模块,Referring to Fig. 4, the association matrix conversion and rearrangement unit (400) further includes the following modules,
关联矩阵元素最大值获取模块(401),用于获取关联矩阵中元素最大值,包括遍历关联矩阵所有元素值,并获取最大值Rmax;The maximum element value acquisition module (401) of the correlation matrix is used to obtain the maximum value of the elements in the correlation matrix, including traversing all element values of the correlation matrix, and obtaining the maximum value Rmax ;
关联矩阵元素值大小转换模块(402),用于对关联矩阵元素数值进行大小转换,包括遍历关联矩阵所有元素值,并执行操作RS(i1,j1)=Rmax-RS(i1,j1);Incidence matrix element value size conversion module (402), used for performing size conversion on the incidence matrix element value, including traversing all element values of the incidence matrix, and performing the operation R S (i 1 , j 1 )=R max -R S (i 1 ,j 1 );
关联矩阵重排模块(403),用于利用标准RCM排序算法对关联矩阵进行重排。A correlation matrix rearrangement module (403), configured to use a standard RCM sorting algorithm to rearrange the correlation matrix.
关联矩阵最佳组合搜索单元(500)包括以下模块,The optimal combination search unit (500) of the correlation matrix comprises the following modules,
初始化模块,用于初始化当前迭代次数d=1;The initialization module is used to initialize the current number of iterations d=1;
最佳组合搜索模块,用于采用局部逼近搜索法寻找一个最佳组合,包括在当前的矩阵中寻找一个n×n的块,使得该矩阵中n×n块内对应的矩阵元素值最大,相应的n个文件构成一个最佳组合;最佳组合搜索模第一次工作时,当前的矩阵为关联矩阵转换重排单元(400)所得重排后的关联矩阵;最佳组合搜索模后续工作时,当前的矩阵为前一次迭代所得的矩阵;The optimal combination search module is used to find an optimal combination by using the local approximation search method, including finding an n×n block in the current matrix, so that the value of the corresponding matrix element in the n×n block in the matrix is the largest, corresponding The n files constitute an optimal combination; when the optimal combination search module works for the first time, the current matrix is the correlation matrix after the rearrangement of the resulting rearrangement of the correlation matrix conversion rearrangement unit (400); , the current matrix is the matrix obtained from the previous iteration;
矩阵更新模块,用于在最佳组合搜索模块进行本次迭代工作搜索得到一个由n个文件组成的最佳组合后,将关联矩阵中对应n个文件的关联矩阵元素删除,得到(N1-dn)×(N1-dn)的矩阵;The matrix update module is used to delete the elements of the correlation matrix corresponding to n files in the correlation matrix after the optimal combination search module performs this iterative work search to obtain an optimal combination consisting of n files, and obtains (N 1 - dn)×(N 1 -dn) matrix;
判断输出模块,用于判断是否d=m-1,否则令d=d+1,以矩阵更新模块本次迭代工作所得(N1-dn)×(N1-dn)的矩阵为当前的矩阵,命令最佳组合搜索模块进行下一次迭代工作继续搜索下一个最近组合,是则停止搜索,共得到m个最佳组合。Judgment output module, used to judge whether d=m-1, otherwise let d=d+1, take the matrix of (N 1 -dn)×(N 1 -dn) obtained by the matrix update module this iteration work as the current matrix , command the best combination search module to carry out the next iteration and continue to search for the next closest combination, otherwise stop searching, and get m best combinations in total.
各模块具体实现可与方法具体步骤一致,本发明不予赘述。The specific implementation of each module may be consistent with the specific steps of the method, which will not be described in detail in the present invention.
本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代,但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention belongs can make various modifications or supplements to the described specific embodiments or adopt similar methods to replace them, but they will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510042456.9A CN104573082B (en) | 2015-01-28 | 2015-01-28 | Space small documents distributed data storage method and system based on access log information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510042456.9A CN104573082B (en) | 2015-01-28 | 2015-01-28 | Space small documents distributed data storage method and system based on access log information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104573082A CN104573082A (en) | 2015-04-29 |
CN104573082B true CN104573082B (en) | 2017-11-14 |
Family
ID=53089144
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510042456.9A Expired - Fee Related CN104573082B (en) | 2015-01-28 | 2015-01-28 | Space small documents distributed data storage method and system based on access log information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104573082B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885463B (en) * | 2017-11-10 | 2021-08-31 | 下一代互联网重大应用技术(北京)工程研究中心有限公司 | Target file processing method and device |
CN109491594B (en) * | 2018-09-28 | 2021-12-03 | 北京寄云鼎城科技有限公司 | Method and device for optimizing data storage space in matrix inversion process |
CN109542857B (en) * | 2018-11-26 | 2021-06-29 | 杭州迪普科技股份有限公司 | Audit log storage method, audit log query method, audit log storage device, audit log query device and related equipment |
CN111104381A (en) * | 2019-11-30 | 2020-05-05 | 北京浪潮数据技术有限公司 | Log management method, device and equipment and computer readable storage medium |
CN114063888B (en) * | 2020-07-31 | 2024-11-26 | 中移(苏州)软件技术有限公司 | Data storage system, data processing method, terminal and storage medium |
CN111966950B (en) * | 2020-10-21 | 2021-01-15 | 北京每日优鲜电子商务有限公司 | Log sending method and device, electronic equipment and computer readable medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101453688A (en) * | 2007-12-04 | 2009-06-10 | 中兴通讯股份有限公司 | Method for fast responding scene switching in mobile stream media service |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101324948B (en) * | 2008-07-24 | 2015-11-25 | 阿里巴巴集团控股有限公司 | A kind of method of information recommendation and device |
-
2015
- 2015-01-28 CN CN201510042456.9A patent/CN104573082B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101453688A (en) * | 2007-12-04 | 2009-06-10 | 中兴通讯股份有限公司 | Method for fast responding scene switching in mobile stream media service |
Non-Patent Citations (1)
Title |
---|
基于向量关系表的自动数据收集算法;杨婧 等;《计算机工程与应用》;20071231;第43卷(第15期);176-179 * |
Also Published As
Publication number | Publication date |
---|---|
CN104573082A (en) | 2015-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104573082B (en) | Space small documents distributed data storage method and system based on access log information | |
CN110059067B (en) | A method for storage and management of water conservancy space vector big data | |
KR101938953B1 (en) | Flash optimized columnar data layout and data access algorithms for big data query engines | |
Sun et al. | Rm-ssd: In-storage computing for large-scale recommendation inference | |
Mostak | An overview of MapD (massively parallel database) | |
Fu et al. | An experimental evaluation of large scale GBDT systems | |
CN104598565B (en) | A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm | |
CN105069703A (en) | Mass data management method of power grid | |
CN104391679A (en) | GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream | |
CN107451233B (en) | Storage method of time-attribute-priority spatiotemporal trajectory data file in auxiliary storage device | |
CN104156463A (en) | Big-data clustering ensemble method based on MapReduce | |
CN102693299A (en) | System and method for parallel video copy detection | |
CN109783441A (en) | Mass data inquiry method based on Bloom Filter | |
CN106570173B (en) | A Spark-based clustering method for high-dimensional sparse text data | |
Liu et al. | Segmented analysis for reducing data movement | |
CN106610792A (en) | Repeating data deleting algorithm in cloud storage | |
Aydin et al. | Mining spatiotemporal co-occurrence patterns in non-relational databases | |
CN106547890A (en) | Quick clustering preprocess method in large nuber of images characteristic vector | |
CN107239791A (en) | A kind of higher-dimension K means cluster centre method for optimizing based on LSH | |
Zhang et al. | Overcoming memory constraints in quantum circuit simulation with a high-fidelity compression framework | |
Anusha et al. | Big data techniques for efficient storage and processing of weather data | |
Colmenares et al. | A single-node datastore for high-velocity multidimensional sensor data | |
Li et al. | Block access pattern discovery via compressed full tensor transformer | |
Zhao et al. | Nms: Efficient edge dnn training via near-memory sampling on manifolds | |
AU2023285953A1 (en) | Method and apparatus of processing quantum noise circuit, device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171114 Termination date: 20190128 |
|
CF01 | Termination of patent right due to non-payment of annual fee |