WO2024074042A1

WO2024074042A1 - Data storage method and apparatus, data reading method and apparatus, and device

Info

Publication number: WO2024074042A1
Application number: PCT/CN2023/094310
Authority: WO
Inventors: 林楷智; 蔡志恺; 黄柏学
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2022-10-08
Filing date: 2023-05-15
Publication date: 2024-04-11
Anticipated expiration: 2025-04-08
Also published as: CN115291813A; CN115291813B

Abstract

The present application discloses a data storage method. The method comprises the following steps: receiving a target data set to be stored; obtaining the data size of each piece of data in said target data set, wherein the size of each piece of data in said target data set is the same; and storing each piece of data in said target data set into target blocks in a hard disk which are continuous and have the same size, wherein the block size of each target block is determined according to the data size. The present application further discloses a data storage apparatus, a data reading method and apparatus, a device, and a storage medium.

Description

Data storage method and device, data reading method and device, and equipment

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2022年10月08日提交中国专利局，申请号为202211219584.2，申请名称为“一种数据存储方法及装置、数据读取方法及装置、设备”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on October 8, 2022, with application number 202211219584.2, and application name “A data storage method and device, data reading method and device, and equipment”, all contents of which are incorporated by reference in this application.

Technical Field

本申请涉及数据处理技术领域，特别是涉及一种数据存储方法及装置、数据读取方法及装置、设备及非易失性存储介质。The present application relates to the field of data processing technology, and in particular to a data storage method and device, a data reading method and device, a device and a non-volatile storage medium.

Background technique

人工智能在近几年得到了快速发展，人工智能的机器学习需要对数据集进行收集，标记及预处理等。而后才能在机器学习与深度学习的训练与推论中被读取与使用。Artificial intelligence has developed rapidly in recent years. Machine learning in artificial intelligence requires the collection, labeling and preprocessing of data sets, etc. Only then can it be read and used in the training and inference of machine learning and deep learning.

然而数据集的读写对整体人工智能训练与推论的效能有可能有极大的负面影响，主要原因包括：(1)数据集依不同演算法的需求，其个数可能成千上万或更多(每个都是例如图档，文字或语音)；(2)数据集需要经过预处理为可用的训练/测试数据写入硬盘；(3)数据集经过预处理后，通常每项数据都会变小，且其大小是固定的；(4)以上三步骤完成后，训练与推论的过程其实是“读取”成千上万小数据量的数据集数据，进行运算。也就是说要存取一个数据集，实际上需要执行很多系统程序，并且需要在硬盘中花费时间搜寻该数据集的所有数据项，以还原成原来的数据集。需要花费大量时间搜寻硬盘大部份不连续的区块，才能组合为原数据集，导致数据读写效率低。However, the reading and writing of data sets may have a great negative impact on the overall performance of artificial intelligence training and inference. The main reasons include: (1) Depending on the needs of different algorithms, the number of data sets may be tens of thousands or more (each of which is, for example, an image, text or voice); (2) The data set needs to be preprocessed and written to the hard disk as usable training/test data; (3) After the data set is preprocessed, each data item will usually become smaller and its size is fixed; (4) After the above three steps are completed, the training and inference process is actually "reading" thousands of small data sets for calculation. In other words, to access a data set, it is actually necessary to execute many system programs and spend time searching the hard disk for all data items of the data set to restore it to the original data set. It takes a lot of time to search for most of the discontinuous blocks on the hard disk before it can be combined into the original data set, resulting in low data reading and writing efficiency.

综上所述，如何有效地解决花费大量时间搜寻硬盘大部份不连续的区块，才能组合为原数据集，导致数据读写效率低等问题，是目前本领域技术人员急需解决的问题。In summary, how to effectively solve the problem of spending a lot of time searching for most of the discontinuous blocks on the hard disk to combine them into the original data set, resulting in low data reading and writing efficiency, is an issue that technicians in this field urgently need to solve.

发明内容Summary of the invention

本申请实施例的目的是提供一种数据存储方法，该方法节省了数据读取的时间，提升了数据读写效率；本申请的另一目的是提供一种数据存储装置、数据读取方法及装置、设备及非易失性存储介质。An object of an embodiment of the present application is to provide a data storage method, which saves data reading time and improves data reading and writing efficiency; another object of the present application is to provide a data storage device, a data reading method and device, an equipment and a non-volatile storage medium.

为解决上述技术问题，本申请提供如下技术方案：In order to solve the above technical problems, this application provides the following technical solutions:

根据本申请实施例的第一方面，提供了一种数据存储方法，包括：According to a first aspect of an embodiment of the present application, a data storage method is provided, including:

接收待存储的目标数据集；receiving a target data set to be stored;

获取目标数据集中每项数据的数据大小；其中，目标数据集中各项数据的大小相同；Obtain the data size of each item of data in the target data set; wherein the size of each item of data in the target data set is the same;

将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。Each data in the target data set is stored in each continuous target block of the hard disk with the same size; wherein the block size of each target block is determined according to the data size.

在本申请的一种具体实施方式中，在接收待存储的目标数据集之后，获取目标数据集中每项数据的数据大小之前，还包括：In a specific implementation of the present application, after receiving the target data set to be stored and before obtaining the data size of each item of data in the target data set, the method further includes:

对目标数据集进行第一预处理操作；其中，第一预处理操作为未增加数据大小的预处理操作。A first preprocessing operation is performed on the target data set; wherein the first preprocessing operation is a preprocessing operation that does not increase the data size.

在本申请的一种具体实施方式中，对目标数据集进行第一预处理操作，包括：In a specific implementation of the present application, a first preprocessing operation is performed on the target data set, including:

对目标数据集进行除归一化预处理之外的预处理操作。Perform preprocessing operations on the target dataset except normalization preprocessing.

在本申请的一种具体实施方式中，接收待存储的目标数据集，包括： In a specific implementation of the present application, receiving a target data set to be stored includes:

接收待存储的用于人工智能模型训练的目标数据集。Receive a target data set to be stored for artificial intelligence model training.

在本申请的一种具体实施方式中，获取目标数据集中每项数据的数据大小，包括：In a specific implementation of the present application, obtaining the data size of each item of data in the target data set includes:

获取目标数据集中由数据本身、数据标签以及数据档名构成的每项数据的数据大小。Get the data size of each data item in the target data set, which consists of the data itself, data label, and data file name.

在本申请的一种具体实施方式中，还包括根据数据大小确定目标区块的区块大小的过程，根据数据大小确定目标区块的区块大小的过程，包括：In a specific implementation of the present application, a process of determining a block size of a target block according to the data size is also included. The process of determining the block size of a target block according to the data size includes:

获取预设的各可选区块大小；Get the preset optional block sizes;

从大于数据大小的各可选区块大小中选取得到目标区块的区块大小。The block size of the target block is obtained by selecting from various optional block sizes that are larger than the data size.

在本申请的一种具体实施方式中，从大于数据大小的各可选区块大小中选取得到目标区块的区块大小，包括：In a specific implementation of the present application, the block size of the target block is selected from various optional block sizes that are larger than the data size, including:

从大于数据大小的各可选区块大小中选取与数据大小差值最小的可选区块大小；Selecting the optional block size with the smallest difference with the data size from the optional block sizes that are larger than the data size;

将与数据大小差值最小的可选区块大小确定为目标区块的区块大小。The optional block size with the smallest difference from the data size is determined as the block size of the target block.

在本申请的一种具体实施方式中，在获取预设的各可选区块大小之后，还包括：In a specific implementation of the present application, after obtaining each preset optional block size, it also includes:

判断数据大小是否小于等于各可选区块大小中的最大值；Determine whether the data size is less than or equal to the maximum value of each optional block size;

若是，则执行从大于数据大小的各可选区块大小中选取得到目标区块的区块大小的步骤；If yes, the step of selecting a block size of the target block from each optional block size larger than the data size is performed;

若否，则将各可选区块大小中的最大值确定为目标区块的区块大小。If not, the maximum value among the optional block sizes is determined as the block size of the target block.

根据本申请实施例的第二方面，提供了一种数据读取方法，包括：According to a second aspect of an embodiment of the present application, a data reading method is provided, comprising:

接收数据读取命令；Receive data read command;

从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，各目标区块的区块大小根据每项数据的数据大小确定，且目标数据集中各项数据的大小相同；Reading each item of data of the target data set from each target block of the hard disk that is continuous and has the same size; wherein the block size of each target block is determined according to the data size of each item of data, and the size of each item of data in the target data set is the same;

将读取到的各项数据返回给数据读取命令的发送端。The read data are returned to the sender of the data read command.

在本申请的一种具体实施方式中，在从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据之后，将读取到的各项数据返回给数据读取命令的发送端之前，还包括：In a specific implementation of the present application, after reading each item of data of the target data set from each consecutive target block of the same size in the hard disk, and before returning each item of data read to the sender of the data read command, the method further includes:

对读取到的各项数据进行第二预处理操作；其中，第二预处理操作为增加数据大小的预处理操作。A second preprocessing operation is performed on each of the read data; wherein the second preprocessing operation is a preprocessing operation for increasing the size of the data.

在本申请的一种具体实施方式中，对读取到的各项数据进行第二预处理操作，包括：In a specific implementation of the present application, a second preprocessing operation is performed on each piece of data read, including:

对读取到的各项数据进行归一化预处理操作。Perform normalization preprocessing operations on the read data.

在本申请的一种具体实施方式中，接收数据读取命令，包括：In a specific implementation of the present application, receiving a data read command includes:

接收读取用于人工智能模型训练的目标数据集的数据读取命令。Receive a data read command to read a target data set for artificial intelligence model training.

在本申请的一种具体实施方式中，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据，包括：In a specific implementation of the present application, reading each item of data of the target data set from consecutive target blocks of the same size in the hard disk includes:

当目标区块的区块大小大于等于数据大小时，按照目标区块与每项数据的一对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据。When the block size of the target block is greater than or equal to the data size, each data item of the target data set is read from each consecutive target block of the same size in the hard disk according to the one-to-one relationship between the target block and each data item.

当目标区块的区块大小小于数据大小时，按照目标区块与每项数据的多对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，每项数据预先存储在相邻连续区块中。When the block size of the target block is smaller than the data size, each data item of the target data set is read from each consecutive target block of the same size in the hard disk according to the many-to-one relationship between the target block and each data item; wherein each data item is pre-stored in adjacent consecutive blocks.

根据本申请实施例的第三方面，提供了一种数据存储装置，包括：According to a third aspect of an embodiment of the present application, there is provided a data storage device, including:

数据集接收模块，用于接收待存储的目标数据集；A data set receiving module, used for receiving a target data set to be stored;

数据大小获取模块，用于获取目标数据集中每项数据的数据大小；其中，目标数据集中各项数据的大小相同；A data size acquisition module is used to acquire the data size of each item of data in the target data set; wherein the size of each item of data in the target data set is the same;

数据存储模块，用于将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。The data storage module is used to store each data in the target data set into each continuous target block of the same size in the hard disk; wherein the block size of each target block is determined according to the data size.

根据本申请实施例的第四方面，提供了一种数据读取装置，包括：According to a fourth aspect of an embodiment of the present application, there is provided a data reading device, including:

读取命令接收模块，用于接收数据读取命令；A read command receiving module, used for receiving a data read command;

数据读取模块，用于从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，各目标区块的区块大小根据每项数据的数据大小确定，且目标数据集中各项数据的大小相同；The data reading module is used to read each item of the target data set from each consecutive target block of the same size in the hard disk. wherein the block size of each target block is determined according to the data size of each item of data, and the size of each item of data in the target data set is the same;

数据返回模块，用于将读取到的各项数据返回给数据读取命令的发送端。The data return module is used to return the read data to the sender of the data read command.

根据本申请实施例的第五方面，提供了一种电子设备，包括：According to a fifth aspect of an embodiment of the present application, there is provided an electronic device, including:

存储器，用于存储计算机程序；Memory for storing computer programs;

处理器，用于执行计算机程序时实现如前数据存储方法或数据读取方法的步骤。The processor is used to implement the steps of the above data storage method or data reading method when executing a computer program.

根据本申请实施例的第六方面，提供了一种非易失性可读存储介质，非易失性可读存储介质上存储有计算机程序，计算机程序被处理器执行时实现如前数据存储方法或数据读取方法的步骤。According to a sixth aspect of an embodiment of the present application, a non-volatile readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the above data storage method or data reading method are implemented.

本申请实施例所提供的数据存储方法，接收待存储的目标数据集；获取目标数据集中每项数据的数据大小；其中，目标数据集中各项数据的大小相同；将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。The data storage method provided in the embodiment of the present application receives a target data set to be stored; obtains the data size of each data item in the target data set; wherein the size of each data item in the target data set is the same; stores each data item in the target data set into consecutive target blocks of the same size in the hard disk; wherein the block size of each target block is determined according to the data size.

由上述技术方案可知，通过依据待存储的目标数据集中每项数据的固定大小设定硬盘的目标区块的区块大小，保证目标数据集中各项数据存储至硬盘连续区块。使得数据存储得到较大优化，在数据读取时能够从硬盘连续区块中直接读取，节省了数据读取的时间，提升了数据读写效率。It can be seen from the above technical solution that by setting the block size of the target block of the hard disk according to the fixed size of each data item in the target data set to be stored, it is ensured that each data item in the target data set is stored in a continuous block of the hard disk. This greatly optimizes data storage, and when reading data, it can be directly read from the continuous block of the hard disk, saving data reading time and improving data reading and writing efficiency.

相应的，本申请实施例还提供了与上述数据存储方法相对应的数据存储装置、数据读取方法及装置、设备和非易失性存储介质，具有上述技术效果，在此不再赘述。Correspondingly, the embodiments of the present application also provide a data storage device, a data reading method and device, an apparatus and a non-volatile storage medium corresponding to the above-mentioned data storage method, which have the above-mentioned technical effects and will not be repeated here.

BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或相关技术中的技术方案，下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the related technologies, the drawings required for use in the embodiments or the related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本申请一些实施例中数据存储方法的一种实施流程图；FIG1 is a flowchart of an implementation method of data storage in some embodiments of the present application;

图2为本申请一些实施例中数据存储方法的另一种实施流程图；FIG2 is another implementation flow chart of the data storage method in some embodiments of the present application;

图3为本申请一些实施例中数据读取方法的一种实施流程图；FIG3 is a flowchart of an implementation of a data reading method in some embodiments of the present application;

图4为本申请一些实施例中数据读取方法的另一种实施流程图；FIG4 is another implementation flow chart of the data reading method in some embodiments of the present application;

图5为本申请一些实施例中一种手写辨识数据集图档示意图；FIG5 is a schematic diagram of a handwriting recognition data set image file in some embodiments of the present application;

图6为本申请一些实施例中一种对手写辨识数据集图档归一化后的示意图；FIG6 is a schematic diagram of a normalized handwriting recognition data set image file in some embodiments of the present application;

图7为本申请一些实施例中一种数据集图档分类示意图；FIG7 is a schematic diagram of a data set image classification in some embodiments of the present application;

图8为本申请一些实施例中一种数据存储装置的结构框图；FIG8 is a structural block diagram of a data storage device in some embodiments of the present application;

图9为本申请一些实施例中一种数据读取装置的结构框图；FIG9 is a structural block diagram of a data reading device in some embodiments of the present application;

图10为本申请一些实施例中一种电子设备的结构框图；FIG10 is a structural block diagram of an electronic device in some embodiments of the present application;

图11为本申请一些实施例中提供的一种电子设备的具体结构示意图。FIG. 11 is a schematic diagram of a specific structure of an electronic device provided in some embodiments of the present application.

Detailed ways

现有的数据存储方法，储存在硬盘中的每笔数据，作业系统、档案系统是不保证其连续性的，也就是说每个数据，因为配合硬盘与档案系统规划的区块大小，会被切割为数据区块，无法保证连续，实际上是常常不连续地储存在硬盘之中。读取数据的部分仍需要CPU(Central Processing Unit，中央处理器)，主记忆体与硬盘I/O(Input/Output，输入/输出)系统加上相关软体与作业系统一起完成数据读写，这代表GPU(Graphics Processing Unit，图形处理器)、CPU、主记忆体与硬盘I/O以及相关软体与作业系统彼此需要频繁地沟通与传输非地址连续的数据，也就是说用户在应用层要存取一个数据，实际上需要执行很多作业系统、档案系统的程序，并且需要在硬盘中花费时间搜寻该数据的所有数据区块，以组合还原成原来的档案。从而导致数据集的读写对整体人工智能训练与推论的效能有可能造成极大的负面影响，进而导致读写效能下降。In the existing data storage method, the operating system and the file system do not guarantee the continuity of each piece of data stored in the hard disk. In other words, each piece of data will be cut into data blocks because of the block size planned by the hard disk and the file system. The continuity cannot be guaranteed and the data is often stored discontinuously in the hard disk. The part of reading data still requires the CPU (Central Processing Unit), main memory and hard disk I/O (Input/Output) system plus related software and operating system to complete data reading and writing. This means that the GPU (Graphics Processing Unit), CPU, main memory and hard disk I/O as well as related software and operating system need to communicate and transmit non-contiguous data with each other frequently. In other words, if a user wants to access a piece of data at the application layer, in fact, Many operating system and file system programs need to be executed, and time needs to be spent on searching all the data blocks of the data in the hard disk to combine and restore the original files. As a result, the reading and writing of the data set may have a great negative impact on the overall performance of artificial intelligence training and inference, thereby reducing the reading and writing performance.

为此，本申请中提供的数据存储方法中，保证目标数据集中各项数据存储至硬盘连续区块，使得数据存储得到较大优化，在数据读取时能够从硬盘连续区块中直接读取，节省了数据读取的时间，提升了数据读写效率。To this end, in the data storage method provided in the present application, it is ensured that all data in the target data set are stored in continuous blocks of the hard disk, so that data storage is greatly optimized. When reading data, it can be read directly from continuous blocks of the hard disk, saving data reading time and improving data reading and writing efficiency.

为了使本技术领域的人员更好地理解本申请方案，下面结合附图和具体实施方式对本申请作进一步的详细说明。显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to enable those skilled in the art to better understand the present application, the present application is further described in detail below in conjunction with the accompanying drawings and specific implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present application.

参见图1，图1为本申请实施例中数据存储方法的一种实施流程图，该方法可以包括以下步骤：Referring to FIG. 1 , FIG. 1 is a flowchart of an implementation method of a data storage method in an embodiment of the present application. The method may include the following steps:

S101：接收待存储的目标数据集。S101: Receive a target data set to be stored.

当需要存储预先获取到的目标数据集(DataSet)时，向加速器发送待存储的目标数据集，如可以通过CPU或GPU向加速器发送待存储的目标数据集，加速器接收待存储的目标数据集。When a pre-acquired target data set (DataSet) needs to be stored, the target data set to be stored is sent to the accelerator. For example, the target data set to be stored may be sent to the accelerator via a CPU or a GPU, and the accelerator receives the target data set to be stored.

目标数据集可以为用于人工智能机器学习训练的数据集，如可以为用于训练图像识别模型的数据集，也可以为用于物品推荐模型的数据集，等等。目标数据集中的数据类型可以为图片、文字、语音等。The target dataset can be a dataset used for artificial intelligence machine learning training, such as a dataset used for training an image recognition model, or a dataset used for an item recommendation model, etc. The data types in the target dataset can be pictures, text, voice, etc.

CPU主机或GPU主机与加速器之间可以通过实体连线连接，也可以通过网路连接，本申请实施例对此不做限定。The CPU host or GPU host and the accelerator may be connected via a physical connection or via a network, which is not limited in the embodiments of the present application.

S102：获取目标数据集中每项数据的数据大小。S102: Obtain the data size of each item of data in the target data set.

其中，目标数据集中各项数据的大小相同。Among them, the sizes of each data item in the target data set are the same.

在接收到待存储的目标数据集之后，可以对目标数据集进行预处理，变换为可用的训练数据或测试数据，使得待存储的目标数据集中各项数据的大小相同，获取目标数据集中每项数据的数据大小。After receiving the target data set to be stored, the target data set can be preprocessed and transformed into usable training data or test data so that the size of each data item in the target data set to be stored is the same, and the data size of each data item in the target data set is obtained.

以深度学习的图片预处理为例。给定一张压缩过的图片，通常会进行以下一个或多个预处理步骤：Take image preprocessing for deep learning as an example. Given a compressed image, one or more of the following preprocessing steps are usually performed:

(1)图片解码(Image decode)：将压缩的图片解码，彩色的图会解码分为R(Red，红色)、G(Green，绿色)、B(Blue，蓝色)三个像素通道的图片储存。有些模型演算法后续会需要针对R、G、B其中一个或多个通道进行训练；(1) Image decoding: Decode the compressed image. Color images are decoded and stored in three pixel channels: R (Red), G (Green), and B (Blue). Some model algorithms will need to be trained on one or more of the R, G, and B channels later;

(2)灰度转换(Grayscale conversion)灰度转换只是将图像从彩色转换为黑白。它通常用于降低人工智能算法中的计算复杂度。由于大多数图片不需要识别颜色，因此使用灰度转换是明智的，它减少了图像中的像素数量，从而减少了所需的计算量；(2) Grayscale conversion Grayscale conversion simply converts an image from color to black and white. It is often used to reduce computational complexity in artificial intelligence algorithms. Since most images do not require color recognition, it is wise to use grayscale conversion, which reduces the number of pixels in the image and thus reduces the amount of computation required;

(3)归一化(Normalization)：归一化是将图像数据像素(强度)投影到预定义范围的过程，通常为(0，1)或(-1，1)，但不同演算法有不同的定义，其目的是提高所有图像的公平性。例如，将所有图像缩放到[0，1]或[-1，1]的相等范围允许所有图像对总损失做出同等贡献，而不是当其他图像具有高像素和低像素范围时分别是强损失和弱损失。归一化的目的还包括提供标准学习率由于高像素图像需要低学习率，而低像素图像需要高学习率，重新缩放有助于为所有图像提供标准学习率；(3) Normalization: Normalization is the process of projecting image data pixels (intensities) to a predefined range, usually (0, 1) or (-1, 1), but different algorithms have different definitions. Its purpose is to improve fairness for all images. For example, scaling all images to an equal range of [0, 1] or [-1, 1] allows all images to contribute equally to the total loss, rather than having strong and weak losses when other images have high and low pixel ranges, respectively. The purpose of normalization also includes providing a standard learning rate. Since high-pixel images require a low learning rate, while low-pixel images require a high learning rate, rescaling helps provide a standard learning rate for all images.

(4)数据增强(Data Augmentation)：数据增强是在不收集新数据的情况下对现有数据进行微小改动以增加其多样性的过程。这是一种用于扩大数据集的技术。标准数据增强技术包括水平和垂直翻转、旋转、裁剪、剪切等。执行数据增强有助于防止神经网络学习不相关的特征，提升模型性能；(4) Data Augmentation: Data augmentation is the process of making small changes to existing data to increase its diversity without collecting new data. This is a technique used to expand a dataset. Standard data augmentation techniques include horizontal and vertical flipping, rotation, cropping, shearing, etc. Performing data augmentation helps prevent neural networks from learning irrelevant features and improves model performance.

(5)标准化(Image standardization)：标准化是一种缩放和预处理图像使其具有相似或一致性高度和宽度的方法。人工智能的训练、测试、推论时，如果图像的尺寸是一致的，则处理起来效率更高。 (5) Image standardization: Standardization is a method of scaling and preprocessing images to make them have similar or consistent height and width. When training, testing, and inference of artificial intelligence, if the size of the image is consistent, the processing will be more efficient.

S103：将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。S103: storing each data item in the target data set into consecutive target blocks of the same size in the hard disk; wherein the block size of each target block is determined according to the data size.

预先设置可供选择的最小写入区块大小，如通常可以设定为256位元组(Byte)、512位元组、1024位元组、2048位元组、4096位元组等，有些固态硬盘可以支持更大范围的区块大小。在获取到目标数据集中每项数据的数据大小之后，根据数据大小确定目标区块的区块大小。如在确定存在大于等于数据大小的可选择区块大小时，从大于等于数据大小的各可选择区块大小中选择与数据大小最接近的区块大小。The minimum write block size available for selection is pre-set, such as 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes, etc. Some solid state drives can support a larger range of block sizes. After obtaining the data size of each data item in the target data set, the block size of the target block is determined according to the data size. If it is determined that there is a selectable block size greater than or equal to the data size, the block size closest to the data size is selected from the selectable block sizes greater than or equal to the data size.

在根据数据大小确定目标区块的区块大小之后，将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块。相较于现有的读取不连续地储存在硬盘中的数据，本申请在数据读取时能够从硬盘连续区块中直接读取，节省了数据读取的时间，提升了数据读写效率。After determining the block size of the target block according to the data size, the data in the target data set is stored in each target block of the same size in the hard disk. Compared with the existing method of reading data stored discontinuously in the hard disk, the present application can directly read from the continuous blocks of the hard disk when reading data, saving the time of data reading and improving the efficiency of data reading and writing.

需要说明的是，基于上述实施例，本申请实施例还提供了相应的改进方案。在后续实施例中涉及与上述实施例中相同步骤或相应步骤之间可相互参考，相应的有益效果也可相互参照，在下文的改进实施例中不再一一赘述。It should be noted that, based on the above embodiments, the embodiments of the present application also provide corresponding improved solutions. In the subsequent embodiments, the same steps or corresponding steps as those in the above embodiments can be referenced to each other, and the corresponding beneficial effects can also be referenced to each other, which will not be repeated one by one in the following improved embodiments.

参见图2，图2为本申请实施例中数据存储方法的另一种实施流程图，该方法可以包括以下步骤：Referring to FIG. 2 , FIG. 2 is another implementation flow chart of the data storage method in an embodiment of the present application. The method may include the following steps:

S201：接收待存储的目标数据集。S201: Receive a target data set to be stored.

在本申请的一种具体实施方式中，步骤S201可以包括以下步骤：In a specific implementation of the present application, step S201 may include the following steps:

当需要训练人工智能模型时，预先收集用于人工智能模型训练的目标数据集，并将用于人工智能模型训练的目标数据集发送给加速器，加速器接收待存储的用于人工智能模型训练的目标数据集。When an artificial intelligence model needs to be trained, a target data set for artificial intelligence model training is collected in advance and sent to an accelerator, which receives the target data set for artificial intelligence model training to be stored.

目标数据集可以包含训练集、验证集和测试集。The target dataset can contain training set, validation set and test set.

首先，模型在训练集(training dataset)上进行拟合。对于监督式学习，训练集是由用来拟合参数(例如人工神经网络中神经元之间连接的权重)的样本组成的集合。在实践中，训练集通常是由输入向量和输出向量组成的数据对。其中输出向量被称为目标。在训练过程中，当前模型会对训练集中的每个样本进行预测，并将预测结果与目标进行比较。根据比较的结果，学习算法会更新模型的参数。模型拟合的过程可能同时包括特征选择和参数估计。First, the model is fitted on a training dataset. For supervised learning, the training set is a collection of examples used to fit parameters (such as the weights of the connections between neurons in an artificial neural network). In practice, the training set is usually a data pair consisting of an input vector and an output vector. The output vector is called the target. During training, the current model makes predictions for each example in the training set and compares the predictions with the target. Based on the results of the comparison, the learning algorithm updates the parameters of the model. The process of model fitting may include both feature selection and parameter estimation.

接下来，拟合得到的模型会在验证集(validation dataset)上进行预测。在对模型的超参数(例如神经网络中隐藏层的神经元数量)进行调整时，验证集提供了对在训练集上拟合得到模型的无偏评估。验证集可用于正则化中的提前停止，即在验证集误差上升时(此为在训练集上过拟合的信号)，停止训练。Next, the fitted model is used to make predictions on a validation dataset. The validation set provides an unbiased evaluation of the model fitted on the training set when tuning the model's hyperparameters (e.g., the number of neurons in the hidden layer of a neural network). The validation set can be used for early stopping in regularization, i.e., stopping training when the validation set error rises (a sign of overfitting on the training set).

最后，测试集(test dataset)可被用来提供对最终模型的无偏评估。若测试集在训练过程中从未用到(例如，没有被用在交叉验证当中)，则它也被称之为预留集。Finally, the test dataset can be used to provide an unbiased evaluation of the final model. If the test dataset is never used during training (for example, not used in cross-validation), it is also called a holdout set.

很多人工智能的算法只需用使用训练集作为训练，之后使用测试集或验证集作为训练完成的测试，之后便布署进行推论。也就是说测试集或验证集可以是同一集合。Many artificial intelligence algorithms only need to use the training set as training, and then use the test set or validation set as a test of the training completion, and then deploy inference. That is to say, the test set and validation set can be the same set.

在本申请的一种具体实施方式中，在步骤S201之后，在步骤S202之前，该方法还可以包括以下步骤：In a specific implementation of the present application, after step S201 and before step S202, the method may further include the following steps:

在接收到待存储的目标数据集之后，对目标数据集进行第一预处理操作。在将目标数据集存储至加速器的过程中，第一预处理操作包括部分预处理，一般是进行未增加数据大小的一些预处理操作。如对待识别的图片进行水平翻转、垂直翻转、旋转等。从而避免数据存储过程中的数据预处理导致数据增大，节省存储空间，降低成本。After receiving the target data set to be stored, a first preprocessing operation is performed on the target data set. In the process of storing the target data set to the accelerator, the first preprocessing operation includes partial preprocessing, which is generally performed without increasing the size of the data. Some preprocessing operations, such as horizontal flipping, vertical flipping, and rotation of the image to be recognized, can avoid data preprocessing during data storage, leading to data enlargement, save storage space, and reduce costs.

在本申请的一种具体实施方式中，对目标数据集进行第一预处理操作，可以包括以下步骤：In a specific implementation of the present application, performing a first preprocessing operation on the target data set may include the following steps:

由于归一化处理得到的浮点数值需要占用更多的存储空间，因此在接收到待存储的目标数据集之后，对目标数据集进行预处理时，对目标数据集进行除归一化预处理之外的预处理操作，从而节省存储空间。Since the floating-point values obtained by normalization processing need to occupy more storage space, after receiving the target data set to be stored, when preprocessing the target data set, preprocessing operations other than normalization preprocessing are performed on the target data set, thereby saving storage space.

S202：获取目标数据集中每项数据的数据大小。S202: Obtain the data size of each item of data in the target data set.

在本申请的一种具体实施方式中，步骤S202可以包括以下步骤：In a specific implementation of the present application, step S202 may include the following steps:

目标数据集中每项数据除了包含数据本身外，还包括数据标签以及数据档名，从而由数据本身、数据标签以及数据档名共同构成一项完整数据。在接收到待存储的目标数据集之后，获取目标数据集中由数据本身、数据标签以及数据档名构成的每项数据的数据大小。In addition to the data itself, each data item in the target data set also includes a data tag and a data file name, so that the data itself, the data tag and the data file name together constitute a complete data item. After receiving the target data set to be stored, the data size of each data item in the target data set consisting of the data itself, the data tag and the data file name is obtained.

数据标签为数据项对应的参考标准，数据档名为对数据项进行唯一标识的标识信息。The data label is the reference standard corresponding to the data item, and the data file name is the identification information that uniquely identifies the data item.

S203：获取预设的各可选区块大小。S203: Obtain preset optional block sizes.

预先设置可供选择的最小写入区块大小，如以固态硬盘(Solid State Disk，SSD)为例，通常可以设定为256位元组(Byte)、512位元组、1024位元组、2048位元组、4096位元组等，有些固态硬盘可以支持更大范围的区块大小。在获取到目标数据集中每项数据的数据大小之后，获取预设的各可选区块大小。The minimum write block size that can be selected is preset. For example, for a solid state drive (SSD), it can usually be set to 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes, etc. Some solid state drives can support a larger range of block sizes. After obtaining the data size of each data item in the target data set, obtain the preset optional block sizes.

S204：判断数据大小是否小于等于各可选区块大小中的最大值，若数据大小小于等于各可选区块大小中的最大值，则执行步骤S205，若数据大小大于各可选区块大小中的最大值，则执行步骤S206。S204: Determine whether the data size is less than or equal to the maximum value of the optional block sizes. If the data size is less than or equal to the maximum value of the optional block sizes, execute step S205; if the data size is greater than the maximum value of the optional block sizes, execute step S206.

在获取到预设的各可选区块大小之后，判断数据大小是否小于等于各可选区块大小中的最大值，若是数据大小小于等于各可选区块大小中的最大值，则说明存在可选择的区块大小，使得每个数据项仅存储在一个完整区块中，执行步骤S205，若数据大小大于各可选区块大小中的最大值，则说明每项数据的大小已超过硬盘最大的支持区块大小，则需要多个区块才能容纳每个数据项，执行步骤S206。After obtaining the preset optional block sizes, determine whether the data size is less than or equal to the maximum value of the optional block sizes. If the data size is less than or equal to the maximum value of the optional block sizes, it means that there is a selectable block size so that each data item is stored in only one complete block, and step S205 is executed. If the data size is greater than the maximum value of the optional block sizes, it means that the size of each data item has exceeded the maximum supported block size of the hard disk, and multiple blocks are required to accommodate each data item, and step S206 is executed.

S205：从大于数据大小的各可选区块大小中选取得到目标区块的区块大小。S205: Selecting a block size of a target block from various optional block sizes that are larger than the data size.

当确定数据大小小于等于各可选区块大小中的最大值时，说明存在可选择的区块大小，使得每个数据项仅存储在一个完整区块中，从大于数据大小的各可选区块大小中选取得到目标区块的区块大小。从而实现了目标数据集中各项数据存储至硬盘连续单独区块，实现目标数据集写入硬盘的最佳化，使得在后续数据读取时能够从硬盘连续区块中直接读取，节省了数据读取的时间，提升了数据读写效率。When it is determined that the data size is less than or equal to the maximum value of the optional block sizes, it means that there is an optional block size, so that each data item is stored in only one complete block, and the block size of the target block is selected from the optional block sizes that are larger than the data size. This enables the storage of each data item in the target data set to a continuous separate block on the hard disk, and optimizes the writing of the target data set to the hard disk, so that subsequent data reading can be directly read from the continuous blocks on the hard disk, saving data reading time and improving data reading and writing efficiency.

在本申请的一种具体实施方式中，步骤S205可以包括以下步骤：In a specific implementation of the present application, step S205 may include the following steps:

步骤一：从大于数据大小的各可选区块大小中选取与数据大小差值最小的可选区块大小；Step 1: Select the optional block size with the smallest difference with the data size from the optional block sizes that are larger than the data size;

步骤二：将与数据大小差值最小的可选区块大小确定为目标区块的区块大小。Step 2: Determine the optional block size with the smallest difference with the data size as the block size of the target block.

为方便描述，可以将上述两个步骤结合起来进行说明。For the convenience of description, the above two steps can be combined for explanation.

当确定数据大小小于等于各可选区块大小中的最大值时，从大于数据大小的各可选区块大小中选取与数据大小差值最小的可选区块大小，将与数据大小差值最小的可选区块大小确定为目标区块的区块大小。从而实现了目标数据集中各项数据存储至硬盘连续单独区块，使得在后续数据读取时能够从硬盘连续区块中直接读取，节省了数据读取的时间，提升了数据读写效率。When it is determined that the data size is less than or equal to the maximum value of each optional block size, the optional block size with the smallest difference from the data size is selected from the optional block sizes larger than the data size, and the optional block size with the smallest difference from the data size is determined as the block size of the target block. This enables the storage of each data in the target data set to a continuous separate block on the hard disk, so that subsequent data reading can be directly read from the continuous blocks on the hard disk, saving data reading time and improving data reading and writing efficiency.

例如，当每项数据的大小为3136位元组时，若可供选择的区块大小为256位元组、512位元组、1024位元组、2048位元组、4096位元组等，则确定目标区块大小为4096位元组。For example, when the size of each data item is 3136 bytes, if the available block sizes are 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes, etc., then the target block size is determined to be 4096 bytes. Group.

S206：将各可选区块大小中的最大值确定为目标区块的区块大小。S206: Determine the maximum value among the optional block sizes as the block size of the target block.

当确定数据大小大于各可选区块大小中的最大值时，说明每项数据的大小已超过硬盘最大的支持区块大小，则需要多个区块才能容纳每个数据项，在这种情况下，将各可选区块大小中的最大值确定为目标区块的区块大小。以手写辨识数据集每项大小3200位元组为例，当可选区块大小中的最大值为2048位元组时，选择设定目标区块的区块大小为2048位元组，虽然每项数据存储至多个区块中，但在数据读取时仍是读取连续区块，效能依然很高。When it is determined that the data size is larger than the maximum value of each optional block size, it means that the size of each data item has exceeded the maximum supported block size of the hard disk, and multiple blocks are required to accommodate each data item. In this case, the maximum value of each optional block size is determined as the block size of the target block. Taking the handwriting recognition data set as an example, each item size is 3200 bytes. When the maximum value of the optional block size is 2048 bytes, the block size of the target block is set to 2048 bytes. Although each data item is stored in multiple blocks, the data is still read in continuous blocks when it is read, and the performance is still very high.

S207：将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。S207: storing each data item in the target data set into consecutive target blocks of the same size in the hard disk; wherein the block size of each target block is determined according to the data size.

参见图3，图3为本申请实施例中数据读取方法的一种实施流程图，该方法可以包括以下步骤：Referring to FIG. 3 , FIG. 3 is a flowchart of an implementation of a data reading method in an embodiment of the present application. The method may include the following steps:

S301：接收数据读取命令。S301: Receive a data read command.

在将目标数据集存储至加速器之后，当需要读取目标数据集时，向加速器发送数据读取命令，如CPU或GPU向加速器发送数据读取命令，加速器接收数据读取命令。After the target data set is stored in the accelerator, when the target data set needs to be read, a data read command is sent to the accelerator, such as the CPU or GPU sending the data read command to the accelerator, and the accelerator receives the data read command.

S302：从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，各目标区块的区块大小根据每项数据的数据大小确定，且目标数据集中各项数据的大小相同。S302: reading each data item of the target data set from each consecutive target block of the hard disk with the same size; wherein the block size of each target block is determined according to the data size of each data item, and the size of each data item in the target data set is the same.

预先存储在加速器的目标数据集在存储前会通过对目标数据集进行预处理，使得目标数据集中各项数据的大小相同，并根据目标数据集中每项数据的数据大小确定目标区块的区块大小，从而使得目标数据集在连续区块中存储。加速器在接收到数据读取命令之后，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据。通过从连续区块中读取目标数据集的各项数据，较大地提升了数据读取速率。The target data set pre-stored in the accelerator is pre-processed before storage so that the size of each data item in the target data set is the same, and the block size of the target block is determined according to the data size of each data item in the target data set, so that the target data set is stored in a continuous block. After receiving the data read command, the accelerator reads each data item of the target data set from each continuous and same-sized target block in the hard disk. By reading each data item of the target data set from continuous blocks, the data reading rate is greatly improved.

S303：将读取到的各项数据返回给数据读取命令的发送端。S303: Return the read data to the sender of the data read command.

在从硬盘中目标区块的区块大小的各连续区块中读取目标数据集的每项数据之后，将读取到的各项数据返回给数据读取命令的发送端，从而完成对目标数据集中各项数据的快速读取。After reading each item of data in the target data set from each continuous block of the block size of the target block in the hard disk, the read data is returned to the sender of the data read command, thereby completing the fast reading of each item of data in the target data set.

发送端一般为与加速器进行数据读写交互的主机CPU或主机GPU。The sending end is generally the host CPU or host GPU that interacts with the accelerator to read and write data.

参见图4，图4为本申请实施例中数据读取方法的另一种实施流程图，该方法可以包括以下步骤：Referring to FIG. 4 , FIG. 4 is another implementation flow chart of the data reading method in the embodiment of the present application. The method may include the following steps:

S401：接收数据读取命令。S401: Receive a data read command.

在本申请的一种具体实施方式中，步骤S401可以包括以下步骤：In a specific implementation of the present application, step S401 may include the following steps:

预先存储至加速器的目标数据集可以为用于人工智能模型训练的数据集，当需要训练人工智能模型时，向加速器发送读取用于人工智能模型训练的目标数据集的数据读取命令。加速器接收读取用于人工智能模型训练的目标数据集的数据读取命令。The target data set pre-stored in the accelerator may be a data set for artificial intelligence model training. When the artificial intelligence model needs to be trained, a data read command for reading the target data set for artificial intelligence model training is sent to the accelerator. The accelerator receives the data read command for reading the target data set for artificial intelligence model training.

S402：从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，各目标区块的区块大小根据每项数据的数据大小确定，且目标数据集中各项数据的大小相同。S402: Read each data item of the target data set from each consecutive target block of the hard disk with the same size; wherein the block size of each target block is determined according to the data size of each data item, and the size of each data item in the target data set is the same.

在本申请的一种具体实施方式中，步骤S402可以包括以下步骤：In a specific implementation of the present application, step S402 may include the following steps:

当目标区块的区块大小大于等于数据大小时，说明在数据存储时每项数据存储在一个存储区块中，按照目标区块与每项数据的一对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据，从而实现对目标数据集中各项数据的快速读取。When the block size of the target block is greater than or equal to the data size, it means that each data item is stored in a storage block during data storage. According to the one-to-one relationship between the target block and each data item, each data item of the target data set is read from each continuous and same-sized target block in the hard disk, thereby realizing fast reading of each data item in the target data set.

当目标区块的区块大小小于数据大小时，按照目标区块与每项数据的多对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，每项数据预先存储在相邻连续区块中。 When the block size of the target block is smaller than the data size, each data item of the target data set is read from each consecutive target block of the same size in the hard disk according to the many-to-one relationship between the target block and each data item; wherein each data item is pre-stored in adjacent consecutive blocks.

当目标区块的区块大小小于数据大小时，说明在数据存储时每项数据存储在相邻连续区块中，按照目标区块与每项数据的多对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据。从而实现对目标数据集中各项数据的连续读取，提升了数据读取效率。When the block size of the target block is smaller than the data size, it means that each data item is stored in adjacent continuous blocks during data storage. According to the many-to-one relationship between the target block and each data item, each data item of the target data set is read from each continuous and same-sized target block in the hard disk. This enables continuous reading of each data item in the target data set, improving data reading efficiency.

S403：对读取到的各项数据进行第二预处理操作；其中，第二预处理操作为增加数据大小的预处理操作。S403: performing a second preprocessing operation on each item of the read data; wherein the second preprocessing operation is a preprocessing operation for increasing the size of the data.

在读取到目标数据集的各项数据之后，对读取到的各项数据进行增加数据大小的第二预处理操作。CPU主机或GPU主机系统需要读取数据集时，加速器先由硬盘连续区块读取先前部分预处理数据集，再进行写入时没有进行的余下所有预处理，最后再将完成所有预处理的数据集传送回CPU主机或GPU主机系统，从而写入的部分预处理(不增加大小的处理)由CPU主机或GPU主机系统移到加速器，进一步释放CPU主机或GPU主机，实现更配合调整预处理的步骤与次序，减少数据存储时数据集的大小。After reading each data of the target data set, a second preprocessing operation to increase the data size is performed on each data set read. When the CPU host or GPU host system needs to read a data set, the accelerator first reads the previously partially preprocessed data set from the continuous blocks of the hard disk, and then performs all the remaining preprocessing that was not performed when writing, and finally transmits the data set that has completed all preprocessing back to the CPU host or GPU host system, so that the written part of the preprocessing (processing that does not increase the size) is moved from the CPU host or GPU host system to the accelerator, further freeing up the CPU host or GPU host, achieving a more coordinated adjustment of the preprocessing steps and order, and reducing the size of the data set when storing data.

通过在数据存储时不进行增加数据大小的第一预处理操作，在读取数据时，才在加速器中对目标数据集中各项数据进行增加数据大小的第二预处理操作，此过程即体现了计算存储的概念。通过采用计算存储的策略使得数据集读写效能最佳化，也就是用“计算能力”换取空间减少，加速器实现归一化预处理实际上只是经由快速矩阵运算电路之后，再传回数据项给CPU主机或GPU主机，该过程仅需要额外花费少量的计算时间和电路成本，可大幅减少数据集的储存空间。By not performing the first preprocessing operation to increase the data size when storing data, and only performing the second preprocessing operation to increase the data size on each item in the target data set in the accelerator when reading the data, this process embodies the concept of computational storage. By adopting the strategy of computational storage, the read and write performance of the data set is optimized, that is, "computing power" is exchanged for space reduction. The accelerator implements normalization preprocessing, which actually only passes through the fast matrix operation circuit and then returns the data item to the CPU host or GPU host. This process only requires a small amount of additional computing time and circuit cost, which can greatly reduce the storage space of the data set.

存储计算/计算存储(In-storage computing/computing storage)指的是更靠近在存储装置的地方，配置计算元件，而尽量不使用中央处理器的运算能力。以此加速整体存储的效能并卸载中央处理器。In-storage computing/computing storage refers to placing computing components closer to the storage device, while minimizing the use of the CPU's computing power. This speeds up the overall storage performance and offloads the CPU.

在本申请的一种具体实施方式中，步骤S403可以包括以下步骤：In a specific implementation of the present application, step S403 may include the following steps:

在读取到目标数据集的各项数据之后，可以是对读取到的各项数据进行归一化预处理操作。例如，以手写辨识数据集说明，尚未进行归一化预处理的28x28阵列每个点的值，以0～255表示其颜色强度，这代表只需要1个位元组即可表示每个点的值，但归一化预处理后，每个点的浮点数值需要4个位元组。所以在预处理时，先不进行归一化预处理，将经过部分预处理后得到的数据集写入硬盘，但在读取时，在加速器中进行归一化预处理，然后再传回给CPU主机或GPU主机，如此一来，每笔数据项，相较于存储归一化后的浮点数值，本申请在硬盘储存时只用了1/4的大小，所占空间更小。After reading each item of data in the target data set, normalization preprocessing operations may be performed on each item of data read. For example, using the handwriting recognition data set as an example, the value of each point in the 28x28 array that has not yet been normalized and preprocessed is represented by 0 to 255 to represent its color intensity, which means that only 1 byte is needed to represent the value of each point, but after normalization preprocessing, the floating-point value of each point requires 4 bytes. Therefore, during preprocessing, normalization preprocessing is not performed first, and the data set obtained after partial preprocessing is written to the hard disk, but during reading, normalization preprocessing is performed in the accelerator, and then transmitted back to the CPU host or GPU host. In this way, each data item, compared to storing the normalized floating-point value, only 1/4 of the size is used when storing on the hard disk in this application, and the space occupied is smaller.

S404：将读取到的各项数据返回给数据读取命令的发送端。S404: Return the read data to the sender of the data read command.

在一种具体实例应用中，参见图5和图6，图5为本申请实施例中一种手写辨识数据集图档示意图，图6为本申请实施例中一种对手写辨识数据集图档归一化后的示意图。以著名入门的0～9手写辨识数据集(MNIST handwritten digit database)为例，图5左方是一个“9”的28x28的灰阶单色图片，右方显示28x28阵列每个图像点的值(尚未进行归一化预处理)，以0～255表示其颜色强度。图6左方是图片，在经过归一化预处理(每个阵列元素除以255)之后，最后以一个28x28的浮点数阵列表示。In a specific example application, see Figures 5 and 6. Figure 5 is a schematic diagram of a handwriting recognition data set image file in an embodiment of the present application, and Figure 6 is a schematic diagram of a handwriting recognition data set image file after normalization in an embodiment of the present application. Taking the famous entry-level 0-9 handwriting recognition data set (MNIST handwritten digit database) as an example, the left side of Figure 5 is a 28x28 grayscale monochrome image of "9", and the right side shows the value of each image point in the 28x28 array (normalization preprocessing has not yet been performed), with its color intensity represented by 0 to 255. The left side of Figure 6 is a picture, which is finally represented by a 28x28 floating point array after normalization preprocessing (each array element is divided by 255).

在一种具体实例应用中，参见图7，图7为本申请实施例中一种数据集图档分类示意图。即使是相对入门简单的MNIST数据集也有训练集60000项、测试数据10000项。需要对这70000笔进行预处理，每个数字的影像以28x28的阵列储存，而且每一个皆有标签(1abel)注记其真实的数字，如果标签范围是0～9，则只需多用一个位元组(Byte)储存标签注记。后续的模型训练会视训练过程频繁读取60000项(训练集)，每批次训练完成后，也可能多次读取10000项(测试集)确认测试成功与否。In a specific example application, see Figure 7, which is a schematic diagram of a data set image classification in an embodiment of the present application. Even the relatively simple MNIST data set has 60,000 training sets and 10,000 test data. These 70,000 records need to be preprocessed. The image of each number is stored in a 28x28 array, and each one has a label (1abel) to annotate its real number. If the label range is 0 to 9, only one more byte is needed to store the label annotation. Subsequent model training will frequently read 60,000 items (training sets) depending on the training process. After each batch of training is completed, 10,000 items (test sets) may also be read multiple times to confirm whether the test is successful or not.

在一种具体实例应用中，CPU/GPU主机系统0/1/2分别对数据集存取加速器0/1/2进行写入与读取，多路资料集配合更进阶的资料集硬体核心架构写入流程如下： In a specific example application, the CPU/GPU host system 0/1/2 writes and reads the data set access accelerator 0/1/2 respectively. The writing process of multiple data sets with a more advanced data set hardware core architecture is as follows:

(1)CPU/GPU主机系统0/1/2把“完全不”进行预处理的数据集直接传送给对应的数据集存取加速器0/1/2；(1) The CPU/GPU host system 0/1/2 directly transmits the data set that is “not pre-processed at all” to the corresponding data set access accelerator 0/1/2;

(2)资料集存取加速器0/1/2收到写入数据集，先进行不会导致数据变大的部分预处理，如此写入的数据集所需存储空间最小，此步骤将CPU/GPU的部分预处理运算转移到加速器；(2) Data set access accelerator 0/1/2 receives the written data set and first performs partial preprocessing that does not cause the data to become larger. In this way, the written data set requires the least storage space. This step transfers part of the preprocessing operations of the CPU/GPU to the accelerator;

(3)判断经过部分预处理的数据集的大小(含标签或档名)；(3) Determine the size of the partially preprocessed data set (including labels or file names);

(4)依据每项数据集的大小，设定硬盘最佳化的最小读取区块；(4) Setting the minimum read block size for hard disk optimization based on the size of each data set;

(5)将所有数据集依序写入硬盘连续区块。(5) Write all data sets sequentially to continuous blocks on the hard disk.

多路数据集加速读取流程如下：The process of accelerating reading of multi-channel data sets is as follows:

(1)由传输介质收到CPU/GPU主机系统0/1/2设定读取预处理的设定与参数，此步骤通常只需执行一次便适用于所有数据项的读取；(1) The CPU/GPU host system 0/1/2 is received from the transmission medium to set the settings and parameters for read preprocessing. This step usually only needs to be performed once and is applicable to the reading of all data items;

(2)由传输介质收到CPU/GPU主机系统读取预处理数据集的命令；(2) receiving a command from the CPU/GPU host system to read the preprocessed data set through the transmission medium;

(3)依据命令由硬盘连续区块读取之前写入的部分预处理数据集；(3) reading a portion of the pre-processed data set previously written from a continuous block of the hard disk according to the command;

(4)进行写入时未处理的“余下所有预处理”，完成人工智能模型完成的全部预处理；(4) Perform all remaining preprocessing that was not done during writing, and complete all preprocessing done by the artificial intelligence model;

(5)回传已完成全部预处理的数据集到CPU/GPU主机系统0/1/2。(5) The pre-processed data set is transmitted back to the CPU/GPU host system 0/1/2.

相应于上面的数据存储方法实施例，本申请还提供了一种数据存储装置，下文描述的数据存储装置与上文描述的数据存储方法可相互对应参照。Corresponding to the above data storage method embodiment, the present application also provides a data storage device. The data storage device described below and the data storage method described above can be referenced to each other.

参见图8，图8为本申请实施例中一种数据存储装置的结构框图，该装置可以包括：Referring to FIG. 8 , FIG. 8 is a structural block diagram of a data storage device in an embodiment of the present application, and the device may include:

数据集接收模块81，用于接收待存储的目标数据集；The data set receiving module 81 is used to receive the target data set to be stored;

数据大小获取模块82，用于获取目标数据集中每项数据的数据大小；其中，目标数据集中各项数据的大小相同；The data size acquisition module 82 is used to acquire the data size of each item of data in the target data set; wherein the size of each item of data in the target data set is the same;

数据存储模块83，用于将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。The data storage module 83 is used to store each data in the target data set into each continuous target block of the same size in the hard disk; wherein the block size of each target block is determined according to the data size.

在本申请的一种具体实施方式中，该装置还可以包括：In a specific embodiment of the present application, the device may further include:

第一预处理模块，用于在接收待存储的目标数据集之后，获取目标数据集中每项数据的数据大小之前，对目标数据集进行第一预处理操作；其中，第一预处理操作为未增加数据大小的预处理操作。The first preprocessing module is used to perform a first preprocessing operation on the target data set after receiving the target data set to be stored and before obtaining the data size of each data item in the target data set; wherein the first preprocessing operation is a preprocessing operation that does not increase the data size.

在本申请的一种具体实施方式中，第一预处理模块具体为对目标数据集进行除归一化预处理之外的预处理操作的模块。In a specific implementation of the present application, the first preprocessing module is specifically a module that performs preprocessing operations on the target data set except for normalization preprocessing.

在本申请的一种具体实施方式中，数据集接收模块81具体为接收待存储的用于人工智能模型训练的目标数据集的模块。In a specific embodiment of the present application, the data set receiving module 81 is specifically a module for receiving a target data set to be stored for artificial intelligence model training.

在本申请的一种具体实施方式中，数据大小获取模块82具体为获取目标数据集中由数据本身、数据标签以及数据档名构成的每项数据的数据大小的模块。In a specific implementation of the present application, the data size acquisition module 82 is specifically a module for acquiring the data size of each data item in the target data set, which is composed of the data itself, the data label, and the data file name.

在本申请的一种具体实施方式中，该装置还可以包括区块大小确定模块，区块大小确定模块：In a specific implementation of the present application, the device may further include a block size determination module, the block size determination module:

可选区块大小获取子模块，用于获取预设的各可选区块大小；The optional block size acquisition submodule is used to obtain preset optional block sizes;

区块大小选取子模块，用于从大于数据大小的各可选区块大小中选取得到目标区块的区块大小。The block size selection submodule is used to select the block size of the target block from various optional block sizes that are larger than the data size.

在本申请的一种具体实施方式中，区块大小选取子模块包括：In a specific implementation of the present application, the block size selection submodule includes:

区块大小选取单元，用于从大于数据大小的各可选区块大小中选取与数据大小差值最小的可选区块大小；The block size selection unit is used to select the block size with the smallest difference with the data size from the optional block sizes larger than the data size. Optional block size of ;

区块大小确定单元，用于将与数据大小差值最小的可选区块大小确定为目标区块的区块大小。The block size determining unit is used to determine the optional block size with the smallest difference with the data size as the block size of the target block.

判断模块，用于在获取预设的各可选区块大小之后，判断数据大小是否小于等于各可选区块大小中的最大值；A judging module, used for judging whether the data size is less than or equal to the maximum value of the optional block sizes after obtaining the preset optional block sizes;

区块大小选取子模块具体为当确定数据大小小于等于各可选区块大小中的最大值时，从大于数据大小的各可选区块大小中选取得到目标区块的区块大小的模块；The block size selection submodule is specifically a module for selecting the block size of the target block from the optional block sizes that are larger than the data size when it is determined that the data size is less than or equal to the maximum value of the optional block sizes;

区块大小确定模块具体为当确定数据大小大于各可选区块大小中的最大值时，将各可选区块大小中的最大值确定为目标区块的区块大小的模块。The block size determination module is specifically a module that determines the maximum value among the optional block sizes as the block size of the target block when it is determined that the data size is greater than the maximum value among the optional block sizes.

相应于上面的数据读取方法实施例，本申请还提供了一种数据读取装置，下文描述的数据读取装置与上文描述的数据读取方法可相互对应参照。Corresponding to the above data reading method embodiment, the present application further provides a data reading device. The data reading device described below and the data reading method described above can refer to each other.

参见图9，图9为本申请实施例中一种数据读取装置的结构框图，该装置可以包括：Referring to FIG. 9 , FIG. 9 is a structural block diagram of a data reading device in an embodiment of the present application, and the device may include:

读取命令接收模块91，用于接收数据读取命令；A read command receiving module 91, used for receiving a data read command;

数据读取模块92，用于从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据；其中，各目标区块的区块大小根据每项数据的数据大小确定，且目标数据集中各项数据的大小相同；The data reading module 92 is used to read each data item of the target data set from each target block of the hard disk that is continuous and has the same size; wherein the block size of each target block is determined according to the data size of each data item, and the size of each data item in the target data set is the same;

数据返回模块93，用于将读取到的各项数据返回给数据读取命令的发送端。The data return module 93 is used to return the read data to the sending end of the data read command.

第二预处理模块，用于在从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据之后，将读取到的各项数据返回给数据读取命令的发送端之前，对读取到的各项数据进行第二预处理操作；其中，第二预处理操作为增加数据大小的预处理操作。The second preprocessing module is used to perform a second preprocessing operation on each item of data read from each consecutive target block of the same size in the hard disk before returning the read data to the sending end of the data reading command; wherein the second preprocessing operation is a preprocessing operation to increase the data size.

在本申请的一种具体实施方式中，第二预处理模块具体为对读取到的各项数据进行归一化预处理操作的模块。In a specific implementation of the present application, the second preprocessing module is specifically a module that performs normalization preprocessing operations on each piece of data read.

在本申请的一种具体实施方式中，读取命令接收模块91具体为接收读取用于人工智能模型训练的目标数据集的数据读取命令的模块。In a specific embodiment of the present application, the read command receiving module 91 is specifically a module that receives a data read command to read a target data set for artificial intelligence model training.

在本申请的一种具体实施方式中，数据读取模块92具体为当目标区块的区块大小大于等于数据大小时，按照目标区块与每项数据的一对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据的模块。In a specific embodiment of the present application, the data reading module 92 is specifically a module that reads each data item of the target data set from consecutive target blocks of the same size in the hard disk according to a one-to-one relationship between the target block and each data item when the block size of the target block is greater than or equal to the data size.

在本申请的一种具体实施方式中，数据读取模块92具体为当目标区块的区块大小小于数据大小时，按照目标区块与每项数据的多对一关系，从硬盘中连续且大小相同的各目标区块中读取目标数据集的每项数据的模块；其中，每项数据预先存储在相邻连续区块中。In a specific embodiment of the present application, the data reading module 92 is specifically a module that reads each data item of the target data set from consecutive target blocks of the same size in the hard disk according to the many-to-one relationship between the target block and each data item when the block size of the target block is smaller than the data size; wherein each data item is pre-stored in adjacent consecutive blocks.

相应于上面的方法实施例，参见图10，图10为本申请所提供的电子设备的示意图，该设备可以包括：Corresponding to the above method embodiment, referring to FIG. 10 , FIG. 10 is a schematic diagram of an electronic device provided by the present application, and the device may include:

存储器332，用于存储计算机程序；A memory 332, for storing computer programs;

处理器322，用于执行计算机程序时实现上述方法实施例的数据存储方法或数据读取方法的步骤。The processor 322 is used to implement the steps of the data storage method or the data reading method of the above method embodiment when executing a computer program.

具体的，请参考图11，图11为本实施例提供的一种电子设备的具体结构示意图，该电子设备可因配置或性能不同而产生比较大的差异，可以包括处理器(central processing units，CPU)322(例如，一个或一个以上处理器)和存储器332，存储器332存储有一个或一个以上的计算机应用程序342或数据344。其中，存储器332可以是短暂存储或持久存储。存储在存储器332的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对数据处理设备中的一系列指令操作。可选地，处理器322可以设置为与存储器332通信，在电子设备301上执行存储器332中的一系列指令操作。 Specifically, please refer to Figure 11, which is a schematic diagram of the specific structure of an electronic device provided in this embodiment. The electronic device may have relatively large differences due to different configurations or performances, and may include a processor (central processing units, CPU) 322 (for example, one or more processors) and a memory 332, and the memory 332 stores one or more computer applications 342 or data 344. Among them, the memory 332 can be a temporary storage or a permanent storage. The program stored in the memory 332 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the data processing device. Optionally, the processor 322 can be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the electronic device 301.

电子设备301还可以包括一个或一个以上电源326，一个或一个以上有线或无线网络接口350，一个或一个以上输入输出接口358，和/或，一个或一个以上操作系统341。The electronic device 301 may further include one or more power supplies 326 , one or more wired or wireless network interfaces 350 , one or more input and output interfaces 358 , and/or one or more operating systems 341 .

上文所描述的数据存储方法或数据读取方法中的步骤可以由电子设备的结构实现。The steps in the data storage method or data reading method described above can be implemented by the structure of an electronic device.

相应于上面的方法实施例，本申请还提供一种非易失性可读存储介质，非易失性可读存储介质上存储有计算机程序，计算机程序被处理器执行时可实现如下步骤：Corresponding to the above method embodiment, the present application further provides a non-volatile readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the following steps can be implemented:

接收待存储的目标数据集；获取目标数据集中每项数据的数据大小；其中，目标数据集中各项数据的大小相同；将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块；其中，各目标区块的区块大小根据数据大小确定。Receive a target data set to be stored; obtain the data size of each data item in the target data set; wherein the sizes of each data item in the target data set are the same; store each data item in the target data set into target blocks of the same size that are continuous in the hard disk; wherein the block size of each target block is determined according to the data size.

该非易失性可读存储介质可以包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The non-volatile readable storage medium may include: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and other media that can store program codes.

对于本申请提供的非易失性可读存储介质的介绍请参照上述方法实施例，本申请在此不做赘述。For an introduction to the non-volatile readable storage medium provided in this application, please refer to the above method embodiment, and this application will not go into details here.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置、设备及非易失性存储介质而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the devices, equipment and non-volatile storage media disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part description.

本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的技术方案及其核心思想。应当指出，对于本技术领域的普通技术人员来说，在不脱离本申请原理的前提下，还可以对本申请进行若干改进和修饰，这些改进和修饰也落入本申请权利要求的保护范围内。Specific examples are used herein to illustrate the principles and implementation methods of the present application, and the description of the above embodiments is only used to help understand the technical solution and core ideas of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall within the scope of protection of the claims of the present application.

Industrial Applicability

本申请实施例提供的方案可应用于数据处理技术领域，在本申请实施例中，采用接收待存储的目标数据集；获取目标数据集中每项数据的数据大小，其中，目标数据集中各项数据的大小相同；将目标数据集中各项数据存储至硬盘中连续且大小相同的各目标区块，其中，各目标区块的区块大小根据数据大小确定，取得了节省数据读取的时间，提升数据读写效率的技术效果。 The solution provided by the embodiment of the present application can be applied to the field of data processing technology. In the embodiment of the present application, a target data set to be stored is received; the data size of each data item in the target data set is obtained, wherein the size of each data item in the target data set is the same; and each data item in the target data set is stored in each continuous target block of the same size in the hard disk, wherein the block size of each target block is determined according to the data size, thereby achieving the technical effect of saving data reading time and improving data reading and writing efficiency.

Claims

A data storage method, comprising:

receiving a target data set to be stored;

Acquire the data size of each item of data in the target data set, wherein the size of each item of data in the target data set is the same;

The data in the target data set are stored in target blocks of the same size and continuous in the hard disk, wherein the block size of each target block is determined according to the data size.

The data storage method according to claim 1, characterized in that after receiving the target data set to be stored and before obtaining the data size of each data item in the target data set, it also includes:

A first preprocessing operation is performed on the target data set, wherein the first preprocessing operation is a preprocessing operation that does not increase the data size.

The data storage method according to claim 2, characterized in that performing a first preprocessing operation on the target data set comprises:

The target data set is subjected to a preprocessing operation other than normalization preprocessing.

The data storage method according to claim 2, characterized in that when the data type in the target data set is a picture, performing a first preprocessing operation on the target data set further comprises:

Perform at least one of the following processing operations on the image: horizontal flipping, vertical flipping, and rotation.

The data storage method according to claim 1, wherein receiving a target data set to be stored comprises:

Receive a target data set to be stored for artificial intelligence model training.

The data storage method according to claim 5, wherein the target data set comprises: a training set, a validation set and a test set, wherein:

The training set is configured to fit the artificial intelligence model;

The validation set is set to make predictions on the artificial intelligence model that has been fitted;

The test set is set to evaluate the final artificial intelligence model.

The data storage method according to claim 1, characterized in that obtaining the data size of each item of data in the target data set comprises:

The data size of each data item in the target data set, which is composed of the data itself, the data label and the data file name, is obtained.

The data storage method according to any one of claims 1 to 7, characterized in that it also includes The process of determining the block size of the target block according to the data size includes:

Get the preset optional block sizes;

The block size of the target block is obtained by selecting from the optional block sizes that are larger than the data size.

The data storage method according to claim 8, characterized in that the block size of the target block is selected from each of the optional block sizes that is larger than the data size, comprising:

Selecting a selectable block size having the smallest difference with the data size from among the selectable block sizes that are larger than the data size;

The optional block size with the smallest difference from the data size is determined as the block size of the target block.

The data storage method according to claim 8, characterized in that after obtaining the preset optional block sizes, it also includes:

Determine whether the data size is less than or equal to the maximum value of the optional block sizes;

If the data size is less than or equal to the maximum value of the optional block sizes, then executing the step of selecting the block size of the target block from the optional block sizes that are larger than the data size;

If the data size is larger than a maximum value among the optional block sizes, the maximum value among the optional block sizes is determined as the block size of the target block.

The target data set to be stored is received through an accelerator, wherein the target data set to be stored is sent to the accelerator through a central processing unit or a graphics processing unit.

A data reading method, comprising:

Receive data read command;

Reading each item of data of the target data set from consecutive target blocks of the same size in the hard disk, wherein the block size of each target block is determined according to the data size of each item of data, and the size of each item of data in the target data set is the same;

The read data are returned to the sending end of the data reading command.

The data reading method according to claim 12 is characterized in that the sending end is a central processing unit or a graphics processing unit that interacts with the accelerator for data reading and writing.

The data reading method according to claim 10, characterized in that after reading each item of data of the target data set from each consecutive target block of the same size in the hard disk, and before returning the read data to the sending end of the data reading command, it also includes:

A second preprocessing operation is performed on each item of the read data, wherein the second preprocessing operation is a preprocessing operation for increasing the size of the data.

The data reading method according to claim 14, characterized in that the second preprocessing operation is performed on each item of the read data, comprising:

Perform normalization preprocessing operations on the read data.

The data reading method according to claim 12, wherein receiving a data reading command comprises:

Receive a data read command to read a target data set for artificial intelligence model training.

The data reading method according to claim 12, wherein reading each item of data of the target data set from consecutive target blocks of the same size in the hard disk comprises:

When the block size of the target block is greater than or equal to the data size, each data item of the target data set is read from each of the target blocks that are consecutive and have the same size in the hard disk according to a one-to-one relationship between the target block and each data item.

When the block size of the target block is smaller than the data size, each data item of the target data set is read from consecutive target blocks of the same size in the hard disk according to a many-to-one relationship between the target block and each data item; wherein each data item is pre-stored in adjacent consecutive blocks.

A data storage device, comprising:

A data set receiving module, configured to receive a target data set to be stored;

a data size acquisition module, configured to acquire the data size of each item of data in the target data set, wherein the size of each item of data in the target data set is the same;

The data storage module is configured to store each data in the target data set into consecutive target blocks of the same size in the hard disk, wherein the block size of each target block is determined according to the data size.

A data reading device, comprising:

A read command receiving module, configured to receive a data read command;

a data reading module, configured to read each item of data of a target data set from consecutive target blocks of the same size in a hard disk, wherein the block size of each target block is determined according to the data size of each item of data, and the size of each item of data in the target data set is the same;

The data return module is configured to return each item of read data to the sending end of the data read command.

An electronic device, comprising:

a memory arranged to store a computer program;

A processor, configured to implement the steps of the data storage method according to any one of claims 1 to 11 or the data reading method according to any one of claims 12 to 18 when executing the computer program.

A non-volatile readable storage medium, characterized in that a computer program is stored on the non-volatile readable storage medium, and when the computer program is executed by a processor, the steps of the data storage method according to any one of claims 1 to 11 or the data reading method according to any one of claims 12 to 18 are implemented.