HK40050593B

HK40050593B - Data sampling method, device and computer readable storage medium

Info

Publication number: HK40050593B
Application number: HK42021040424.0A
Authority: HK
Inventors: 袁建伟
Original assignee: 腾讯科技（深圳）有限公司
Filing date: 2021-10-14
Publication date: 2025-07-18

Description

A data sampling method, apparatus, and computer-readable storage medium

技术领域Technical Field

本申请涉及数据处理的技术领域，尤其涉及一种数据采样方法、装置以及计算机可读存储介质。This application relates to the field of data processing technology, and more particularly to a data sampling method, apparatus, and computer-readable storage medium.

背景技术Background Technology

随着计算机网络的不断发展，计算机网络中所产生的数据的数据量与日俱增，当需要对计算机网络中的全量数据进行大数据统计分析时，首先，需要对该全量数据进行采样统计，进而可以通过采样统计所得到的数据来表征全量数据，例如，可以将采样统计所得到的数据的数据特征作为全量数据的数据特征。With the continuous development of computer networks, the amount of data generated in computer networks is increasing day by day. When it is necessary to perform big data statistical analysis on the full amount of data in computer networks, the first step is to sample and statistically analyze the full amount of data. Then, the data obtained from the sampling and statistical analysis can be used to characterize the full amount of data. For example, the data characteristics of the data obtained from the sampling and statistical analysis can be used as the data characteristics of the full amount of data.

现有技术中，当需要对某个时间段内所产生的数据进行数据采样时，通常需要预先将该时间段内的所有数据全部读取下来，然后采用一个设定的采样率再对所读取得到的该时间段内的数据进行采样。由于现有技术中，需要将该时间段内的所有数据都预先读取下来，因此，当该时间段内数据的数据量过大时，会对系统容量有较高的要求。In existing technologies, when data sampling is required for data generated within a certain time period, it is usually necessary to pre-read all the data within that time period and then sample the read data within that time period using a set sampling rate. Because existing technologies require pre-reading all the data within that time period, when the amount of data within that time period is too large, it places high demands on the system capacity.

发明内容Summary of the Invention

本申请提供了一种数据采样方法、装置以及计算机可读存储介质，可在对数据进行采样时，节省系统容量。This application provides a data sampling method, apparatus, and computer-readable storage medium that can save system capacity when sampling data.

本申请一方面提供了一种数据采样方法，包括：This application provides a data sampling method, including:

在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据；The target type data is sampled at the first sampling rate within the first time window to obtain the target sampled data;

对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，其中，合并采样数据的第三采样率是根据第一采样率和历史采样数据的第二采样率所确定的，历史采样数据为在得到目标采样数据之前对目标类型的数据采样获取到的目标类型的采样数据，第二采样率是历史采样数据相对于目标类型的数据的采样率，第三采样率是合并采样数据相对于目标类型的数据的采样率；The target sampling data and historical sampling data in the sampling database are merged to obtain merged sampling data. The third sampling rate of the merged sampling data is determined based on the first sampling rate and the second sampling rate of the historical sampling data. The historical sampling data is the target type sampling data obtained by sampling the target type data before obtaining the target sampling data. The second sampling rate is the sampling rate of the historical sampling data relative to the target type data. The third sampling rate is the sampling rate of the merged sampling data relative to the target type data.

当合并采样数据的数量大于采样数量阈值时，基于自适应采样参数对合并采样数据进行采样，得到更新历史采样数据，采样数量阈值与目标类型的数据相关，自适应采样参数用于将合并采样数据的数量控制在采样数量阈值内；When the number of merged sampled data exceeds the sampling number threshold, the merged sampled data is sampled based on the adaptive sampling parameter to obtain updated historical sampled data. The sampling number threshold is related to the target type of data. The adaptive sampling parameter is used to control the number of merged sampled data within the sampling number threshold.

将采样数据库中的历史采样数据替换为更新历史采样数据。Replace the historical sampling data in the sampling database with the updated historical sampling data.

其中，方法还包括：The methods also include:

根据自适应采样参数和第三采样率，确定第四采样率，以便于在第二时间窗口将第四采样率作为第一采样率对目标类型的数据进行采样，其中第四采样率是更新历史采样数据相对于目标类型的数据的采样率，第二时间窗口是第一时间窗口的下一个时间窗口。Based on the adaptive sampling parameters and the third sampling rate, a fourth sampling rate is determined so that the fourth sampling rate can be used as the first sampling rate to sample the target type data in the second time window. The fourth sampling rate is the sampling rate of the updated historical sampling data relative to the target type data, and the second time window is the next time window after the first time window.

其中，对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，包括：This involves merging the target sampling data and historical sampling data in the sampling database to obtain merged sampling data, including:

当第一采样率大于第二采样率时，将第二采样率与第一采样率之间的比值确定为第一比值采样率；When the first sampling rate is greater than the second sampling rate, the ratio between the second sampling rate and the first sampling rate is determined as the first ratio sampling rate.

根据第一比值采样率，对目标采样数据进行采样，得到采样后的目标采样数据；Based on the first ratio sampling rate, the target sampling data is sampled to obtain the sampled target sampling data;

将采样后的目标采样数据和历史采样数据，确定为合并采样数据。The target sampling data and historical sampling data after sampling are determined as the merged sampling data.

当第一采样率小于第二采样率时，将第一采样率与第二采样率之间的比值确定为第二比值采样率；When the first sampling rate is less than the second sampling rate, the ratio between the first sampling rate and the second sampling rate is determined as the second ratio sampling rate;

根据第二比值采样率，对历史采样数据进行采样，得到采样后的历史采样数据；Based on the second ratio sampling rate, historical sampling data is sampled to obtain sampled historical sampling data;

将采样后的历史采样数据和目标采样数据，确定为合并采样数据；The historical sampling data and the target sampling data after sampling are determined as the merged sampling data;

当第一采样率等于第二采样率时，将目标采样数据和历史采样数据确定为合并采样数据。When the first sampling rate equals the second sampling rate, the target sampling data and historical sampling data are determined as the merged sampling data.

其中，在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据，包括：Specifically, the target type data is sampled at a first sampling rate within the first time window to obtain target sampled data, including:

将历史采样数据的第二采样率确定为第一采样率；The second sampling rate of the historical sampling data is determined as the first sampling rate;

在第一时间窗口内，采用第一采样率对目标类型的数据进行采样，得到目标采样数据；Within the first time window, the target type data is sampled using the first sampling rate to obtain the target sampled data;

则，对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，包括：Then, the target sampling data and the historical sampling data in the sampling database are merged to obtain the merged sampling data, including:

将目标采样数据和历史采样数据，确定为合并采样数据。The target sampling data and historical sampling data are determined as the merged sampling data.

其中，当合并采样数据的数量大于采样数量阈值时，基于自适应采样参数对合并采样数据进行采样，得到更新历史采样数据，包括：When the number of merged sampled data exceeds a sampling threshold, the merged sampled data is sampled based on adaptive sampling parameters to obtain updated historical sampled data, including:

当合并采样数据的数量大于采样数量阈值时，获取合并采样数据对应的数据标识字符串；When the number of merged sampled data exceeds the sampling quantity threshold, obtain the data identifier string corresponding to the merged sampled data;

将数据标识字符串映射到均匀采样空间中，得到数据标识字符串对应的哈希值；Map the data identifier string to the uniform sampling space to obtain the hash value corresponding to the data identifier string;

根据第三采样率、自适应采样参数以及哈希值，对合并采样数据进行采样，得到更新历史采样数据。Based on the third sampling rate, adaptive sampling parameters, and hash value, the merged sampled data is sampled to obtain updated historical sampled data.

其中，根据第三采样率、自适应采样参数以及哈希值，对合并采样数据进行采样，得到更新历史采样数据，包括：Specifically, based on the third sampling rate, adaptive sampling parameters, and hash values, the merged sampled data is sampled to obtain updated historical sampled data, including:

根据第三采样率和自适应采样参数，得到第四采样率；The fourth sampling rate is obtained based on the third sampling rate and the adaptive sampling parameters;

基于第四采样率和哈希值，对合并采样数据进行采样，得到更新历史采样数据；第四采样率是更新历史采样数据相对于目标类型的数据的采样率。Based on the fourth sampling rate and the hash value, the merged sampled data is sampled to obtain updated historical sampled data; the fourth sampling rate is the sampling rate of the updated historical sampled data relative to the target type of data.

其中，还包括：This also includes:

当合并采样数据的数量小于或者等于采样数量阈值时，将采样数据库中的历史采样数据替换为合并采样数据。When the number of merged sampled data is less than or equal to the sampling number threshold, the historical sampled data in the sampling database is replaced with the merged sampled data.

其中，第一时间窗口的上一个时间窗口为第三时间窗口；第一时间窗口和第三时间窗口具备交集时间窗口；历史采样数据为在第三时间窗口的交集时间窗口内对目标类型的数据进行采样所得到的采样数据；The time window preceding the first time window is the third time window; the first time window and the third time window have an intersection time window; historical sampling data is the sampling data obtained by sampling the target type data within the intersection time window of the third time window;

在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据，包括：The target type data is sampled at the first sampling rate within the first time window to obtain the target sampled data, including:

在第一时间窗口的除交集时间窗口之外的时间窗口内，以第一采样率对目标类型的数据进行采样，得到目标采样数据；Within the time window excluding the intersection time window of the first time window, the target type data is sampled at the first sampling rate to obtain the target sampled data;

则，对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，包括：Then, the target sampling data and the historical sampling data in the sampling database are merged to obtain merged sampling data, including:

当历史采样数据的数量小于合并数量阈值，且第一采样率与第二采样率之间的比值小于比值阈值时，删除历史采样数据，将目标采样数据确定为合并采样数据。When the number of historical sampled data is less than the merging threshold, and the ratio between the first sampling rate and the second sampling rate is less than the ratio threshold, the historical sampled data is deleted, and the target sampled data is determined as the merged sampled data.

在第一时间窗口从数据流中读取目标类型的数据，对读取到的目标类型的数据进行字段解析，得到初始解析数据；Read data of the target type from the data stream in the first time window, parse the fields of the read target type data to obtain the initial parsed data;

基于过滤机制对初始解析数据中的多个字段信息进行过滤，得到过滤解析数据；The filtered data is obtained by filtering multiple fields of information in the initial parsed data based on the filtering mechanism.

基于词表关联机制在过滤解析数据中添加关联字段信息，得到采样业务数据；Based on the vocabulary association mechanism, related field information is added to the filtered and parsed data to obtain sampled business data;

以第一采样率对采样业务数据进行采样，得到目标采样数据。The target sampled data is obtained by sampling the sampled business data at the first sampling rate.

在第一时间窗口，通过第一线程以第一采样率对目标类型的数据进行采样，得到目标采样数据；Within the first time window, the target type of data is sampled using the first thread at the first sampling rate to obtain the target sampled data;

通过第二线程对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据。The target sampling data and historical sampling data in the sampling database are merged by a second thread to obtain merged sampling data.

本申请一方面提供了一种数据采样装置，包括：This application provides a data sampling device, including:

采样模块，用于在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据；The sampling module is used to sample data of the target type at a first sampling rate within a first time window to obtain target sampled data;

合并模块，用于对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，其中，合并采样数据的第三采样率是根据第一采样率和历史采样数据的第二采样率所确定的，历史采样数据为在得到目标采样数据之前对目标类型的数据采样获取到的目标类型的采样数据，第二采样率是历史采样数据相对于目标类型的数据的采样率，第三采样率是合并采样数据相对于目标类型的数据的采样率；The merging module is used to merge the target sampling data and the historical sampling data in the sampling database to obtain merged sampling data. The third sampling rate of the merged sampling data is determined based on the first sampling rate and the second sampling rate of the historical sampling data. The historical sampling data is the target type sampling data obtained by sampling the target type data before obtaining the target sampling data. The second sampling rate is the sampling rate of the historical sampling data relative to the target type data. The third sampling rate is the sampling rate of the merged sampling data relative to the target type data.

自适应采样模块，用于当合并采样数据的数量大于采样数量阈值时，基于自适应采样参数对合并采样数据进行采样，得到更新历史采样数据，采样数量阈值与目标类型的数据相关，自适应采样参数用于将合并采样数据的数量控制在采样数量阈值内；The adaptive sampling module is used to sample the merged sampled data based on adaptive sampling parameters when the number of merged sampled data exceeds the sampling number threshold, so as to obtain updated historical sampled data. The sampling number threshold is related to the target type of data, and the adaptive sampling parameters are used to control the number of merged sampled data within the sampling number threshold.

替换模块，用于将采样数据库中的历史采样数据替换为更新历史采样数据。The replacement module is used to replace historical sampling data in the sampling database with updated historical sampling data.

其中，数据采样装置，还用于：The data sampling device is also used for:

其中，合并模块，包括：The merging module includes:

第一比值确定单元，用于当第一采样率大于第二采样率时，将第二采样率与第一采样率之间的比值确定为第一比值采样率；The first ratio determination unit is used to determine the ratio between the second sampling rate and the first sampling rate as the first ratio sampling rate when the first sampling rate is greater than the second sampling rate.

第一比值采样单元，用于根据第一比值采样率，对目标采样数据进行采样，得到采样后的目标采样数据；The first ratio sampling unit is used to sample the target sampling data according to the first ratio sampling rate to obtain the sampled target sampling data.

第一合并单元，用于将采样后的目标采样数据和历史采样数据，确定为合并采样数据。The first merging unit is used to determine the target sampled data and historical sampled data as merged sampled data.

其中，合并模块，包括：The merging module includes:

第二比值确定单元，用于当第一采样率小于第二采样率时，将第一采样率与第二采样率之间的比值确定为第二比值采样率；The second ratio determination unit is used to determine the ratio between the first sampling rate and the second sampling rate as the second ratio sampling rate when the first sampling rate is less than the second sampling rate.

第二比值采样单元，用于根据第二比值采样率，对历史采样数据进行采样，得到采样后的历史采样数据；The second ratio sampling unit is used to sample the historical sampling data according to the second ratio sampling rate to obtain the sampled historical sampling data.

第二合并单元，用于将采样后的历史采样数据和目标采样数据，确定为合并采样数据；The second merging unit is used to determine the merged sampled data from the sampled historical sampled data and the target sampled data.

第三合并单元，用于当第一采样率等于第二采样率时，将目标采样数据和历史采样数据确定为合并采样数据。The third merging unit is used to determine the target sampled data and historical sampled data as merged sampled data when the first sampling rate is equal to the second sampling rate.

其中，采样模块，包括：The sampling module includes:

采样率确定单元，用于将历史采样数据的第二采样率确定为第一采样率；A sampling rate determination unit is used to determine the second sampling rate of historical sampling data as the first sampling rate;

窗口采样单元，用于在第一时间窗口内，采用第一采样率对目标类型的数据进行采样，得到目标采样数据；A window sampling unit is used to sample data of the target type at a first sampling rate within a first time window to obtain target sampled data.

则，合并模块，具体还用于：The merging module is also specifically used for:

其中，自适应采样模块，包括：The adaptive sampling module includes:

字符串获取单元，用于当合并采样数据的数量大于采样数量阈值时，获取合并采样数据对应的数据标识字符串；The string acquisition unit is used to acquire the data identifier string corresponding to the merged sampled data when the number of merged sampled data exceeds the sampling number threshold.

映射单元，用于将数据标识字符串映射到均匀采样空间中，得到数据标识字符串对应的哈希值；The mapping unit is used to map the data identifier string to the uniform sampling space to obtain the hash value corresponding to the data identifier string;

合并采样单元，用于根据第三采样率、自适应采样参数以及哈希值，对合并采样数据进行采样，得到更新历史采样数据。The merged sampling unit is used to sample the merged sampling data based on the third sampling rate, adaptive sampling parameters, and hash value to obtain updated historical sampling data.

其中，合并采样单元，包括：The merging sampling unit includes:

自适应采样率子单元，用于根据第三采样率和自适应采样参数，得到第四采样率；The adaptive sampling rate subunit is used to obtain the fourth sampling rate based on the third sampling rate and the adaptive sampling parameters;

合并采样子单元，用于基于第四采样率和哈希值，对合并采样数据进行采样，得到更新历史采样数据；第四采样率是更新历史采样数据相对于目标类型的数据的采样率。The merged sampling subunit is used to sample the merged sampled data based on the fourth sampling rate and the hash value to obtain updated historical sampled data; the fourth sampling rate is the sampling rate of the updated historical sampled data relative to the target type of data.

其中，数据采样装置，具体还用于：Specifically, the data sampling device is also used for:

采样模块，具体还用于：The sampling module is also specifically used for:

在第一时间窗口的除交集时间窗口之外的时间窗口内，以第一采样率对目标类型的数据进行采样，得到目标采样数据。Within the time window excluding the intersection time window of the first time window, the target type data is sampled at the first sampling rate to obtain the target sampled data.

其中，采样模块，包括：The sampling module includes:

数据读取单元，用于在第一时间窗口从数据流中读取目标类型的数据，对读取到的目标类型的数据进行字段解析，得到初始解析数据；The data reading unit is used to read data of the target type from the data stream in the first time window, and to parse the fields of the read target type data to obtain the initial parsed data;

过滤单元，用于基于过滤机制对初始解析数据中的多个字段信息进行过滤，得到过滤解析数据；The filtering unit is used to filter multiple fields of information in the initial parsed data based on the filtering mechanism to obtain filtered parsed data;

关联单元，用于基于词表关联机制在过滤解析数据中添加关联字段信息，得到采样业务数据；The association unit is used to add association field information to the filtered and parsed data based on the vocabulary association mechanism to obtain sampled business data.

目标采样单元，用于以第一采样率对采样业务数据进行采样，得到目标采样数据。The target sampling unit is used to sample the sampling service data at a first sampling rate to obtain target sampling data.

其中，采样模块，具体还用于：Specifically, the sampling module is also used for:

本申请一方面提供了一种计算机设备，包括存储器和处理器，存储器存储有计算机程序，计算机程序被处理器执行时，使得处理器执行如本申请中一方面中的方法。This application provides a computer device, including a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the processor performs the method as described in one aspect of this application.

本申请一方面提供了一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，该计算机程序包括程序指令，该程序指令当被处理器执行时使该处理器执行上述一方面中的方法。This application provides a computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method described in the above-described aspect.

本申请可以在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据；对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，其中，合并采样数据的第三采样率是根据第一采样率和历史采样数据的第二采样率所确定的，历史采样数据为在得到目标采样数据之前对目标类型的数据采样获取到的目标类型的采样数据；当合并采样数据的数量大于采样数量阈值时，基于自适应采样参数对合并采样数据进行采样，得到更新历史采样数据，采样数量阈值与目标类型的数据相关，自适应采样参数用于将合并采样数据的数量控制在采样数量阈值内；将采样数据库中的历史采样数据替换为更新历史采样数据。由此可见，本申请提出的方法可在合并采样数据的数量大于采样数量阈值时，对采样数据进行进一步的采样，通过此种方式，使得在采样期间或者采样完成之后，可以始终将采样数据库中所存储的采样数据的数量控制在采样数量阈值范围之内，从而节省了系统容量，并且实现了采样率的自适应调整。This application can sample target type data at a first sampling rate within a first time window to obtain target sampled data; merge the target sampled data and historical sampled data in the sampling database to obtain merged sampled data. The third sampling rate of the merged sampled data is determined based on the first sampling rate and the second sampling rate of the historical sampled data. The historical sampled data refers to the target type sampled data obtained before obtaining the target sampled data. When the number of merged sampled data exceeds a sampling quantity threshold, the merged sampled data is sampled based on adaptive sampling parameters to obtain updated historical sampled data. The sampling quantity threshold is related to the target type data, and the adaptive sampling parameters are used to control the number of merged sampled data within the sampling quantity threshold. The historical sampled data in the sampling database is then replaced with the updated historical sampled data. Therefore, the method proposed in this application can further sample the data when the number of merged sampled data exceeds the sampling quantity threshold. This ensures that the number of sampled data stored in the sampling database is always controlled within the sampling quantity threshold range during or after sampling, thereby saving system capacity and achieving adaptive adjustment of the sampling rate.

附图说明Attached Figure Description

为了更清楚地说明本申请或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1是本申请提供的一种系统架构示意图；Figure 1 is a schematic diagram of a system architecture provided in this application;

图2a是本申请提供的一种数据采样的场景示意图；Figure 2a is a schematic diagram of a data sampling scenario provided in this application;

图2b是本申请提供的一种线程采样的场景示意图；Figure 2b is a schematic diagram of a thread sampling scenario provided in this application;

图3是本申请提供的一种数据采样方法的流程示意图；Figure 3 is a flowchart illustrating a data sampling method provided in this application;

图4是本申请提供的一种时间窗口的场景示意图；Figure 4 is a schematic diagram of a time window scenario provided in this application;

图5是本申请提供的一种获取合并采样数据的场景示意图；Figure 5 is a schematic diagram of a scenario for obtaining merged sampled data provided in this application;

图6是本申请提供的一种获取更新历史采样数据的流程示意图；Figure 6 is a schematic diagram of a process for obtaining updated historical sampling data provided in this application;

图7是本申请提供的另一种获取更新历史采样数据的流程示意图；Figure 7 is a schematic diagram of another process for obtaining updated historical sampling data provided in this application;

图8是本申请提供的另一种数据采样的场景示意图；Figure 8 is a schematic diagram of another data sampling scenario provided in this application;

图9是本申请提供的一种数据采样装置的结构示意图；Figure 9 is a schematic diagram of the structure of a data sampling device provided in this application;

图10是本申请提供的一种计算机设备的结构示意图。Figure 10 is a structural schematic diagram of a computer device provided in this application.

具体实施方式Detailed Implementation

下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions of this application will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

请参见图1，是本申请提供的一种系统架构示意图。如图1所示，该系统架构图包括设备集群400a、服务器100以及设备集群400b。其中，设备集群400a中可以包括多个终端设备，具体包括终端设备200a、终端设备200b和终端设备200c。其中，终端设备可以是手机、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备(例如智能手表、智能手环等)。设备集群400b中可以包括多个采样设备，该采样设备可以是服务器，因此，该多个采样设备具体包括服务器300a、服务器300b和服务器300c。其中，通过上述设备集群400a中的终端设备可以生成用于采样的数据，通过服务器100可以存储通过设备集群400a中的终端设备所生成的用于采样的数据。设备集群400b中的采样设备可以向服务器100获取到需要采样的数据，并对获取到的数据进行采样。Please refer to Figure 1, which is a schematic diagram of a system architecture provided in this application. As shown in Figure 1, the system architecture includes device cluster 400a, server 100, and device cluster 400b. Device cluster 400a may include multiple terminal devices, specifically terminal devices 200a, 200b, and 200c. These terminal devices may be mobile phones, tablets, laptops, PDAs, mobile internet devices (MIDs), or wearable devices (e.g., smartwatches, smart bracelets). Device cluster 400b may include multiple sampling devices, which may be servers; therefore, these multiple sampling devices specifically include servers 300a, 300b, and 300c. The terminal devices in device cluster 400a can generate data for sampling, and server 100 can store the data generated by the terminal devices in device cluster 400a. The sampling devices in device cluster 400b can obtain the data to be sampled from server 100 and perform sampling on the obtained data.

请一并参见图2a，是本申请提供的一种数据采样的场景示意图。如图2a所示，设备集群106a中包括多个终端设备(设备集群106a等同于上述图1中的设备集群400a)，该多个终端设备均可以用于响应用户的操作指令，并生成相应的数据。例如，终端设备可以响应用户针对某个商品的下单购买的操作指令，并生成相应的订单信息，该订单信息即是终端设备根据用户的操作指令生成的相应的数据。终端设备生成的数据不仅仅可以是某一类数据(例如订单类型的数据)，还可以包括多种类型的数据，例如，还可以包括某个商品被用户点击的点击记录等类型的数据。因此，可以理解的是，通过上述设备集群106a中的多个终端设备，可以获取到大量的数据，该大量的数据可以包括多种类型的数据。上述终端设备集群106a生成的数据可以是数据集合100a中的数据，数据集合100a中包括多条数据，具体包括n条数据，n的数值根据实际应用场景决定，对此不作限制，例如，n可以是1万或者1亿等。其中，数据集合100a中的1条数据可以指上述生成的1条订单信息，或者可以指某个商品被用户点击的点击记录等。可以在上述图1中的服务器100中存储数据集合100a。Please refer to Figure 2a, which is a schematic diagram of a data sampling scenario provided in this application. As shown in Figure 2a, the device cluster 106a includes multiple terminal devices (device cluster 106a is equivalent to device cluster 400a in Figure 1 above). These terminal devices can all be used to respond to user operation commands and generate corresponding data. For example, a terminal device can respond to a user's operation command to place an order for a certain product and generate corresponding order information. This order information is the corresponding data generated by the terminal device based on the user's operation command. The data generated by the terminal device can not only be one type of data (such as order type data), but can also include multiple types of data, such as click records of a product being clicked by the user. Therefore, it can be understood that a large amount of data can be obtained through the multiple terminal devices in the device cluster 106a, and this large amount of data can include multiple types of data. The data generated by the aforementioned terminal device cluster 106a can be data from data set 100a. Data set 100a includes multiple data entries, specifically n data entries. The value of n is determined based on the actual application scenario and is not limited thereto; for example, n could be 10,000 or 100 million. A single data entry in data set 100a can refer to a single order information entry generated above, or it could refer to a click record of a product being clicked by a user. Data set 100a can be stored in server 100 as shown in Figure 1.

本申请主要提供了一种数据采样的方法，其中，被采样的数据可以是通过上述设备集群106a中的多个终端设备所生成的数据，换句话说，此处以对数据集合100a中的n条数据进行采样为例进行本申请所提供的方法的说明。可以使用多个线程同时读取数据集合100a中的n条数据，并对数据集合100a中的n条数据进行采样。可以理解的是，该多个线程可以是在采样设备中的线程，可以提供多个采样设备来对数据集合100a中的数据进行读取和采样。其中，采样设备可以是服务器，一个采样设备中也可以包括多个线程，因此，最后需要对所有采样设备中的多个线程所得到的采样结果进行合并。具体为，首先，对于一个采样设备而言，需要对自己的多个线程所得到的采样结果进行合并，从而得到该采样设备所对应的采样结果。接着，需要对每个采样设备所对应的采样结果进行合并(可以将每个采样设备理解为是更高一层级的线程)，最终得到对数据集合100a中的所有数据对应的采样结果。此处，以采用3个线程(如图2a所示，分别为线程s1、线程s2和线程s3)对数据集合100a中的数据进行读取和采样为例进行说明，该3个线程可以是同一个采样设备中的3个不同的线程，也可以是处于不同采样设备中的线程。例如，该3个线程可以是图1中设备集群400b中的采样设备中的线程。实际上，可以采用更多的线程(可以是多个采样设备中的)对数据集合100a中的数据进行读取和采样，线程数量以及采样设备数量根据实际应用场景决定，对此不作限制。This application primarily provides a data sampling method. The sampled data can be generated by multiple terminal devices in the aforementioned device cluster 106a. In other words, this application uses sampling n data items in a dataset 100a as an example to illustrate the method provided. Multiple threads can be used to simultaneously read and sample the n data items in the dataset 100a. It is understood that these multiple threads can be threads within sampling devices, and multiple sampling devices can be provided to read and sample the data in the dataset 100a. The sampling device can be a server, and a single sampling device can also include multiple threads. Therefore, the sampling results obtained by the multiple threads in all sampling devices need to be merged. Specifically, firstly, for a single sampling device, the sampling results obtained by its multiple threads need to be merged to obtain the sampling result corresponding to that sampling device. Next, the sampling results corresponding to each sampling device need to be merged (each sampling device can be understood as a higher-level thread), ultimately obtaining the sampling result corresponding to all data in the dataset 100a. Here, we will use three threads (as shown in Figure 2a, threads s1, s2, and s3) to read and sample data from dataset 100a as an example. These three threads can be three different threads within the same sampling device, or threads located in different sampling devices. For example, these three threads could be threads within the sampling devices of device cluster 400b in Figure 1. In practice, more threads (from multiple sampling devices) can be used to read and sample data from dataset 100a. The number of threads and the number of sampling devices depend on the actual application scenario and are not limited thereto.

其中，线程s1、线程s2和线程s3中的每个线程中均包括多个数据池，不同数据池用于存储线程所采样得到的不同类型的数据。具体的，线程s1中包括数据池a1、数据池b1和数据池c1；线程s2中包括数据池a2、数据池b2和数据池c2；线程s3中包括数据池a3、数据池b3和数据池c3。需要进行说明的是，数据池a1、数据池a2和数据池a3用于存储从数据集合100a中采样得到的同一类型的数据。数据池b1、数据池b2和数据池b3用于存储数据集合100a中采样得到的同一类型的数据。数据池c1、数据池c2和数据池c3用于存储数据集合100a中采样得到的同一类型的数据。假设上述数据池a1、数据池a2和数据池a3用于存储第一类型的数据，数据池b1、数据池b2和数据池b3用于存储第二类型的数据，数据池c1、数据池c2和数据池c3用于存储第三类型的数据。线程s1、线程s2和线程s3均可以同时读取数据集合100a中的数据，在读取数据的过程同时对所读取到的数据进行抽样，即边读取边采样。Each of threads s1, s2, and s3 includes multiple data pools, each used to store different types of data sampled by the thread. Specifically, thread s1 includes data pools a1, b1, and c1; thread s2 includes data pools a2, b2, and c2; and thread s3 includes data pools a3, b3, and c3. It should be noted that data pools a1, a2, and a3 store data of the same type sampled from data set 100a. Data pools b1, b2, and b3 also store data of the same type sampled from data set 100a. Data pools c1, c2, and c3 also store data of the same type sampled from data set 100a. Assume that data pools a1, a2, and a3 are used to store data of the first type, data pools b1, b2, and b3 are used to store data of the second type, and data pools c1, c2, and c3 are used to store data of the third type. Threads s1, s2, and s3 can all simultaneously read data from data set 100a, and sample the read data during the reading process, i.e., they sample while reading.

需要进行说明的是，上述数据集合100a中的数据可以是数据流，即在线程1、线程2和线程3读取数据集合100a中的数据，并对所读取到的数据进行采样的过程中，数据集合100a中可以不断增加通过设备集群106a所生成的数据，即本申请所提供的采样方式可以是实时的采样，换句话说，本申请中被采样的数据可以是实时生成并被多个线程实时读取并进行采样的数据。其中，数据集合100a中的每条数据均可以包括多个字段信息，每个线程(包括线程s1、线程s2和线程s3)均可以通过每条数据对应的多个字段信息对所读取到的数据进行识别分类。例如，每个线程可以将所读取到的数据分为收入类型数据、年龄类型数据点击类型数据或者订单类型数据等。此处，假设一共存在三种类型数据，即上述第一类型的数据(例如收入类型数据)、第二类型的数据(例如年龄类型数据)和第三类型的数据(例如订单类型数据)。线程1可以对读取得到的第一类型的数据进行采样，采样结果即是数据池a1中最终存留的第一类型的数据，假设为采样结果A1。线程1可以对读取得到的第二类型的数据进行采样，采样结果即是数据池b1中最终存留的第二类型的数据，假设为采样结果B1。线程1可以对读取得到的第三类型的数据进行采样，采样结果即是数据池c1中最终存留的第三类型的数据，假设为采样结果C1。线程2可以对读取得到的第一类型的数据进行采样，采样结果即是数据池a2中最终存留的第一类型的数据，假设为采样结果A2。线程2可以对读取得到的第二类型的数据进行采样，采样结果即是数据池b2中最终存留的第二类型的数据，假设为采样结果B2。线程2可以对读取得到的第三类型的数据进行采样，采样结果即是数据池c2中最终存留的第三类型的数据，假设为采样结果C2。线程3可以对读取得到的第一类型的数据进行采样，采样结果即是数据池a3中最终存留的第一类型的数据，假设为采样结果A3。线程3可以对读取得到的第二类型的数据进行采样，采样结果即是数据池b3中最终存留的第二类型的数据，假设为采样结果B3。线程3可以对读取得到的第三类型的数据进行采样，采样结果即是数据池c3中最终存留的第三类型的数据，假设为采样结果C3。因此，可以理解的是，每一类型的数据分别对应于一个采样结果，不同类型的数据之间的采样过程是相互独立的。对于上述第一类型的数据的采样结果即是对上述采样结果A1、采样结果A2和采样结果A3进行合并所得到的结果，如图2a中的采样结果101a；对于上述第二类型的数据的采样结果即是对上述采样结果B1、采样结果B2和采样结果B3进行合并所得到的结果，如图2a中的采样结果102a；对于上述第三类型的数据的采样结果即是对上述采样结果C1、采样结果C2和采样结果C3进行合并所得到的结果，如图2a中的采样结果103a。上述第一类型的数据对应的采样结果101a、第二类型的数据的采样结果102a和上述第三类型的数据的采样结果103a，可以统称为对上述数据集合100a中的数据进行采样所得到的采样结果104a。It should be noted that the data in the aforementioned data set 100a can be a data stream. That is, during the process of threads 1, 2, and 3 reading data from data set 100a and sampling the read data, data generated by the device cluster 106a can be continuously added to data set 100a. In other words, the sampling method provided in this application can be real-time sampling; that is, the sampled data in this application can be data generated in real-time and read and sampled in real-time by multiple threads. Each piece of data in data set 100a can include multiple fields. Each thread (including threads s1, s2, and s3) can identify and classify the read data using the multiple fields corresponding to each piece of data. For example, each thread can classify the read data into income type data, age type data, click type data, or order type data, etc. Here, it is assumed that there are three types of data: the first type (e.g., income type data), the second type (e.g., age type data), and the third type (e.g., order type data). Thread 1 can sample the first type of data read, and the sampling result is the first type of data ultimately stored in data pool a1, let's call it sampling result A1. Thread 1 can sample the second type of data read, and the sampling result is the second type of data ultimately stored in data pool b1, let's call it sampling result B1. Thread 1 can sample the third type of data read, and the sampling result is the third type of data ultimately stored in data pool c1, let's call it sampling result C1. Thread 2 can sample the first type of data read, and the sampling result is the first type of data ultimately stored in data pool a2, let's call it sampling result A2. Thread 2 can sample the second type of data read, and the sampling result is the second type of data ultimately stored in data pool b2, let's call it sampling result B2. Thread 2 can sample the third type of data read, and the sampling result is the third type of data ultimately stored in data pool c2, let's call it sampling result C2. Thread 3 can sample the first type of data read, and the sampling result is the first type of data ultimately stored in data pool a3, let's call it sampling result A3. Thread 3 can sample the second type of data read, and the sampling result is the second type of data ultimately stored in data pool b3, let's call it sampling result B3. Thread 3 can sample the third type of data read, and the sampling result is the third type of data ultimately stored in data pool c3, let's call it sampling result C3. Therefore, it can be understood that each type of data corresponds to a sampling result, and the sampling process between different types of data is independent of each other. The sampling result for the first type of data is the result obtained by merging the above sampling results A1, A2, and A3, as shown in sampling result 101a in Figure 2a; the sampling result for the second type of data is the result obtained by merging the above sampling results B1, B2, and B3, as shown in sampling result 102a in Figure 2a; and the sampling result for the third type of data is the result obtained by merging the above sampling results C1, C2, and C3, as shown in sampling result 103a in Figure 2a. The sampling results 101a for the first type of data, 102a for the second type of data, and 103a for the third type of data can be collectively referred to as sampling result 104a obtained by sampling the data in the above data set 100a.

其中，采样结果104a中包括上述采样结果101a、采样结果102a和上述采样结果103a。如图2a所示，在采样结果104a中，采样结果101a具体为采样得到了3条第一类型的数据(分别为第一类型数据1、第一类型数据2和第一类型数据3)，并且该3条第一类型的数据是以1/24(即每条数据被采样留下的概率为1/24)采样率所采样得到的。采样结果102a具体为采样得到了4条第二类型的数据(分别为第二类型数据1、第二类型数据2、第二类型数据3和第二类型数据4)，并且该4条第二类型的数据是以1/8(即每条数据被采样留下的概率为1/8)采样率所采样得到的。采样结果103a具体为采样得到了5条第三类型的数据(分别为第三类型数据1、第三类型数据2、第三类型数据3、第三类型数据4和第三类型数据5)，并且该5条第三类型的数据是以1/4(即每条数据被采样留下的概率为1/4)采样率所采样得到的。其中，每一类型的数据所对应的采样率在采样过程中会根据对应的采样数量阈值自适应地进行调整和改变，采样率的调整过程可以参见下述图2b对应的实施例中的描述。需要进行说明的是，在实际应用场景中，上述采样结果101a、采样结果102a和采样结果103a中可以包括采样得到的更多的数据(例如10万条数据、100万条数据等)，此处仅以3条、4条和5条为例进行说明。通过上述得到的采样结果101a中所包含的3条第一类型的数据的数据特征可以用于表征数据集合100a中所有第一类型的数据的数据特征；通过上述得到的采样结果102a中所包含的4条第二类型的数据的数据特征可以用于表征数据集合100a中所有第二类型的数据的数据特征；通过上述得到的采样结果103a中所包含的5条第三类型的数据的数据特征可以用于表征数据集合100a中所有第三类型的数据的数据特征。Sampling result 104a includes sampling results 101a, 102a, and 103a. As shown in Figure 2a, in sampling result 104a, sampling result 101a specifically obtained three data points of the first type (first type data 1, first type data 2, and first type data 3), and these three data points of the first type were sampled at a sampling rate of 1/24 (i.e., the probability of each data point being sampled and retained is 1/24). Sampling result 102a specifically obtained four data points of the second type (second type data 1, second type data 2, second type data 3, and second type data 4), and these four data points of the second type were sampled at a sampling rate of 1/8 (i.e., the probability of each data point being sampled and retained is 1/8). Sampling result 103a specifically involves sampling five data points of the third type (namely, third type data 1, third type data 2, third type data 3, third type data 4, and third type data 5), obtained at a sampling rate of 1/4 (i.e., each data point has a 1/4 probability of being sampled and remaining). The sampling rate for each data type is adaptively adjusted and changed during the sampling process based on the corresponding sampling quantity threshold. The adjustment process of the sampling rate can be seen in the embodiment corresponding to Figure 2b below. It should be noted that in practical applications, sampling results 101a, 102a, and 103a may include more sampled data (e.g., 100,000 data points, 1 million data points, etc.). Here, only three, four, and five data points are used as examples. The data features of the three first-type data included in the sampling result 101a obtained above can be used to characterize the data features of all first-type data in the dataset 100a; the data features of the four second-type data included in the sampling result 102a obtained above can be used to characterize the data features of all second-type data in the dataset 100a; and the data features of the five third-type data included in the sampling result 103a obtained above can be used to characterize the data features of all third-type data in the dataset 100a.

举个例子，假设上述第一类型的数据是针对每个公民的收入情况的数据，当需要统计每个地区(此处假设包括地区1、地区2和地区3)中的公民的人均收入情况时，则可以对上述采样结果101a中的第一类型的数据进行统计计算(即计算每个地区中的公民的人均收入情况)，得到地区1对应的人均收入为3000元，地区2对应的人均收入为6000元，地区3对应的人均收入为8000元。其中，对采样结果101a中的第一类型的数据进行统计计算的过程可以是单独另外一个专门用于作统计计算的设备所执行的，也可以是上述线程s1、线程s2和线程s3中任意一个线程所在的采样设备所执行的，执行统计计算的设备根据实际应用场景决定，对此不作限制。当对采样结果101a中的第一类型的数据进行统计计算，得到计算结果(即地区1对应的人均收入为3000元，地区2对应的人均收入为6000元，地区3对应的人均收入为8000元)之后，计算得到该计算结果的设备可以将该计算结果发送至用户终端105a，用户终端105a可以在终端页面中显示该计算结果，用户终端105a可以是管理员所持有的终端或者包含管理员账户的终端。其中，管理员可以是被认证过的具备数据管理权限的用户，例如，管理员可以是可以获取到公民收入状况的政府机构的高层管理人员。如图2a所示，在用户终端105a中显示的计算结果即为“收入分布的统计分析结果：地区1：人均收入3000元地区2：人均收入6000元地区3：人均收入8000元”。For example, assuming the first type of data mentioned above represents income data for each citizen, when it is necessary to statistically analyze the per capita income of citizens in each region (here, we assume it includes region 1, region 2, and region 3), we can perform statistical calculations on the first type of data in the sampling result 101a (i.e., calculate the per capita income of citizens in each region), resulting in a per capita income of 3000 yuan for region 1, 6000 yuan for region 2, and 8000 yuan for region 3. The process of performing statistical calculations on the first type of data in sampling result 101a can be executed by a separate device specifically designed for statistical calculations, or it can be executed by the sampling device containing any of the threads s1, s2, and s3 mentioned above. The device performing the statistical calculations depends on the actual application scenario and is not restricted in this regard. After statistical calculations are performed on the first type of data in sampling result 101a to obtain the calculation results (i.e., per capita income of 3000 yuan for region 1, 6000 yuan for region 2, and 8000 yuan for region 3), the device that obtained the calculation results can send them to user terminal 105a. User terminal 105a can display the calculation results on its terminal page. User terminal 105a can be a terminal held by an administrator or a terminal containing an administrator account. The administrator can be an authenticated user with data management permissions; for example, the administrator can be a senior manager of a government agency who can access citizens' income information. As shown in Figure 2a, the calculation results displayed on user terminal 105a are: "Statistical analysis results of income distribution: Region 1: per capita income 3000 yuan; Region 2: per capita income 6000 yuan; Region 3: per capita income 8000 yuan."

由于在实际应用场景中，通常采集到的每一类型的数据的数据总量会异常庞大，因此，可以通过上述方式对每一类型的数据先进行抽样，当后续，需要对数据集合100a中的每一类型的数据所包含的数据特征进行统计计算时，可以直接计算采样得到的每一类型的数据所包含的数据特征。通过此种方式，由于减少了参与计算的数据的数量，因此可以减少对每一类型的数据的数据特征进行计算的计算量，同时降低对每一类型的数据的数据特征进行计算的计算复杂度。并且，由于对于每一类型的数据的采样过程中，保证了每条数据被采样的等概率性，因此，被采样出来的数据的数据特征可以用于表征所有数据的数据特征。In real-world applications, the total amount of data collected for each type is typically enormous. Therefore, the above method can be used to sample each type of data first. Later, when statistical calculations are needed on the data features of each type of data in dataset 100a, the data features of each sampled type can be directly calculated. This method reduces the amount of data involved in the calculation, thus reducing the computational load and complexity of calculating the data features for each type of data. Furthermore, because the sampling process for each type of data ensures equal probability of sampling each data point, the data features of the sampled data can be used to characterize the data features of all data.

本申请所提供的方法，可以对每一类采样数据分别设定一个采样数量阈值，可以将该采样数量阈值理解为是每一个数据池的容量，因此，上述每个数据池均可以根据所对应的采样数量阈值，对所读取得到的数据进行采样，使得每个数据池中最终所采样得到的数据的数量不大于对应的采样数量阈值。具体的，对于一类采样数据所对应的多个线程中的不同数据池对应的采样数量阈值均相同。例如，设定上述第一类型的数据对应的采样数量阈值为50条，表明最终需要采样得到的第一类型的数据的数量小于或者等于50，则上述数据池a1、数据池a2和数据池a3对应的采样数量阈值均为50。之后，当对数据池a1对应的采样结果A1、数据池a2对应的采样结果A2和数据池a3对应的采样结果A3进行合并之后，最终所得到的第一类型的数据的总数量也要小于或者等于50。再例如，设定上述第二类型的数据对应的采样数量阈值为100条，表明，最终需要采样得到的第二类型的数据的数量小于或者等于100，则上述数据池b1、数据池b2和数据池b3对应的采样数量阈值均为100。之后，当对数据池b1对应的采样结果B1、数据池b2对应的采样结果B2和数据池b3对应的采样结果B3进行合并之后，最终所得到的第二类型的数据的总数量也要小于或者等于100。还例如，设定上述第三类型的数据对应的采样数量阈值为150条，表明，最终需要采样得到的第三类型的数据的数量小于或者等于150，则上述数据池c1、数据池c2和数据池c3对应的采样数量阈值均为150。之后，当对数据池c1对应的采样结果C1、数据池c2对应的采样结果C2和数据池c3对应的采样结果C3进行合并之后，最终所得到的第三类型的数据的总数量也要小于或者等于150。对于上述线程如何对每一类数据进行采样，以及不同线程之间对于同一类型的数据的采样结果之间的合并过程可以参见下述图2b对应的实施例中的描述。The method provided in this application allows setting a sampling quantity threshold for each type of sampled data. This threshold can be understood as the capacity of each data pool. Therefore, each data pool can sample the read data according to its corresponding sampling quantity threshold, ensuring that the final number of sampled data in each data pool does not exceed the corresponding sampling quantity threshold. Specifically, the sampling quantity thresholds for different data pools in multiple threads corresponding to a type of sampled data are the same. For example, setting the sampling quantity threshold for the first type of data to 50 indicates that the final number of first type of data to be sampled is less than or equal to 50. Therefore, the sampling quantity thresholds for data pools a1, a2, and a3 are all 50. After merging the sampling results A1 (for data pool a1), A2 (for data pool a2), and A3 (for data pool a3), the final total number of first type of data will also be less than or equal to 50. For example, setting the sampling threshold for the second type of data to 100 entries means that the final number of second-type data samples is less than or equal to 100. Therefore, the sampling threshold for data pools b1, b2, and b3 is also 100. Then, after merging the sampling results B1 (for data pool b1), B2 (for data pool b2), and B3 (for data pool b3), the total number of second-type data obtained will also be less than or equal to 100. Similarly, setting the sampling threshold for the third type of data to 150 entries means that the final number of third-type data samples is less than or equal to 150. Therefore, the sampling threshold for data pools c1, c2, and c3 is also 150. Then, after merging the sampling results C1 (for data pool c1), C2 (for data pool c2), and C3 (for data pool c3), the total number of third-type data obtained will also be less than or equal to 150. For details on how the above threads sample each type of data, and the merging process between sampling results of the same type of data between different threads, please refer to the description in the embodiment corresponding to Figure 2b below.

请参见图2b，是本申请提供的一种线程采样的场景示意图。图2b所描述的实施例是针对一个线程如何对所读取的某一类数据进行采样的过程，换句话说，图2b所描述的实施例是针对一个线程中的一个数据池(对应于某一类数据的采样)对应的采样原理的具体说明。可以理解的是，每个数据池(不同的线程中用于存储不同类型的数据池)对应的采样原理均相同且独立。如图2b所示，数据池100g假设为某个用于对数据进行采样的采样设备中的某个线程中的任意一个数据池(例如数据池100g可以是图2a中的数据池a1、数据池a2、数据池a3、数据池b1、数据池b2、数据池b3、数据池c1、数据池c2和数据池c3中的任意一个数据池)，此处假设数据池100g为上述线程s3中的数据池b3。数据池100g中可以存储线程s3从数据层中读取并采样得到的第二类型的数据，可以理解为数据层中存储了图2a中的数据集合100a。线程s3可以同时对数据层中的数据进行读取和采样，可以理解为是，线程s3不断从数据层中读取数据，并存储在数据池100g中，而数据池可以对所存储的数据进行过滤(该过滤过程即是采样过程)，过滤过程中，会保留下一些数据，同时也会丢弃掉一些数据。即本申请所提供的方法，无需将所有数据全部读取下来再进行采样，而是可以边读取边采样(根据设定的采样数据阈值进行采样)，即边读取边过滤，使得在系统中所存储的数据的数量被限定在采样数据阈值内，因此节省了系统容量，也提高了采样效率。Please refer to Figure 2b, which is a schematic diagram of a thread sampling scenario provided in this application. The embodiment described in Figure 2b is a process of how a thread samples a certain type of data it reads. In other words, the embodiment described in Figure 2b is a specific explanation of the sampling principle corresponding to a data pool (corresponding to the sampling of a certain type of data) in a thread. It can be understood that the sampling principle corresponding to each data pool (used by different threads to store different types of data pools) is the same and independent. As shown in Figure 2b, data pool 100g is assumed to be any data pool in a certain thread of a sampling device used to sample data (for example, data pool 100g can be any one of data pools a1, a2, a3, b1, b2, b3, c1, c2, and c3 in Figure 2a). Here, it is assumed that data pool 100g is data pool b3 in the aforementioned thread s3. Data pool 100g can store the second type of data read and sampled by thread s3 from the data layer. This can be understood as the data set 100a in Figure 2a being stored in the data layer. Thread s3 can simultaneously read and sample data from the data layer. This means that thread s3 continuously reads data from the data layer and stores it in data pool 100g. The data pool can filter the stored data (this filtering process is the sampling process). During filtering, some data is retained, while some is discarded. In other words, the method provided in this application does not require reading all the data before sampling; instead, it can sample while reading (based on a set sampling data threshold), i.e., filtering while reading. This limits the amount of data stored in the system to within the sampling data threshold, thus saving system capacity and improving sampling efficiency.

如图2b所示，假设数据池100g对应的采样数量阈值为5，最开始，线程s3会以“1”的概率从数据层读取数据，在读取的数据的数量还未超过采样数据阈值5时，线程s3可以将所读取的数据(此处均指某一类数据，因为一个数据池对应于存储一类数据)全部存储在数据池100g中。也就是，线程s3从第1条数据读取到第5条数据时，都可以将所读取到的5条数据存储在数据池100g，此时，数据池100g中所存储的数据可以是数据集合101g中的5条数据。当线程s3读取到第6条数据时，数据池100g中存储的数据可以是数据集合102g中的6条数据,。此时，由于数据池100g对应的采样数量阈值为5，而所存储的数据的数量为6超过了采样数量阈值5，因此，需要对此时所存储的6条数据进行采样(也就是过滤)，通过对该6条数据进行采样，可以丢弃掉该6条数据中的部分数据。可以对该6条数据以1/2的采样率进行采样，采样之后，数据池100g中所存储的数据可以是数据集合103g中的3条数据。As shown in Figure 2b, assuming the sampling threshold for data pool 100g is 5, initially, thread s3 will read data from the data layer with a probability of "1". Before the number of read data exceeds the sampling threshold of 5, thread s3 can store all the read data (here referring to a specific type of data, as one data pool corresponds to storing one type of data) in data pool 100g. That is, when thread s3 reads data from the 1st to the 5th data item, it can store all 5 read data items in data pool 100g. At this time, the data stored in data pool 100g can be 5 data items from data set 101g. When thread s3 reads the 6th data item, the data stored in data pool 100g can be 6 data items from data set 102g. At this point, since the sampling threshold for data pool 100g is 5, and the number of stored data items is 6, exceeding the sampling threshold of 5, it is necessary to sample (i.e., filter) these 6 stored data items. By sampling these 6 data items, some of the data can be discarded. These 6 data points can be sampled at a sampling rate of 1/2. After sampling, the data stored in data pool 100g can be 3 data points from data set 103g.

接下来，在读取新数据时，都是以1的采样率读取进来(也就是未对新数据进行采样)。由于需要将读取的新数据添加至数据池100g中，也就是，需要将读取的新数据(即假设在得到数据集合103g之后，读取的新数据为数据105g)与数据池100g中已经存在的数据集合103g进行合并，因此，需要对数据之间的采样率进行统一，才能保证对每个数据是等概率进行采样的。具体的，由于数据集合103g中的数据已经进行了1/2采样率的采样，因此，也需要对数据105g以1/2的采样率进行采样。其中，对数据集合102g中的数据以及对数据105g以1/2的采样率进行采样的过程可以参见下述图3对应的实施例中对步骤S102的描述。假设，对数据105g以1/2的采样率进行采样之后，得到了数据集合104g，也就是数据池100g中此处存储的数据为数据集合104g中的数据，即在存储了数据集合103g中的数据的基础上，又存储了数据105g。此时，数据集合104g中的数据均是以1/2的采样率进行采样所得到的。Next, when reading new data, it is read at a sampling rate of 1 (i.e., no sampling is performed on the new data). Since the newly read data needs to be added to data pool 100g, that is, the newly read data (i.e., assuming that after obtaining data set 103g, the newly read data is data 105g) needs to be merged with the existing data set 103g in data pool 100g, the sampling rate between the data needs to be unified to ensure that each data point is sampled with equal probability. Specifically, since the data in data set 103g has already been sampled at a sampling rate of 1/2, data 105g also needs to be sampled at a sampling rate of 1/2. The process of sampling the data in data set 102g and data 105g at a sampling rate of 1/2 can be found in the description of step S102 in the embodiment corresponding to Figure 3 below. Suppose that after sampling data 105g at a sampling rate of 1/2, we obtain data set 104g. This means that the data stored in data pool 100g here is the data from data set 104g. In other words, data 105g was stored in addition to the data from data set 103g. At this point, all the data in data set 104g was obtained by sampling at a sampling rate of 1/2.

接下来，线程s3会不断读取数据，线程s3也会对所读取到的每个数据均与数据集合104g中的数据进行采样率的统一，即线程s3会对所读取到的每个数据都以1/2的采样率进行采样。此间对采样率进行统一的过程中，数据池100g中会新增一些线程s3所读取和采样到的数据，同时数据池100g中也会过滤(即丢弃)掉一些线程s3所读取和采样到的数据。当数据池100g中的数据再次超过了采样数量阈值5时，此时，数据池100g中所存储的数据可以是数据集合106g中的6个数据。数据集合106g中的6个数据均是以1/2的采样率采样得到的，线程s3可以再次对数据集合106g中的数据以1/2的采样率进行采样。通过对数据集合106g中的数据以1/2的采样率进行采样之后，可以得到数据集合107g，即此时，数据池100g中所存储的数据为数据集合107g中的3个数据，该3个数据是以1/2的采样率从数据集合106g中采样得到的。可以理解的是，数据集合107g中的数据均是以1/4的采样率进行采样得到(因为历经了2次以1/2的采样率进行采样)。Next, thread s3 will continuously read data, and for each data point it reads, it will unify the sampling rate with the data in dataset 104g. That is, thread s3 will sample each data point at a sampling rate of 1/2. During this sampling rate unification process, some data read and sampled by thread s3 will be added to data pool 100g, while some data read and sampled by thread s3 will be filtered (discarded). When the data in data pool 100g again exceeds the sampling threshold of 5, the data stored in data pool 100g can then be the 6 data points from dataset 106g. Since all 6 data points in dataset 106g were sampled at a sampling rate of 1/2, thread s3 can again sample the data in dataset 106g at the 1/2 sampling rate. After sampling the data in dataset 106g at a sampling rate of 1/2, dataset 107g is obtained. At this point, the data stored in data pool 100g consists of three data points from dataset 107g, which were sampled from dataset 106g at a sampling rate of 1/2. It can be understood that the data in dataset 107g were all sampled at a sampling rate of 1/4 (because it underwent two sampling operations at a sampling rate of 1/2).

后续，线程s3依然可以不断读取新数据，同样，线程s3也会对所读取到的每个新数据与数据集合107g中的数据进行采样率的统一，也就是线程s3会对读取到的每个新数据以1/4的概率进行采样。此件统一采样率的过程中，数据池100g中也会保留下线程s3读取并采样得到的一些新数据，同时也会丢弃掉线程s3所读取到的一些新数据。当数据池100g中所存储的数据的数量再次超过采样数量阈值时(即数据池100g中所存储的数据的数量再次达到6条时)，此时，数据池100g中所存储的6条数据均是以1/4的采样率采样得到，线程s3可以再次以1/2的采样率对此时数据池100g中所存储的数据进行采样。采样之后，数据池100g中所存储的数据的数量可以再次为3条，并且，可以理解的是，该3条数据是以1/8的采样率进行采样得到(因为以1/2的采样率进行采样了3次)。之后，线程s3还可以不断读取新数据，并对所读取得到的新数据循环进行上述所描述的步骤，得到采样结果。Subsequently, thread s3 can continue to read new data. Similarly, thread s3 will unify the sampling rate of each new data read with the data in data set 107g; that is, thread s3 will sample each new data read with a 1/4 probability. During this unification process, data pool 100g will retain some new data read and sampled by thread s3, while discarding some. When the number of data stored in data pool 100g exceeds the sampling threshold again (i.e., when the number of data stored in data pool 100g reaches 6), at this time, all 6 data stored in data pool 100g are sampled at a 1/4 sampling rate. Thread s3 can then sample the data stored in data pool 100g again at a 1/2 sampling rate. After sampling, the number of data stored in data pool 100g can again be 3, and these 3 data are sampled at a 1/8 sampling rate (because they were sampled 3 times at a 1/2 sampling rate). Afterwards, thread s3 can continuously read new data and repeat the steps described above on the new data to obtain the sampling results.

最终，当线程s3不再读取新数据时(也就是采样完成时)，数据池100g中可以存储数量不超过采样数量阈值6的数据，即得到最终的采样结果。例如，最终数据池100g中所存储的数据可以是数据集合108g中的5条数据，该5条数据对应的采样率是根据线程s3所读取到的数据池100g中所存储的那一类数据的数量以及数据池100g对应的采样数量阈值共同决定的。数据池100g对应的采样结果也就是数据集合108g中所包含的5条数据。通过与上述描述的过程相同的方式，可以得到图2a中数据池a1对应的采样结果A1、数据池b1对应的采样结果B1、数据池c1对应的采样结果C1、数据池a2对应的采样结果A2、数据池b2对应的采样结果B2、数据池c2对应的采样结果C2、数据池a3对应的采样结果A3、数据池b3对应的采样结果B3和数据池c3对应的采样结果C3。Finally, when thread s3 stops reading new data (i.e., sampling is complete), data pool 100g can store no more than the sampling quantity threshold of 6, thus obtaining the final sampling result. For example, the data stored in the final data pool 100g can be 5 data items from data set 108g. The sampling rate corresponding to these 5 data items is determined by the quantity of that type of data stored in data pool 100g read by thread s3 and the sampling quantity threshold corresponding to data pool 100g. The sampling result corresponding to data pool 100g is the 5 data items contained in data set 108g. Using the same process described above, we can obtain the sampling results A1 corresponding to data pool a1, B1 corresponding to data pool b1, C1 corresponding to data pool c1, A2 corresponding to data pool a2, B2 corresponding to data pool b2, C2 corresponding to data pool c2, A3 corresponding to data pool a3, B3 corresponding to data pool b3, and C3 corresponding to data pool c3 in Figure 2a.

其中，在对多个线程所得到的采样结果进行合并时，合并的原理与上述线程s3对应的采样原理相同。其中，当存在多个线程时，多个线程所得到的采样结果之间可以进行两两合并(合并的均为同一类型的数据对应的采样结果)。例如，当存在线程1、线程2、线程3和线程4时，线程1的采样结果和线程2的采样结果可以合并，得到合并结果1，线程3的采样结果和线程4的采样结果可以合并，得到合并结果2，接着，可以将合并结果1和合并结果2进行合并，即可得到上述4个线程(包括线程1、线程2、线程3和线程4)对应的总的采样结果。可选的，合并的方式还可以是，线程1的采样结果和线程2的采样结果可以合并，得到合并结果1。合并结果1可以与线程3的采样结果进行合并，得到合并结果3。合并结果3可以与线程4的采样结果进行合并，得到合并结果4。该合并结果4即是上述4个线程对应的总的采样结果。When merging the sampling results obtained from multiple threads, the merging principle is the same as that for thread s3. Specifically, when multiple threads exist, the sampling results obtained from each thread can be merged pairwise (merging results corresponding to the same data type). For example, when there are threads 1, 2, 3, and 4, the sampling results of thread 1 and thread 2 can be merged to obtain merged result 1; the sampling results of thread 3 and thread 4 can be merged to obtain merged result 2. Then, merged result 1 and merged result 2 can be merged to obtain the total sampling result for the four threads (including threads 1, 2, 3, and 4). Optionally, the merging method can also be as follows: the sampling results of thread 1 and thread 2 can be merged to obtain merged result 1. Merged result 1 can be merged with the sampling result of thread 3 to obtain merged result 3. Merged result 3 can be merged with the sampling result of thread 4 to obtain merged result 4. This merged result 4 is the total sampling result for the four threads.

举个例子，当线程1的采样结果与线程2的采样结果进行合并时，首先也需要对线程1的采样结果和线程2的采样结果进行采样率的统一。假设线程1的采样结果所包含的多个数据对应的采样率为1/8，线程2的采样结果所包含的多个数据对应的采样率为1/16，那么，需要再次对线程1的采样结果所包含的数据以1/2的采样率进行采样，使得可以将线程1的采样结果和线程2的采样结果对应的采样率均统一到1/16(因为1/8*1/2＝1/16)。将线程1的采样结果中的多个数据对应的采样率统一到1/16之后，线程1的采样结果中的多个数据中的部分数据会被丢弃。可以将采样率被统一到1/16的线程1的采样结果称之为统一采样结果，可以将该统一采样结果直接和上述线程2的采样结果进行合并，得到合并采样结果。该合并采样结果中包括多条数据(具体包括统一采样结果所包含的数据和线程2的采样结果所包含的数据)，该合并采样结果中的多条数据的采样率均为1/16。接着，可以对合并采样结果中的数据的数量进行判定，若是合并采样结果中的数据的数量超过采样数量阈值，则需要再次以1/2的采样率对合并采样结果中的数据进行采样，采样之后，即可得到线程1与线程2对应的最终的合并结果，可以理解的是，该合并结果中的数据对应的采样率均为1/32(即1/16*1/2＝1/32)。若是上述合并采样结果中的数据的数量并未超过采样数量阈值，则可以直接将该合并采样结果作为线程1与线程2对应的最终的合并结果。当涉及到多个线程的采样结果之间的合并时，合并的原理与上述相同，即首先进行采样率的统一，统一之后判断此时合并的数据的数量是否超过采样数据阈值，若超过，再次以1/2的概率进行采样，得到最终的采样结果，保证最终合并得到的采样结果中的多个数据对应的采样率均相同。若未超过，则可以直接将合并的数据作为最终合并的结果。For example, when merging the sampling results of thread 1 and thread 2, the sampling rates of the two results first need to be unified. Assuming the sampling rate for multiple data points in thread 1's sampling result is 1/8, and the sampling rate for multiple data points in thread 2's sampling result is 1/16, then the data points in thread 1's sampling result need to be sampled again at a sampling rate of 1/2, so that the sampling rates of both thread 1 and thread 2's results are unified to 1/16 (because 1/8 * 1/2 = 1/16). After unifying the sampling rate for multiple data points in thread 1's sampling result to 1/16, some data points in thread 1's sampling result will be discarded. The sampling result of thread 1 with a unified sampling rate of 1/16 is called the unified sampling result. This unified sampling result can be directly merged with the sampling result of thread 2 to obtain the merged sampling result. The merged sampling result includes multiple data points (specifically, data from the unified sampling result and data from the sampling result of thread 2), all with a sampling rate of 1/16. Next, the number of data points in the merged sampling result is determined. If the number of data points exceeds the sampling threshold, the data in the merged sampling result needs to be sampled again at a sampling rate of 1/2. After sampling, the final merged result for thread 1 and thread 2 is obtained. It can be understood that the sampling rate for the data in this merged result is 1/32 (i.e., 1/16 * 1/2 = 1/32). If the number of data points in the above merged sampling result does not exceed the sampling threshold, this merged sampling result can be directly used as the final merged result for thread 1 and thread 2. When merging sampling results from multiple threads, the merging principle is the same as above: first, the sampling rate is unified; after unification, it is determined whether the number of data points to be merged exceeds the sampling threshold. If it does, sampling is performed again with a probability of 1/2 to obtain the final sampling result, ensuring that the sampling rates for multiple data points in the final merged result are the same. If the number of data points does not exceed the limit, the merged data can be used directly as the final merge result.

通过本申请所提供的方法，支持在对大量数据进行读取的过程中并行对所读取的数据进行采样，即边读取边过滤，使得可以将在数据池中所存储的数据的数量控制在采样数量阈值内，节省了系统容量，提高了采样效率。同时，在对数据进行采样的过程中，可以根据数据池对应的采样数量阈值自适应地对所读取到的数据的采样率进行调整，适用于对数据流形式的数据进行采样。即在不知道需要进行读取到的数据的具体数量时，也可以根据最终需要采样得到的数据的数量，自适应地调整所读取的数据的采样率，且保证所读取到的每个数据均是以等概率的采样率进行采样得到(此处指同一类型数据均是以相同的采样率采样得到，不同类型的数据可以对应于不同的采样率)，提高了采样的准确性。The method provided in this application supports parallel sampling of the read data during the reading of large amounts of data, i.e., filtering while reading. This allows the amount of data stored in the data pool to be controlled within a sampling threshold, saving system capacity and improving sampling efficiency. Simultaneously, during data sampling, the sampling rate of the read data can be adaptively adjusted according to the sampling threshold corresponding to the data pool, making it suitable for sampling data in the form of data streams. That is, even when the specific amount of data to be read is unknown, the sampling rate of the read data can be adaptively adjusted according to the final amount of data to be sampled, ensuring that each piece of data is sampled with equal probability (meaning that data of the same type is sampled with the same sampling rate, and different types of data can correspond to different sampling rates), thus improving sampling accuracy.

请参见图3，是本申请提供的一种数据采样方法的流程示意图，如图3所示，该方法可以包括：Please refer to Figure 3, which is a flowchart illustrating a data sampling method provided in this application. As shown in Figure 3, the method may include:

步骤S101，在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据；Step S101: Sample the target type data at a first sampling rate in the first time window to obtain target sampled data;

具体的，首先需要对本申请中所涉及到的数据采样过程进行说明，本申请中需要进行采样的数据可以是任意数据，可以使用多个采样设备对数据进行读取和采样，每个采样设备中还可以包括多个线程。因此，最终的采样结果为，首先每个采样设备需要对自己的多个线程所对应的采样结果进行合并，得到每个采样设备分别对应的采样结果，其次，需要对每个采样设备对应的采样结果进行合并，得到最终的采样结果。本实施例中的执行主体可以是采样系统，该采样系统中包括多个线程(可以是多个采样设备对应的多个线程)。本申请所提供的方法可以应用在对数据流中的数据(可以理解为数据是实时生成，对数据进行采样的过程也是实时进行的)进行采样的场景中，该场景中可以对应有多个时间窗口，针对每一个时间窗口可以分别对应于有一个采样结果，某个时间窗口对应的采样结果即是对在该时间窗口中所生成的所有数据进行采样所得到的结果。Specifically, the data sampling process involved in this application needs to be explained first. The data to be sampled in this application can be arbitrary data, and multiple sampling devices can be used to read and sample the data. Each sampling device can also include multiple threads. Therefore, the final sampling result is as follows: First, each sampling device needs to merge the sampling results corresponding to its multiple threads to obtain the sampling result corresponding to each sampling device. Second, the sampling results corresponding to each sampling device need to be merged to obtain the final sampling result. The execution subject in this embodiment can be a sampling system, which includes multiple threads (which can be multiple threads corresponding to multiple sampling devices). The method provided in this application can be applied to scenarios where data in a data stream (which can be understood as data being generated in real time, and the process of sampling the data also being performed in real time) is sampled. In this scenario, there can be multiple time windows, and each time window can correspond to a sampling result. The sampling result corresponding to a certain time window is the result obtained by sampling all the data generated in that time window.

例如，假设需要对当前1小时(也可以是24小时或者设置的其他时长)前所生成的数据进行读取和采样，且每隔10分钟进行采样得到一个采样结果，则可以理解的是，上述每个时间窗口对应的窗口时间为1小时(即60分钟)，相邻两个时间窗口的起始时间点相差10分钟，相邻两个时间窗口的终止时间点也相差10分钟。请参见图4，是本申请提供的一种时间窗口的场景示意图。如图4所示，可以将时间窗口的宽理解为是时间窗口对应的时间段的长度，由于时间窗口对应的时间段的长度(例如1小时)通常会长于相邻两个时间窗口的起始时间之间的时间差距(例如10分钟)，因此，相邻的两个时间窗口之间通常会有交集时间窗口，该交集时间窗口即是指两个时间窗口中相同的那一个时间段所对应的时间窗口。如图4所示，假设每个时间窗口的长度均为1小时，则时间窗口100f、时间窗口101f、时间窗口102f、时间窗口103f和时间窗口104f的长度均表征时长为1小时的时间段。时间窗口100f与时间窗口101f为相邻的两个时间窗口，时间窗口100f与时间窗口101f之间具备交集时间窗口1(本申请中所描述的交集时间窗口均是指后一个时间窗口相较于前一个时间窗口的交集时间窗口)。其中，时间窗口100f为第一个时间窗口，因此，在时间窗口100f之前没有其他时间窗口。时间窗口100f的作用就是用于将当前时刻开始前1小时内产生的数据读取进来并进行采样，例如，从13:00点开始读取，则在时间窗口100f中所读取和采样的数据就是12:00点到13:00点之间产生的数据。由于产生的每个数据均对应有一个数据生成时间戳，该数据生成时间戳表征了每个数据的生成时间，因此，在每个时间窗口读取数据时，均可以根据每个数据对应的数据生成时间戳读取对应时间窗口内所产生的数据。For example, suppose we need to read and sample data generated one hour ago (or 24 hours, or other set durations), and sample every 10 minutes to obtain a sampling result. Then, we can understand that each time window corresponds to a window time of one hour (60 minutes), the start time of two adjacent time windows differs by 10 minutes, and the end time of two adjacent time windows also differs by 10 minutes. Please refer to Figure 4, which is a schematic diagram of a time window scenario provided in this application. As shown in Figure 4, the width of the time window can be understood as the length of the time period corresponding to the time window. Since the length of the time period corresponding to the time window (e.g., one hour) is usually longer than the time difference between the start times of two adjacent time windows (e.g., 10 minutes), there is usually an intersection time window between two adjacent time windows. This intersection time window refers to the time window corresponding to the same time period in the two time windows. As shown in Figure 4, assuming each time window is 1 hour long, the lengths of time windows 100f, 101f, 102f, 103f, and 104f all represent time periods of 1 hour. Time windows 100f and 101f are two adjacent time windows, and they have an intersection time window 1 (the intersection time window described in this application refers to the intersection of the later time window with the earlier time window). Time window 100f is the first time window; therefore, there are no other time windows before it. The purpose of time window 100f is to read and sample data generated within the hour preceding the current time. For example, if reading starts at 13:00, the data read and sampled in time window 100f is the data generated between 12:00 and 13:00. Since each piece of data has a corresponding data generation timestamp, which represents the generation time of each piece of data, when reading data in each time window, the data generated in the corresponding time window can be read according to the data generation timestamp of each piece of data.

其中，读取每个时间窗口内的数据均可以使用多个线程(可以存在于多个采样设备中)同时并行地读取和采样。上述时间窗口101f中还包括未与时间窗口100f相交的时间窗口，即非交集时间窗口1，交集时间窗口1和非交集时间窗口1加起来就得到了时间窗口101f。可以将交集时间窗口1称之为时间窗口101f对应的交集时间窗口，将非交集时间窗口1称之为时间窗口101f对应的非交集时间窗口。同理，时间窗口101f与时间窗口102f之间也具备交集时间窗口2，时间窗口102f中也包括未与时间窗口101f相交的时间窗口，即非交集时间窗口2，交集时间窗口2和非交集时间窗口2相加就得到了时间窗口102f。可以将交集时间窗口2称之为时间窗口102f对应的交集时间窗口，将非交集时间窗口2称之为时间窗口102f对应的非交集时间窗口。时间窗口102f与时间窗口103f之间也具备交集时间窗口3，时间窗口103f中也包括未与时间窗口102f相交的时间窗口，即非交集时间窗口3，交集时间窗口3和非交集时间窗口3相加就得到了时间窗口103f。可以将交集时间窗口3称之为时间窗口103f对应的交集时间窗口，将非交集时间窗口3称之为时间窗口103f对应的非交集时间窗口。时间窗口103f与时间窗口104f之间也具备交集时间窗口4，时间窗口104f中也包括未与时间窗口103f相交的时间窗口，即非交集时间窗口4，交集时间窗口4和非交集时间窗口4相加就得到了时间窗口104f。可以将交集时间窗口4称之为时间窗口104f对应的交集时间窗口，将非交集时间窗口4称之为时间窗口104f对应的非交集时间窗口。The data within each time window can be read and sampled simultaneously and in parallel using multiple threads (which can reside on multiple sampling devices). Time window 101f also includes time windows that do not intersect with time window 100f, i.e., non-intersecting time windows 1. The sum of intersecting time windows 1 and non-intersecting time windows 1 equals time window 101f. Intersecting time windows 1 can be referred to as the intersecting time window corresponding to time window 101f, and non-intersecting time windows 1 can be referred to as the non-intersecting time window corresponding to time window 101f. Similarly, time window 101f and time window 102f also have an intersecting time window 2. Time window 102f also includes time windows that do not intersect with time window 101f, i.e., non-intersecting time windows 2. The sum of intersecting time windows 2 and non-intersecting time windows 2 equals time window 102f. Intersecting time windows 2 can be referred to as the intersecting time window corresponding to time window 102f, and non-intersecting time windows 2 can be referred to as the non-intersecting time window corresponding to time window 102f. Time windows 102f and 103f also intersect with time window 3. Time window 103f also includes time windows that do not intersect with time window 102f, i.e., non-intersecting time windows 3. The sum of intersecting time windows 3 and non-intersecting time windows 3 gives time window 103f. Intersecting time windows 3 can be called the intersecting time windows corresponding to time window 103f, and non-intersecting time windows 3 can be called the non-intersecting time windows corresponding to time window 103f. Similarly, time windows 103f and 104f also intersect with time window 4. Time window 104f also includes time windows that do not intersect with time window 103f, i.e., non-intersecting time windows 4. The sum of intersecting time windows 4 and non-intersecting time windows 4 gives time window 104f. Intersecting time windows 4 can be called the intersecting time windows corresponding to time window 104f, and non-intersecting time windows 4 can be called the non-intersecting time windows corresponding to time window 104f.

由于每个时间窗口对应的采样过程是串行的，上一个时间窗口得到了采样结果之后，才会执行下一个时间窗口内的数据的采样。因此，可以将上一个时间窗口所得到的采样结果中，数据生成时间戳在与下一个时间窗口的交集时间窗口内的数据直接作为下一个时间窗口的部分采样结果，因此，下一个时间窗口内只需要对与上一个时间窗口未有交集的时间窗口内(即非交集时间窗口内)所产生的新数据，而不用再对与上一个时间窗口的交集时间窗口内所产生的数据重新进行读取和采样。这可以提高在一个时间窗口内对数据进行采样的效率。举个例子，上述时间窗口100f对应的采样结果，是对当前时刻至前1小时内的所有数据进行读取和采样所得到的结果，对时间窗口100f中生成的数据进行读取和采样得到对应的采样结果之后，即可进行时间窗口101f对应的采样过程。假设时间窗口100f与时间窗口101f的起始时间点之间相差10分钟，则在进行时间窗口101f对应的采样过程时，只需要对非交集时间窗口1(为10分钟)中所产生的新数据进行读取和采样。当对非交集时间窗口1中所产生的数据进行读取和采样之后，可以直接拿取时间窗口100f对应的采样结果中数据生成时间戳在交集时间窗口1(为50分钟)中的数据，与在非交集时间窗口1中所产生的数据进行读取和采样之后所得到的采样结果进行合并，即可得到时间窗口101f对应的采样结果。Since the sampling process for each time window is sequential, the sampling of data in the next time window is only performed after the sampling results of the previous time window have been obtained. Therefore, data whose generation timestamps fall within the intersection time window with the next time window can be directly used as part of the sampling results for the next time window. Thus, the next time window only needs to process new data generated within time windows that do not intersect with the previous time window (i.e., non-intersection time windows), without needing to re-read and sample data generated within the intersection time windows. This improves the efficiency of data sampling within a time window. For example, the sampling results for time window 100f are obtained by reading and sampling all data from the current time to the previous hour. After reading and sampling the data generated in time window 100f to obtain the corresponding sampling results, the sampling process for time window 101f can then proceed. Assuming there is a 10-minute difference between the start times of time window 100f and time window 101f, when performing the sampling process corresponding to time window 101f, it is only necessary to read and sample the new data generated in the non-intersecting time window 1 (10 minutes). After reading and sampling the data generated in the non-intersecting time window 1, the data whose data generation timestamps are in the intersecting time window 1 (50 minutes) can be directly taken from the sampling results corresponding to time window 100f and merged with the sampling results obtained after reading and sampling the data generated in the non-intersecting time window 1 to obtain the sampling result corresponding to time window 101f.

其中，由于是通过多个线程并行地对每个时间窗口内的数据进行读取和采样，因此，在执行一个时间窗口对应的采样过程时，每个线程都会对该个时间窗口中所生成的数据进行读取和采样(每个线程所读取的数据会不同，每个线程所读取和进行采样的数据的数量也不一定，由上述可知，此处读取的数据为该时间窗口与上一个时间窗口对应的未交集时间窗口内所生成的新数据)，最后，该个时间窗口对应的采样结果即为将每个线程对应的采样结果进行合并，并且将与上一个时间窗口对应的交集时间窗口内所得到的采样数据进行合并得到。Since multiple threads read and sample data in parallel within each time window, each thread reads and samples the data generated in that time window during the sampling process (the data read by each thread will be different, and the amount of data read and sampled by each thread will also vary; as mentioned above, the data read here is the new data generated in the non-intersecting time window corresponding to the previous time window). Finally, the sampling result corresponding to that time window is obtained by merging the sampling results of each thread and merging the sampled data obtained in the intersecting time window corresponding to the previous time window.

其中，对于每个线程而言，对数据的读取过程和采样过程都相同，因此，此处以采样系统中的第一线程为例对如何对数据进行读取和采样进行说明，第一线程可以是存在于任意一个采样设备中的用于读取数据和对数据进行采样的线程。第一线程中所采样得到的数据可以存储在采样数据库中。并且，每个线程对不同类型的数据的读取和采样的过程也是相同的，因此，此处以第一线程对目标类型(目标类型可以是进行采样的数据中的任意一种类型)的数据进行采样为例进行说明。第一线程中的采样数据库也可以有多个，一类采样数据对应于一个采样数据库，即一个采样数据库用于存储某一类采样得到的采样数据。因此，目标类型的数据也对应有一个采样数据库，该采样数据库用于存储对目标类型的数据进行采样所得到的采样数据。下述过程中的采样数据库均以目标类型的数据所对应的采样数据库为例进行说明。In this process, the data reading and sampling processes are identical for each thread. Therefore, this explanation uses the first thread in the sampling system as an example to illustrate how data is read and sampled. The first thread can exist in any sampling device and is used for reading and sampling data. The data sampled by the first thread can be stored in a sampling database. Furthermore, the process of reading and sampling different types of data is the same for each thread. Therefore, this explanation uses the first thread sampling data of the target type (which can be any type of data being sampled) as an example. The first thread can also have multiple sampling databases; one sampling database corresponds to one type of sampled data, meaning one sampling database stores the sampled data obtained from a specific type of sampling. Therefore, the target type of data also corresponds to a sampling database, which stores the sampled data obtained by sampling the target type of data. The sampling databases used in the following processes all refer to the sampling database corresponding to the target type of data.

其中，可以将在获取目标采样数据之前，采样数据库中已经存在的数据称之为历史采样数据，该历史采样数据可以是第一线程在获取到目标采样数据之前对目标类型的数据经过采样所得到的采样数据，该历史采样数据可以是第一线程自己采样得到的，也可以是获取到的其他线程向其发送的。另外，采样数据库中的历史采样数据也可以为空。Specifically, data already existing in the sampling database before the target sampling data is acquired can be referred to as historical sampling data. This historical sampling data can be the sampling data obtained by the first thread after sampling the target type of data before acquiring the target sampling data. This historical sampling data can be obtained by the first thread itself or sent to it by other threads. Furthermore, the historical sampling data in the sampling database can also be empty.

其中，获取目标交易数据的过程可以是：被采样的每条数据中都可以包括多个字段信息，第一线程从数据流(可以是底层数据层中的)中读取到了原始业务数据(包括目标类型的数据等多种类型的数据)之后，首先会对原始业务数据进行字段解析，解析之后，即可得到所读取到的每条原始业务数据的多个字段信息，可以将经过字段解析之后的原始业务数据称之为初始解析数据(包括目标类型的数据对应的初始解析数据)。第一线程可以通过过滤机制对初始解析数据中的多个字段信息进行过滤，即可以预先设定过滤规则，该过滤规则表明了需要将初始解析数据中的哪些字段信息保留下来，同时需要将初始解析数据中的哪些字段信息过滤掉(即删除)，该过滤操作是可选项，即可以对初始解析数据中的多个字段信息进行过滤，也可以不过滤，将初始解析数据中的多个字段信息都保留下来。可以将对初始解析数据进行字段信息过滤之后所得到的数据，称之为过滤解析数据(包括目标类型的数据对应的过滤解析数据)。进一步地，还可以通过词表关联机制在过滤解析数据中添加关联字段信息，即可以预先设定需要进行词表关联的词表，将需要进行词表关联的词表中的字段信息添加到过滤解析数据中。举个例子，当过滤解析数据中的字段信息包括个人收入的信息，那么通过词表关联，还可以在过滤解析数据中添加个人出生地等字段信息。再举个例子，当过滤解析数据中的字段信息包括个人婚姻与否的信息，那么通过词表关联，还可以在过滤解析数据中添加个人父母名称等字段信息。该词表关联操作也是可选项，即可以在过滤解析数据中添加另外的字段信息，也可以不在过滤解析数据中添加另外的字段信息。可以将进行词表关联操作之后的过滤解析数据称之为采样业务数据(包括目标类型的数据对应的采样业务数据)，该采样业务数据就是第一线程对从数据层中读取得到的对原始业务数据处理完成的数据，后续，可以对该采样业务数据进行采样操作。The process of acquiring target transaction data can be as follows: Each sampled data item can include multiple fields. After the first thread reads the original business data (including target type data and other types of data) from the data stream (which can be in the underlying data layer), it first performs field parsing on the original business data. After parsing, multiple fields of each piece of original business data can be obtained. This parsed original business data can be called the initial parsed data (including the initial parsed data corresponding to the target type data). The first thread can filter the multiple fields in the initial parsed data through a filtering mechanism. That is, filtering rules can be preset, indicating which fields in the initial parsed data need to be retained and which fields need to be filtered out (i.e., deleted). This filtering operation is optional; multiple fields in the initial parsed data can be filtered or not, retaining all fields. The data obtained after filtering the initial parsed data can be called the filtered parsed data (including the filtered parsed data corresponding to the target type data). Furthermore, related field information can be added to the filtered parsed data through a thesaurus association mechanism. This means that the thesaurus to be associated can be pre-defined, and the field information from that thesaurus can be added to the filtered parsed data. For example, if the filtered parsed data includes personal income information, then through thesaurus association, fields such as personal birthplace can be added. Another example is when the filtered parsed data includes personal marital status information; then through thesaurus association, fields such as parents' names can be added. This thesaurus association operation is optional; additional fields can be added to the filtered parsed data, or not. The filtered parsed data after the thesaurus association operation can be called sampled business data (including sampled business data corresponding to the target type of data). This sampled business data is the data processed by the first thread from the data layer after reading the original business data. Subsequently, sampling operations can be performed on this sampled business data.

第一线程可以通过采样业务数据中的多个字段信息识别读取到的数据的数据类型，以识别出目标类型的采样业务数据。第一线程可以对识别出的目标类型的数据进行采样以得到目标采样数据，目标采样数据可以在第一时间窗口中实时采样得到的目标类型的采样数据，第一时间窗口可以是采样过程中的任意一个时间窗口。在第一线程中，每种类型的数据(包括目标类型的数据)都会对应有一个采样结果，该采样结果所得到的采样数据即是最终存储在每类数据分别对应的采样数据库中的数据。上述目标采样数据也可以第一线程以第一采样率(根据实际采样场景中可能不同)对读取到的数据进行采样所得到的数据。例如，当读取到新数据时，目标采样数据也可以是第一线程以1(即第一采样率为1)的采样率读取得到的新数据。The first thread can identify the data type of the read data by sampling multiple fields in the business data, thus identifying the target type of sampled business data. The first thread can sample the identified target type of data to obtain target sampled data. This target sampled data can be obtained in real-time within a first time window, which can be any time window during the sampling process. In the first thread, each type of data (including the target type of data) corresponds to a sampling result. The sampled data obtained from this sampling result is the data ultimately stored in the sampling database corresponding to each type of data. The aforementioned target sampled data can also be obtained by the first thread sampling the read data at a first sampling rate (which may vary depending on the actual sampling scenario). For example, when new data is read, the target sampled data could also be the new data read by the first thread at a sampling rate of 1 (i.e., a first sampling rate of 1).

在本申请中，可以包括两种采样场景，在该两种采样场景中，目标采样数据的获取方式有所不同。上述两种采样场景中的第一种采样场景为：在这种采样场景中，主要为各个线程自己读取某个时间窗口内的目标类型的数据，并进行采样。同样，以第一线程在第一时间窗口以第一采样率对目标类型的数据(可以是从数据流中读取的原始的数据)进行采样得到目标采样数据的过程为例：第一线程会对数据层中的数据进行边读取边采样。在这个阶段中，可以引出一个t1时刻，采样数据库中的历史采样数据可以是第一线程在t1时刻之前采样得到的目标类型的采样数据。在t1时刻之后，第一线程可以继续读取新数据，可以将在t1时刻之后读取到的新数据称之为目标采样数据(此种特殊情况下，目标采样数据的第一采样率为1，即目标采样数据是以1的概率读取到的目标类型的采样数据)。由于在第一个采样场景中，第一线程会不断读取新数据，因此，对于第一线程读取新数据的每一个时刻(例如t2时刻)都存在历史采样数据和目标采样数据，历史采样数据为t2时刻之前所采样得到的目标类型的数据，目标采样数据为t2时刻之后读取到的新的目标类型的采样数据。This application may include two sampling scenarios, in which the method of acquiring the target sampled data differs. The first sampling scenario involves each thread reading and sampling data of the target type within a specific time window. For example, taking the process of the first thread sampling the target type data (which may be raw data read from the data stream) at a first sampling rate within the first time window to obtain the target sampled data: the first thread will read and sample data from the data layer simultaneously. At this stage, a time point t1 can be introduced. The historical sampled data in the sampling database can be the target type sampled data obtained by the first thread before time point t1. After time point t1, the first thread can continue to read new data, which can be referred to as the target sampled data (in this special case, the first sampling rate of the target sampled data is 1, meaning the target sampled data is target type sampled data read with a probability of 1). Since the first thread continuously reads new data in the first sampling scenario, there are historical sampling data and target sampling data at each moment when the first thread reads new data (e.g., time t2). The historical sampling data is the target type data sampled before time t2, and the target sampling data is the new target type sampling data read after time t2.

上述两种采样场景中的第二种采样场景为：以第一时间窗口和第三时间窗口为例进行说明，第一时间窗口和第三时间窗口为两个相邻的时间窗口，第一时间窗口与第二时间窗口的窗口大小相等。第一时间窗口和第三时间窗口具备交集时间窗口，第三时间窗口是第一时间窗口的上一个时间窗口。换句话说，第一时间窗口的起始时间点晚于第三时间窗口的起始时间点，第一时间窗口的终止时间点也晚于第三时间窗口的终止时间点。例如，若是每隔10分钟进行一次采样操作，那么，相邻两个时间窗口之间的时间差距可以是10分钟，即第一时间窗口的起始时间点与第二时间窗口的起始时间点之间的时间差可以是10分钟，第一时间窗口的终止时间点与第二时间窗口的终止时间点之间的时间差也可以是10分钟。The second sampling scenario described above is illustrated using a first and third time window as an example. The first and third time windows are two adjacent time windows, with the first and second time windows having the same window size. The first and third time windows have an intersection window; the third time window is the preceding time window of the first time window. In other words, the start time of the first time window is later than the start time of the third time window, and the end time of the first time window is also later than the end time of the third time window. For example, if a sampling operation is performed every 10 minutes, then the time difference between two adjacent time windows can be 10 minutes. That is, the time difference between the start time of the first time window and the start time of the second time window can be 10 minutes, and the time difference between the end time of the first time window and the end time of the second time window can also be 10 minutes.

在该种采样场景中，采样数据库中的历史采样数据可以是在第三时间窗口期间，在第三时间窗口与第一时间窗口之间的交集时间窗口内对目标类型的数据进行实时采样所得到的采样数据，即历史采样数据为在第三时间窗口内采样得到的生成时间在第一时间窗口与第三时间窗口的交集时间窗口内的采样数据。而目标采样数据就是在第一时间窗口期间，在第一时间窗口的除了与第三时间窗口的交集时间窗口之外的时间窗口内，对目标类型的数据以第一采样率进行采样所得到的。In this sampling scenario, the historical sampling data in the sampling database can be the sampling data obtained by real-time sampling of the target type data within the intersection time window between the third time window and the first time window during the third time window. In other words, the historical sampling data is the sampling data sampled within the third time window, with the generation time within the intersection time window of the first and third time windows. The target sampling data, on the other hand, is the data of the target type sampled at the first sampling rate within the first time window, excluding the intersection time window with the third time window.

步骤S102，对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，其中，合并采样数据的第三采样率是根据第一采样率和历史采样数据的第二采样率所确定的，历史采样数据为在得到目标采样数据之前对目标类型的数据采样获取到的目标类型的采样数据，第二采样率是历史采样数据相对于目标类型的数据的采样率，第三采样率是合并采样数据相对于目标类型的数据的采样率；Step S102: Merge the target sampling data and the historical sampling data in the sampling database to obtain merged sampling data. The third sampling rate of the merged sampling data is determined based on the first sampling rate and the second sampling rate of the historical sampling data. The historical sampling data is the target type sampling data obtained by sampling the target type data before obtaining the target sampling data. The second sampling rate is the sampling rate of the historical sampling data relative to the target type data. The third sampling rate is the sampling rate of the merged sampling data relative to the target type data.

具体的，第一线程可以对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据。具体为，当历史采样数据和目标采样数据是在上述第一种采样场景中的采样数据，那么，历史采样数据是由第一线程通过一定的采样率对读取到的目标类型的数据进行采样得到的，假设历史采样数据的采样率为第二采样率，该第二采样率为历史采样数据针对目标类型的数据的采样率，换句话说，该第二采样率为针对历史采样数据对应的原始业务数据(未被经过采样的)的采样率。目标采样数据的第一采样率为1，实际上，此种采样场景下，还未对目标采样数据进行采样，只是读取过来了。那么，第一线程需要对目标采样数据和历史采样数据进行采样率的统一，以保证对目标类型的数据都是以相同的采样率进行采样的。在该采样阶段中，第一采样率肯定会大于或者等于第二采样率(本申请，以采样率的值域为0-1为例进行说明)，因此，可以得到第二采样率与第一采样率之间的比值(可以将该比值称之为第一比值采样率第一比值采样率的值域也在0到1之间，该比值是将较小的采样率作为被除数，将较大的采样率作为除数而得到的)，可以通过该比值对目标采样数据进行采样，采样之后的目标采样数据就和当前的历史采样数据的采样率相同了。可以将采样后的目标采样数据和当前的历史采样数据直接合并起来，得到合并采样数据。可以将合并采样数据的采样率称之为第三采样率，因此，此时，合并采样数据的第三采样率就等于第二采样率。其中，第三采样率也是相对于目标类型的数据的采样率，即第三采样率为针对合并采样数据对应的原始业务数据(未被经过采样的)的采样率。Specifically, the first thread can merge the target sampled data and the historical sampled data in the sample database to obtain merged sampled data. Specifically, when the historical sampled data and the target sampled data are sampled data from the first sampling scenario described above, the historical sampled data is obtained by the first thread sampling the target type data at a certain sampling rate. Assuming the historical sampled data has a second sampling rate, this second sampling rate is the sampling rate of the historical sampled data for the target type data; in other words, this second sampling rate is the sampling rate for the original business data (unsampled) corresponding to the historical sampled data. The target sampled data has a first sampling rate of 1. In fact, in this sampling scenario, the target sampled data has not yet been sampled; it has only been read. Therefore, the first thread needs to unify the sampling rates of the target sampled data and the historical sampled data to ensure that the target type data is sampled at the same sampling rate. In this sampling phase, the first sampling rate will definitely be greater than or equal to the second sampling rate (in this application, the range of the sampling rate is 0-1 for illustration). Therefore, the ratio between the second sampling rate and the first sampling rate can be obtained (this ratio can be called the first ratio sampling rate; the range of the first ratio sampling rate is also between 0 and 1, and this ratio is obtained by using the smaller sampling rate as the dividend and the larger sampling rate as the divisor). The target sampling data can be sampled using this ratio, and the sampled target sampling data will have the same sampling rate as the current historical sampling data. The sampled target sampling data and the current historical sampling data can be directly merged to obtain merged sampling data. The sampling rate of the merged sampling data can be called the third sampling rate; therefore, at this time, the third sampling rate of the merged sampling data is equal to the second sampling rate. The third sampling rate is also relative to the sampling rate of the target type of data; that is, the third sampling rate is the sampling rate for the original business data (which has not been sampled) corresponding to the merged sampling data.

在本申请中，采样率均可以是1/2的n次方(下述步骤S103中有解释)，例如，采样率可以是1、……。举个例子，若上述第一采样率为1/4，则上述第一比值采样率就为1/4除以1，得到1/4。可以以1/4的采样率对目标采样数据进行采样。采样的方式具体是：由于参与采样的每个数据都会有一个数据标识(即数据ID)，该数据标识可以是字母和数字相互组合的字符串，可以将每个数据的数据标识称之为数据标识字符串。因此，每个目标采样数据均对应有一个数据表示字符串。可以通过每个目标交易数据的数据标识字符串将每个目标交易数据映射到均匀采样空间中，在该均匀采样空间中，可以实现对每个目标交易数据按照相同的采样率(此处为1/4)进行采样。将每个目标交易数据映射到均匀采样空间中的做法可以是，计算每个目标交易数据的数据标识字符串分别对应的哈希值。对目标采样数据以1/4的采样率进行采样，也就是，分别用每个目标采样数据对应的哈希值除以4，若除得尽(即除出来的余数等于0)，则可以将对应的目标采样数据存储下来作为采样得到的采样数据，若除不尽(即除出来的余数不等于0)，则可以将对应的目标采样数据丢弃。通过上述过程对目标采样数据进行采样之后，即可得到采样后的目标采样数据。在此种采样场景中，可以将采样后的目标采样数据和历史采样数据直接进行合并，得到上述合并采样数据。In this application, the sampling rate can be any power of 1/2 (explained in step S103 below), for example, the sampling rate can be 1, ... For example, if the first sampling rate is 1/4, then the first ratio sampling rate is 1/4 divided by 1, which equals 1/4. The target sampled data can be sampled at a sampling rate of 1/4. Specifically, the sampling method is as follows: since each data point participating in the sampling has a data identifier (i.e., data ID), which can be a string composed of letters and numbers, the data identifier of each data point can be called a data identifier string. Therefore, each target sampled data point corresponds to a data representation string. Each target transaction data point can be mapped to a uniform sampling space using its data identifier string. In this uniform sampling space, each target transaction data point can be sampled at the same sampling rate (here, 1/4). Mapping each target transaction data point to the uniform sampling space can be achieved by calculating the hash value corresponding to the data identifier string of each target transaction data point. The target sampling data is sampled at a sampling rate of 1/4. This means that the hash value corresponding to each target sampling data point is divided by 4. If the division is exact (i.e., the remainder is 0), the corresponding target sampling data is stored as the sampled data. If the division is not exact (i.e., the remainder is not 0), the corresponding target sampling data is discarded. After sampling the target data through this process, the sampled target sampling data is obtained. In this sampling scenario, the sampled target sampling data and historical sampling data can be directly merged to obtain the merged sampling data.

其中，在上述第一种采样场景中，还可以直接获取历史采样数据的第二采样率作为第一采样率(即使第一采样率等于第二采样率)，在读取新的目标类型的数据时就以该第一采样率对读取到的目标类型的数据进行采样，得到目标采样数据。因此，此时历史采样数据的采样率与目标采样数据的采样率已经保持一致了，可以直接将历史采样数据和目标采样数据进行合并，得到合并采样数据。通过此种方式可以提高对目标采样数据和历史采样数据进行合并的效率。In the first sampling scenario described above, the second sampling rate of historical sampling data can be directly obtained as the first sampling rate (even if the first sampling rate equals the second sampling rate). When reading new target type data, this first sampling rate is used to sample the read target type data to obtain the target sampled data. Therefore, the sampling rate of the historical sampling data and the sampling rate of the target sampled data are now consistent, allowing for direct merging of the historical and target sampled data to obtain merged sampled data. This method improves the efficiency of merging target and historical sampled data.

当历史采样数据和目标采样数据是在上述第二种采样场景中的数据：即历史采样数据是在第三时间窗口期间，在第三时间窗口与第一时间窗口的交集时间窗口内采样得到的，目标采样数据是在第一时间窗口期间，在第一时间窗口的除了与第三时间窗口的交集时间窗口之外的时间窗口内所采样得到的。同样，在第二种采样场景中，也需要对目标采样数据和历史采样数据进行合并。首先，是对目标采样数据和历史采样数据的采样率进行统一。此种情况下，第一采样率和第二采样率之间的采样率的大小要根据实际采样场景决定。当目标采样数据对应的第一采样率小于历史采样数据对应的第二采样率时，则需要将历史采样数据的采样率从第二采样率统一到第一采样率。举个例子，当第一采样率为1/16，第二采样率为1/4，表明历史采样数据是通过1/4的采样率得到的，目标采样数据是通过1/16的采样率得到的。理论上是需要对历史采样数据再以1/4(因为第一采样率1/16除以第二采样率1/4等于1/4，可以将第一采样率与第二采样率之间的比值称之为第二比值采样率)的采样率进行采样。When the historical and target sampled data fall under the second sampling scenario described above: the historical sampled data was obtained within the intersection of the third and first time windows during the third time window, while the target sampled data was obtained within the first time window, excluding the intersection with the third time window. Similarly, in this second sampling scenario, the target and historical sampled data also need to be merged. First, the sampling rates of the target and historical sampled data need to be unified. In this case, the difference between the first and second sampling rates depends on the actual sampling scenario. When the first sampling rate corresponding to the target sampled data is less than the second sampling rate corresponding to the historical sampled data, the sampling rate of the historical sampled data needs to be unified from the second sampling rate to the first sampling rate. For example, if the first sampling rate is 1/16 and the second sampling rate is 1/4, it means that the historical sampled data was obtained using a sampling rate of 1/4, and the target sampled data was obtained using a sampling rate of 1/16. In theory, the historical sampling data needs to be sampled again at a sampling rate of 1/4 (because the first sampling rate of 1/16 divided by the second sampling rate of 1/4 equals 1/4, and the ratio between the first sampling rate and the second sampling rate can be called the second ratio sampling rate).

其中，当通过第二采样率对历史采样数据进行采样时，是将历史采样数据的数据标识字符串对应的哈希值除以4，如果余数为0，则存储历史采样数据(此处显示余数为0，因为历史采样数据是存储在采样数据库中的)，如果余数不为0，则丢弃历史采样数据。由于历史采样数据是以1/4的采样率采样得到的，因此可以将历史采样数据的数据标识字符串对应的哈希值除以4的结果称之为第一采样哈希值，若采样系统在得到历史采样数据时，就存储了该第一采样哈希值，则对历史采样数据再次以1/4的采样率进行采样时，可以用该第一采样哈希值除以4，若余数为0，则存储对应的历史采样数据，若余数不为0，则丢弃对应的历史采样数据。若是采样系统中未存储上述第一采样哈希值，则可以执行步骤框100b中的步骤(即统一采样率)，即使用第一采样率1/16直接对历史采样数据进行采样，也就是用历史采样数据的数据标识字符串对应的哈希值直接除以16，若余数为0，则存储对应的历史采样数据，若余数不为0，则丢弃对应的历史采样数据。由上述过程，即可得到采样后的历史采样数据。Specifically, when sampling historical data using the second sampling rate, the hash value corresponding to the data identifier string of the historical sample data is divided by 4. If the remainder is 0, the historical sample data is stored (the remainder is shown as 0 here because the historical sample data is stored in the sampling database). If the remainder is not 0, the historical sample data is discarded. Since the historical sample data is obtained by sampling at a 1/4 sampling rate, the result of dividing the hash value corresponding to the data identifier string of the historical sample data by 4 can be called the first sampling hash value. If the sampling system stores this first sampling hash value when it obtains the historical sample data, then when sampling the historical sample data again at a 1/4 sampling rate, this first sampling hash value can be divided by 4. If the remainder is 0, the corresponding historical sample data is stored; if the remainder is not 0, the corresponding historical sample data is discarded. If the sampling system does not store the aforementioned first sampling hash value, then the step in step box 100b (i.e., uniform sampling rate) can be executed. This involves directly sampling the historical sampling data using the first sampling rate of 1/16. Specifically, the hash value corresponding to the data identifier string of the historical sampling data is divided by 16. If the remainder is 0, the corresponding historical sampling data is stored; otherwise, it is discarded. Through this process, the sampled historical sampling data can be obtained.

由于历史采样数据通常会有多个，通过再次对历史采样数据进行采样，可以过滤掉历史采样数据中的部分历史采样数据。可以将采样后的历史采样数据和目标采样数据直接合并起来，得到合并采样数据。Since there are usually multiple historical sampling data sets, resampling these historical data sets can filter out some of the original historical data. The sampled historical data and the target data can then be directly merged to obtain the merged sampling data.

同理，在第二种采样场景中，若第一采样率大于第二采样率，则需要将目标采样数据的采样率从第一采样率统一到第二采样率。举个例子，当第一采样率为1/4，第二采样率为1/16，则需要对目标采样数据再次进行1/4的采样。与上述一样，对目标采样数据再次进行1/4的采样的过程可以是，用目标采样数据的数据标识字符串对应的哈希值直接除以16，若余数为0，则存储对应的目标采样数据，若余数不为0，则丢弃对应的目标采样数据。若是，在使用第一采样率对目标采样数据进行采样之后，存储了目标采样数据的数据标识字符串对应的哈希值除以4的值，则再次对目标采样数据进行1/4采样时，可以使用该值再次除以4，若余数为0，则存储对应的目标采样数据，若余数不为0，则丢弃对应的目标采样数据。由于通常会有多个目标采样数据，因此，再次以1/4的采样率对目标采样数据进行采样之后，可以过滤掉部分目标采样数据。通过上述过程即可得到采样后的目标采样数据。同样，可以将采样后的目标采样数据和历史采样数据直接合并起来，得到合并采样数据。当第一采样率等于第二采样率时，可以直接将目标采样数据和合并采样数据合并起来，得到合并采样数据。Similarly, in the second sampling scenario, if the first sampling rate is greater than the second sampling rate, the sampling rate of the target sampled data needs to be unified from the first sampling rate to the second sampling rate. For example, if the first sampling rate is 1/4 and the second sampling rate is 1/16, the target sampled data needs to be sampled again by 1/4. As described above, the process of sampling the target sampled data again by 1/4 can be as follows: divide the hash value corresponding to the data identifier string of the target sampled data directly by 16. If the remainder is 0, store the corresponding target sampled data; if the remainder is not 0, discard the corresponding target sampled data. Alternatively, if, after sampling the target sampled data using the first sampling rate, the hash value corresponding to the data identifier string of the target sampled data is stored divided by 4, then when sampling the target sampled data again by 1/4, this value can be divided by 4 again. If the remainder is 0, store the corresponding target sampled data; if the remainder is not 0, discard the corresponding target sampled data. Since there are usually multiple target sample data points, re-sampling the target sample data at a 1/4 sampling rate can filter out some of the target sample data. This process yields the sampled target sample data. Similarly, the sampled target sample data and historical sample data can be directly merged to obtain merged sample data. When the first sampling rate equals the second sampling rate, the target sample data and merged sample data can be directly merged to obtain merged sample data.

可选的，在第二种采样场景中，若历史采样数据的数量小于合并数量阈值(可以自行设置，例如设置为1万条数据)，并且历史采样数据对应的第二采样率与目标采样数据对应的第一采样率之间的比值(第二采样率除以第一采样率)小于比值阈值(可以自行设置，例如1/512)时，则可以直接将历史采样数据丢齐，只保留目标采样数据。此种情况下，需要从第二时间窗口对应的非交集时间窗口处重新开始积累采样数据，将之前采样所得到的数据都丢弃掉。这种方式，可以避免当前一个时间窗口(此处为第三时间窗口)所采样得到的数据过少且采样率过低时，影响到后面的时间窗口中的采样过程，例如导致后面采样得到的采样数据的数量过少且采样率过低的问题。Optionally, in the second sampling scenario, if the number of historical sampled data is less than the merging threshold (which can be set manually, for example, to 10,000 data points), and the ratio between the second sampling rate corresponding to the historical sampled data and the first sampling rate corresponding to the target sampled data (second sampling rate divided by first sampling rate) is less than the ratio threshold (which can be set manually, for example, 1/512), then the historical sampled data can be discarded, and only the target sampled data can be retained. In this case, it is necessary to start accumulating sampled data again from the non-intersecting time window corresponding to the second time window, discarding all previously sampled data. This method can avoid the problem of insufficient data and low sampling rate in the previous time window affecting the sampling process in subsequent time windows, such as causing the number of sampled data to be too small and the sampling rate to be too low.

通过上述过程可以知道，所得到的合并采样数据对应的第三采样率，其实就是第一采样率和第二采样率之间的最小值。As can be seen from the above process, the third sampling rate corresponding to the obtained merged sampled data is actually the minimum value between the first sampling rate and the second sampling rate.

请参见图5，是本申请提供的一种获取合并采样数据的场景示意图。如图5所示，第一时间窗口101e和第三时间窗口100e之间具备交集时间窗口，第一时间窗口101e中还包括未与第三时间窗口100e相交的未交集时间窗口。通过对第三时间窗口100e中所产生的所有数据进行采样得到了采样结果102e。如图5所示，采样结果102e中包括采样数据1、采样数据2、采样数据3、采样数据4、采样数据5和采样数据6。采样结果102e中的每个采样数据均对应于一个数据生成时间戳，该数据生成时间戳表征了采样数据的生成时间。具体为，采样数据1的数据生成时间戳为数据生成时间戳1，采样数据2对应的数据生成时间戳为数据生成时间戳2，采样数据3对应的数据生成时间戳为数据生成时间戳3，采样数据4对应的数据生成时间戳为数据生成时间戳4，采样数据5对应的数据生成时间戳为数据生成时间戳5，采样数据6对应的数据生成时间戳为数据生成时间戳6。Please refer to Figure 5, which is a schematic diagram of a scenario for obtaining merged sampled data provided in this application. As shown in Figure 5, there is an intersection between the first time window 101e and the third time window 100e. The first time window 101e also includes non-intersecting time windows that do not intersect with the third time window 100e. By sampling all the data generated in the third time window 100e, a sampling result 102e is obtained. As shown in Figure 5, the sampling result 102e includes sampled data 1, sampled data 2, sampled data 3, sampled data 4, sampled data 5, and sampled data 6. Each sampled data in the sampling result 102e corresponds to a data generation timestamp, which represents the generation time of the sampled data. Specifically, the data generation timestamp for sample data 1 is data generation timestamp 1, the data generation timestamp for sample data 2 is data generation timestamp 2, the data generation timestamp for sample data 3 is data generation timestamp 3, the data generation timestamp for sample data 4 is data generation timestamp 4, the data generation timestamp for sample data 5 is data generation timestamp 5, and the data generation timestamp for sample data 6 is data generation timestamp 6.

其中，从采样结果102e中获取历史采样数据的过程可以是：可以将上述采样数据1对应的数据生成时间戳1记为t1，将上述采样数据2对应的数据生成时间戳2记为t2，将上述采样数据3对应的数据生成时间戳3记为t3，将上述采样数据4对应的数据生成时间戳4记为t4，将上述采样数据5对应的数据生成时间戳5记为t5，将上述采样数据6对应的数据生成时间戳6记为t6。如图5中的时间戳示意框104e所示，上述数据生成时间戳t4和数据生成时间戳t6在第三时间窗口100e对应的非交集时间窗口内，上述数据生成时间戳t1、数据生成时间戳t2、数据生成时间戳t3和数据生成时间戳t5在第三时间窗口100e对应的交集时间窗口内。由于，此处历史采样数据为第三窗口采样数据中数据生成时间戳在交集时间窗口内，因此，可以将上述采样数据1、采样数据2、采样数据3和采样数据5作为历史采样数据，即数据集合105e中的采样数据即为历史采样数据。并且，数据集合105e中的每个历史采样数据对应的采样率为第二采样率，即采样数据1、采样数据2、采样数据3和采样数据5对应的采样率为第二采样率。The process of obtaining historical sampling data from sampling result 102e can be as follows: the data generation timestamp 1 corresponding to the above sampling data 1 can be recorded as t1, the data generation timestamp 2 corresponding to the above sampling data 2 can be recorded as t2, the data generation timestamp 3 corresponding to the above sampling data 3 can be recorded as t3, the data generation timestamp 4 corresponding to the above sampling data 4 can be recorded as t4, the data generation timestamp 5 corresponding to the above sampling data 5 can be recorded as t5, and the data generation timestamp 6 corresponding to the above sampling data 6 can be recorded as t6. As shown in the timestamp schematic box 104e in Figure 5, the above data generation timestamps t4 and t6 are within the non-intersecting time window corresponding to the third time window 100e, while the above data generation timestamps t1, t2, t3, and t5 are within the intersecting time window corresponding to the third time window 100e. Since the timestamps of the data generation in the third window of the historical sampling data fall within the intersection time window, the aforementioned sampling data 1, sampling data 2, sampling data 3, and sampling data 5 can be considered as historical sampling data. In other words, the sampling data in dataset 105e is the historical sampling data. Furthermore, the sampling rate corresponding to each historical sampling data in dataset 105e is the second sampling rate; that is, the sampling rates corresponding to sampling data 1, sampling data 2, sampling data 3, and sampling data 5 are the second sampling rate.

更多的，在第一时间窗口101e对应的采样过程中，对未交集时间窗口内所产生的数据进行采样所得到的采样结果可以是采样结果103e。此时，可以将采样结果103e中的采样数据作为上述目标采样数据。如图5所示，假设此处采样结果103e中包括采样数据7、采样数据8和采样数据9。并且采样数据7、采样数据8和采样数据9对应的采样率为第一采样率。若是需要得到第一时间窗口101e内最终的采样结果，则需要对上述目标采样数据和历史采样数据进行合并。合并的时候，需要将目标采样数据和历史采样数据的采样率统一到一致。如图5中的数据集合106e所示，统一目标采样数据与历史采样数据之间的采样率之后，还剩下采样数据1、采样数据7、采样数据8和采样数据9，该采样数据1、采样数据7、采样数据8和采样数据9即为合并采样数据。并且，采样数据1、采样数据7、采样数据8和采样数据9对应的采样率为第三采样率(该第三采样率可以等于上述第一采样率)，即合并采样数据的采样率为第三采样率。More specifically, during the sampling process corresponding to the first time window 101e, the sampling result obtained by sampling the data generated within the non-overlapping time windows can be sampling result 103e. In this case, the sampled data in sampling result 103e can be used as the aforementioned target sampled data. As shown in Figure 5, assume that sampling result 103e includes sampled data 7, sampled data 8, and sampled data 9. Furthermore, the sampling rates corresponding to sampled data 7, sampled data 8, and sampled data 9 are the first sampling rate. If the final sampling result within the first time window 101e is required, the aforementioned target sampled data and historical sampled data need to be merged. During merging, the sampling rates of the target sampled data and historical sampled data need to be unified. As shown in data set 106e in Figure 5, after unifying the sampling rates between the target sampled data and historical sampled data, sampled data 1, sampled data 7, sampled data 8, and sampled data 9 remain. These sampled data 1, sampled data 7, sampled data 8, and sampled data 9 are the merged sampled data. Furthermore, the sampling rates corresponding to sampled data 1, sampled data 7, sampled data 8, and sampled data 9 are the third sampling rate (this third sampling rate can be equal to the first sampling rate mentioned above), that is, the sampling rate of the merged sampled data is the third sampling rate.

可以理解的是，每个时间窗口内最终获取到的针对目标类型的数据的采样结果，可以是对多个线程在每个时间窗口内采样得到的目标类型的采样数据进行合并之后，所得到的。其中，对每个线程针对目标类型的数据的采样结果进行合并可以是由任意一个或者多个线程来执行。因此，上述在第一时间窗口以第一采样率对目标类型的数据进行采样，得到的目标交易数据，可以是由第一线程所采样得到的。第一线程可以将所采样得到的目标交易数据给到采样系统中的第二线程，由第二线程来对目标采样数据和历史采样数据进行合并，以得到合并采样数据。由此可知，在采样过程中，是由采样系统中的多个线程协同完成的采样工作，保证了采样的效率。It is understandable that the final sampling result for the target type of data obtained within each time window can be obtained by merging the target type of sampled data obtained by multiple threads within each time window. The merging of the sampling results for the target type of data from each thread can be performed by any one or more threads. Therefore, the target transaction data obtained by sampling the target type of data at the first sampling rate in the first time window can be obtained by the first thread. The first thread can then pass the sampled target transaction data to the second thread in the sampling system, which will merge the target sampled data with the historical sampled data to obtain the merged sampled data. Thus, it can be seen that the sampling process is completed collaboratively by multiple threads in the sampling system, ensuring sampling efficiency.

步骤S103，当合并采样数据的数量大于采样数量阈值时，基于自适应采样参数对合并采样数据进行采样，得到更新历史采样数据，采样数量阈值与目标类型的数据相关，自适应采样参数用于将合并采样数据的数量控制在采样数量阈值内；Step S103: When the number of merged sampled data is greater than the sampling number threshold, the merged sampled data is sampled based on the adaptive sampling parameter to obtain updated historical sampled data. The sampling number threshold is related to the target type of data. The adaptive sampling parameter is used to control the number of merged sampled data within the sampling number threshold.

具体的，当检测到上述合并采样数据的数量大于采样数量阈值时，可以通过自适应采样参数再次对合并采样数据进行采样，以得到更新历史采样数据。不同类型的数据可以对应于不同的采样数量阈值。某类数据对应的采样数量阈值，表示最终采样得到的该类数据的采样数据的数量不会超过(小于或者等于)其对应的采样数量阈值。因此，目标类型的数据也对应有一个采样数量阈值。Specifically, when the number of merged sampled data points exceeds the sampling number threshold, the merged sampled data can be sampled again using adaptive sampling parameters to obtain updated historical sampled data. Different types of data can correspond to different sampling number thresholds. The sampling number threshold for a certain type of data indicates that the number of sampled data points of that type obtained in the final sampling will not exceed (be less than or equal to) its corresponding sampling number threshold. Therefore, the target type of data also corresponds to a sampling number threshold.

其中，上述自适应采样参数用于将合并采样数据的数量控制在采样数量阈值内，该自适应采样参数可以取为1/2。原因为：此处还是以目标类型的数据对应的采样过程为例进行说明。由于在对目标类型的数据进行采样的过程中，每次采样到的采样数据的数量都会被控制在目标类型对应的采样数量阈值内，因此，上述目标采样数据的数量以及历史采样数据的数量均会小于采样数量阈值。那么，假设目标类型的数据对应的采样数量阈值为M，由此可以知道，对目标交易数据和历史采样数据进行合并之后(可以是采样之后合并，也可以是直接合并)，所得到的合并采样数据的数量会小于或者等于采样数量阈值的2倍(即小于2M)，因为2个小于或者等于M的值相加必然会小于或者等于2M。因此，将自适应采样参数取为1/2，再通过该自适应采样参数对合并采样数据进行1/2采样，会使得采样后的合并采样数据的数量小于M。其中，以1/2的采样率再次对合并采样数据进行采样，也就是过滤掉合并采样数据中一半的采样数据，保留下合并采样数据中一半的采样数据，得到更新历史采样数据(即采样后的合并采样数据)，由上可知，更新历史采样数据的数量肯定会小于采样数量阈值M。The aforementioned adaptive sampling parameter is used to control the number of merged sampled data within a sampling quantity threshold. This adaptive sampling parameter can be set to 1/2. The reason is as follows: Let's take the sampling process corresponding to the target type of data as an example. Since the number of sampled data points is controlled within the sampling quantity threshold corresponding to the target type during the sampling process, both the number of target sampled data and the number of historical sampled data will be less than the sampling quantity threshold. Assuming the sampling quantity threshold for the target type of data is M, we can see that after merging the target transaction data and historical sampled data (either merging after sampling or merging directly), the number of merged sampled data points will be less than or equal to twice the sampling quantity threshold (i.e., less than 2M), because the sum of two values less than or equal to M will always be less than or equal to 2M. Therefore, setting the adaptive sampling parameter to 1/2 and then sampling the merged sampled data by 1/2 using this adaptive sampling parameter will result in a merged sampled data point number less than M. In this process, the merged sampled data is sampled again at a sampling rate of 1/2. This means that half of the sampled data in the merged sampled data is filtered out, and half of the sampled data in the merged sampled data is retained to obtain the updated historical sampled data (i.e., the merged sampled data after sampling). As can be seen from the above, the number of updated historical sampled data will definitely be less than the sampling number threshold M.

其中，可以将更新历史采样数据的采样率称之为第四采样率，该第四采样率为更新历史采样数据相对于目标类型的数据的采样率，换句话说，该第四采样率为更新历史采样数据针对所对应的原始业务数据(未被采样过)的采样率。该第四采样率等于合并采样数据的第三采样率与自适应采样参数之间的乘积，例如，第三采样率为1/8，自适应采样参数为1/2，那么第四采样率就为1/16。可以通过第四采样率对合并采样数据进行采样得到更新历史采样数据，例如，当第四采样率为1/16，则可以用合并采样数据的数据标识字符串对应的哈希值除以16，若余数为0，则保留下对应的合并采样数据，若余数不为0，则丢弃对应的合并采样数据。The sampling rate for updating historical sampled data can be referred to as the fourth sampling rate. This fourth sampling rate is the sampling rate of the updated historical sampled data relative to the target type of data. In other words, it is the sampling rate of the updated historical sampled data relative to the corresponding original business data (which has not been sampled). This fourth sampling rate is equal to the product of the third sampling rate of the merged sampled data and the adaptive sampling parameter. For example, if the third sampling rate is 1/8 and the adaptive sampling parameter is 1/2, then the fourth sampling rate is 1/16. Updated historical sampled data can be obtained by sampling the merged sampled data using the fourth sampling rate. For example, when the fourth sampling rate is 1/16, the hash value corresponding to the data identifier string of the merged sampled data can be divided by 16. If the remainder is 0, the corresponding merged sampled data is retained; if the remainder is not 0, the corresponding merged sampled data is discarded.

更多的，上述所得到的更新历史采样数据可以是第一时间窗口对应的采样结果，若第二时间窗口为第一时间窗口的下一个时间窗口，则可以将更新历史采样数据中在第一时间窗口与第二时间窗口之间的交集时间窗口中采样得到的采样数据，作为第二时间窗口对应的采样结果中的部分采样数据。因此，对于第二时间窗口内的采样过程而言，第二时间窗口中的历史采样数据可以是第一时间窗口中，在第一时间窗口和第二时间窗口之间的交集时间窗口内所采样得到的。第二时间窗口中的目标采样数据可以是，在第二时间窗口中除了与第一时间窗口之间的交集时间窗口之外的时间窗口内所采样得到的目标类型的采样数据。因此，在第二时间窗口中除了与第一时间窗口之间的交集时间窗口之外的时间窗口内读取新的目标类型的数据，并对读取到的目标类型的数据进行采样时，可以直接用第四采样率(第一时间窗口中的更新历史采样数据的采样率)作为第二时间窗口内的第一采样率对读取到的新的目标类型的数据进行采样，以得到目标采样数据。通过此种方式，在第二时间窗口中，由于读取采样所得到的目标采样数据与第二时间窗口中的历史采样数据的采样率相同，因此，可以将它们直接合并起来，得到第二时间窗口内的合并采样数据，提高了对第二时间窗口中目标采样数据和历史采样数据的合并效率，同时节省了系统资源，因为不用再对目标采样数据或者历史采样数据进行采样之后再合并，并且以这种方式实现了采样率的自适应调整。之后，可以通过第二时间窗口内的合并采样数据得到第二时间窗口内的更新历史采样数据，第二时间窗口内的更新历史采样数据可以是第二时间窗口对应的采样结果。Furthermore, the updated historical sampling data obtained above can be the sampling result corresponding to the first time window. If the second time window is the next time window after the first time window, the sampling data obtained from the intersection time window between the first and second time windows in the updated historical sampling data can be used as part of the sampling data in the sampling result corresponding to the second time window. Therefore, for the sampling process within the second time window, the historical sampling data in the second time window can be the sampling data obtained in the first time window within the intersection time window between the first and second time windows. The target sampling data in the second time window can be the target type sampling data obtained in the second time window excluding the intersection time window with the first time window. Therefore, when reading new target type data in the second time window excluding the intersection time window with the first time window, and sampling the read target type data, the fourth sampling rate (the sampling rate of the updated historical sampling data in the first time window) can be directly used as the first sampling rate in the second time window to sample the read new target type data to obtain the target sampling data. In this way, within the second time window, since the target sampled data obtained from the sampling has the same sampling rate as the historical sampled data within the second time window, they can be directly merged to obtain the merged sampled data within the second time window. This improves the merging efficiency of the target and historical sampled data within the second time window, while saving system resources because it eliminates the need to sample the target or historical sampled data before merging. Furthermore, this method achieves adaptive adjustment of the sampling rate. Subsequently, the updated historical sampled data within the second time window can be obtained from the merged sampled data within the second time window; the updated historical sampled data within the second time window can be the sampling results corresponding to the second time window.

步骤S104，将采样数据库中的历史采样数据替换为更新历史采样数据；Step S104: Replace the historical sampling data in the sampling database with the updated historical sampling data;

具体的，可以将上述采样数据库中的历史采样数据替换为上述所得到的更新历史采样数据。此时采样数据库中的更新历史采样数据即为对目标采样数据和历史采样数据进行合并之后所得到的最终的采样数据。Specifically, the historical sampling data in the aforementioned sampling database can be replaced with the updated historical sampling data obtained above. The updated historical sampling data in the sampling database at this point is the final sampling data obtained after merging the target sampling data and the historical sampling data.

其中，若上述合并采样数据的数量小于或者等于采样数量阈值时，则无需再次对合并采样数据进行采样，可以将采样数据库中的历史采样数据直接替换为合并采样数据。此时采样数据库中的合并采样数据即为将目标采样数据和历史采样数据进行合并所得到的最终的采样数据。因此通过本申请所提供的方法，可以通过采样数量阈值和自适应采样参数，保证将采样数据库中最终所存储的采样数据的数量控制在采样数量阈值内，且在此件过程中，采样数据的采样率也是根据对应的采样数量阈值以及自适应采样参数进行自适应变化的。If the number of merged sampled data is less than or equal to the sampling quantity threshold, there is no need to sample the merged sampled data again. The historical sampled data in the sampling database can be directly replaced with the merged sampled data. In this case, the merged sampled data in the sampling database is the final sampled data obtained by merging the target sampled data and the historical sampled data. Therefore, the method provided in this application can ensure that the number of sampled data ultimately stored in the sampling database is controlled within the sampling quantity threshold by using the sampling quantity threshold and adaptive sampling parameters. Furthermore, the sampling rate of the sampled data also adaptively changes according to the corresponding sampling quantity threshold and adaptive sampling parameters during this process.

请参见图6，是本申请提供的一种获取更新历史采样数据的流程示意图。如图6所示，当目标采样数据对应的第一采样率小于历史采样数据对应的第二采样率时，可以执行步骤100b，即可以使用第一采样率对历史采样数据进行采样，得到采样之后的历史采样数据。之后，可执行步骤101b，即将采样之后的历史采样数据和目标采样数据进行合并，得到合并采样数据。当合并采样数据的总数量大于采样数量阈值时，可以执行步骤102b(容量把控，即通过采样数量阈值控制采样数据库中所存储的数据的数量)，即可以通过采样率递增系数对合并采样数据进行采样。此申请中，采样率递增系数为1/2，即可以对合并采样数据再次进行1/2的采样，得到更新历史采样数据。请参见图7，是本申请提供的另一种获取更新历史采样数据的流程示意图。如图7所示，当目标采样数据对应的第一采样率大于历史采样数据对应的第二采样率时，可以执行步骤100c，即可以使用第二采样率对目标采样数据进行采样，得到采样之后的目标采样数据。之后，可执行步骤101c，即将采样之后的目标采样数据和历史采样数据进行合并，得到合并采样数据。当合并采样数据的总数量小于或者等于采样数量阈值时，可以执行步骤102b，即直接将合并采样数据作为更新历史采样数据。Please refer to Figure 6, which is a flowchart illustrating one method for obtaining updated historical sampling data provided in this application. As shown in Figure 6, when the first sampling rate corresponding to the target sampling data is less than the second sampling rate corresponding to the historical sampling data, step 100b can be executed, that is, the historical sampling data can be sampled using the first sampling rate to obtain the sampled historical sampling data. Then, step 101b can be executed, that is, the sampled historical sampling data and the target sampling data are merged to obtain merged sampling data. When the total number of merged sampling data exceeds the sampling quantity threshold, step 102b (capacity control, i.e., controlling the amount of data stored in the sampling database through the sampling quantity threshold) can be executed, that is, the merged sampling data can be sampled using a sampling rate increment coefficient. In this application, the sampling rate increment coefficient is 1/2, that is, the merged sampling data can be sampled again by 1/2 to obtain updated historical sampling data. Please refer to Figure 7, which is another flowchart illustrating another method for obtaining updated historical sampling data provided in this application. As shown in Figure 7, when the first sampling rate corresponding to the target sampling data is greater than the second sampling rate corresponding to the historical sampling data, step 100c can be executed, that is, the target sampling data can be sampled using the second sampling rate to obtain the sampled target sampling data. Then, step 101c can be executed, that is, the sampled target sampling data and the historical sampling data are merged to obtain merged sampling data. When the total number of merged sampling data is less than or equal to the sampling quantity threshold, step 102b can be executed, that is, the merged sampling data is directly used as the updated historical sampling data.

通过上述过程可以知道，对于每一个时间窗口而言，其都会对应有一个采样结果，若时间窗口的窗口长度为60分钟，则每个时间窗口对应的采样结果即是对该时间窗口对应的60分钟的时间段内产生的数据进行采样所得到的结果。后一个时间窗口均可以直接使用在前一个时间窗口中获取到的交集时间窗口内所采样得到的数据，该数据加上该后一个时间窗口在对应的非交集时间窗口内所获取到的采样结果所包含的数据，即为该后一个时间窗口对应的采样结果。可以将每一个时间窗口所得到的采样结果都输出来，在用户终端上进行显示，每一个时间窗口对应于一个采样状态(通过采样结果表征)，通过此种方式，可以直观地查看到随着时间窗口的推移，前后时间窗口所采样得到的采样结果的变化。输出的每一个时间窗口所对应的采样结果中可以包括多种不同类型的数据分别对应的采样结果，以及对每种数据进行采样所得到的数据的数量和对每种数据进行采样的采样率。不同类型的数据可以对应有不同的采样率和采样数量阈值，但是同一类型的数据一定是被等概率(即等采样率，也就是相同的采样率)进行采样的。As described above, each time window corresponds to a sampling result. If the window length is 60 minutes, the sampling result for each time window is the result obtained by sampling the data generated within that 60-minute time period. Subsequent time windows can directly use the data sampled from the intersection time windows of the previous time window. This data, plus the data included in the sampling results obtained from the corresponding non-intersection time windows of the subsequent time window, constitutes the sampling result for that subsequent time window. The sampling results obtained from each time window can be output and displayed on the user terminal. Each time window corresponds to a sampling state (represented by the sampling results). In this way, the changes in the sampling results obtained from previous and subsequent time windows can be visually observed as the time window progresses. The output sampling results for each time window can include sampling results corresponding to various data types, as well as the number of data points obtained from sampling each type of data and the sampling rate for each type of data. Different types of data can correspond to different sampling rates and sampling number thresholds, but data of the same type must be sampled with equal probability (i.e., equal sampling rate, that is, the same sampling rate).

请参见图8，是本申请提供的另一种数据采样的场景示意图。如图8所示，计算层中包括多个采样设备，每个采样设备中均可以包括多个线程。该多个采样设备可以读取数据层中的原始业务数据，即可以由数据层提供用于采样的数据。数据层中可以提供若干种类型的需要被采样的数据，具体可以包括点击日志、曝光日志、效果日志、广告词表、订单词表、资源词表等类型的采样数据。上述多个采样设备可以按照每个时间窗口(包括时间窗口1、时间窗口2、时间窗口3、……、时间窗口n)的时间先后顺序，依次对数据层中的数据进行读取和采样。在输出了上一个时间窗口对应的采样结果之后，可以进行下一个时间窗口对应的采样过程。时间窗口之间可以相互有交集，例如，当时间窗口的窗口长度为60分钟(即1小时)，时间窗口平移的时间为10分钟，即每隔10分钟进行一次全局的采样，则时间窗口1对应的采样时间段可以是12：00-13:00(即参与采样的数据是在12：00-13:00之间所生成的)，时间窗口2对应的采样时间段可以是12:10-13:10，时间窗口3对应的采样时间段可以是12:20-13:20，以此类推。上述时间窗口1与时间窗口2之间的交集时间段即是12:10-13:00。通过对每个时间窗口内所产生的数据进行采样可以得到每个时间窗口对应的采样结果。其中，采样设备对数据层中的原始业务数据进行读取和采样的过程大致可以是：首先，采样设备可以对从数据层中所读取到的原始业务数据进行格式解析(即数据接入格式解析，也就是解析出原始业务数据中的字段信息)。接着，通过过滤器可以对所解析出的原始业务数据中的多个字段信息进行过滤，即将不需要的字段信息过滤掉(即数据过滤字段挑选，可以通过预设的过滤规则进行过滤)。接着，通过聚合函数对所读取到的数据进行预聚合，即将属于同一类型的数据归类到一起。接着，可以通过词表关联操作在解析出的原始业务数据中的多个字段信息中添加需要添加的字段信息(可以根据预设添加规则进行添加)。接着，就是进行逻辑计算，逻辑计算就是对所读取到的经过字段过滤和词表关联的数据进行采样的过程。结果输出即是输出采样所得到的采样结果，每个时间窗口对应于一个采样结果。可以将最终所得到的采样结果通过存储层进行存储。其中，存储层中用于存储数据的结构可以包括分布文件系统(例如HDFS)、日志型开源数据库(例如Redis)、关系型数据库(例如MySql)以及结构化数据库(例如Hbase)。Please refer to Figure 8, which is a schematic diagram of another data sampling scenario provided in this application. As shown in Figure 8, the computing layer includes multiple sampling devices, each of which can include multiple threads. These multiple sampling devices can read the raw business data in the data layer, that is, the data provided by the data layer for sampling. The data layer can provide several types of data to be sampled, specifically including click logs, exposure logs, performance logs, advertising keyword lists, order keyword lists, resource keyword lists, etc. The aforementioned multiple sampling devices can sequentially read and sample the data in the data layer according to the time sequence of each time window (including time window 1, time window 2, time window 3, ..., time window n). After outputting the sampling results corresponding to the previous time window, the sampling process corresponding to the next time window can be performed. Time windows can overlap. For example, if the window length is 60 minutes (1 hour) and the time window shift is 10 minutes (meaning global sampling occurs every 10 minutes), then the sampling period for time window 1 could be 12:00-13:00 (i.e., the data sampled was generated between 12:00 and 13:00), the sampling period for time window 2 could be 12:10-13:10, the sampling period for time window 3 could be 12:20-13:20, and so on. The overlap between time window 1 and time window 2 is 12:10-13:00. By sampling the data generated within each time window, the sampling result for each time window can be obtained. The process of the sampling device reading and sampling the raw business data from the data layer can be roughly as follows: First, the sampling device can perform format parsing on the raw business data read from the data layer (i.e., data access format parsing, which means parsing out the field information in the raw business data). Next, filters can be used to filter multiple fields in the parsed raw business data, removing unnecessary fields (i.e., data filtering field selection, which can be done using preset filtering rules). Then, aggregation functions are used to pre-aggregate the read data, grouping data of the same type together. Next, alexicon join operations can be used to add necessary fields to the multiple fields in the parsed raw business data (addition can be done according to preset rules). Following this, logical calculations are performed, which involves sampling the data after field filtering and alexicon joins. The output is the sampling result, with each time window corresponding to one sampling result. The final sampling results can be stored in a storage layer. The storage layer can contain structures such as distributed file systems (e.g., HDFS), log-based open-source databases (e.g., Redis), relational databases (e.g., MySQL), and structured databases (e.g., HBase).

请参见图9，是本申请提供的一种数据采样装置的结构示意图。该数据采样装置可以是运行于计算机设备中的一个计算机程序(包括程序代码)，例如该数据采样装置为一个应用软件，该数据采样装置可以用于执行本申请实施例提供的方法中的相应步骤。如图9所示，该数据采样装置1可以包括：采样模块11、合并模块12、自适应采样模块13和替换模块14；Please refer to Figure 9, which is a schematic diagram of the structure of a data sampling device provided in this application. This data sampling device can be a computer program (including program code) running on a computer device, for example, the data sampling device is an application software, and the data sampling device can be used to execute the corresponding steps in the method provided in the embodiments of this application. As shown in Figure 9, the data sampling device 1 may include: a sampling module 11, a merging module 12, an adaptive sampling module 13, and a replacement module 14;

采样模块11，用于在第一时间窗口以第一采样率对目标类型的数据进行采样，得到目标采样数据；Sampling module 11 is used to sample data of the target type at a first sampling rate in a first time window to obtain target sampled data;

合并模块12，用于对目标采样数据和采样数据库中的历史采样数据进行合并，得到合并采样数据，其中，合并采样数据的第三采样率是根据第一采样率和历史采样数据的第二采样率所确定的，历史采样数据为在得到目标采样数据之前对目标类型的数据采样获取到的目标类型的采样数据，第二采样率是历史采样数据相对于目标类型的数据的采样率，第三采样率是合并采样数据相对于目标类型的数据的采样率；The merging module 12 is used to merge the target sampling data and the historical sampling data in the sampling database to obtain merged sampling data. The third sampling rate of the merged sampling data is determined based on the first sampling rate and the second sampling rate of the historical sampling data. The historical sampling data is the target type sampling data obtained by sampling the target type data before obtaining the target sampling data. The second sampling rate is the sampling rate of the historical sampling data relative to the target type data. The third sampling rate is the sampling rate of the merged sampling data relative to the target type data.

自适应采样模块13，用于当合并采样数据的数量大于采样数量阈值时，基于自适应采样参数对合并采样数据进行采样，得到更新历史采样数据，采样数量阈值与目标类型的数据相关，自适应采样参数用于将合并采样数据的数量控制在采样数量阈值内；The adaptive sampling module 13 is used to sample the merged sampled data based on the adaptive sampling parameters when the number of merged sampled data is greater than the sampling number threshold, so as to obtain updated historical sampled data. The sampling number threshold is related to the target type of data, and the adaptive sampling parameters are used to control the number of merged sampled data within the sampling number threshold.

替换模块14，用于将采样数据库中的历史采样数据替换为更新历史采样数据。Replacement module 14 is used to replace historical sampling data in the sampling database with updated historical sampling data.

其中，采样模块11、合并模块12、自适应采样模块13和替换模块14的具体功能实现方式请参见图3对应的实施例中的步骤S101-步骤S104，这里不再进行赘述。The specific functional implementation of the sampling module 11, merging module 12, adaptive sampling module 13 and replacement module 14 can be found in steps S101-S104 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，数据采样装置1，还用于：The data sampling device 1 is also used for:

其中，合并模块12，包括：第一比值确定单元121、第一比值采样单元122和第一合并单元123；The merging module 12 includes: a first ratio determination unit 121, a first ratio sampling unit 122, and a first merging unit 123;

第一比值确定单元121，用于当第一采样率大于第二采样率时，将第二采样率与第一采样率之间的比值确定为第一比值采样率；The first ratio determination unit 121 is used to determine the ratio between the second sampling rate and the first sampling rate as the first ratio sampling rate when the first sampling rate is greater than the second sampling rate;

第一比值采样单元122，用于根据第一比值采样率，对目标采样数据进行采样，得到采样后的目标采样数据；The first ratio sampling unit 122 is used to sample the target sampling data according to the first ratio sampling rate to obtain the sampled target sampling data.

第一合并单元123，用于将采样后的目标采样数据和历史采样数据，确定为合并采样数据。The first merging unit 123 is used to determine the target sampling data and historical sampling data after sampling as merged sampling data.

其中，第一比值确定单元121、第一比值采样单元122和第一合并单元123的具体功能实现方式请参见图3对应的实施例中的步骤S101-步骤S104，这里不再进行赘述。The specific functional implementation of the first ratio determination unit 121, the first ratio sampling unit 122, and the first merging unit 123 can be found in steps S101-S104 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，合并模块12，包括：第二比值确定单元124、第二比值采样单元125、第二合并单元126和第三合并单元127；The merging module 12 includes: a second ratio determination unit 124, a second ratio sampling unit 125, a second merging unit 126, and a third merging unit 127.

第二比值确定单元124，用于当第一采样率小于第二采样率时，将第一采样率与第二采样率之间的比值确定为第二比值采样率；The second ratio determination unit 124 is used to determine the ratio between the first sampling rate and the second sampling rate as the second ratio sampling rate when the first sampling rate is less than the second sampling rate.

第二比值采样单元125，用于根据第二比值采样率，对历史采样数据进行采样，得到采样后的历史采样数据；The second ratio sampling unit 125 is used to sample historical sampling data according to the second ratio sampling rate to obtain the sampled historical sampling data.

第二合并单元126，用于将采样后的历史采样数据和目标采样数据，确定为合并采样数据；The second merging unit 126 is used to determine the sampled historical sampling data and the target sampling data as the merged sampling data;

第三合并单元127，用于当第一采样率等于第二采样率时，将目标采样数据和历史采样数据确定为合并采样数据。The third merging unit 127 is used to determine the target sampling data and historical sampling data as merged sampling data when the first sampling rate is equal to the second sampling rate.

其中，第二比值确定单元124、第二比值采样单元125、第二合并单元126和第三合并单元127的具体功能实现方式请参见图3对应的实施例中的步骤S102，这里不再进行赘述。The specific functional implementation of the second ratio determination unit 124, the second ratio sampling unit 125, the second merging unit 126, and the third merging unit 127 can be found in step S102 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，采样模块11，包括：采样率确定单元111和窗口采样单元112；The sampling module 11 includes: a sampling rate determination unit 111 and a window sampling unit 112;

采样率确定单元111，用于将历史采样数据的第二采样率确定为第一采样率；The sampling rate determination unit 111 is used to determine the second sampling rate of historical sampling data as the first sampling rate;

窗口采样单元112，用于在第一时间窗口内，采用第一采样率对目标类型的数据进行采样，得到目标采样数据；The window sampling unit 112 is used to sample the target type data using a first sampling rate within a first time window to obtain target sampled data;

其中，采样率确定单元111和窗口采样单元112的具体功能实现方式请参见图3对应的实施例中的步骤S101，这里不再进行赘述。The specific functional implementation of the sampling rate determination unit 111 and the window sampling unit 112 can be found in step S101 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，自适应采样模块13，包括：字符串获取单元131、映射单元132和合并采样单元133；The adaptive sampling module 13 includes: a string acquisition unit 131, a mapping unit 132, and a merging sampling unit 133;

字符串获取单元131，用于当合并采样数据的数量大于采样数量阈值时，获取合并采样数据对应的数据标识字符串；The string acquisition unit 131 is used to acquire the data identifier string corresponding to the merged sampled data when the number of merged sampled data is greater than the sampling number threshold.

映射单元132，用于将数据标识字符串映射到均匀采样空间中，得到数据标识字符串对应的哈希值；The mapping unit 132 is used to map the data identifier string to the uniform sampling space to obtain the hash value corresponding to the data identifier string;

合并采样单元133，用于根据第三采样率、自适应采样参数以及哈希值，对合并采样数据进行采样，得到更新历史采样数据。The merging sampling unit 133 is used to sample the merged sampling data according to the third sampling rate, the adaptive sampling parameter and the hash value to obtain updated historical sampling data.

其中，字符串获取单元131、映射单元132和合并采样单元133的具体功能实现方式请参见图3对应的实施例中的步骤S103，这里不再进行赘述。The specific functional implementation of the string acquisition unit 131, the mapping unit 132 and the merging sampling unit 133 can be found in step S103 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，合并采样单元133，包括：自适应采样率子单元1331和合并采样子单元1332；The merging sampling unit 133 includes: an adaptive sampling rate subunit 1331 and a merging sampling subunit 1332;

自适应采样率子单元1331，用于根据第三采样率和自适应采样参数，得到第四采样率；The adaptive sampling rate subunit 1331 is used to obtain the fourth sampling rate based on the third sampling rate and the adaptive sampling parameters;

合并采样子单元1332，用于基于第四采样率和哈希值，对合并采样数据进行采样，得到更新历史采样数据；第四采样率是更新历史采样数据相对于目标类型的数据的采样率。The merged sampling subunit 1332 is used to sample the merged sampled data based on the fourth sampling rate and the hash value to obtain updated historical sampled data; the fourth sampling rate is the sampling rate of the updated historical sampled data relative to the target type of data.

其中，自适应采样率子单元1331和合并采样子单元1332的具体功能实现方式请参见图3对应的实施例中的步骤S103，这里不再进行赘述。The specific functional implementation of the adaptive sampling rate subunit 1331 and the merging sampling subunit 1332 can be found in step S103 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，数据采样装置1，具体还用于：Specifically, the data sampling device 1 is also used for:

采样模块12，具体还用于：Sampling module 12 is also specifically used for:

则，合并模块12，具体还用于：Therefore, module 12, in particular, is also used for:

其中，采样模块11，包括：数据读取单元113、过滤单元114、关联单元115和目标采样单元116；The sampling module 11 includes: a data reading unit 113, a filtering unit 114, an association unit 115, and a target sampling unit 116.

数据读取单元113，用于在第一时间窗口从数据流中读取目标类型的数据，对读取到的目标类型的数据进行字段解析，得到初始解析数据；Data reading unit 113 is used to read data of target type from the data stream in the first time window, and to parse the fields of the read target type data to obtain initial parsed data;

过滤单元114，用于基于过滤机制对初始解析数据中的多个字段信息进行过滤，得到过滤解析数据；Filtering unit 114 is used to filter multiple fields of information in the initial parsed data based on the filtering mechanism to obtain filtered parsed data;

关联单元115，用于基于词表关联机制在过滤解析数据中添加关联字段信息，得到采样业务数据；The association unit 115 is used to add association field information to the filtered parsed data based on the vocabulary association mechanism to obtain the sampled business data;

目标采样单元116，用于以第一采样率对采样业务数据进行采样，得到目标采样数据。The target sampling unit 116 is used to sample the sampling service data at a first sampling rate to obtain target sampling data.

其中，数据读取单元113、过滤单元114、关联单元115和目标采样单元116的具体功能实现方式请参见图3对应的实施例中的步骤S101，这里不再进行赘述。The specific functional implementation of the data reading unit 113, the filtering unit 114, the association unit 115 and the target sampling unit 116 can be found in step S101 of the embodiment corresponding to Figure 3, and will not be repeated here.

其中，采样模块11，具体还用于：The sampling module 11 is further used for:

请参见图10，是本申请提供的一种计算机设备的结构示意图。如图10所示，计算机设备1000可以包括：处理器1001，网络接口1004和存储器1005，此外，计算机设备1000还可以包括：用户接口1003，和至少一个通信总线1002。其中，通信总线1002用于实现这些组件之间的连接通信。其中，用户接口1003可以包括显示屏(Display)、键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图10所示，作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。Please refer to Figure 10, which is a schematic diagram of the structure of a computer device provided in this application. As shown in Figure 10, the computer device 1000 may include: a processor 1001, a network interface 1004, and a memory 1005. Furthermore, the computer device 1000 may also include: a user interface 1003, and at least one communication bus 1002. The communication bus 1002 is used to realize communication between these components. The user interface 1003 may include a display screen and a keyboard; optionally, the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk storage device. Optionally, the memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001. As shown in Figure 10, the memory 1005, as a computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.

在图10所示的计算机设备1000中，网络接口1004可提供网络通讯功能；而用户接口1003主要用于为用户提供输入的接口；而处理器1001可以用于调用存储器1005中存储的设备控制应用程序，以实现前文图3中所对应实施例中对数据采样方法的描述。应当理解，本申请中所描述的计算机设备1000也可执行前文图9所对应实施例中对数据采样装置1的描述，在此不再赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。In the computer device 1000 shown in Figure 10, the network interface 1004 provides network communication functionality; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control application stored in the memory 1005 to implement the data sampling method described in the corresponding embodiment of Figure 3 above. It should be understood that the computer device 1000 described in this application can also execute the data sampling device 1 described in the corresponding embodiment of Figure 9 above, and will not be repeated here. In addition, the beneficial effects of using the same method will not be repeated here.

此外，这里需要指出的是：本申请还提供了一种计算机可读存储介质，且计算机可读存储介质中存储有前文提及的数据采样装置1所执行的计算机程序，且计算机程序包括程序指令，当处理器执行程序指令时，能够执行前文图3所对应实施例中对数据采样方法的描述，因此，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本申请所涉及的计算机存储介质实施例中未披露的技术细节，请参照本申请方法实施例的描述。Furthermore, it should be noted that this application also provides a computer-readable storage medium storing a computer program executed by the aforementioned data sampling device 1. The computer program includes program instructions, which, when executed by the processor, enable the execution of the data sampling method described in the embodiment corresponding to Figure 3 above. Therefore, this description will not be repeated here. Additionally, the beneficial effects of using the same method will also not be repeated. For technical details not disclosed in the embodiments of the computer storage medium involved in this application, please refer to the description of the method embodiments of this application.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，上述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，上述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(Random AccessMemory，RAM)等。Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

以上所揭露的仅为本申请较佳实施例而已，当然不能以此来限定本申请之权利范围，因此依本申请权利要求所作的等同变化，仍属本申请所涵盖范围。The above-disclosed embodiments are merely preferred embodiments of this application and should not be construed as limiting the scope of this application. Therefore, any equivalent variations made in accordance with the claims of this application shall still fall within the scope of this application.

Claims

1. A data sampling method, characterized in that it comprises:

The target type data is sampled at the first sampling rate within the first time window to obtain the target sampled data;

The target sampling data and historical sampling data in the sampling database are merged to obtain merged sampling data. The third sampling rate of the merged sampling data is determined based on the first sampling rate and the second sampling rate of the historical sampling data. The historical sampling data is the target type sampling data obtained by sampling the target type of data before obtaining the target sampling data. The second sampling rate is the sampling rate of the historical sampling data relative to the target type of data. The third sampling rate is the sampling rate of the merged sampling data relative to the target type of data.

When the number of merged sampled data exceeds the sampling number threshold, the merged sampled data is sampled based on adaptive sampling parameters to obtain updated historical sampled data. The sampling number threshold is related to the data of the target type, and the adaptive sampling parameters are used to control the number of merged sampled data within the sampling number threshold.

Replace the historical sampling data in the sampling database with the updated historical sampling data.

2. The method according to claim 1, characterized in that the method further comprises:

A fourth sampling rate is determined based on the adaptive sampling parameters and the third sampling rate, so that the fourth sampling rate is used as the first sampling rate to sample the target type data in a second time window, wherein the fourth sampling rate is the sampling rate of the updated historical sampling data relative to the target type data, and the second time window is the next time window after the first time window.

3. The method according to claim 1, characterized in that, merging the target sampling data and historical sampling data in the sampling database to obtain merged sampling data includes:

When the first sampling rate is greater than the second sampling rate, the ratio between the second sampling rate and the first sampling rate is determined as the first ratio sampling rate;

Based on the first ratio sampling rate, the target sampling data is sampled to obtain the sampled target sampling data;

The target sampled data and the historical sampled data are determined as the merged sampled data.

4. The method according to claim 1, characterized in that, merging the target sampling data and historical sampling data in the sampling database to obtain merged sampling data includes:

When the first sampling rate is less than the second sampling rate, the ratio between the first sampling rate and the second sampling rate is determined as the second ratio sampling rate;

Based on the second ratio sampling rate, the historical sampling data is sampled to obtain the sampled historical sampling data;

The sampled historical data and the target sampled data are determined as the merged sampled data;

When the first sampling rate is equal to the second sampling rate, the target sampling data and the historical sampling data are determined as the merged sampling data.

5. The method according to claim 1, characterized in that, the step of sampling the target type data at a first sampling rate within a first time window to obtain target sampled data includes:

The second sampling rate of the historical sampling data is determined as the first sampling rate;

Within the first time window, the target type of data is sampled using the first sampling rate to obtain the target sampled data;

Then, merging the target sampling data and the historical sampling data in the sampling database to obtain merged sampling data includes:

6. The method according to claim 1, characterized in that, when the number of merged sampled data is greater than a sampling quantity threshold, sampling the merged sampled data based on adaptive sampling parameters to obtain updated historical sampled data includes:

When the number of merged sampled data exceeds the sampling quantity threshold, the data identifier string corresponding to the merged sampled data is obtained;

The data identifier string is mapped to a uniform sampling space to obtain the hash value corresponding to the data identifier string;

The merged sampled data is sampled according to the third sampling rate, the adaptive sampling parameter, and the hash value to obtain the updated historical sampled data.

7. The method according to claim 6, characterized in that, the step of sampling the merged sampled data according to the third sampling rate, the adaptive sampling parameter, and the hash value to obtain the updated historical sampled data includes:

The fourth sampling rate is obtained based on the third sampling rate and the adaptive sampling parameters;

Based on the fourth sampling rate and the hash value, the merged sampled data is sampled to obtain the updated historical sampled data; the fourth sampling rate is the sampling rate of the updated historical sampled data relative to the target type of data.

8. The method according to claim 1, characterized in that the method further comprises:

When the number of merged sampled data is less than or equal to the sampling quantity threshold, the historical sampled data in the sampling database is replaced with the merged sampled data.

9. The method according to claim 1, wherein the previous time window of the first time window is a third time window; the first time window and the third time window have an intersection time window; and the historical sampling data is sampling data obtained by sampling data of the target type within the intersection time window of the third time window;

The step of sampling data of the target type at a first sampling rate within a first time window to obtain target sampled data includes:

Within the time window other than the intersection time window of the first time window, the data of the target type is sampled at the first sampling rate to obtain the target sampled data;

When the number of historical sampled data is less than the merging quantity threshold, and the ratio between the first sampling rate and the second sampling rate is less than the ratio threshold, the historical sampled data is deleted, and the target sampled data is determined as the merged sampled data.

10. The method according to claim 1, wherein sampling the target type data at a first sampling rate within a first time window to obtain target sampled data comprises:

In the first time window, data of the target type is read from the data stream, and the fields of the read data of the target type are parsed to obtain initial parsed data;

The initial parsed data is filtered based on a filtering mechanism to obtain filtered parsed data.

Based on the vocabulary association mechanism, related field information is added to the filtered and parsed data to obtain sampled business data;

The target sampled data is obtained by sampling the sampled service data at the first sampling rate.

11. The method according to claim 1, wherein sampling the target type data at a first sampling rate within a first time window to obtain target sampled data comprises:

Within the first time window, the target type of data is sampled by the first thread at the first sampling rate to obtain the target sampled data;

The target sampling data and the historical sampling data in the sampling database are merged by a second thread to obtain the merged sampling data.

12. A data sampling device, characterized in that it comprises:

The sampling module is used to sample data of the target type at a first sampling rate within a first time window to obtain target sampled data;

The merging module is used to merge the target sampling data and the historical sampling data in the sampling database to obtain merged sampling data. The third sampling rate of the merged sampling data is determined based on the first sampling rate and the second sampling rate of the historical sampling data. The historical sampling data is the target type sampling data obtained by sampling the target type data before obtaining the target sampling data. The second sampling rate is the sampling rate of the historical sampling data relative to the target type data. The third sampling rate is the sampling rate of the merged sampling data relative to the target type data.

The adaptive sampling module is used to sample the merged sampled data based on adaptive sampling parameters when the number of merged sampled data exceeds the sampling number threshold, so as to obtain updated historical sampled data. The sampling number threshold is related to the target type of data, and the adaptive sampling parameters are used to control the number of merged sampled data within the sampling number threshold.

The replacement module is used to replace historical sampling data in the sampling database with updated historical sampling data.

13. The apparatus according to claim 12, wherein the apparatus is further configured to:

14. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method as claimed in any one of claims 1-11.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program comprising program instructions, which, when executed by a processor, perform the method as described in any one of claims 1-11.