CN112131303A

CN112131303A - Large-scale data lineage method based on neural network model

Info

Publication number: CN112131303A
Application number: CN202010988710.5A
Authority: CN
Inventors: 李�杰; 叶一舟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-25

Abstract

The invention discloses a large-scale data lineage method based on a neural network model, which comprises the following steps: (1) generating a network training set; the method comprises the steps of array sequencing, dimension standard division and training subset division; sorting the data in the data set according to values in different dimensions in the data set; determining a division standard for each dimension to solve the sample exhaustion problem; dividing the training set into a number of smaller training subsets; (2) training a neural network model; a layered network structure is used for replacing the traditional neural network structure so as to solve the error problem caused by large sample data difference; the hierarchical structure specifically comprises a network selector and a subnet; (3) visual interaction and lineage; the method specifically comprises a space scatter diagram, a space-time projection view and a mode contrast view; the data set visualization interactive exploration is performed on the data set, and a visualization mode is used for facilitating exploration of data results by a user; and allows users to explore data sources in an intuitive manner.

Description

Large-scale data lineage method based on neural network model

技术领域technical field

本专利主要涉及机器学习和数据可视化领域，具体涉及对大规模数据集的实时交互及神经网络模型优化的方法。This patent mainly relates to the field of machine learning and data visualization, and specifically relates to a method for real-time interaction of large-scale data sets and optimization of neural network models.

背景技术Background technique

近年来研究人员面对的数据集所包含的数据量级呈指数型增长^[4]，这无疑给交互式可视化探索与沿袭带来了麻烦。最近提出的技术使分析人员可以实时地交互式地探索大规模数据集^[5]，但是这些技术忽略了人们可能关心隐藏在统计数据分布背后的真实数据^[10]。我们从可视化实现了数据的反向生成，因此视觉视图将不再局限于显示数据的统计信息，它还可以用作生成更复杂的视觉视图的数据，或者探索视图子集中数据的详细分布。In recent years, researchers have faced an exponential increase in the amount of data contained in datasets ^[4] , which undoubtedly brings trouble to interactive visual exploration and lineage. Recently proposed techniques allow analysts to interactively explore large-scale datasets in real-time ^[5] , but these techniques ignore the real data that one might care about hiding behind statistical data distributions ^[10] . We implement the inverse generation of data from the visualization, so the visual view will no longer be limited to showing statistics of the data, it can also be used as data to generate more complex visual views, or to explore the detailed distribution of data in a subset of views.

关于数据沿袭的研究已经在数据库领域进行了一段时间^[7]。传统方法通过扩展基本数据模型来捕获源信息^[9]，由此带来的缺点是显而易见的：必须使用与实际数据不同的模型来存储访问源。Miles等人^[8]提出，由数据产生的产品和描述可能隐藏结果的来源以及如何产生结果的细节，他们研究并讨论了数据来源如何可以帮助科学家进行实验。BorisGlavic等人^[6]提出了使用查询重写为源元组标注结果元组的方法，并在数据库中证明了其可行性。K.Dursun等人^[1]提出了一种新的中间体重用模型，该模型可缓存在查询处理过程中实现的内部物理数据结构。这项工作通过研究数据库中中间体的重用来加速分析查询的处理。R.Ikeda等人^[2]的panda实现了物源捕获，存储，运算符和查询。他们将数据沿袭应用于诸如调试，审计，数据集成，安全性，迭代分析和清理之类的任务。在他们的基础上，FotosPsallidas等人^[3]提出了Smoke，这是一个内存数据库引擎，不需要牺牲沿袭捕获开销。Smoke将哈希表形式的谱系情况以哈希表的形式预先存储，以节省谱系查询带来的时间开销，可以满足实时视觉交互要求。Research on data lineage has been in the database field for some time ^[7] . The traditional method captures source information by extending the basic data model ^[9] , and the disadvantage is obvious: the access source must be stored using a different model than the actual data. Proposing that products and descriptions produced by data may hide the details of where the results came from and how they were produced, Miles et al. ^[8] study and discuss how data sources can help scientists conduct experiments. Boris Glavic et al. ^[6] proposed a method of labeling result tuples for source tuples using query rewriting and demonstrated its feasibility in a database. K. Dursun et al. ^[1] propose a new intermediate reuse model that caches internal physical data structures implemented during query processing. This work accelerates the processing of analytical queries by investigating the reuse of intermediates in databases. Panda by R. Ikeda et al. ^[2] implements provenance capture, storage, operators and queries. They apply data lineage to tasks such as debugging, auditing, data integration, security, iterative analysis, and cleaning. Building on theirs, FotosPsallidas et al. ^[3] proposed Smoke, an in-memory database engine that does not need to sacrifice lineage capture overhead. Smoke pre-stores the pedigree in the form of a hash table in the form of a hash table to save the time overhead brought by pedigree query and meet the requirements of real-time visual interaction.

上述的工作主要使用较大规模的数据集，然而这些工作都存在一些缺点和不足：首先，一些工作为每个输入创建哈希索引以加快沿袭查询，但是与此同时，随着数据大小的增加，哈希表的大小也会增加，这可能会带来诸如内存耗尽的问题。其次，最新工作使用一种方法在内存中实时实现哈希表，以加快查询速度，但即使此方法优化了实时生成哈希表的时间，它仍然带来了不可避免的存储开销和额外的查询时间。同时，上述工作无法使用查询数据再次生成可视化，它只能在多个可视化视图之间建立连接。The above works mainly use large-scale datasets, however, these works have some shortcomings and shortcomings: first, some works create hash indexes for each input to speed up lineage queries, but at the same time, as the data size increases , the size of the hash table will also increase, which may bring problems such as memory exhaustion. Second, recent work uses a method to implement hash tables in memory in real-time to speed up queries, but even if this method optimizes the time to generate hash tables in real-time, it still brings inevitable storage overhead and additional queries time. At the same time, the above work cannot generate visualizations again using query data, it can only establish connections between multiple visualization views.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决现有技术中的以下问题。1.使用神经网络模型取代传统索引结构，从而减少查询带来的时间开销与存储开销。2.对于大量数据，神经网络无法很好地满足查询和索引之间的关系，因此需要使用层次结构来解决此问题。分层结构包括第一层网络选择器，用于查找查询对应的子网；以及第二层子网络，用于计算并输出查询结果。3.大规模数据集往往包含多个维度，用户可能不仅需要约束一个维度，所以要解决同时满足多维约束的沿袭查询，需要对不同的维度制定不同的划分标准，并为每一个维度分别训练神经网络模型。因此，本发明提出了一个基于神经网络模型的框架以沿袭探索大规模数据集。首先，框架采用了一个基于神经网络模型的索引结构，满足实时交互式沿袭查询。其次，框架集成了层次结构网络模型以及哈希表，实现对误差数据的处理。最后，设计支持对该数据结果进行快速查询及交互的可视化界面。The purpose of the present invention is to solve the following problems in the prior art. 1. Use the neural network model to replace the traditional index structure, thereby reducing the time overhead and storage overhead caused by the query. 2. For a large amount of data, neural networks cannot satisfy the relationship between query and index well, so hierarchical structure is needed to solve this problem. The hierarchical structure includes a first-layer network selector, which is used to find the sub-network corresponding to the query; and a second-layer sub-network, which is used to calculate and output the query result. 3. Large-scale datasets often contain multiple dimensions, and users may not only need to constrain one dimension, so to solve the lineage query that satisfies multi-dimensional constraints at the same time, it is necessary to formulate different division criteria for different dimensions, and train the neural network separately for each dimension. network model. Therefore, the present invention proposes a framework based on neural network models for lineage exploration of large-scale datasets. First, the framework adopts an index structure based on a neural network model to satisfy real-time interactive lineage queries. Second, the framework integrates a hierarchical network model and a hash table to process error data. Finally, a visual interface that supports quick query and interaction of the data results is designed.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

基于神经网络模型的大规模数据沿袭方法，包括以下步骤：A large-scale data lineage method based on a neural network model, including the following steps:

(1)生成网络训练集；包括数组排序、维度标准划分和训练子集划分；根据数据集中不同维度中的值对数据集中的数据进行排序并存储；为每个维度确定一个划分标准以解决样本穷举问题；并将训练集分为若干个子集，作为训练子集；(1) Generate a network training set; including array sorting, dimension standard division and training subset division; sort and store the data in the data set according to the values in different dimensions in the data set; determine a division standard for each dimension to solve the sample Exhaustive problem; divide the training set into several subsets as training subsets;

(2)训练神经网络模型；使用分层的网络结构代替传统的神经网络结构，以解决由于样本数据差别大造成的误差问题；分层的网络结构具体包括网络选择器和子网两大部分，使用网络选择器作为第一层，其作用是为查询找到正确的对应子网；对于每一个子网，使用训练子集分别进行训练；(2) Train the neural network model; use the layered network structure to replace the traditional neural network structure to solve the error problem caused by the large difference of sample data; the layered network structure specifically includes two parts, the network selector and the subnet, using The network selector is used as the first layer, and its role is to find the correct corresponding subnet for the query; for each subnet, use the training subset to train separately;

(3)可视交互与沿袭；将神经网络的输出结果映射为若干种视图，具体包括空间散点图、时空投影视图和模式对比视图；用于对数据集进行可视化交互探索，使用可视化的方式方便用户对数据结果进行探索；并允许用户通过沿袭的方式探索数据来源。(3) Visual interaction and lineage; map the output of the neural network into several views, including spatial scatter plots, spatiotemporal projection views, and pattern comparison views; used for visual interactive exploration of datasets, using a visual approach It is convenient for users to explore data results; and allows users to explore data sources through lineage.

与现有技术相比，本发明的技术方案所带来的有益效果是：Compared with the prior art, the beneficial effects brought by the technical solution of the present invention are:

1.可视化沿袭，这是一种基于可视化的新查询方法来查找数据，并着重于如何在人们感兴趣的可视化视图区域后面查找真实数据。据了解，现如今的方法都无法做到实时交互查找真实数据，尤其是数据中的细节。然后，可以使用这些数据生成新的可视化视图，以帮助用户进一步分析和研究。1. Visual Lineage, which is a new visualization-based query method to find data and focuses on how to find real data behind areas of visual view that people are interested in. It is understood that none of the current methods can interactively find real data in real time, especially the details in the data. This data can then be used to generate new visualizations to aid users in further analysis and research.

2.分层的神经网络结构。它可以有效地减少每个神经网络需要预测的值的范围，帮助控制神经元的数量，较少的神经元数量可以使得网络易于观察和调整，同时解决了网络的更新问题。该结构可以有效控制误差，一个仅需要适应数千个数据样本和标签之间关系的神经网络，可以简单地将最大允许误差控制在一个合理的值。2. Hierarchical neural network structure. It can effectively reduce the range of values that each neural network needs to predict, and help control the number of neurons. A smaller number of neurons can make the network easy to observe and adjust, while solving the network update problem. This structure can effectively control the error. A neural network that only needs to adapt to the relationship between thousands of data samples and labels can simply control the maximum allowable error to a reasonable value.

3.一种基于神经网络模型的方法，它使用哈希表和分层神经网络的复合结构来支持以交互速率对大型数据进行沿袭查询。它花费很少的额外存储开销，并实现了细粒度的沿袭查询。该结构实现实时交互探索的同时，花费的存储开销比现存的所有技术都要少。该结构支持更新，从而解决了神经网络结构的常见问题。3. An approach based on a neural network model that uses a composite structure of hash tables and hierarchical neural networks to support lineage queries on large data at interactive rates. It costs little additional storage overhead and enables fine-grained lineage queries. This structure enables real-time interactive exploration while consuming less storage overhead than all existing technologies. The structure supports updates, which solve common problems with neural network structures.

附图说明Description of drawings

图1为提出方法的总流程图。Figure 1 is a general flow chart of the proposed method.

图2为网络训练集生成图。Figure 2 is a graph of network training set generation.

图3为神经网络结构图。该图中：a表示用户输入的查询，b表示查询通过网络选择器输出对应子网络，c表示查询对应的子网络输出对应的数据所在位置。Figure 3 shows the structure of the neural network. In the figure: a indicates the query input by the user, b indicates that the query outputs the corresponding sub-network through the network selector, and c indicates the location of the data corresponding to the sub-network output corresponding to the query.

具体实施方式Detailed ways

以下结合附图和具体实施例对本发明作进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明提出了一个基于神经网络模型的框架以沿袭探索大规模数据集。首先，框架采用了一个基于神经网络模型的索引结构，对于数据集中的每一个维度，将所有查询条件作为样本，查询结果作为标签，训练神经网络模型，满足实时交互式沿袭查询。其次，层次结构网络模型以及哈希表，通过将网络分为网络选择器与子网络两层，有效减少每个神经网络需要预测的值的范围，以及使用哈希表记录误差较大的数据，实现对误差数据的处理。具体而言，如图1所示，主要包含以下步骤：The present invention proposes a framework based on a neural network model for lineage exploration of large-scale datasets. First, the framework adopts an index structure based on the neural network model. For each dimension in the dataset, all query conditions are used as samples, and the query results are used as labels to train the neural network model to satisfy real-time interactive lineage query. Secondly, the hierarchical network model and hash table, by dividing the network into two layers, the network selector and the sub-network, effectively reduce the range of values that each neural network needs to predict, and use the hash table to record data with large errors, Realize the processing of error data. Specifically, as shown in Figure 1, it mainly includes the following steps:

步骤一：网络训练集生成(图2)。具体操作包括数组排序、维度标准划分和训练子集划分。据数据集中不同维度中的值对数据集中的数据进行排序，然后存储所有这些顺序；为每个维度确定一个划分标准以解决样本穷举问题；将训练样本分为许多较小的子集以克服神经网络不能很好地适应细微的误差。Step 1: Network training set generation (Figure 2). Specific operations include array sorting, dimension standard division, and training subset division. Sort the data in the dataset according to the values in the different dimensions of the dataset, and then store all these orders; determine a partitioning criterion for each dimension to solve the problem of sample exhaustion; divide the training samples into many smaller subsets to overcome the problem of Neural networks do not adapt well to small errors.

对数组进行排序是数据结构中非常重要的组成部分。简单来说，对数组进行排序意味着根据不同维度中的值对数据集中的数据进行排序，然后存储所有这些顺序。训练神经网络需要输入和输出之间的某种功能关系，而顺序排序可以在查询和输出之间建立某种线性关系。对于数据集中的多个维度(经度，纬度，小时，天)，将为每个维度存储一个排序后的数组，排序依据是每个数据在那个维度上的值。像数据库一样，为每个数据分配唯一的主键，因此每个排序的数组仅需要存储主键，这将大大减少存储空间的开销。通常，每条数据只会带来最低的数据类型存储消耗，即4个字节。Sorting an array is a very important part of a data structure. In simple terms, sorting an array means sorting the data in the dataset based on the values in different dimensions, and then storing all those orders. Training a neural network requires some kind of functional relationship between the input and the output, and sequential ordering can establish some kind of linear relationship between the query and the output. For multiple dimensions in the dataset (longitude, latitude, hour, day), a sorted array will be stored for each dimension, sorted by the value of each data in that dimension. Like a database, each data is assigned a unique primary key, so each sorted array only needs to store the primary key, which will greatly reduce the storage space overhead. Usually, each piece of data only brings the lowest data type storage consumption, which is 4 bytes.

训练样本的生成需要分别为每个维度生成训练样本。为了使训练样本详尽无遗，需要为每个维度确定一个划分标准，即样本基于此准则。对于具有明确标准(例如天)的尺寸，将一天作为划分标准；对于没有明确标准(例如经度或纬度)的尺寸，将它们尽可能细地分开。训练样本分为输入和输出(标签)。输入是查询条件，即一个分辨率大小从零开始增加，直到达到维度的最大值；输出是每个输入对应的查询结果。输出是查询结果在与每个输入相对应的每个已排序数组中的索引位置，即最后一个数据的索引，该索引的值小于某个维的已排序数组中的输入。在这种情况下，输入和输出单调增加，因此它们之间可能存在一定的线性关系。由于神经网络不能很好地适应细微的误差，随着训练样本范围的扩大，这个问题将被放大，最终导致无法接受的误差。因此将训练样本分为许多较小的子集，然后使用它们来训练不同的神经网络。对于每个子集，使用标准化过程来简化神经网络的训练。The generation of training samples requires generating training samples for each dimension separately. In order to make the training samples exhaustive, a division criterion needs to be determined for each dimension, that is, the samples are based on this criterion. For dimensions with a clear standard (such as days), use one day as the dividing standard; for dimensions without clear standards (such as longitude or latitude), divide them as finely as possible. Training samples are divided into inputs and outputs (labels). The input is the query condition, that is, a resolution size increases from zero until it reaches the maximum value of the dimension; the output is the query result corresponding to each input. The output is the index position of the query result in each sorted array corresponding to each input, that is, the index of the last data whose value is less than the input in the sorted array of a dimension. In this case, the input and output increase monotonically, so there may be some linear relationship between them. Since neural networks do not adapt well to small errors, this problem will be magnified as the range of training samples increases, eventually leading to unacceptable errors. So divide the training samples into many smaller subsets and use them to train different neural networks. For each subset, a normalization process is used to simplify the training of the neural network.

步骤二：训练神经网络模型(图3)。使用分层的网络结构代替传统的神经网络结构，以解决由于样本数据差别较大造成的误差问题；分层结构具体包括网络选择器和子网两大部分，使用网络选择器作为第一层，其作用是为查询找到正确的对应子网；对于每一个子网，使用训练子集分别进行训练。Step 2: Train the neural network model (Figure 3). Use a layered network structure to replace the traditional neural network structure to solve the error problem caused by the large difference in sample data; the hierarchical structure specifically includes two parts: the network selector and the subnet, and the network selector is used as the first layer. The role is to find the correct corresponding subnet for the query; for each subnet, use the training subset to train separately.

有了训练样本，传统的方法是使用它来训练神经网络，然后保存网络。在本实施例的过程中，由于样本范围的较大范围，神经网络的细微偏差被放大了许多倍，因此误差是不可接受的。因此决定使用分层结构代替传统的神经网络。层次结构主要由两部分组成：网络选择器和子网。使用网络选择器作为第一层，其作用是为查询找到正确的对应子网。根据一定数量均匀地分配训练样本，因此在输入查询时，可以轻松地通过一次计算来计算出它属于哪个子网。然后对查询进行规范化并将其输入到子网中，然后子网在排序数组中输出与查询相对应的数据的索引。应当注意此处输出的索引不是完全准确的值。这是由于神经网络的性质所致，这意味着除非神经元的数量足够大，否则它无法完全适合每条数据。因此为神经网络的输出设置一个误差值。只要输出在误差范围内，就认为神经网络的预测是成功的。使用分层结构而不是整个神经网络的好处如下：With a training sample, the traditional approach is to use it to train a neural network and then save the network. In the process of this embodiment, due to the large range of the sample range, the subtle deviation of the neural network is amplified many times, so the error is unacceptable. So it was decided to use a hierarchical structure instead of traditional neural networks. The hierarchy mainly consists of two parts: network selectors and subnets. A network selector is used as the first layer whose role is to find the correct corresponding subnet for the query. Training samples are distributed evenly according to a certain number, so when a query is entered, it can easily be calculated in one calculation to which subnet it belongs. The query is then normalized and fed into the subnet, which then outputs the index of the data corresponding to the query in a sorted array. It should be noted that the index output here is not a completely accurate value. This is due to the nature of neural networks, which means that unless the number of neurons is large enough, it cannot fit every piece of data exactly. So set an error value for the output of the neural network. As long as the output is within the error bounds, the prediction of the neural network is considered successful. The benefits of using a layered structure instead of an entire neural network are as follows:

(1)它可以有效地减少每个神经网络需要预测的值的范围。对于大型数据集，查询和索引之间的关系总体上是线性的，但是如果放大单个记录，它将显示越来越多的不规则性。但是，如果单个神经网络只需要满足数千个查询和索引之间的关系(极少数情况除外)，则神经网络具有完美的功能。(1) It can effectively reduce the range of values that each neural network needs to predict. For large datasets, the relationship between query and index is generally linear, but if you zoom in on a single record, it will show more and more irregularities. However, if a single neural network only needs to satisfy relationships between thousands of queries and indexes (except in rare cases), a neural network is perfectly functional.

(2)它可以帮助控制神经元的数量。例如，对于具有数千万个样本的训练集，很难确定单个神经网络中神经元的数量。但是，如果对于仅包含数千个样本的训练集，只需要几个神经元即可取得非常好的结果。另外，神经元较少的神经网络可以将较长的训练时间分成单独的小部分，这更便于观察和调整。(2) It can help control the number of neurons. For example, with training sets with tens of millions of samples, it is difficult to determine the number of neurons in a single neural network. However, very good results can be achieved with only a few neurons for a training set of only a few thousand samples. In addition, neural networks with fewer neurons can divide longer training times into separate small parts, which are easier to observe and adjust.

(3)它可以帮助设置误差。如果单个神经网络需要适合数百万甚至数千万个数据样本和标签之间的关系，则很难控制预测值和真实值之间的允许误差。但是相比之下，如果单个神经网络只需要适应数千个数据样本和标签之间的关系，那么可以简单地将最大允许误差控制在一个合理的值。(3) It can help to set the error. If a single neural network needs to fit the relationship between millions or even tens of millions of data samples and labels, it is difficult to control the allowable error between predicted and true values. But in contrast, if a single neural network only needs to adapt to the relationship between thousands of data samples and labels, then the maximum allowable error can simply be controlled to a reasonable value.

即使将数据分成许多小部分，对于神经网络，仍然只有很少的数据无法将预测值和真实值之间的误差控制在最大允许范围内。对于神经网络“异常”的这部分数据，将使用哈希表来存储它们。因为它们的数量最少，并且哈希查询的时间复杂度仅为O(1)，所以哈希表带来的时间开销和存储开销几乎可以忽略不计。因此最后，使用预排序的数据，自适应单层神经网络和其他哈希表作为最终的索引结构。之后，在不同的维度上，使用神经网络或哈希表的输出来查找原始数据中与查询相对应的位置，然后获得不同维度的结果，对这些结果进行合取运算，并根据总结用户提供的条件，然后将结果传递到前端界面，然后完成前端界面以进行视觉呈现。Even if the data is divided into many small parts, there are still very few data that cannot control the error between the predicted value and the true value within the maximum allowable range for the neural network. For this part of the "anomaly" part of the neural network, a hash table will be used to store them. Because their number is minimal and the time complexity of hash queries is only O(1), the time overhead and storage overhead brought by hash tables are almost negligible. So in the end, use pre-sorted data, adaptive single-layer neural networks and other hash tables as the final indexing structure. After that, in different dimensions, use the output of the neural network or hash table to find the position in the original data that corresponds to the query, then get the results of the different dimensions, perform a conjunction operation on these results, and summarize the user-provided results according to the condition, then pass the result to the front-end interface, which then completes the front-end interface for visual presentation.

步骤三：可视交互与沿袭。对数据集提供可视化交互沿袭。具体包括空间散点图、时空投影视图和模式对比视图；用于对数据集进行可视化交互探索，使用可视化的方式方便用户对数据结果进行探索；并允许用户通过沿袭的方式探索数据来源。Step 3: Visual Interaction and Lineage. Provides visual interaction lineage on datasets. Specifically, it includes spatial scatter plot, spatiotemporal projection view, and pattern comparison view; it is used for visual and interactive exploration of data sets, and it is convenient for users to explore data results in a visual way; and allows users to explore data sources through lineage.

地图空间散点图。Map space scatter plot.

可视化界面主要通过散点图反映数据在地图上的分布。用户可以根据地图上每个点的突出程度来判断该区域中的数据分布量。同时，散点图可以约束其他四个视觉视图。用户可以在散点图上选择一个框，以进一步探索和分析局部区域而不是整个地图中的数据。使用散点图上的用户选择框作为查询，返回查询的沿袭结果，将结果存储在内存中，并通过即时计算更新其他四个视图。散点图支持缩放，因此用户可以在很小的区域(例如仅包含少量数据的街道)中浏览单个数据。散点图会在其他四个视觉视图中响应用户的操作，然后在用户感兴趣的特定时间段内在地图上显示数据分布。The visualization interface mainly reflects the distribution of data on the map through scatter plots. Users can judge the amount of data distribution in the area based on the prominence of each point on the map. At the same time, the scatter plot can constrain the other four visual views. Users can select a box on the scatter plot to further explore and analyze data in local areas rather than the entire map. Use the user select box on the scatter plot as a query, return the lineage results of the query, store the results in memory, and update the other four views with on-the-fly calculations. Scatter charts support zooming, so users can browse individual data in a small area, such as a street with only a small amount of data. The scatter plot responds to the user's actions in four other visual views, and then displays the distribution of data on the map over a specific time period of interest to the user.

视图组件view component

可视化界面主要使用折线图，条形图和热图来反映数据的时间分布。用户可以通过限制散点图来执行沿袭查询。通过即时计算以热图，折线图和条形图的形式绘制查询结果，使用户可以轻松地探索和分析本地数据的时间分布。如果用户不执行约束，则视觉视图将反映整个数据集的时间段信息。用户可以在除热图之外的三个视图组件中选择框，以约束所有其他视图(包括可视图)反映的数据内容。用户可以根据感兴趣的时间段来构图视图组件。例如，要浏览和分析周末数据，用户需要对星期投影直方图执行框选，而要在夜晚浏览和分析数据，用户需要对小时直方图执行框选。小时投影直方图反映了24小时内的数据分布，星期投影直方图反映了一周7天之内的数据分布，而天分布直方图则反映了开始日期和结束日期之间的数据分布。天分布直方图中添加了一个额外的摘要视图，该视图以细线图的形式出现，反映了时间段内整个数据的分布。用户可以根据自己的兴趣在摘要视图上选择感兴趣的时间范围，并将天分布直方图用作详细视图以可视化查询结果。The visualization interface mainly uses line charts, bar charts and heatmaps to reflect the time distribution of data. Users can perform lineage queries by limiting the scatter plot. Plot query results in the form of heatmaps, line graphs, and bar graphs with instant calculations, allowing users to easily explore and analyze the temporal distribution of local data. If the user does not enforce the constraints, the visual view will reflect the time period information for the entire dataset. Users can select boxes in the three view components except the heatmap to constrain the data content reflected by all other views, including the viewables. Users can frame view components based on time periods of interest. For example, to browse and analyze weekend data, the user needs to perform a box selection on the weekly projection histogram, while to browse and analyze data at night, the user needs to perform a box selection on the hourly histogram. The hourly projection histogram reflects the data distribution over 24 hours, the week projection histogram reflects the data distribution over 7 days of the week, and the day distribution histogram reflects the data distribution between start and end dates. An additional summary view has been added to the day distribution histogram, which appears as a thin line graph that reflects the distribution of the entire data over time. Users can select the time range of interest on the summary view according to their interests, and use the day distribution histogram as a detailed view to visualize query results.

资源占用。resource usage.

该数据结构的存储开销主要来自神经网络和预分类。表1中展示了一些细节。预排序所占用的存储空间与沿袭查询的维数有关。沿袭查询的每个维度都将使数据结构使用相同数量的原始数据再存储一个排序后的数组，尽管排序后的数组中的每个数字都是最小的4字节。存储开销的另一部分来自神经网络的参数。神经网络的总参数与神经网络的数量和每个神经网络的神经元数量有关。数据结构中神经网络的数量由不同维度的分辨率决定。在时间维度中，沿袭查询的分辨率为一小时，然后将两个时间戳为10h15m和10h20m的对象划分为同一标签，即10h。因此，定义越高，神经网络的数量或每个神经网络的神经元的数量就会越多。实现数据结构时更倾向于增加神经网络的数量，这是因为神经网络可以很好拟合的数据具有一定的规律性，并且其范围不应超过特定范围。The storage overhead of this data structure mainly comes from the neural network and pre-classification. Some details are shown in Table 1. The storage space occupied by presort is related to the dimension of the lineage query. Each dimension of the lineage query will cause the data structure to store one more sorted array using the same amount of original data, although each number in the sorted array is a minimum of 4 bytes. Another part of the storage overhead comes from the parameters of the neural network. The total parameters of a neural network are related to the number of neural networks and the number of neurons in each neural network. The number of neural networks in the data structure is determined by the resolution of different dimensions. In the time dimension, the resolution of the lineage query is one hour, and then two objects with timestamps 10h15m and 10h20m are divided into the same label, i.e. 10h. Therefore, the higher the definition, the higher the number of neural networks or the number of neurons per neural network. It is more inclined to increase the number of neural networks when implementing data structures, because the data that a neural network can fit well has a certain regularity and its range should not exceed a certain range.

在训练过程中，增加神经网络的数量并减少每个神经网络需要拟合的数据数量可以有效地提高准确性，这将导致神经网络占用的存储开销是恒定的，因为插入新数据时新分辨率不会改变，以相应的分辨率更改索引值即可完成结构的快速更新。对于新数据的每个维度添加一个索引值，该索引值与其在相应维度的排序数组中的位置相对应，该索引值的大小也为4个字节。输入对象的数量，训练时间和存储开销在前三列中报告。输入对象的数量是指数据集中有效数据的数量。训练时间是训练所有需要训练的网络所花费的时间的总和。网络列表示每个数据集使用的子网数。对象(网络)列表示与每个子网相对应的数据对象的数量。由于数据结构从原始数据中丢弃了一些有价值的信息。因此此处的存储开销可能不一定大于数据集本身的大小。During training, increasing the number of neural networks and reducing the amount of data each neural network needs to fit can effectively improve accuracy, which will result in a constant memory overhead occupied by the neural network as new data is inserted with new resolutions Does not change, changing the index value at the corresponding resolution can complete the fast update of the structure. For each dimension of the new data, add an index value corresponding to its position in the sorted array of the corresponding dimension, which is also 4 bytes in size. The number of input objects, training time and storage overhead are reported in the first three columns. The number of input objects refers to the number of valid data in the dataset. The training time is the sum of the time spent training all the networks that need to be trained. The Network column represents the number of subnets used by each dataset. The Object (Network) column indicates the number of data objects corresponding to each subnet. Due to the data structure some valuable information is discarded from the original data. So the storage overhead here may not necessarily be greater than the size of the dataset itself.

该数据结构可以处理非常细致的数据沿袭，例如一位数的数据沿袭，精度为1.0。同时，在处理大规模数据沿袭时，它仍然具有良好的性能，较低的时间开销和存储开销。对于包含数百万或数千万数据的数据集，很难在如此多的数据中仅找到其中的几个。该数据结构能够以良好的性能实现对大规模数据集的沿袭查询的细粒度可视化。This data structure can handle very fine-grained data lineage, such as single-digit data lineage, with a precision of 1.0. At the same time, it still has good performance, low time overhead and storage overhead when dealing with large-scale data lineage. For datasets containing millions or tens of millions of data, it is difficult to find only a few of them out of so much data. This data structure enables fine-grained visualization of lineage queries on large-scale datasets with good performance.

表1资源占用Table 1 Resource occupation

本发明并不限于上文描述的实施方式。以上对具体实施方式的描述旨在描述和说明本发明的技术方案，上述的具体实施方式仅仅是示意性的，并不是限制性的。在不脱离本发明宗旨和权利要求所保护的范围情况下，本领域的普通技术人员在本发明的启示下还可做出很多形式的具体变换，这些均属于本发明的保护范围之内。The present invention is not limited to the embodiments described above. The above description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above-mentioned specific embodiments are only illustrative and not restrictive. Without departing from the spirit of the present invention and the protection scope of the claims, those of ordinary skill in the art can also make many specific transformations under the inspiration of the present invention, which all fall within the protection scope of the present invention.

参考文献：references:

[1]K.Dursun,C.Binnig,U.Cetintemel,and T.Kraska.Revisiting reuse inmain memory database systems.arXiv preprint arXiv:1608.05678,2016.[1] K. Dursun, C. Binnig, U. Cetintemel, and T. Kraska. Revisiting reuse inmain memory database systems. arXiv preprint arXiv:1608.05678, 2016.

[2]R.Ikeda and J.Widom.Panda:Asystem for provenance and data.IEEEData Eng.Bull,2010.[2] R.Ikeda and J.Widom.Panda: Asystem for provenance and data.IEEEData Eng.Bull, 2010.

[3]Fotis Psallidas and Eugene Wu.Smoke:Fined-Grained Lineage CaptureAt Interactive Speed.Proceedings of the VLDB Endowment,2018.[3] Fotis Psallidas and Eugene Wu. Smoke: Fined-Grained Lineage CaptureAt Interactive Speed. Proceedings of the VLDB Endowment, 2018.

[4]F.Psallidas and E.Wu.Provenance for interactivevisualizations.HILDA,2018.[4] F. Psallidas and E. Wu. Provenance for interactive visualizations. HILDA, 2018.

[5]J.Poco,J.Heer,Reverse-engineering visualizations:Recovering visualencodings from chart images.Comput.Graph.Forum,2017.[5] J.Poco, J.Heer, Reverse-engineering visualizations: Recovering visualencodings from chart images.Comput.Graph.Forum, 2017.

[6]B.Glavic and G.Alonso.Perm:Processing provenance and data on thesame data model through query rewriting.ICDE,2009.[6] B.Glavic and G.Alonso.Perm:Processing provenance and data on thesame data model through query rewriting.ICDE, 2009.

[7]Muniswamy-Reddy,K.-K.,Macko,P.,and Seltzer,M.Provenance for thecloud.Proceedings of the 8th USENIX Conference on File and StorageTechnologies(FAST),2010.[7] Muniswamy-Reddy, K.-K., Macko, P., and Seltzer, M. Provenance for the cloud. Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST), 2010.

[8]Miles,S.,Groth,P.,Deelman,E.,Vahi,K.,Mehta,G.,and Moreau,L.Provenance:The bridge between experiments and data.Comput.Sci.Engin,2008.[8] Miles, S., Groth, P., Deelman, E., Vahi, K., Mehta, G., and Moreau, L. Provenance: The bridge between experiments and data. Comput. Sci. Engin, 2008.

[9]B.Glavic.Big data provenance:Challenges and implications forbenchmarking.Specifying Big Data Benchmarks-First Workshop,WBDB,2014.[9] B. Glavic. Big data provenance: Challenges and implications for benchmarking. Specifying Big Data Benchmarks-First Workshop, WBDB, 2014.

[10]J.Wang,D.Crawl,S.Purawat,M.Nguyen,I.Altintas,Big data provenance:Challenges state of the art and opportunities.Big Data,2015.[10] J. Wang, D. Crawl, S. Purawat, M. Nguyen, I. Altintas, Big data provenance: Challenges state of the art and opportunities. Big Data, 2015.

Claims

1. The large-scale data lineage method based on the neural network model is characterized by comprising the following steps of:

(1) generating a network training set; the method comprises the steps of array sequencing, dimension standard division and training subset division; sorting and storing the data in the data set according to values in different dimensions in the data set; determining a division standard for each dimension to solve the sample exhaustion problem; dividing the training set into a plurality of subsets as training subsets;

(2) training a neural network model; a layered network structure is used for replacing the traditional neural network structure so as to solve the error problem caused by large sample data difference; the hierarchical network structure specifically comprises a network selector and a subnet, wherein the network selector is used as a first layer and is used for finding a correct corresponding subnet for inquiry; for each subnet, respectively training by using the training subsets;

(3) visual interaction and lineage; mapping the output result of the neural network into a plurality of views, specifically comprising a space scatter diagram, a space-time projection view and a mode contrast view; the data set visualization interactive exploration is performed on the data set, and a visualization mode is used for facilitating exploration of data results by a user; and allows users to explore data sources in an intuitive manner.