[go: up one dir, main page]

CN111444162A - Big data initialization method and device, electronic equipment and storage medium - Google Patents

Big data initialization method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111444162A
CN111444162A CN202010151374.9A CN202010151374A CN111444162A CN 111444162 A CN111444162 A CN 111444162A CN 202010151374 A CN202010151374 A CN 202010151374A CN 111444162 A CN111444162 A CN 111444162A
Authority
CN
China
Prior art keywords
data
data table
table set
historical
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010151374.9A
Other languages
Chinese (zh)
Inventor
李广翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010151374.9A priority Critical patent/CN111444162A/en
Publication of CN111444162A publication Critical patent/CN111444162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种大数据初始化方法、装置、电子设备及存储介质。该方法能够将导入到分布式文件系统中的历史数据集映射成数据表集,提高了数据的容错性,进一步根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集,保证了数据准确性和完整性,再通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集,并将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集,提高了数据处理的速度,进一步将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集,实现了对大数据的初始化。

Figure 202010151374

The present invention provides a big data initialization method, device, electronic device and storage medium. The method can map the historical data set imported into the distributed file system into a data table set, which improves the fault tolerance of the data, further performs missing value detection on the data in the data table set according to the historical data set, and obtains The standard data table set ensures the accuracy and integrity of the data, and then performs data analysis on the data in the standard data table set through the preset association degree dependency to obtain the key field set of the data, and the keyword The segment set is randomly distributed into the standard data table set, and a random landing data table set is generated, which improves the speed of data processing, and further combines the random landing data tables in the random landing data table set according to preset conditions to obtain initialization data. The table set realizes the initialization of big data.

Figure 202010151374

Description

大数据初始化方法、装置、电子设备及存储介质Big data initialization method, device, electronic device and storage medium

技术领域technical field

本发明涉及数据处理技术领域,尤其涉及一种大数据初始化方法、装置、电子设备及存储介质。The present invention relates to the technical field of data processing, and in particular, to a big data initialization method, device, electronic device and storage medium.

背景技术Background technique

目前,大数据已经应用到各种应用系统中,且各种应用系统也在根据大数据系统进行升级转换。例如:对于系统架构升级,在开发技术上仅需要将数据处理模块替换成大数据的处理技术即可,但是对于历史数据,则需要做相应的转换处理。At present, big data has been applied to various application systems, and various application systems are also being upgraded and transformed according to the big data system. For example, for system architecture upgrade, it is only necessary to replace the data processing module with big data processing technology in terms of development technology, but for historical data, corresponding conversion processing is required.

在进行大数据升级时,通常需要执行系统重构,而历史数据则需要按照新规则进行初始化,生成符合新系统规则的数据,只是简单的进行语句堆积会带来严重的性能问题,也无法完成初始化。When upgrading big data, it is usually necessary to perform system reconstruction, while historical data needs to be initialized according to the new rules to generate data that conforms to the new system rules. Simply stacking statements will cause serious performance problems and cannot be completed. initialization.

发明内容SUMMARY OF THE INVENTION

鉴于以上内容,有必要提供一种大数据初始化方法、装置、电子设备及存储介质,能够实现对大数据的快速初始化,且保证了数据的准确性和完整性,同时提升了数据的容错性。In view of the above, it is necessary to provide a big data initialization method, device, electronic device and storage medium, which can realize rapid initialization of big data, ensure the accuracy and integrity of the data, and improve the fault tolerance of the data.

一种大数据初始化方法,所述方法包括:A big data initialization method, the method includes:

从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中;Obtain historical datasets from pre-built databases, and import the historical datasets into a distributed file system;

将导入后的所述历史数据集映射成数据表集;mapping the imported historical data set into a data table set;

根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集;According to the historical data set, missing value detection is performed on the data in the data table set to obtain a standard data table set;

通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集;Perform data analysis on the data in the standard data table set by using a preset association degree dependency to obtain a key field set of the data;

将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集;Randomly distributing the key field set to the standard data table set to generate a random landing data table set;

将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The random landing data tables in the random landing data table set are combined according to preset conditions to obtain an initialization data table set.

根据本发明优选实施例,所述将所述历史数据集导入到分布式文件系统中,包括:According to a preferred embodiment of the present invention, the importing the historical data set into the distributed file system includes:

获取所述数据库所在服务器的IP地址及SID号;Obtain the IP address and SID number of the server where the database is located;

根据所述IP地址及所述SID号登录所述数据库;Log in to the database according to the IP address and the SID number;

从所述数据库中获取所述历史数据集在所述分布式文件系统上的绝对路径;Obtain the absolute path of the historical data set on the distributed file system from the database;

根据所述绝对路径将所述历史数据集导入到所述分布式文件系统中。The historical dataset is imported into the distributed file system according to the absolute path.

根据本发明优选实施例,所述将所述历史数据集导入到分布式文件系统中,包括:According to a preferred embodiment of the present invention, the importing the historical data set into the distributed file system includes:

获取所述历史数据集中数据的属性信息;Obtain attribute information of the data in the historical data set;

根据所述属性信息确定所述历史数据集中数据的优先级;Determine the priority of the data in the historical data set according to the attribute information;

根据所述优先级将所述历史数据集导入到分布式文件系统中。The historical dataset is imported into a distributed file system according to the priority.

根据本发明优选实施例,所述将导入后的所述历史数据集映射成数据表集包括:According to a preferred embodiment of the present invention, the mapping of the imported historical data set into a data table set includes:

利用配置工具将所述历史数据集映射为数据表;Using a configuration tool to map the historical data set into a data table;

根据所述历史数据集在所述分布式文件系统上的绝对路径检验所述历史数据集是否加载到所述数据表中;Check whether the historical data set is loaded into the data table according to the absolute path of the historical data set on the distributed file system;

当所述历史数据集加载到所述数据表中时,利用所述数据表中的数据构建所述数据表集。When the historical data set is loaded into the data table, the data table set is constructed using the data in the data table.

根据本发明优选实施例,所述根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集包括:According to a preferred embodiment of the present invention, performing missing value detection on the data in the data table set according to the historical data set to obtain a standard data table set includes:

采用missmap function函数对所述数据表集中的数据进行缺失值检测;Use missmap function to perform missing value detection on the data in the data table set;

当检测到所述数据表集中没有缺失值时,将所述数据表集确定为所述标准数据表集;或者When detecting that there is no missing value in the data table set, determining the data table set as the standard data table set; or

当检测到所述数据表集中有缺失值时,采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集。When it is detected that there are missing values in the data table set, a maximum likelihood estimation algorithm is used to fill in the missing values to obtain the standard data table set.

根据本发明优选实施例,在采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集时,采用如下公式:According to a preferred embodiment of the present invention, when using the maximum likelihood estimation algorithm to fill in the missing values to obtain the standard data table set, the following formula is used:

Figure BDA0002402557140000031
Figure BDA0002402557140000031

其中,L(θ)表示填充的所述缺失值,θ表示所述缺失值对应的概率参数,n表示所述历史数据集的数量,p(xi|θ)表示所述缺失值的概率。Wherein, L(θ) represents the filled missing value, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data set, and p( xi |θ) represents the probability of the missing value.

根据本发明优选实施例,所述将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集包括:According to a preferred embodiment of the present invention, randomly distributing the key field set to the standard data table set, and generating a random landing data table set includes:

确定所述关键字段集中关键字段的数量;determining the number of key fields in the key field set;

根据所述关键字段的数量生成多个数值,所述多个数值的数量与所述关键字段的数量相同;Generate a plurality of numerical values according to the number of the key fields, the number of the plurality of numerical values is the same as the number of the key fields;

建立所述多个数值与所述标准数据表集的映射关系;establishing a mapping relationship between the plurality of numerical values and the standard data table set;

随机匹配所述多个数值与所述关键字段,得到匹配结果;Randomly matching the plurality of numerical values and the key fields to obtain a matching result;

根据所述映射关系及所述匹配结果将所述关键字段分布到所述标准数据表集中,得到所述随机落地数据表集。The key fields are distributed to the standard data table set according to the mapping relationship and the matching result to obtain the random landing data table set.

一种大数据初始化装置,所述装置包括:A big data initialization device, the device includes:

导入单元,用于从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中;an importing unit, used to obtain a historical data set from a pre-built database, and import the historical data set into a distributed file system;

映射单元,用于将导入后的所述历史数据集映射成数据表集;a mapping unit for mapping the imported historical data set into a data table set;

检测单元,用于根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集;a detection unit, configured to perform missing value detection on the data in the data table set according to the historical data set to obtain a standard data table set;

分析单元,用于通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集;an analysis unit, configured to perform data analysis on the data in the standard data table set through a preset association degree dependency relationship to obtain a key field set of the data;

分布单元,用于将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集;a distribution unit, configured to randomly distribute the key field set to the standard data table set to generate a random landing data table set;

合并单元,用于将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The merging unit is used for merging the random landing data tables in the random landing data table set according to preset conditions to obtain an initialization data table set.

根据本发明优选实施例,所述导入单元具体用于:According to a preferred embodiment of the present invention, the introduction unit is specifically used for:

获取所述数据库所在服务器的IP地址及SID号;Obtain the IP address and SID number of the server where the database is located;

根据所述IP地址及所述SID号登录所述数据库;Log in to the database according to the IP address and the SID number;

从所述数据库中获取所述历史数据集在所述分布式文件系统上的绝对路径;Obtain the absolute path of the historical data set on the distributed file system from the database;

根据所述绝对路径将所述历史数据集导入到所述分布式文件系统中。The historical dataset is imported into the distributed file system according to the absolute path.

根据本发明优选实施例,所述装置还包括:According to a preferred embodiment of the present invention, the device further comprises:

获取单元,用于在将所述历史数据集导入到分布式文件系统中时,获取所述历史数据集中数据的属性信息;an acquiring unit, configured to acquire attribute information of data in the historical data set when the historical data set is imported into the distributed file system;

确定单元,用于根据所述属性信息确定所述历史数据集中数据的优先级;a determining unit, configured to determine the priority of the data in the historical data set according to the attribute information;

所述导入单元,还用于根据所述优先级将所述历史数据集导入到分布式文件系统中。The importing unit is further configured to import the historical data set into the distributed file system according to the priority.

根据本发明优选实施例,所述映射单元具体用于:According to a preferred embodiment of the present invention, the mapping unit is specifically used for:

利用配置工具将所述历史数据集映射为数据表;Using a configuration tool to map the historical data set into a data table;

根据所述历史数据集在所述分布式文件系统上的绝对路径检验所述历史数据集是否加载到所述数据表中;Check whether the historical data set is loaded into the data table according to the absolute path of the historical data set on the distributed file system;

当所述历史数据集加载到所述数据表中时,利用所述数据表中的数据构建所述数据表集。When the historical data set is loaded into the data table, the data table set is constructed using the data in the data table.

根据本发明优选实施例,所述检测单元具体用于:According to a preferred embodiment of the present invention, the detection unit is specifically used for:

采用missmap function函数对所述数据表集中的数据进行缺失值检测;Use missmap function to perform missing value detection on the data in the data table set;

当检测到所述数据表集中没有缺失值时,将所述数据表集确定为所述标准数据表集;或者When detecting that there is no missing value in the data table set, determining the data table set as the standard data table set; or

当检测到所述数据表集中有缺失值时,采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集。When it is detected that there are missing values in the data table set, a maximum likelihood estimation algorithm is used to fill in the missing values to obtain the standard data table set.

根据本发明优选实施例,所述检测单元在采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集时,采用如下公式:According to a preferred embodiment of the present invention, when the detection unit uses a maximum likelihood estimation algorithm to fill in the missing values and obtains the standard data table set, the following formula is used:

Figure BDA0002402557140000051
Figure BDA0002402557140000051

其中,L(θ)表示填充的所述缺失值,θ表示所述缺失值对应的概率参数,n表示所述历史数据集的数量,p(xi|θ)表示所述缺失值的概率。Wherein, L(θ) represents the filled missing value, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data set, and p( xi |θ) represents the probability of the missing value.

根据本发明优选实施例,所述生成单元具体用于:According to a preferred embodiment of the present invention, the generating unit is specifically used for:

确定所述关键字段集中关键字段的数量;determining the number of key fields in the key field set;

根据所述关键字段的数量生成多个数值,所述多个数值的数量与所述关键字段的数量相同;Generate a plurality of numerical values according to the number of the key fields, the number of the plurality of numerical values is the same as the number of the key fields;

建立所述多个数值与所述标准数据表集的映射关系;establishing a mapping relationship between the plurality of numerical values and the standard data table set;

随机匹配所述多个数值与所述关键字段,得到匹配结果;Randomly matching the plurality of numerical values and the key fields to obtain a matching result;

根据所述映射关系及所述匹配结果将所述关键字段分布到所述标准数据表集中,得到所述随机落地数据表集。The key fields are distributed to the standard data table set according to the mapping relationship and the matching result to obtain the random landing data table set.

一种电子设备,所述电子设备包括:An electronic device comprising:

存储器,存储至少一个指令;及a memory that stores at least one instruction; and

处理器,执行所述存储器中存储的指令以实现所述大数据初始化方法。A processor executes the instructions stored in the memory to implement the big data initialization method.

一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现所述大数据初始化方法。A computer-readable storage medium, where at least one instruction is stored, the at least one instruction is executed by a processor in an electronic device to implement the big data initialization method.

由以上技术方案可以看出,本发明能够将导入到分布式文件系统中的历史数据集映射成数据表集,提高了数据的容错性,进一步根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集,保证了数据准确性和完整性,再通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集,并将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集,提高了数据处理的速度,进一步将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集,实现了对大数据的初始化。It can be seen from the above technical solutions that the present invention can map the historical data set imported into the distributed file system into a data table set, thereby improving the fault tolerance of the data, and further according to the historical data set, the data table is centralized. The missing value detection is performed on the data of the standard data table, and the standard data table set is obtained, which ensures the accuracy and integrity of the data, and then the data in the standard data table set is analyzed through the preset correlation dependency relationship, and the key fields of the data are obtained. The key field set is randomly distributed into the standard data table set to generate a random landing data table set, which improves the speed of data processing. Further, the random landing data table in the random landing data table set is pre- Set the conditions to merge, get the initialization data table set, and realize the initialization of big data.

附图说明Description of drawings

图1是本发明大数据初始化方法的较佳实施例的流程图。FIG. 1 is a flow chart of a preferred embodiment of a method for initializing big data according to the present invention.

图2是本发明大数据初始化装置的较佳实施例的功能模块图。FIG. 2 is a functional block diagram of a preferred embodiment of the big data initialization device of the present invention.

图3是本发明实现大数据初始化方法的较佳实施例的电子设备的结构示意图。FIG. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for initializing big data according to the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

如图1所示,是本发明大数据初始化方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。As shown in FIG. 1 , it is a flow chart of a preferred embodiment of the big data initialization method of the present invention. According to different requirements, the order of the steps in this flowchart can be changed, and some steps can be omitted.

所述大数据初始化方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital SignalProcessor,DSP)、嵌入式设备等。The big data initialization method is applied to one or more electronic devices, the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but not Limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital SignalProcessors, DSPs), embedded devices, and the like.

所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。The electronic device can be any electronic product that can interact with a user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network TV ( Internet Protocol Television, IPTV), smart wearable devices, etc.

所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(CloudComputing)的由大量主机或网络服务器构成的云。The electronic equipment may also include network equipment and/or user equipment. Wherein, the network device includes, but is not limited to, a single network server, a server group formed by multiple network servers, or a cloud formed by a large number of hosts or network servers based on cloud computing (Cloud Computing).

所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.

S10,从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中。S10: Acquire a historical data set from a pre-built database, and import the historical data set into a distributed file system.

在本发明的至少一个实施例中,所述预先构建的数据库为传统数据库,也称为关系型数据库,用于处理永久、稳定的数据。In at least one embodiment of the present invention, the pre-built database is a traditional database, also known as a relational database, for processing permanent and stable data.

例如:所述数据库可以为Oracle数据库、MySQL数据库以及图数据库等。For example, the database may be an Oracle database, a MySQL database, a graph database, or the like.

在本发明的至少一个实施例中,所述历史数据集是组合用户历史行为产生的数据而形成。In at least one embodiment of the present invention, the historical data set is formed by combining data generated by historical user behavior.

在本发明的至少一个实施例中,所述分布式文件系统可以是Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)。其中,所述分布式文件系统是一个部署在集群上的分布式文件系统,需要通过网络进行数据的传输。In at least one embodiment of the present invention, the distributed file system may be a Hadoop Distributed File System (Hadoop Distributed File System, HDFS). Wherein, the distributed file system is a distributed file system deployed on a cluster, and data transmission needs to be performed through a network.

在本发明的至少一个实施例中,所述电子设备将所述历史数据集导入到分布式文件系统中包括:In at least one embodiment of the present invention, the electronic device importing the historical data set into the distributed file system includes:

所述电子设备获取所述数据库所在服务器的IP(Internet Protocol,网际互连协议)地址及SID(Security Identifiers,安全标识符)号,并根据所述IP地址及所述SID号登录所述数据库,所述电子设备从所述数据库中获取所述历史数据集在所述分布式文件系统上的绝对路径,进一步根据所述绝对路径将所述历史数据集导入到所述分布式文件系统中。The electronic device obtains the IP (Internet Protocol, Internet Protocol) address and SID (Security Identifiers, security identifier) number of the server where the database is located, and logs into the database according to the IP address and the SID number, The electronic device acquires the absolute path of the historical data set on the distributed file system from the database, and further imports the historical data set into the distributed file system according to the absolute path.

具体地,所述电子设备可以通过Sqoop传输工具将所述历史数据集从所述数据库中导入至所述HDFS中。Specifically, the electronic device may import the historical data set from the database into the HDFS through the Sqoop transmission tool.

其中,所述Sqoop是一款开源的工具,主要用于在Hadoop与传统的数据库间进行数据的传递,所述Sqoop可以将一个关系型数据库中的数据导进到Hadoop的HDFS集群中,也可以将HDFS的数据导进到关系型数据库中。Among them, the Sqoop is an open source tool, which is mainly used for data transfer between Hadoop and traditional databases. The Sqoop can import data in a relational database into the HDFS cluster of Hadoop, or can Import HDFS data into a relational database.

在本发明的至少一个实施例中,所述电子设备将所述历史数据集导入到分布式文件系统中,包括:In at least one embodiment of the present invention, the electronic device imports the historical data set into a distributed file system, including:

所述电子设备获取所述历史数据集中数据的属性信息,并根据所述属性信息确定所述历史数据集中数据的优先级,所述电子设备根据所述优先级将所述历史数据集导入到分布式文件系统中。The electronic device acquires attribute information of the data in the historical data set, and determines the priority of the data in the historical data set according to the attribute information, and the electronic device imports the historical data set into the distribution system according to the priority. in the file system.

其中,所述属性信息包括,但不限于以下一种或者多种的组合:Wherein, the attribute information includes, but is not limited to, one or more of the following combinations:

所述历史数据集中表的大小,是否有主键、含时间序列或者数字序列的字段等。The size of the table in the historical data set, whether there is a primary key, a field containing a time series or a number sequence, etc.

S11,将导入后的所述历史数据集映射成数据表集。S11: Map the imported historical data set into a data table set.

在本发明的至少一个实施例中,所述电子设备将导入后的所述历史数据集映射成数据表集包括:In at least one embodiment of the present invention, mapping the imported historical data set into a data table set by the electronic device includes:

所述电子设备利用配置工具将所述历史数据集映射为数据表,并根据所述历史数据集在所述分布式文件系统上的绝对路径检验所述历史数据集是否加载到所述数据表中,当所述历史数据集加载到所述数据表中时,所述电子设备利用所述数据表中的数据构建所述数据表集。The electronic device uses a configuration tool to map the historical data set into a data table, and checks whether the historical data set is loaded into the data table according to the absolute path of the historical data set on the distributed file system , when the historical data set is loaded into the data table, the electronic device constructs the data table set by using the data in the data table.

具体地,所述配置工具可以为hive,所述数据表可以包括hive数据表,所述hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL(Structured Query Language,结构化查询语言)查询功能。Specifically, the configuration tool can be hive, the data table can include hive data table, the hive is a data warehouse tool based on Hadoop, can map structured data files into a database table, and provide simple SQL (Structured Query Language, Structured Query Language) query function.

通过上述实施方式,能够采用映射的方式将历史数据集中的数据转换为数据表,提高了数据的容错性。Through the above embodiments, the data in the historical data set can be converted into a data table in a mapping manner, which improves the fault tolerance of the data.

S12,根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集。S12, according to the historical data set, perform missing value detection on the data in the data table set to obtain a standard data table set.

可以理解的是,由于开发人员的操作失误和/或导入传输工具的失效,可能导致数据缺失,因此,所述电子设备对所述数据表集中的数据进行缺失值检测,得到标准数据表集。It can be understood that data may be missing due to the developer's operation error and/or the failure of the import and transmission tool. Therefore, the electronic device performs missing value detection on the data in the data table set to obtain a standard data table set.

具体地,所述缺失值包括:完全随机缺失、随机缺失以及非随机缺失。Specifically, the missing values include: completely random missing, random missing and non-random missing.

其中,所述完全随机缺失指的是某一变量缺失值不依赖于其他任何原因的完全随机缺失;所述随机缺失指的是某一变量的缺失与其他变量相关但与该变量本身的数值不相关的缺失;所述非随机缺失指的是某一变量的缺失和该变量本身的数值相关的缺失。Among them, the completely random missing refers to the completely random missing value of a variable that does not depend on any other reasons; the said random missing refers to the missing of a variable that is related to other variables but not related to the value of the variable itself. Correlated missing; the non-random missing refers to the missing of a variable and the missing value of the variable itself.

在本发明的至少一个实施例中,所述电子设备根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集包括:In at least one embodiment of the present invention, the electronic device performs missing value detection on the data in the data table set according to the historical data set, and obtains a standard data table set including:

所述电子设备采用missmap function函数对所述数据表集中的数据进行缺失值检测,具体地:The electronic device uses the missmap function to perform missing value detection on the data in the data table set, specifically:

(1)当检测到所述数据表集中没有缺失值时,所述电子设备将所述数据表集确定为所述标准数据表集。(1) When detecting that there are no missing values in the data table set, the electronic device determines the data table set as the standard data table set.

(2)当检测到所述数据表集中有缺失值时,所述电子设备采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集。(2) When detecting that there are missing values in the data table set, the electronic device uses a maximum likelihood estimation algorithm to fill in the missing values to obtain the standard data table set.

进一步地,所述电子设备在采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集时,采用如下公式:Further, when the electronic device uses a maximum likelihood estimation algorithm to fill in the missing values and obtains the standard data table set, the following formula is used:

Figure BDA0002402557140000091
Figure BDA0002402557140000091

其中,L(θ)表示填充的所述缺失值,θ表示所述缺失值对应的概率参数,n表示所述历史数据集的数量,p(xi|θ)表示所述缺失值的概率。Wherein, L(θ) represents the filled missing value, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data set, and p( xi |θ) represents the probability of the missing value.

通过上述实施方式,利用缺失值检测的方法保证了数据准确性和完整性。Through the above-mentioned embodiments, the method for detecting missing values ensures the accuracy and integrity of the data.

S13,通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集。S13: Perform data analysis on the data in the standard data table set by using a preset association degree dependency relationship to obtain a key field set of the data.

在本发明的至少一个实施例中,由于分布式系统跟关系型数据库不同,无法直接根据索引进行数据的分析,此时,所述电子设备需要根据预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到造成数据倾斜的所述关键字段集。In at least one embodiment of the present invention, since a distributed system is different from a relational database, data analysis cannot be performed directly according to an index. In this case, the electronic device needs to analyze the standard data according to a preset correlation degree dependency Perform data analysis on the data in the table set to obtain the key field set that causes the data to be skewed.

其中,所述数据倾斜指的是一个字段出现的数量比例差异较大,比如一个学校的男生数量为10000,女生数量为100。Among them, the data skew refers to the large difference in the proportion of the number of occurrences of a field, for example, the number of boys in a school is 10,000, and the number of girls is 100.

其中,所述预设的关联度依存关系包括:不等值key关联规则、多个数据表联合检测规则。Wherein, the preset association degree dependency includes: an unequal value key association rule and a joint detection rule of multiple data tables.

进一步地,所述电子设备根据所述关联度依存关系得到key的分布情况,从而得到所述关键字段集。Further, the electronic device obtains the distribution of keys according to the correlation degree dependency, so as to obtain the key field set.

S14,将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集。S14: Randomly distribute the key field set to the standard data table set to generate a random landing data table set.

在本发明的至少一个实施例中,所述电子设备将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集包括:In at least one embodiment of the present invention, the electronic device randomly distributes the key field set to the standard data table set, and generating a random landing data table set includes:

所述电子设备确定所述关键字段集中关键字段的数量,并根据所述关键字段的数量生成多个数值,所述多个数值的数量与所述关键字段的数量相同,所述电子设备建立所述多个数值与所述标准数据表集的映射关系,并随机匹配所述多个数值与所述关键字段,得到匹配结果,所述电子设备根据所述映射关系及所述匹配结果将所述关键字段分布到所述标准数据表集中,得到所述随机落地数据表集。The electronic device determines the number of key fields in the key field set, and generates a plurality of numerical values according to the number of the key fields, and the number of the plurality of numerical values is the same as the number of the key fields, and the The electronic device establishes a mapping relationship between the plurality of numerical values and the standard data table set, and randomly matches the plurality of numerical values with the key fields to obtain a matching result, and the electronic device obtains a matching result according to the mapping relationship and the The matching result distributes the key fields to the standard data table set to obtain the random landing data table set.

通过上述实施方式,能够结合关键字段随机落地分布的方式提高数据处理的速度。Through the above-mentioned embodiments, the speed of data processing can be improved in combination with the random distribution of key fields.

S15,将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。S15: Combine the random landing data tables in the random landing data table set according to preset conditions to obtain an initialization data table set.

在本发明的至少一个实施例中,所述预设条件可以由开发人员根据所述随机落地数据表的不同需求进行配置。In at least one embodiment of the present invention, the preset conditions can be configured by developers according to different requirements of the random landing data table.

其中,所述预设条件包括,但不限于:单表合并、多表合并以及相邻表合并等。The preset conditions include, but are not limited to: merging of single tables, merging of multiple tables, merging of adjacent tables, and the like.

通过上述实施方式,按照预设条件合并所述随机落地数据表,能够降低数据初始化的难度,并间接提升了数据处理速度,为项目切换提供有力支持。Through the above-mentioned embodiment, the random landing data table is merged according to the preset conditions, which can reduce the difficulty of data initialization, indirectly improve the data processing speed, and provide strong support for project switching.

由以上技术方案可以看出,本发明能够将导入到分布式文件系统中的历史数据集映射成数据表集,提高了数据的容错性,进一步根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集,保证了数据准确性和完整性,再通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集,并将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集,提高了数据处理的速度,进一步将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集,实现了对大数据的初始化。It can be seen from the above technical solutions that the present invention can map the historical data set imported into the distributed file system into a data table set, thereby improving the fault tolerance of the data, and further according to the historical data set, the data table is centralized. The missing value detection is performed on the data of the standard data table, and the standard data table set is obtained, which ensures the accuracy and integrity of the data, and then the data in the standard data table set is analyzed through the preset correlation dependency relationship, and the key fields of the data are obtained. The key field set is randomly distributed into the standard data table set to generate a random landing data table set, which improves the speed of data processing. Further, the random landing data table in the random landing data table set is pre- Set the conditions to merge, get the initialization data table set, and realize the initialization of big data.

如图2所示,是本发明大数据初始化装置的较佳实施例的功能模块图。所述大数据初始化装置11包括导入单元110、映射单元111、检测单元112、分析单元113、分布单元114、合并单元115、获取单元116以及确定单元117。本发明所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。As shown in FIG. 2 , it is a functional block diagram of a preferred embodiment of the big data initialization apparatus of the present invention. The big data initialization device 11 includes an import unit 110 , a mapping unit 111 , a detection unit 112 , an analysis unit 113 , a distribution unit 114 , a merge unit 115 , an acquisition unit 116 and a determination unit 117 . The modules/units referred to in the present invention refer to a series of computer program segments that can be executed by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.

导入单元110从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中。The importing unit 110 acquires historical data sets from a pre-built database, and imports the historical data sets into the distributed file system.

在本发明的至少一个实施例中,所述预先构建的数据库为传统数据库,也称为关系型数据库,用于处理永久、稳定的数据。In at least one embodiment of the present invention, the pre-built database is a traditional database, also known as a relational database, for processing permanent and stable data.

例如:所述数据库可以为Oracle数据库、MySQL数据库以及图数据库等。For example, the database may be an Oracle database, a MySQL database, a graph database, or the like.

在本发明的至少一个实施例中,所述历史数据集是组合用户历史行为产生的数据而形成。In at least one embodiment of the present invention, the historical data set is formed by combining data generated by historical user behavior.

在本发明的至少一个实施例中,所述分布式文件系统可以是Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)。其中,所述分布式文件系统是一个部署在集群上的分布式文件系统,需要通过网络进行数据的传输。In at least one embodiment of the present invention, the distributed file system may be a Hadoop Distributed File System (Hadoop Distributed File System, HDFS). Wherein, the distributed file system is a distributed file system deployed on a cluster, and data transmission needs to be performed through a network.

在本发明的至少一个实施例中,所述导入单元110将所述历史数据集导入到分布式文件系统中包括:In at least one embodiment of the present invention, the importing unit 110 importing the historical data set into the distributed file system includes:

所述导入单元110获取所述数据库所在服务器的IP(Internet Protocol,网际互连协议)地址及SID(Security Identifiers,安全标识符)号,并根据所述IP地址及所述SID号登录所述数据库,所述导入单元110从所述数据库中获取所述历史数据集在所述分布式文件系统上的绝对路径,进一步根据所述绝对路径将所述历史数据集导入到所述分布式文件系统中。The importing unit 110 obtains the IP (Internet Protocol, Internet Protocol) address and SID (Security Identifiers, security identifier) number of the server where the database is located, and logs into the database according to the IP address and the SID number , the import unit 110 obtains the absolute path of the historical data set on the distributed file system from the database, and further imports the historical data set into the distributed file system according to the absolute path .

具体地,所述导入单元110可以通过Sqoop传输工具将所述历史数据集从所述数据库中导入至所述HDFS中。Specifically, the importing unit 110 may import the historical data set from the database into the HDFS through the Sqoop transmission tool.

其中,所述Sqoop是一款开源的工具,主要用于在Hadoop与传统的数据库间进行数据的传递,所述Sqoop可以将一个关系型数据库中的数据导进到Hadoop的HDFS集群中,也可以将HDFS的数据导进到关系型数据库中。Among them, the Sqoop is an open source tool, which is mainly used for data transfer between Hadoop and traditional databases. The Sqoop can import data in a relational database into the HDFS cluster of Hadoop, or can Import HDFS data into a relational database.

在本发明的至少一个实施例中,所述导入单元110将所述历史数据集导入到分布式文件系统中,包括:In at least one embodiment of the present invention, the importing unit 110 imports the historical data set into a distributed file system, including:

获取单元116获取所述历史数据集中数据的属性信息,确定单元117根据所述属性信息确定所述历史数据集中数据的优先级,所述导入单元110根据所述优先级将所述历史数据集导入到分布式文件系统中。The acquiring unit 116 acquires attribute information of the data in the historical data set, the determining unit 117 determines the priority of the data in the historical data set according to the attribute information, and the importing unit 110 imports the historical data set according to the priority into a distributed file system.

其中,所述属性信息包括,但不限于以下一种或者多种的组合:Wherein, the attribute information includes, but is not limited to, one or more of the following combinations:

所述历史数据集中表的大小,是否有主键、含时间序列或者数字序列的字段等。The size of the table in the historical data set, whether there is a primary key, a field containing a time series or a number sequence, etc.

映射单元111将导入后的所述历史数据集映射成数据表集。The mapping unit 111 maps the imported historical data set into a data table set.

在本发明的至少一个实施例中,所述映射单元111将导入后的所述历史数据集映射成数据表集包括:In at least one embodiment of the present invention, the mapping unit 111 maps the imported historical data set into a data table set including:

所述映射单元111利用配置工具将所述历史数据集映射为数据表,并根据所述历史数据集在所述分布式文件系统上的绝对路径检验所述历史数据集是否加载到所述数据表中,当所述历史数据集加载到所述数据表中时,所述映射单元111利用所述数据表中的数据构建所述数据表集。The mapping unit 111 uses a configuration tool to map the historical data set to a data table, and checks whether the historical data set is loaded into the data table according to the absolute path of the historical data set on the distributed file system , when the historical data set is loaded into the data table, the mapping unit 111 uses the data in the data table to construct the data table set.

具体地,所述配置工具可以为hive,所述数据表可以包括hive数据表,所述hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL(Structured Query Language,结构化查询语言)查询功能。Specifically, the configuration tool can be hive, the data table can include hive data table, the hive is a data warehouse tool based on Hadoop, can map structured data files into a database table, and provide simple SQL (Structured Query Language, Structured Query Language) query function.

通过上述实施方式,能够采用映射的方式将历史数据集中的数据转换为数据表,提高了数据的容错性。Through the above embodiments, the data in the historical data set can be converted into a data table in a mapping manner, which improves the fault tolerance of the data.

检测单元112根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集。The detection unit 112 performs missing value detection on the data in the data table set according to the historical data set to obtain a standard data table set.

可以理解的是,由于开发人员的操作失误和/或导入传输工具的失效,可能导致数据缺失,因此,所述检测单元112对所述数据表集中的数据进行缺失值检测,得到标准数据表集。It can be understood that data may be missing due to the developer's operation error and/or the failure of the import and transmission tool. Therefore, the detection unit 112 performs missing value detection on the data in the data table set to obtain a standard data table set. .

具体地,所述缺失值包括:完全随机缺失、随机缺失以及非随机缺失。Specifically, the missing values include: completely random missing, random missing and non-random missing.

其中,所述完全随机缺失指的是某一变量缺失值不依赖于其他任何原因的完全随机缺失;所述随机缺失指的是某一变量的缺失与其他变量相关但与该变量本身的数值不相关的缺失;所述非随机缺失指的是某一变量的缺失和该变量本身的数值相关的缺失。Among them, the completely random missing refers to the completely random missing value of a variable that does not depend on any other reasons; the said random missing refers to the missing of a variable that is related to other variables but not related to the value of the variable itself. Correlated missing; the non-random missing refers to the missing of a variable and the missing value of the variable itself.

在本发明的至少一个实施例中,所述检测单元112根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集包括:In at least one embodiment of the present invention, the detection unit 112 performs missing value detection on the data in the data table set according to the historical data set, and obtains a standard data table set including:

所述检测单元112采用missmap function函数对所述数据表集中的数据进行缺失值检测,具体地:The detection unit 112 uses the missmap function to perform missing value detection on the data in the data table set, specifically:

(1)当检测到所述数据表集中没有缺失值时,所述检测单元112将所述数据表集确定为所述标准数据表集。(1) When detecting that there is no missing value in the data table set, the detecting unit 112 determines the data table set as the standard data table set.

(2)当检测到所述数据表集中有缺失值时,所述检测单元112采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集。(2) When detecting that there are missing values in the data table set, the detection unit 112 uses a maximum likelihood estimation algorithm to fill in the missing values to obtain the standard data table set.

进一步地,所述检测单元112在采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集时,采用如下公式:Further, when the detection unit 112 uses the maximum likelihood estimation algorithm to fill in the missing values to obtain the standard data table set, the following formula is used:

Figure BDA0002402557140000141
Figure BDA0002402557140000141

其中,L(θ)表示填充的所述缺失值,θ表示所述缺失值对应的概率参数,n表示所述历史数据集的数量,p(xi|θ)表示所述缺失值的概率。Wherein, L(θ) represents the filled missing value, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data set, and p( xi |θ) represents the probability of the missing value.

通过上述实施方式,利用缺失值检测的方法保证了数据准确性和完整性。Through the above-mentioned embodiments, the method for detecting missing values ensures the accuracy and integrity of the data.

分析单元113通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集。The analyzing unit 113 performs data analysis on the data in the standard data table set according to the preset association degree dependency, and obtains a key field set of the data.

在本发明的至少一个实施例中,由于分布式系统跟关系型数据库不同,无法直接根据索引进行数据的分析,此时,所述分析单元113需要根据预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到造成数据倾斜的所述关键字段集。In at least one embodiment of the present invention, since the distributed system is different from the relational database, it is impossible to directly analyze the data according to the index. The data in the data table set is subjected to data analysis to obtain the key field set that causes the data to be skewed.

其中,所述数据倾斜指的是一个字段出现的数量比例差异较大,比如一个学校的男生数量为10000,女生数量为100。Among them, the data skew refers to the large difference in the proportion of the number of occurrences of a field, for example, the number of boys in a school is 10,000, and the number of girls is 100.

其中,所述预设的关联度依存关系包括:不等值key关联规则、多个数据表联合检测规则。Wherein, the preset association degree dependency includes: an unequal value key association rule and a joint detection rule of multiple data tables.

进一步地,所述分析单元113根据所述关联度依存关系得到key的分布情况,从而得到所述关键字段集。Further, the analyzing unit 113 obtains the distribution of keys according to the correlation degree dependency, thereby obtaining the key field set.

分布单元114将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集。The distribution unit 114 randomly distributes the key field set to the standard data table set to generate a random landing data table set.

在本发明的至少一个实施例中,所述分布单元114将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集包括:In at least one embodiment of the present invention, the distribution unit 114 randomly distributes the key field set to the standard data table set, and generating a random landing data table set includes:

所述分布单元114确定所述关键字段集中关键字段的数量,并根据所述关键字段的数量生成多个数值,所述多个数值的数量与所述关键字段的数量相同,所述分布单元114建立所述多个数值与所述标准数据表集的映射关系,并随机匹配所述多个数值与所述关键字段,得到匹配结果,所述分布单元114根据所述映射关系及所述匹配结果将所述关键字段分布到所述标准数据表集中,得到所述随机落地数据表集。The distribution unit 114 determines the number of key fields in the key field set, and generates a plurality of numerical values according to the number of the key fields, and the number of the multiple numerical values is the same as the number of the key fields, so The distribution unit 114 establishes a mapping relationship between the multiple values and the standard data table set, and randomly matches the multiple values with the key fields to obtain a matching result. The distribution unit 114 determines the mapping relationship according to the mapping relationship. and the matching result distributes the key fields into the standard data table set to obtain the random landing data table set.

通过上述实施方式,能够结合关键字段随机落地分布的方式提高数据处理的速度。Through the above-mentioned embodiments, the speed of data processing can be improved in combination with the random distribution of key fields.

合并单元115将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The merging unit 115 merges the random landing data tables in the random landing data table set according to preset conditions to obtain an initialization data table set.

在本发明的至少一个实施例中,所述预设条件可以由开发人员根据所述随机落地数据表的不同需求进行配置。In at least one embodiment of the present invention, the preset conditions can be configured by developers according to different requirements of the random landing data table.

其中,所述预设条件包括,但不限于:单表合并、多表合并以及相邻表合并等。The preset conditions include, but are not limited to: merging of single tables, merging of multiple tables, merging of adjacent tables, and the like.

通过上述实施方式,按照预设条件合并所述随机落地数据表,能够降低数据初始化的难度,并间接提升了数据处理速度,为项目切换提供有力支持。Through the above-mentioned embodiment, the random landing data table is merged according to the preset conditions, which can reduce the difficulty of data initialization, indirectly improve the data processing speed, and provide strong support for project switching.

由以上技术方案可以看出,本发明能够将导入到分布式文件系统中的历史数据集映射成数据表集,提高了数据的容错性,进一步根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集,保证了数据准确性和完整性,再通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集,并将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集,提高了数据处理的速度,进一步将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集,实现了对大数据的初始化。It can be seen from the above technical solutions that the present invention can map the historical data set imported into the distributed file system into a data table set, thereby improving the fault tolerance of the data, and further according to the historical data set, the data table is centralized. The missing value detection is performed on the data of the standard data table, and the standard data table set is obtained, which ensures the accuracy and integrity of the data, and then the data in the standard data table set is analyzed through the preset correlation dependency relationship, and the key fields of the data are obtained. The key field set is randomly distributed into the standard data table set to generate a random landing data table set, which improves the speed of data processing. Further, the random landing data table in the random landing data table set is pre- Set the conditions to merge, get the initialization data table set, and realize the initialization of big data.

如图3所示,是本发明实现大数据初始化方法的较佳实施例的电子设备的结构示意图。As shown in FIG. 3 , it is a schematic structural diagram of an electronic device according to a preferred embodiment of the method for initializing big data according to the present invention.

所述电子设备1可以包括存储器12、处理器13和总线,还可以包括存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如大数据初始化程序。The electronic device 1 may include a memory 12, a processor 13 and a bus, and may also include a computer program stored in the memory 12 and executable on the processor 13, such as a big data initialization program.

本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,所述电子设备1既可以是总线型结构,也可以是星形结构,所述电子设备1还可以包括比图示更多或更少的其他硬件或者软件,或者不同的部件布置,例如所述电子设备1还可以包括输入输出设备、网络接入设备等。Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1, and does not constitute a limitation on the electronic device 1. The electronic device 1 can be either a bus-type structure or a star-shaped structure. The device 1 may also include more or less other hardware or software than shown, or different component arrangements, for example, the electronic device 1 may also include input and output devices, network access devices, and the like.

需要说明的是,所述电子设备1仅为举例,其他现有的或今后可能出现的电子产品如可适应于本发明,也应包含在本发明的保护范围以内,并以引用方式包含于此。It should be noted that the electronic device 1 is only an example. If other existing or future electronic products can be adapted to the present invention, they should also be included within the protection scope of the present invention, and are incorporated herein by reference. .

其中,存储器12至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器12在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。存储器12在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,存储器12还可以既包括电子设备1的内部存储单元也包括外部存储设备。存储器12不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如大数据初始化程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 12 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. . The memory 12 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 12 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the electronic device 1 ) card, flash memory card (Flash Card) and so on. Further, the memory 12 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 12 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of a big data initialization program, etc., but also can be used to temporarily store data that has been output or will be output.

处理器13在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。处理器13是所述电子设备1的控制核心(Control Unit),利用各种接口和线路连接整个电子设备1的各个部件,通过运行或执行存储在所述存储器12内的程序或者模块(例如执行大数据初始化程序等),以及调用存储在所述存储器12内的数据,以执行电子设备1的各种功能和处理数据。The processor 13 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units. CPU (Central Processing Unit, CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 13 is the control core (Control Unit) of the electronic device 1, and uses various interfaces and lines to connect the various components of the entire electronic device 1, by running or executing the program or module (for example, executing the program) stored in the memory 12. Big data initialization program, etc.), and call the data stored in the memory 12 to perform various functions of the electronic device 1 and process data.

所述处理器13执行所述电子设备1的操作系统以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个大数据初始化方法实施例中的步骤,例如图1所示的步骤S10、S11、S12、S13、S14、S15。The processor 13 executes the operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in each of the foregoing big data initialization method embodiments, such as steps S10 , S11 , S12 , S13 , S14 , and S15 shown in FIG. 1 .

或者,所述处理器13执行所述计算机程序时实现上述各装置实施例中各模块/单元的功能,例如:Alternatively, when the processor 13 executes the computer program, the functions of the modules/units in the above device embodiments are implemented, for example:

从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中;Obtain historical datasets from pre-built databases, and import the historical datasets into a distributed file system;

将导入后的所述历史数据集映射成数据表集;mapping the imported historical data set into a data table set;

根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集;According to the historical data set, missing value detection is performed on the data in the data table set to obtain a standard data table set;

通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集;Perform data analysis on the data in the standard data table set by using a preset association degree dependency to obtain a key field set of the data;

将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集;Randomly distributing the key field set to the standard data table set to generate a random landing data table set;

将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The random landing data tables in the random landing data table set are combined according to preset conditions to obtain an initialization data table set.

示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本发明。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述电子设备1中的执行过程。例如,所述计算机程序可以被分割成导入单元110、映射单元111、检测单元112、分析单元113、分布单元114、合并单元115、获取单元116以及确定单元117。Exemplarily, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete the present invention. invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1 . For example, the computer program may be divided into an import unit 110 , a mapping unit 111 , a detection unit 112 , an analysis unit 113 , a distribution unit 114 , a merge unit 115 , an acquisition unit 116 , and a determination unit 117 .

上述以软件功能模块的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、计算机设备,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分。The above-mentioned integrated units implemented in the form of software functional modules may be stored in a computer-readable storage medium. The above-mentioned software function modules are stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) to execute the methods described in the various embodiments of the present invention. part.

所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指示相关的硬件设备来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。If the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present invention can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware devices through a computer program, and the computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, the steps of the above method embodiments can be implemented.

其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) .

总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,在图3中仅用一根箭头表示,但并不表示仅有一根总线或一种类型的总线。所述总线被设置为实现所述存储器12以及至少一个处理器13等之间的连接通信。The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (extended industry standard architecture, EISA for short) bus, or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one arrow is shown in FIG. 3, but it does not mean that there is only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 and the like.

尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器13逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。Although not shown, the electronic device 1 may also include a power source (such as a battery) for supplying power to various components, preferably, the power source may be logically connected to the at least one processor 13 through a power management device, so as to be implemented by the power management device Charge management, discharge management, and power management functions. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.

可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.

应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.

图3仅示出了具有组件12-13的电子设备1,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 3 only shows the electronic device 1 with components 12-13. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include less than shown in the figure. Or more components, or a combination of certain components, or a different arrangement of components.

结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种大数据初始化方法,所述处理器13可执行所述多个指令从而实现:With reference to FIG. 1 , the memory 12 in the electronic device 1 stores multiple instructions to implement a big data initialization method, and the processor 13 can execute the multiple instructions to implement:

从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中;Obtain historical datasets from pre-built databases, and import the historical datasets into a distributed file system;

将导入后的所述历史数据集映射成数据表集;mapping the imported historical data set into a data table set;

根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集;According to the historical data set, missing value detection is performed on the data in the data table set to obtain a standard data table set;

通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集;Perform data analysis on the data in the standard data table set by using a preset association degree dependency to obtain a key field set of the data;

将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集;Randomly distributing the key field set to the standard data table set to generate a random landing data table set;

将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The random landing data tables in the random landing data table set are combined according to preset conditions to obtain an initialization data table set.

具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above-mentioned instruction by the processor 13, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1 , which is not repeated here.

在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体形式实现本发明。It will be apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, but that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics of the invention.

因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the invention is to be defined by the appended claims rather than the foregoing description, which are therefore intended to fall within the scope of the claims. All changes within the meaning and range of the equivalents of , are included in the present invention. Any reference signs in the claims shall not be construed as limiting the involved claim.

此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.

最后应说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或等同替换,而不脱离本发明技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1.一种大数据初始化方法,其特征在于,所述方法包括:1. a big data initialization method, is characterized in that, described method comprises: 从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中;Obtain historical datasets from pre-built databases, and import the historical datasets into a distributed file system; 将导入后的所述历史数据集映射成数据表集;mapping the imported historical data set into a data table set; 根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集;According to the historical data set, missing value detection is performed on the data in the data table set to obtain a standard data table set; 通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集;Perform data analysis on the data in the standard data table set by using a preset association degree dependency to obtain a key field set of the data; 将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集;Randomly distributing the key field set to the standard data table set to generate a random landing data table set; 将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The random landing data tables in the random landing data table set are combined according to preset conditions to obtain an initialization data table set. 2.如权利要求1所述的大数据初始化方法,其特征在于,所述将所述历史数据集导入到分布式文件系统中,包括:2. The big data initialization method according to claim 1, wherein the importing the historical data set into a distributed file system comprises: 获取所述数据库所在服务器的IP地址及SID号;Obtain the IP address and SID number of the server where the database is located; 根据所述IP地址及所述SID号登录所述数据库;Log in to the database according to the IP address and the SID number; 从所述数据库中获取所述历史数据集在所述分布式文件系统上的绝对路径;Obtain the absolute path of the historical data set on the distributed file system from the database; 根据所述绝对路径将所述历史数据集导入到所述分布式文件系统中。The historical dataset is imported into the distributed file system according to the absolute path. 3.如权利要求1所述的大数据初始化方法,其特征在于,所述将所述历史数据集导入到分布式文件系统中,包括:3. The big data initialization method according to claim 1, wherein the importing the historical data set into a distributed file system comprises: 获取所述历史数据集中数据的属性信息;Obtain attribute information of the data in the historical data set; 根据所述属性信息确定所述历史数据集中数据的优先级;Determine the priority of the data in the historical data set according to the attribute information; 根据所述优先级将所述历史数据集导入到分布式文件系统中。The historical dataset is imported into a distributed file system according to the priority. 4.如权利要求2所述的大数据初始化方法,其特征在于,所述将导入后的所述历史数据集映射成数据表集包括:4. The big data initialization method according to claim 2, wherein the mapping of the imported historical data set into a data table set comprises: 利用配置工具将所述历史数据集映射为数据表;Using a configuration tool to map the historical data set into a data table; 根据所述历史数据集在所述分布式文件系统上的绝对路径检验所述历史数据集是否加载到所述数据表中;Check whether the historical data set is loaded into the data table according to the absolute path of the historical data set on the distributed file system; 当所述历史数据集加载到所述数据表中时,利用所述数据表中的数据构建所述数据表集。When the historical data set is loaded into the data table, the data table set is constructed using the data in the data table. 5.如权利要求1所述的大数据初始化方法,其特征在于,所述根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集包括:5. The big data initialization method according to claim 1, wherein, according to the historical data set, performing missing value detection on the data in the data table set, and obtaining a standard data table set comprises: 采用missmap function函数对所述数据表集中的数据进行缺失值检测;Use missmap function to perform missing value detection on the data in the data table set; 当检测到所述数据表集中没有缺失值时,将所述数据表集确定为所述标准数据表集;或者When detecting that there is no missing value in the data table set, determining the data table set as the standard data table set; or 当检测到所述数据表集中有缺失值时,采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集。When it is detected that there are missing values in the data table set, a maximum likelihood estimation algorithm is used to fill in the missing values to obtain the standard data table set. 6.如权利要求5所述的大数据初始化方法,其特征在于,在采用极大似然估计算法对所述缺失值进行填充,得到所述标准数据表集时,采用如下公式:6. The big data initialization method according to claim 5, characterized in that, when using a maximum likelihood estimation algorithm to fill in the missing values to obtain the standard data table set, the following formula is used:
Figure FDA0002402557130000021
Figure FDA0002402557130000021
其中,L(θ)表示填充的所述缺失值,θ表示所述缺失值对应的概率参数,n表示所述历史数据集的数量,p(xi|θ)表示所述缺失值的概率。Wherein, L(θ) represents the filled missing value, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data set, and p( xi |θ) represents the probability of the missing value.
7.如权利要求1所述的大数据初始化方法,其特征在于,所述将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集包括:7. The big data initialization method according to claim 1, wherein the random distribution of the key field set to the standard data table set, and the generation of a random landing data table set comprises: 确定所述关键字段集中关键字段的数量;determining the number of key fields in the key field set; 根据所述关键字段的数量生成多个数值,所述多个数值的数量与所述关键字段的数量相同;Generate a plurality of numerical values according to the number of the key fields, the number of the plurality of numerical values is the same as the number of the key fields; 建立所述多个数值与所述标准数据表集的映射关系;establishing a mapping relationship between the plurality of numerical values and the standard data table set; 随机匹配所述多个数值与所述关键字段,得到匹配结果;Randomly matching the plurality of numerical values and the key fields to obtain a matching result; 根据所述映射关系及所述匹配结果将所述关键字段分布到所述标准数据表集中,得到所述随机落地数据表集。The key fields are distributed to the standard data table set according to the mapping relationship and the matching result to obtain the random landing data table set. 8.一种大数据初始化装置,其特征在于,所述装置包括:8. A big data initialization device, wherein the device comprises: 导入单元,用于从预先构建的数据库中获取历史数据集,并将所述历史数据集导入到分布式文件系统中;an importing unit, used to obtain a historical data set from a pre-built database, and import the historical data set into a distributed file system; 映射单元,用于将导入后的所述历史数据集映射成数据表集;a mapping unit for mapping the imported historical data set into a data table set; 检测单元,用于根据所述历史数据集,对所述数据表集中的数据进行缺失值检测,得到标准数据表集;a detection unit, configured to perform missing value detection on the data in the data table set according to the historical data set to obtain a standard data table set; 分析单元,用于通过预设的关联度依存关系对所述标准数据表集中的数据进行数据分析,得到数据的关键字段集;an analysis unit, configured to perform data analysis on the data in the standard data table set through a preset association degree dependency relationship to obtain a key field set of the data; 分布单元,用于将所述关键字段集随机分布到所述标准数据表集中,生成随机落地数据表集;a distribution unit, configured to randomly distribute the key field set to the standard data table set to generate a random landing data table set; 合并单元,用于将所述随机落地数据表集中的随机落地数据表按预设条件进行合并,得到初始化数据表集。The merging unit is used for merging the random landing data tables in the random landing data table set according to preset conditions to obtain an initialization data table set. 9.一种电子设备,其特征在于,所述电子设备包括:9. An electronic device, characterized in that the electronic device comprises: 存储器,存储至少一个指令;及a memory that stores at least one instruction; and 处理器,执行所述存储器中存储的指令以实现如权利要求1至7中任意一项所述的大数据初始化方法。The processor executes the instructions stored in the memory to implement the big data initialization method according to any one of claims 1 to 7. 10.一种计算机可读存储介质,其特征在于:所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如权利要求1至7中任意一项所述的大数据初始化方法。10. A computer-readable storage medium, characterized in that: the computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the method as claimed in claims 1 to 7 The big data initialization method described in any one of the items.
CN202010151374.9A 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium Pending CN111444162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010151374.9A CN111444162A (en) 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010151374.9A CN111444162A (en) 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111444162A true CN111444162A (en) 2020-07-24

Family

ID=71627349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010151374.9A Pending CN111444162A (en) 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111444162A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544036A (en) * 2022-09-26 2022-12-30 网易(杭州)网络有限公司 Data updating method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951191A (en) * 2017-03-22 2017-07-14 江苏金易达供应链管理有限公司 Towards the big data storage method of auto service platform
CN110442647A (en) * 2019-07-29 2019-11-12 招商局金融科技有限公司 Data consistency synchronous method, device and computer readable storage medium
CN110633318A (en) * 2019-09-23 2019-12-31 北京锐安科技有限公司 Data extraction processing method, device, equipment and storage medium
CN110765119A (en) * 2019-10-21 2020-02-07 招商局金融科技有限公司 Client trend change presentation method and device and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951191A (en) * 2017-03-22 2017-07-14 江苏金易达供应链管理有限公司 Towards the big data storage method of auto service platform
CN110442647A (en) * 2019-07-29 2019-11-12 招商局金融科技有限公司 Data consistency synchronous method, device and computer readable storage medium
CN110633318A (en) * 2019-09-23 2019-12-31 北京锐安科技有限公司 Data extraction processing method, device, equipment and storage medium
CN110765119A (en) * 2019-10-21 2020-02-07 招商局金融科技有限公司 Client trend change presentation method and device and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董树明等: "数据集成中的一种数据合并技术", 现代计算机, no. 11, 30 November 2003 (2003-11-30) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544036A (en) * 2022-09-26 2022-12-30 网易(杭州)网络有限公司 Data updating method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10078659B2 (en) Semantic database driven form validation
KR102361153B1 (en) Managing data profiling operations related to data type
CN110476151A (en) K selection using parallel processing
US11379499B2 (en) Method and apparatus for executing distributed computing task
CN114116673B (en) Data migration method based on artificial intelligence and related equipment
US11995046B2 (en) Record management for database systems using fuzzy field matching
CN114491646A (en) Data desensitization method and device, electronic equipment and storage medium
CN116089535A (en) Data synchronization method, device, equipment and storage medium
CN111429085A (en) Contract data generation method, device, electronic device and storage medium
CN114385497A (en) Test environment generation method and device, electronic equipment and storage medium
CN114138761B (en) Data query method, device, equipment and storage medium based on python
US12314425B2 (en) Privacy data management in distributed computing systems
CN115017054A (en) Data synchronization test method, device, electronic device and storage medium
CN115048651A (en) Database security detection method, device, equipment and storage medium
CN115129753A (en) Data blood relationship analysis method, device, electronic device and storage medium
US11188594B2 (en) Wildcard searches using numeric string hash
US12326949B2 (en) Privacy data management in distributed computing systems
CN111444162A (en) Big data initialization method and device, electronic equipment and storage medium
CN105930354A (en) Storage model conversion method and device
CN114978964A (en) Communication announcement configuration method, device, equipment and medium based on network self-checking
CN114510666A (en) Page jump method, device, equipment and storage medium
CN106446039B (en) Aggregated big data query method and device
CN115314570B (en) Data issuing method, device, equipment and medium based on protocol development framework
US9607029B1 (en) Optimized mapping of documents to candidate duplicate documents in a document corpus
CN115510093A (en) Data query method, device, equipment and storage medium across heterogeneous data sources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination