[go: up one dir, main page]

CN109815295A - Distributed type assemblies data lead-in method and device - Google Patents

Distributed type assemblies data lead-in method and device Download PDF

Info

Publication number
CN109815295A
CN109815295A CN201910119281.5A CN201910119281A CN109815295A CN 109815295 A CN109815295 A CN 109815295A CN 201910119281 A CN201910119281 A CN 201910119281A CN 109815295 A CN109815295 A CN 109815295A
Authority
CN
China
Prior art keywords
data
load
file
foreigntablescan
operator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910119281.5A
Other languages
Chinese (zh)
Inventor
刘欣然
张鸿
惠榛
吕雁飞
马秉楠
冷健全
王鸿翔
高峰
李恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201910119281.5A priority Critical patent/CN109815295A/en
Publication of CN109815295A publication Critical patent/CN109815295A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of distributed type assemblies data lead-in method and devices, the described method includes: back end receives the data load command that Master node issues, start ForeignTableScan operator load document and loads process, pre-set external table is based on by ForeignTableScan operator, the data and external file relevant information to be requested are sent to file load process, wherein, in file load process setting third party ETL server;File loads the information sequence that process is sent according to back end and reads data file, and sends the data to back end;After the ForeignTableScan operator of back end collects data, local is stored data into.

Description

Distributed type assemblies data lead-in method and device
Technical field
The present invention relates to computer field more particularly to a kind of distributed type assemblies data lead-in methods and device.
Background technique
Distributed type assemblies database is mainly characterized by the quick response of the quick storage and complex query of mass data.Therefore The quick storage of data is of great significance to distributed data base.Distributed data base KingbaseAnalyticsDB passes through Data and processing work are assigned to the mode of multiple servers or host, store and process a large amount of data. KingbaseAnalyticsDB is based on multiple single machine databases, they cooperate, and is presented to the user the effect of a database Fruit.Fig. 1 describes the component for constituting KingbaseAnalyticsDB Database Systems: Master node is The entrance of KingbaseAnalyticsDB Database Systems.It is client connection and the database instance for submitting SQL statement Node.Master can coordinate the work of other database instance nodes in oneself and system, these database instances are known as counting According to node (Segment node), for storing and processing real data.KingbaseAnalyticsDB database Segment is real Example is independent database, and each Segment node can store the data of a part and execute most of query processing. When a user is connected to database, and an inquiry is initiated by Master node, each Segment node can be created Some processes are built to handle this inquiry work.User-defined table and corresponding index are all distributed in each in Database Systems On a available Segment node, each Segment stores a part of different data.User exists It is interacted by Master node with these Segment nodes in KingbaseAnalyticsDB Database Systems.Wherein Master Node is also referred to as management node, and Segment node is also referred to as back end or calculate node.
Copy order is loaded into the data of file in file system in database.Copy order is first in Master node The data in data file are parsed line by line, and are combined into a tuple according to the format of data store internal, according to the distribution key meter of table The back end to be issued is calculated, the data is finally stored by the back end.
This scheme is to load the conventional method of external data, also be can be used in distributed data base.But there is it existing Real disadvantage:
1.Copy order needs Master node elder generation dissection process data, calculates which data are sent to according to table distribution mode A back end.Copy is serial process each row of data, cannot make full use of the resource of back end, and each back end is idle Time is more, keeps loading performance relatively low.
2.Master node easily becomes bottleneck.Master node is the entrance of distributed data base, and all inquiries are all Master node can be passed through.The storage of mass data, which individually connects execution Copy order, can occupy biggish hardware resource, When concurrent relatively high, Master node can become the bottleneck of distributed data base.The performance of data load is not only influenced, It will increase the response time of other type SQL.
The data file that 3.Copy order is read can only be on the host of Master node.User is needed using Copy order Data file is first uploaded to Master node host, increases Master node host storage load simultaneously, ease for use can compare Difference.
Summary of the invention
The embodiment of the present invention provides a kind of distributed type assemblies data lead-in method and device, divides in the prior art to solve The slow problem of cloth data base cluster system data loading.
The embodiment of the present invention provides a kind of distributed type assemblies data lead-in method, comprising:
Back end receives the data load command that Master node issues, starting ForeignTableScan operator load File loads process, is based on pre-set external table by ForeignTableScan operator, the data to be requested and outside Portion's file-related information is sent to file load process, wherein file loads in process setting third party ETL server;
File loads the information sequence that process is sent according to back end and reads data file, and sends the data to data Node;
After the ForeignTableScan operator of back end collects data, local is stored data into.
Preferably, the external table preserves the relevant information of load process, specifically includes: load process port numbers, IP Address and to load external file list.
Preferably, starting ForeignTableScan operator load document load process specifically includes:
Start ForeignTableScan operator;
ForeignTableScan operator connects data load document and loads process according to itself node ID, poll.
Preferably, it after the ForeignTableScan operator of back end collects data, stores data into local specific Include:
The data that the ForeignTableScan operator of back end returns to file load process explain, and are converted to Internal tuple, storage is arrived locally, and carries out next step SQL operation.
The embodiment of the present invention also provides a kind of distributed type assemblies data importing device, comprising: memory, processor and storage On the memory and the computer program that can run on the processor, the computer program are held by the processor The step of above method is realized when row.
Using the embodiment of the present invention, by the quick storage of mass data, data will bypass Master node, be directly inserted into In Segment node, it can make all Segment nodes that can receive processing data simultaneously, make full use of back end firmly Part resource, while the concurrency of file load process can also be increased, the extensive speed for promoting data loading.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the component diagram for constituting KingbaseAnalyticsDB Database Systems in the prior art;
Fig. 2 is the integrally-built schematic diagram that distributed data base external data loads in the embodiment of the present invention.
Specific embodiment
The embodiment of the invention provides a kind of distributed type assemblies data lead-in method and devices, realize user's mass data Quickly storage.Using in colonization process, back end resource is sufficiently used using file load process, data flow bypasses Master is loaded directly into back end.
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
The overall structure of distributed data base external data load is as shown in Fig. 2, the embodiment of the present invention needs to provide two Part:
1. file loads process: it is a HTTP service process that file, which loads process,.Each calculate node (data section Point) it is used as a HTTP client.Data load command is handed down to calculate node by Master node, and calculate node, which receives, to be added After carrying order, the data and external file relevant information to be requested are sent to file load process, load process receives HTTP After request, the information sequence sent according to client reads data file, and sends the data to calculate node.Calculate node is received After, remaining process is similar with the execution Copy order of Master node for access, only stores the data to local, and no longer issues To other nodes.
2. providing external table mechanism in cluster.External table has recorded the relevant information of load process, comprising: load process end Slogan, IP address will load the information such as external file list.Cluster uses external table mechanism, can start in calculate node ForeignTableScan operator, ForeignTableScan parallel threaded file load process concurrently load data.
The external table access module of file loader program (process) and distributed data base is illustrated below.
File loader program:
File load process is stored in third party's ETL server, and the execution process of file loader program is as follows:
1) externally start specific network port service
2) calculate node of cluster connects network service
3) PC cluster node, which is sent, reads instruction
4) file load process send the data of certain data volume to calculate node (size can be set, such as: 4MB)
External table access technique:
Group system realizes ForeignTableScan operator, in optimizer, if the external data table of access, makes It is accessed with this operator to data.
The calculation process of ForeignTableScan operator is as follows.
1) in ForeignTableScan, according to the node ID of itself, poll connects data and loads service processes
2) it sends and reads data command
3) instruction of return is explained, is converted to internal tuple, carry out next step SQL operation
In conclusion the embodiment of the present invention loads process by third party's file, make data flow without Master node It is directly entered in back end, takes full advantage of back end resource, greatly improve data loading.Meanwhile increasing When multiple file load processes raising data load concurrent, Master node load can't be made excessively high, become entire database The bottleneck of system.External data file can be placed on third party's load machine, also reduce storage pressure.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (5)

1. a kind of distributed type assemblies data lead-in method characterized by comprising
Back end receives the data load command that Master node issues, and starts ForeignTableScan operator load document Load process is based on pre-set external table by ForeignTableScan operator, the data to be requested and external text Part relevant information is sent to file load process, wherein file loads in process setting third party ETL server;
File loads the information sequence that process is sent according to back end and reads data file, and sends the data to data section Point;
After the ForeignTableScan operator of back end collects data, local is stored data into.
2. the method as described in claim 1, which is characterized in that the external table preserves the relevant information of load process, tool Body includes: load process port numbers, IP address and to load external file list.
3. the method as described in claim 1, which is characterized in that starting ForeignTableScan operator load document load into Journey specifically includes:
Start ForeignTableScan operator;
ForeignTableScan operator connects data load document and loads process according to itself node ID, poll.
4. the method as described in claim 1, which is characterized in that the ForeignTableScan operator of back end collects data Afterwards, local specifically include is stored data into:
The data that the ForeignTableScan operator of back end returns to file load process explain, and are converted to inside Tuple, storage is arrived locally, and carries out next step SQL operation.
5. a kind of distributed type assemblies data importing device characterized by comprising memory, processor and be stored in described deposit On reservoir and the computer program that can run on the processor, the computer program are realized when being executed by the processor Step according to any one of claims 1 to 4.
CN201910119281.5A 2019-02-18 2019-02-18 Distributed type assemblies data lead-in method and device Pending CN109815295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910119281.5A CN109815295A (en) 2019-02-18 2019-02-18 Distributed type assemblies data lead-in method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910119281.5A CN109815295A (en) 2019-02-18 2019-02-18 Distributed type assemblies data lead-in method and device

Publications (1)

Publication Number Publication Date
CN109815295A true CN109815295A (en) 2019-05-28

Family

ID=66606853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910119281.5A Pending CN109815295A (en) 2019-02-18 2019-02-18 Distributed type assemblies data lead-in method and device

Country Status (1)

Country Link
CN (1) CN109815295A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342885A (en) * 2021-06-15 2021-09-03 深圳前海微众银行股份有限公司 Data import method, device, equipment and computer program product
CN118820355A (en) * 2024-04-22 2024-10-22 中国移动通信集团设计院有限公司 Method, device, medium and product for loading external data of distributed database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790489A (en) * 2016-12-14 2017-05-31 成都华为技术有限公司 Parallel data loading method and system
US9898469B1 (en) * 2014-02-28 2018-02-20 Pivotal Software, Inc. Parallel streaming of external data
CN107885780A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of performance data collection method performed for distributed query
CN107885460A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of data access method of cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9898469B1 (en) * 2014-02-28 2018-02-20 Pivotal Software, Inc. Parallel streaming of external data
CN106790489A (en) * 2016-12-14 2017-05-31 成都华为技术有限公司 Parallel data loading method and system
CN107885780A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of performance data collection method performed for distributed query
CN107885460A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of data access method of cluster

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342885A (en) * 2021-06-15 2021-09-03 深圳前海微众银行股份有限公司 Data import method, device, equipment and computer program product
CN118820355A (en) * 2024-04-22 2024-10-22 中国移动通信集团设计院有限公司 Method, device, medium and product for loading external data of distributed database

Similar Documents

Publication Publication Date Title
US12105703B1 (en) System and method for interacting with a plurality of data sources
US9721116B2 (en) Test sandbox in production systems during productive use
CN102375837B (en) Data acquiring system and method
US12229119B2 (en) Multiple index scans
US7966349B2 (en) Moving records between partitions
US20080189252A1 (en) Hardware accelerated reconfigurable processor for accelerating database operations and queries
US20100030995A1 (en) Method and apparatus for applying database partitioning in a multi-tenancy scenario
WO2015030767A1 (en) Queries involving multiple databases and execution engines
CN103455526A (en) ETL (extract-transform-load) data processing method, device and system
CN107783985A (en) A kind of distributed networks database query method, apparatus and management system
CN105740264A (en) Distributed XML database sorting method and apparatus
US9613129B2 (en) Localized data affinity system and hybrid method
US9672231B2 (en) Concurrent access for hierarchical data storage
US20140095508A1 (en) Efficient selection of queries matching a record using a cache
CN102737061B (en) Distributed ticket query management system and method
CN113157692A (en) Relational memory database system
EP1808779B1 (en) Bundling database
CN113868267B (en) Method for injecting time sequence data, method for inquiring time sequence data and database system
CN109815295A (en) Distributed type assemblies data lead-in method and device
CN111488323B (en) Data processing method and device and electronic equipment
US9129037B2 (en) Disappearing index for more efficient processing of a database query
CN107783728A (en) Date storage method, device and equipment
JP5464017B2 (en) Distributed memory database system, database server, data processing method and program thereof
US7392359B2 (en) Non-blocking distinct grouping of database entries with overflow
CN116483892A (en) Serial number generation method, device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190528

RJ01 Rejection of invention patent application after publication