[go: up one dir, main page]

CN115391333A - Data acquisition method and device - Google Patents

Data acquisition method and device Download PDF

Info

Publication number
CN115391333A
CN115391333A CN202210837084.9A CN202210837084A CN115391333A CN 115391333 A CN115391333 A CN 115391333A CN 202210837084 A CN202210837084 A CN 202210837084A CN 115391333 A CN115391333 A CN 115391333A
Authority
CN
China
Prior art keywords
data
acquisition
incremental
various heterogeneous
sources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210837084.9A
Other languages
Chinese (zh)
Inventor
安西平
徐辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Singularity Of Life Beijing Technology Co ltd
Original Assignee
Singularity Of Life Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Singularity Of Life Beijing Technology Co ltd filed Critical Singularity Of Life Beijing Technology Co ltd
Priority to CN202210837084.9A priority Critical patent/CN115391333A/en
Publication of CN115391333A publication Critical patent/CN115391333A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data acquisition method and a data acquisition device. The method comprises the following steps: configuring an acquisition task, and issuing the configured acquisition task; based on the issued acquisition task, acquiring data from various heterogeneous data sources in an increment manner by utilizing corresponding data acquisition tools; and storing the acquired data to a service database.

Description

一种数据采集方法及装置A data acquisition method and device

技术领域technical field

本发明涉及数据处理技术领域,并且更具体地,涉及一种数据采集方法及装置。The present invention relates to the technical field of data processing, and more specifically, to a data collection method and device.

背景技术Background technique

数据网关,是指一类设备实现多个业务数据系统互连,实现各个系统之间数据的集成、共享和管理。智能医疗健康数据网关(Intelligent Clinical&Health DataGateWay,以下简称IDCHGW)是指一个设备,提供面向医疗数据的数据治理、再生产和数据服务,提供可视化、智能化、可交互、可编程的操作方式,以及提供安全的数据交换能力。Data gateway refers to a type of equipment that realizes the interconnection of multiple business data systems and realizes the integration, sharing and management of data between various systems. Intelligent Clinical&Health DataGateWay (hereinafter referred to as IDCHGW) refers to a device that provides data governance, reproduction and data services for medical data, provides visual, intelligent, interactive and programmable operation methods, and provides security data exchange capabilities.

数据采集是整个智能医疗健康数据网关建设过程中不可缺少的一个环节。能医疗健康数据网关对接的是多家医疗信息化系统数据源,针对医院多源异构数据,如何实现全量采集和汇聚,已成为现有技术中一亟待解决的难题。Data collection is an indispensable link in the construction process of the entire intelligent medical and health data gateway. The medical and health data gateway can be connected to multiple medical information system data sources. For the hospital's multi-source heterogeneous data, how to achieve full collection and aggregation has become an urgent problem in the existing technology.

发明内容Contents of the invention

针对现有技术的不足,本发明提供一种数据采集方法及装置。Aiming at the deficiencies of the prior art, the present invention provides a data collection method and device.

根据本发明的一个方面,提供了一种数据采集方法,包括:According to one aspect of the present invention, a data collection method is provided, comprising:

对采集任务进行配置,并将配置好的采集任务进行发布;Configure collection tasks and publish the configured collection tasks;

基于发布的采集任务,利用对应的数据采集工具,从各种异构数据源增量采集数据;Based on the released collection tasks, use the corresponding data collection tools to incrementally collect data from various heterogeneous data sources;

将采集到的数据存储至业务数据库。Store the collected data in the business database.

可选地,利用对应的数据采集工具,从各种异构数据源增量采集数据,包括:使用DataX中间件,从各种异构数据源增量采集数据。Optionally, using corresponding data collection tools to incrementally collect data from various heterogeneous data sources, including: using DataX middleware to incrementally collect data from various heterogeneous data sources.

可选地,利用对应的数据采集工具,从各种异构数据源增量采集数据,包括:Optionally, use corresponding data collection tools to incrementally collect data from various heterogeneous data sources, including:

使用DataX中间件,将各种异构数据源的增量数据采集至临时表;Use DataX middleware to collect incremental data from various heterogeneous data sources to temporary tables;

创建upate表、new表和delete表;Create upate table, new table and delete table;

通过将临时表的主健与md5字符串比对,将临时表中筛选出的数据分别插入upate表、new表和delete表;By comparing the main key of the temporary table with the md5 string, insert the filtered data in the temporary table into the upate table, new table and delete table respectively;

在没有数据插入或者插入结束时,将upate表、new表和delete表删除。When no data is inserted or the insertion ends, the upate table, new table, and delete table are deleted.

可选地,使用DataX中间件,将各种异构数据源的增量数据采集至临时表,包括:Optionally, use DataX middleware to collect incremental data from various heterogeneous data sources to temporary tables, including:

基于OGG(Oracle GoldenGate)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;Based on the OGG (Oracle GoldenGate) acquisition technology, the incremental data of various heterogeneous data sources is identified through logs, and the identified incremental data is collected to a temporary table;

基于CDC(Change Data Capture)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;Based on the CDC (Change Data Capture) acquisition technology, the incremental data of various heterogeneous data sources is identified through logs, and the identified incremental data is collected to a temporary table;

基于ETL(Extraction-Transformation-Loading)采集技术,通过时间戳和主键的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表。Based on the ETL (Extraction-Transformation-Loading) acquisition technology, the incremental data of various heterogeneous data sources is identified by means of time stamps and primary keys, and the identified incremental data is collected to a temporary table.

根据本发明的另一个方面,提供了一种数据采集装置,包括:According to another aspect of the present invention, a data collection device is provided, comprising:

配置及发布模块,用于对采集任务进行配置,并将配置好的采集任务进行发布;The configuration and release module is used to configure the collection tasks and publish the configured collection tasks;

数据采集模块,用于基于发布的采集任务,利用对应的数据采集工具,从各种异构数据源增量采集数据;The data collection module is used to collect data incrementally from various heterogeneous data sources by using corresponding data collection tools based on published collection tasks;

数据存储模块,用于将采集到的数据存储至业务数据库。The data storage module is used to store the collected data in the business database.

可选地,数据采集模块,具体用于:使用DataX中间件,从各种异构数据源增量采集数据。Optionally, the data collection module is specifically configured to: use DataX middleware to incrementally collect data from various heterogeneous data sources.

可选地,数据采集模块,具体用于:Optionally, the data acquisition module is specifically used for:

使用DataX中间件,将各种异构数据源的增量数据采集至临时表;Use DataX middleware to collect incremental data from various heterogeneous data sources to temporary tables;

创建upate表、new表和delete表;Create upate table, new table and delete table;

通过将临时表的主健与md5字符串比对,将临时表中筛选出的数据分别插入upate表、new表和delete表;By comparing the main key of the temporary table with the md5 string, insert the filtered data in the temporary table into the upate table, new table and delete table respectively;

在没有数据插入或者插入结束时,将upate表、new表和delete表删除。When no data is inserted or the insertion ends, the upate table, new table, and delete table are deleted.

可选地,数据采集模块,还具体用于:Optionally, the data acquisition module is also specifically used for:

基于Ogg(Oracle GoldenGate)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;Based on the Ogg (Oracle GoldenGate) acquisition technology, the incremental data of various heterogeneous data sources is identified through logs, and the identified incremental data is collected to a temporary table;

基于Cdc(Change Data Capture)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;Based on the Cdc (Change Data Capture) collection technology, the incremental data of various heterogeneous data sources is identified through logs, and the identified incremental data is collected to a temporary table;

基于ETL(Extraction-Transformation-Loading)采集技术,通过时间戳和主键的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表。Based on the ETL (Extraction-Transformation-Loading) acquisition technology, the incremental data of various heterogeneous data sources is identified by means of time stamps and primary keys, and the identified incremental data is collected to a temporary table.

本发明首先对采集任务进行配置,并将配置好的采集任务进行发布。然后基于发布的采集任务,利用对应的数据采集工具,从各种异构数据源增量采集数据。最后,将采集到的数据存储至业务数据库。本发明采用数据库同步技术和对应的数据采集工具,对各种异构数据源的数据进行抽取、同步、汇集,从而实现针对医院多源异构数据的采集和汇聚。The present invention firstly configures the collection tasks, and releases the configured collection tasks. Then, based on the published collection tasks, use the corresponding data collection tools to incrementally collect data from various heterogeneous data sources. Finally, the collected data is stored in the business database. The present invention adopts database synchronization technology and corresponding data collection tools to extract, synchronize and collect data from various heterogeneous data sources, thereby realizing the collection and collection of multi-source heterogeneous data for hospitals.

附图说明Description of drawings

通过参考下面的附图,可以更为完整地理解本发明的示例性实施方式:A more complete understanding of the exemplary embodiments of the present invention can be had by referring to the following drawings:

图1是本发明一示例性实施例提供的数据采集方法的流程示意图;Fig. 1 is a schematic flow chart of a data acquisition method provided by an exemplary embodiment of the present invention;

图2是本发明一示例性实施例提供的应用有数据采集方法的数据采集服务模块的框架图;Fig. 2 is a frame diagram of a data collection service module applied with a data collection method provided by an exemplary embodiment of the present invention;

图3是本发明一示例性实施例提供的数据采集装置的结构示意图。Fig. 3 is a schematic structural diagram of a data collection device provided by an exemplary embodiment of the present invention.

具体实施方式Detailed ways

下面,将参考附图详细地描述根据本发明的示例实施例。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是本发明的全部实施例,应理解,本发明不受这里描述的示例实施例的限制。Hereinafter, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. Apparently, the described embodiments are only some embodiments of the present invention, rather than all embodiments of the present invention, and it should be understood that the present invention is not limited by the exemplary embodiments described here.

应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。It should be noted that the relative arrangements of components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

本领域技术人员可以理解,本发明实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。Those skilled in the art can understand that terms such as "first" and "second" in the embodiments of the present invention are only used to distinguish different steps, devices or modules, etc. necessary logical sequence.

还应理解,在本发明实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。It should also be understood that in the embodiments of the present invention, "plurality" may refer to two or more than two, and "at least one" may refer to one, two or more than two.

还应理解,对于本发明实施例中提及的任一部件、数据或结构,在没有明确限定或者在前后文给出相反启示的情况下,一般可以理解为一个或多个。It should also be understood that for any component, data or structure mentioned in the embodiments of the present invention, it can generally be understood as one or more unless there is a clear limitation or a contrary suggestion is given in the context.

另外,本发明中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本发明中字符“/”,一般表示前后关联对象是一种“或”的关系。In addition, the term "and/or" in the present invention is only an association relationship describing associated objects, indicating that there may be three relationships, for example, A and/or B may indicate: A exists alone, and A and B exist at the same time , there are three cases of B alone. In addition, the character "/" in the present invention generally indicates that the contextual objects are an "or" relationship.

还应理解,本发明对各个实施例的描述着重强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,不再一一赘述。It should also be understood that the description of the various embodiments of the present invention emphasizes the differences between the various embodiments, and the same or similar points can be referred to each other, and for the sake of brevity, details are not repeated one by one.

同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。At the same time, it should be understood that, for the convenience of description, the sizes of the various parts shown in the drawings are not drawn according to the actual proportional relationship.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,技术、方法和设备应当被视为说明书的一部分。Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, techniques, methods and devices should be considered part of the description.

应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters denote like items in the following figures, therefore, once an item is defined in one figure, it does not require further discussion in subsequent figures.

本发明实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。Embodiments of the present invention can be applied to electronic equipment such as terminal equipment, computer systems, servers, etc., which can operate with many other general-purpose or special-purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick client Computers, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, etc.

终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server can be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including storage devices.

示例性方法exemplary method

图1是本发明一示例性实施例提供的数据采集方法的流程示意图。本实施例可应用在电子设备上,例如但不限于应用在数据采集服务系统上。Fig. 1 is a schematic flowchart of a data collection method provided by an exemplary embodiment of the present invention. This embodiment can be applied to electronic equipment, for example but not limited to a data collection service system.

如图1所示,数据采集方法100包括以下步骤:As shown in Figure 1, the data collection method 100 includes the following steps:

步骤101,对采集任务进行配置,并将配置好的采集任务进行发布。Step 101, configure collection tasks, and publish the configured collection tasks.

在本发明实施例中,待采集的数据例如但不限于为医院数据。该医院数据可以从多个业务数据系统获取,每一个业务数据系统对应于一个数据源。因此,如图2所示,在采集数据之前,业务单位可以通过数据源配置管理模块,根据实际需求对数据源进行配置,即确定对接哪几个数据源。完成配置之后,根据配置信息,从对应的数据源中采集数据,以便后续进行同步和汇聚。In the embodiment of the present invention, the data to be collected is, for example but not limited to, hospital data. The hospital data can be obtained from multiple business data systems, and each business data system corresponds to a data source. Therefore, as shown in Figure 2, before collecting data, business units can configure data sources according to actual needs through the data source configuration management module, that is, determine which data sources to connect to. After the configuration is complete, according to the configuration information, collect data from the corresponding data source for subsequent synchronization and aggregation.

进一步地,如图2所示,业务单位还可以通过采集任务配置管理模块,根据实际需求配置相应的采集任务,然后通过采集任务发布模块,将配置好的采集任务进行发布。Further, as shown in Figure 2, the business unit can also configure corresponding collection tasks according to actual needs through the collection task configuration management module, and then publish the configured collection tasks through the collection task release module.

步骤102,基于发布的采集任务,利用对应的数据采集工具,从各种异构数据源增量采集数据。Step 102, based on the released collection tasks, use corresponding data collection tools to incrementally collect data from various heterogeneous data sources.

可选地,利用对应的数据采集工具,从各种异构数据源增量采集数据,包括:使用DataX中间件,从各种异构数据源增量采集数据。Optionally, using corresponding data collection tools to incrementally collect data from various heterogeneous data sources, including: using DataX middleware to incrementally collect data from various heterogeneous data sources.

在本发明实施例中,数据采集工具采用DataX,它是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。In the embodiment of the present invention, the data acquisition tool adopts DataX, which is an off-line synchronization tool for heterogeneous data sources, and is dedicated to realizing various databases including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc. Stable and efficient data synchronization function between heterogeneous data sources.

在本发明实施例中,数据存储与计算采用CDH(Cloudera’s DistributionIncludingApache Hadoop)大数据处理方案,提供一个可扩展的、灵活的、集成的企业级大数据管理平台,可用来方便地管医院快速增长的多种多样数据,同时提供安全保护以及与硬件、软件方案的集成。支持医疗行业结构化数据和非结构化数据存储。In the embodiment of the present invention, the CDH (Cloudera's Distribution Including Apache Hadoop) big data processing scheme is used for data storage and calculation, providing an expandable, flexible, and integrated enterprise-level big data management platform, which can be used to conveniently manage the rapid growth of hospitals. A variety of data, while providing security protection and integration with hardware and software solutions. Support structured data and unstructured data storage in the medical industry.

在本发明实施例中,数据接入方式采用表或视图方式,直接读取原始业务系统中已经存在的表或视图,使用SQL进行数据过滤和导入,这种方式也是目前最方便成熟的方式。系统采用DataX中间件作为数据导入组件,可以支持mysql、oracle、pg、sqlserver等关系数据库,以及hive、HBase等非关系数据,共几十种数据库之间的互相对接。In the embodiment of the present invention, the data access method adopts the table or view method, directly reads the existing table or view in the original business system, and uses SQL to filter and import data. This method is also the most convenient and mature method at present. The system uses DataX middleware as the data import component, which can support relational databases such as mysql, oracle, pg, and sqlserver, as well as non-relational data such as hive, HBase, etc. There are dozens of databases interconnected with each other.

可选地,利用对应的数据采集工具,从各种异构数据源增量采集数据,包括:使用DataX中间件,将各种异构数据源的增量数据采集至临时表;创建upate表、new表和delete表;通过将临时表的主健与md5字符串比对,将临时表中筛选出的数据分别插入upate表、new表和delete表;在没有数据插入或者插入结束时,将upate表、new表和delete表删除。Optionally, use corresponding data collection tools to incrementally collect data from various heterogeneous data sources, including: using DataX middleware to collect incremental data from various heterogeneous data sources to temporary tables; create upate tables, new table and delete table; by comparing the main key of the temporary table with the md5 string, the data filtered out of the temporary table are inserted into the upate table, new table and delete table respectively; when no data is inserted or the insertion is completed, the upate Table, new table and delete table are deleted.

可选地,使用DataX中间件,将各种异构数据源的增量数据采集至临时表,包括:基于OGG(Oracle GoldenGate)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;基于CDC(Change Data Capture)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;基于ETL(Extraction-Transformation-Loading)采集技术,通过时间戳和主键的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表。Optionally, use DataX middleware to collect incremental data from various heterogeneous data sources to temporary tables, including: based on OGG (Oracle GoldenGate) collection technology, identify incremental data from various heterogeneous data sources through logs data, and collect the identified incremental data to a temporary table; based on CDC (Change Data Capture) collection technology, identify the incremental data of various heterogeneous data sources through logs, and collect the identified incremental data To a temporary table; based on ETL (Extraction-Transformation-Loading) collection technology, identify incremental data from various heterogeneous data sources by means of timestamps and primary keys, and collect the identified incremental data to a temporary table.

在本发明实施例中,构建基于OGG、CDC和ETL三种方式的数据增量采集,既可以实现业务系统的读写分离,又可以实现数据实时备份。增量数据模型构建基于医院的全量数据模型,即创建增量的数据模型与全量的数据模型保持一致,以便于新增数据可快速采集到原有数据平台中。增量数据采集的主要步骤如下:In the embodiment of the present invention, incremental data collection based on three methods of OGG, CDC and ETL is constructed, which can not only realize the separation of reading and writing of the business system, but also realize real-time data backup. The incremental data model is built based on the full data model of the hospital, that is, the incremental data model is created to be consistent with the full data model, so that new data can be quickly collected into the original data platform. The main steps of incremental data collection are as follows:

1)创建临时表;1) Create a temporary table;

2)通过DataX(数据采集工具)将增量数据采集至临时表。OGG、CDC和ETL基于不同的方式识别增量数据。CDC和OGG通过日志;ETL通过时间戳和主键的方式。2) Collect the incremental data to the temporary table through DataX (data collection tool). OGG, CDC, and ETL identify incremental data based on different approaches. CDC and OGG through logs; ETL through timestamps and primary keys.

3)分别创建upate、new、delete表。3) Create upate, new, delete tables respectively.

4)通过主键与md5字符串比对,将筛选出的数据分别插入update、new和delete表。4) By comparing the primary key with the md5 string, insert the filtered data into the update, new and delete tables respectively.

5)将三个表的数据插入;如果没有数据或者插入结束则删除表。5) Insert the data of the three tables; delete the table if there is no data or the insertion is complete.

步骤103,将采集到的数据存储至业务数据库。Step 103, storing the collected data in the business database.

在本发明实施例中,支持对全量、增量(时间戳、自增主键、表达式等方式)任务的一站式配置,极大提高配置效率,降低配置出错率。使用适配和抽取工具将医院HIS、LIS、RIS/PACS、手术麻醉、病理、心电、重症监护(ICU/CCU)等系统中与临床科研相关的各个数据表导入到医疗数据平台的原始主数据库中。支持多种数据源的灵活配置、支持数据库连通性测试,并且支持对采集目标模型灵活添加时间戳字段,与增量任务形成完整闭环。支持按不同版本发布采集任务,能够在后续创建新任务时适当提高复用性,并支持直接配置定时任务实现手动执行、定时任务等调度策略的多任务场景。In the embodiment of the present invention, it supports one-stop configuration of full and incremental (time stamp, self-incrementing primary key, expression, etc.) tasks, which greatly improves configuration efficiency and reduces configuration error rate. Use adaptation and extraction tools to import various data tables related to clinical research in the hospital HIS, LIS, RIS/PACS, surgical anesthesia, pathology, ECG, intensive care (ICU/CCU) and other systems to the original master of the medical data platform in the database. Supports flexible configuration of multiple data sources, supports database connectivity testing, and supports flexible addition of timestamp fields to the collection target model, forming a complete closed loop with incremental tasks. It supports the release of collection tasks by different versions, which can appropriately improve the reusability when creating new tasks in the future, and supports direct configuration of scheduled tasks to implement multi-task scenarios such as manual execution and scheduled tasks.

在本发明实施例中,采用数据库同步技术和ETL等技术,对数据进行抽取、同步、汇集,实现针对医院多源异构数据的采集和汇聚,支持与超过150家主流医疗信息化系统厂商无缝集成,支持300多种医疗信息化系统数据源智能映射。整合接入医院生产系统:HIS系统、EMR系统、PACS、RIS系统、检验信息系统、超声信息管理系统、病理系统等临床系统的历史数据的全量采集,且可以即时整合新的数据源。In the embodiment of the present invention, database synchronization technology, ETL and other technologies are used to extract, synchronize, and collect data to realize the collection and aggregation of multi-source heterogeneous data in hospitals, and support cooperation with more than 150 mainstream medical information system manufacturers. Seamless integration, supporting intelligent mapping of more than 300 medical information system data sources. Integration and access to hospital production systems: full collection of historical data from clinical systems such as HIS system, EMR system, PACS, RIS system, inspection information system, ultrasound information management system, and pathology system, and new data sources can be integrated in real time.

从而,本发明首先对采集任务进行配置,并将配置好的采集任务进行发布。然后基于发布的采集任务,利用对应的数据采集工具,从各种异构数据源增量采集数据。最后,将采集到的数据存储至业务数据库。本发明采用数据库同步技术和对应的数据采集工具,对各种异构数据源的数据进行抽取、同步、汇集,从而实现针对医院多源异构数据的采集和汇聚。Therefore, the present invention firstly configures the collection task, and releases the configured collection task. Then, based on the published collection tasks, use the corresponding data collection tools to incrementally collect data from various heterogeneous data sources. Finally, the collected data is stored in the business database. The present invention adopts database synchronization technology and corresponding data collection tools to extract, synchronize and collect data from various heterogeneous data sources, thereby realizing the collection and collection of multi-source heterogeneous data for hospitals.

示例性装置Exemplary device

图3是本发明一示例性实施例提供的数据采集装置的结构示意图。如图3所示,数据采集装置300包括:Fig. 3 is a schematic structural diagram of a data collection device provided by an exemplary embodiment of the present invention. As shown in Figure 3, the data acquisition device 300 includes:

配置及发布模块310,用于对采集任务进行配置,并将配置好的采集任务进行发布;The configuration and release module 310 is configured to configure the collection tasks and release the configured collection tasks;

数据采集模块320,用于基于发布的采集任务,利用对应的数据采集工具,从各种异构数据源增量采集数据;The data collection module 320 is used to incrementally collect data from various heterogeneous data sources using corresponding data collection tools based on published collection tasks;

数据存储模块330,用于将采集到的数据存储至业务数据库。The data storage module 330 is configured to store the collected data in a business database.

可选地,数据采集模块320,具体用于:使用DataX中间件,从各种异构数据源增量采集数据。Optionally, the data collection module 320 is specifically configured to: use DataX middleware to incrementally collect data from various heterogeneous data sources.

可选地,数据采集模块320,具体用于:Optionally, the data acquisition module 320 is specifically used for:

使用DataX中间件,将各种异构数据源的增量数据采集至临时表;Use DataX middleware to collect incremental data from various heterogeneous data sources to temporary tables;

创建upate表、new表和delete表;Create upate table, new table and delete table;

通过将临时表的主健与md5字符串比对,将临时表中筛选出的数据分别插入upate表、new表和delete表;By comparing the main key of the temporary table with the md5 string, insert the filtered data in the temporary table into the upate table, new table and delete table respectively;

在没有数据插入或者插入结束时,将upate表、new表和delete表删除。When no data is inserted or the insertion ends, the upate table, new table, and delete table are deleted.

可选地,数据采集模块320,还具体用于:Optionally, the data collection module 320 is also specifically used for:

基于OGG(Oracle GoldenGate)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;Based on the OGG (Oracle GoldenGate) acquisition technology, the incremental data of various heterogeneous data sources is identified through logs, and the identified incremental data is collected to a temporary table;

基于CDC(Change Data Capture)采集技术,通过日志的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表;Based on the CDC (Change Data Capture) acquisition technology, the incremental data of various heterogeneous data sources is identified through logs, and the identified incremental data is collected to a temporary table;

基于ETL(Extraction-Transformation-Loading)采集技术,通过时间戳和主键的方式识别各种异构数据源的增量数据,并将识别出的增量数据采集至临时表。Based on the ETL (Extraction-Transformation-Loading) acquisition technology, the incremental data of various heterogeneous data sources is identified by means of time stamps and primary keys, and the identified incremental data is collected to a temporary table.

本发明的实施例的数据采集装置300与本发明的另一个实施例的数据采集方法100相对应,在此不再赘述。The data collection device 300 in the embodiment of the present invention corresponds to the data collection method 100 in another embodiment of the present invention, and will not be repeated here.

Claims (8)

1. A method of data acquisition, comprising:
configuring the acquisition task and issuing the configured acquisition task;
based on the issued acquisition task, acquiring data from various heterogeneous data sources in an incremental manner by using corresponding data acquisition tools;
and storing the collected data in a service database.
2. The method of claim 1, wherein incrementally collecting data from a variety of disparate data sources using a corresponding data collection tool comprises: data is incrementally acquired from various heterogeneous data sources using DataX middleware.
3. The method of claim 1, wherein incrementally collecting data from a variety of disparate data sources using corresponding data collection tools comprises:
acquiring incremental data of various heterogeneous data sources to a temporary table by using DataX middleware;
creating an update table, a new table and a delete table;
respectively inserting the screened data in the temporary table into an update table, a new table and a delete table by comparing the main key of the temporary table with the md5 character string;
when there is no data insertion or the insertion is finished, the update table, the new table, and the delete table are deleted.
4. The method of claim 3, wherein collecting incremental data for various heterogeneous data sources into a temporary table using DataX middleware comprises:
identifying incremental data of various heterogeneous data sources in a log mode based on an OGG (Oracle golden Gate) acquisition technology, and acquiring the identified incremental data to a temporary table;
based on a CDC (Change Data Capture) acquisition technology, identifying incremental Data of various heterogeneous Data sources in a log mode, and acquiring the identified incremental Data to a temporary table; or
Based on an ETL (Extraction-Transformation-Loading) acquisition technology, incremental data of various heterogeneous data sources are identified through a time stamp and a main key mode, and the identified incremental data are acquired to a temporary table.
5. A data acquisition device, comprising:
the configuration and release module is used for configuring the acquisition tasks and releasing the configured acquisition tasks;
the data acquisition module is used for incrementally acquiring data from various heterogeneous data sources by using corresponding data acquisition tools based on the issued acquisition tasks;
and the data storage module is used for storing the acquired data to a service database.
6. The device of claim 5, wherein the data acquisition module is specifically configured to: data is incrementally acquired from various heterogeneous data sources using DataX middleware.
7. The device of claim 5, wherein the data acquisition module is specifically configured to:
acquiring incremental data of various heterogeneous data sources to a temporary table by using DataX middleware;
creating an update table, a new table and a delete table;
respectively inserting the screened data in the temporary table into an update table, a new table and a delete table by comparing the main key of the temporary table with the md5 character string;
when there is no data inserted or the insertion is finished, the update table, the new table, and the delete table are deleted.
8. The apparatus of claim 7, wherein the data acquisition module is further specifically configured to:
identifying incremental data of various heterogeneous data sources in a log mode based on an OGG (Oracle golden Gate) acquisition technology, and acquiring the identified incremental data to a temporary table;
based on a CDC (Change Data Capture) acquisition technology, identifying incremental Data of various heterogeneous Data sources in a log mode, and acquiring the identified incremental Data to a temporary table;
based on an ETL (Extraction-Transformation-Loading) acquisition technology, incremental data of various heterogeneous data sources are identified through a time stamp and a primary key mode, and the identified incremental data are acquired to a temporary table.
CN202210837084.9A 2022-07-15 2022-07-15 Data acquisition method and device Pending CN115391333A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210837084.9A CN115391333A (en) 2022-07-15 2022-07-15 Data acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210837084.9A CN115391333A (en) 2022-07-15 2022-07-15 Data acquisition method and device

Publications (1)

Publication Number Publication Date
CN115391333A true CN115391333A (en) 2022-11-25

Family

ID=84116712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210837084.9A Pending CN115391333A (en) 2022-07-15 2022-07-15 Data acquisition method and device

Country Status (1)

Country Link
CN (1) CN115391333A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488187A (en) * 2015-12-02 2016-04-13 北京四达时代软件技术股份有限公司 Method and device for extracting multi-source heterogeneous data increment
CN111881136A (en) * 2020-07-29 2020-11-03 山东健康医疗大数据有限公司 An Approach to Incremental Data Governance in the Healthcare Industry
CN112433998A (en) * 2020-11-20 2021-03-02 广东电网有限责任公司佛山供电局 Multisource heterogeneous data acquisition and convergence system and method based on power system
CN113407538A (en) * 2021-06-17 2021-09-17 北京计算机技术及应用研究所 Incremental acquisition method for data of multi-source heterogeneous relational database
KR20220013108A (en) * 2020-07-24 2022-02-04 주식회사 레드우드케이 System for providing intergration platform for collecting, processing and storaging of bigdata

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488187A (en) * 2015-12-02 2016-04-13 北京四达时代软件技术股份有限公司 Method and device for extracting multi-source heterogeneous data increment
KR20220013108A (en) * 2020-07-24 2022-02-04 주식회사 레드우드케이 System for providing intergration platform for collecting, processing and storaging of bigdata
CN111881136A (en) * 2020-07-29 2020-11-03 山东健康医疗大数据有限公司 An Approach to Incremental Data Governance in the Healthcare Industry
CN112433998A (en) * 2020-11-20 2021-03-02 广东电网有限责任公司佛山供电局 Multisource heterogeneous data acquisition and convergence system and method based on power system
CN113407538A (en) * 2021-06-17 2021-09-17 北京计算机技术及应用研究所 Incremental acquisition method for data of multi-source heterogeneous relational database

Similar Documents

Publication Publication Date Title
CN103593422B (en) A Virtual Access Management Method for Heterogeneous Database
EP3602341B1 (en) Data replication system
CN102750406B (en) Multi-version management method for model set and difference model-based power grid model
CN102542007B (en) Method and system for synchronization of relational databases
CN103617176B (en) One kind realizes the autosynchronous method of multi-source heterogeneous data resource
EP3513313A1 (en) System for importing data into a data repository
CN110134671B (en) Traceability application-oriented block chain database data management system and method
WO2018051097A1 (en) System for analysing data relationships to support query execution
CN112507681B (en) Multi-source heterogeneous medical data acquisition method based on template design mode
EP3513315A1 (en) System for data management in a large scale data repository
CN104318481A (en) Power-grid-operation-oriented holographic time scale measurement data extraction conversion method
CN104111996A (en) Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN103294724A (en) Method for managing database structures and system for method
CN105302803A (en) Product BOM difference analyzing and synchronous updating method
CN111243748A (en) Needle pushing health data standardization system
CN102801565B (en) Method for carrying out centralized management on service configuration in network management system
CN104572740B (en) A method and device for storing data
CN102508886A (en) Extensive makeup language (XML)-based method for synchronously updating increment of spatial data
CN110781197A (en) A kind of hive offline synchronization verification method, device and electronic equipment
CN103577614B (en) A kind of collecting method towards SAP PI application integration platform and system
Qiao et al. Gobblin: Unifying data ingestion for Hadoop
CN118503229A (en) Hudi data ingestion method and system for multi-source heterogeneous data
CN119848765A (en) Building full life cycle data through fusion method
CN113380414B (en) Data acquisition method and system based on big data
CN115391333A (en) Data acquisition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination