CN106503457B

CN106503457B - Clinical data based on translational medicine analysis platform integrates technical data introduction method

Info

Publication number: CN106503457B
Application number: CN201610947256.2A
Authority: CN
Inventors: 金涛; 吕成; 王建民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2018-12-11
Anticipated expiration: 2036-10-26
Also published as: CN106503457A

Abstract

The invention relates to a clinical data integration technology data import method based on a translational medicine analysis platform, which belongs to the field of data integration technology; the method is used for the import of text-type data and the import of data in source databases: including preparing data files, column mapping Files, text replacement files, and files used to build a concept tree; build a concept tree, verify whether the data in the column mapping file meets the set rules, read and process the data for the first time, and add the custom variable logo again Construct a concept tree; import patient ID, visitor ID data, and patient information data into the platform database and perform relevant processing on data abnormalities, network abnormalities, or platform database system abnormalities. The characteristics of this method include building a concept tree in memory on the basis of tranSMART Batch, adding custom variables to directly read data from the database; adding an exception handling mechanism to make it have the characteristics of high automation and strong adaptability.

Description

Data import method of clinical data integration technology based on translational medicine analysis platform

技术领域technical field

本发明属于数据集成技术领域,特别涉及实现了一个自动化程度高、适应能力强的tranSMART临床数据集成工具。The invention belongs to the technical field of data integration, and particularly relates to the realization of a tranSMART clinical data integration tool with high degree of automation and strong adaptability.

背景技术Background technique

近年来随着精准医疗概念的提出与相关行业的蓬勃发展，在医疗过程中产生的临床数据，基因数据等被逐渐积累，而有效地分析这些数据里蕴藏的医学信息在医疗领域开始显得越来越重要。In recent years, with the introduction of the concept of precision medicine and the vigorous development of related industries, clinical data and genetic data generated in the medical process have been gradually accumulated, and effective analysis of the medical information contained in these data has begun to appear in the medical field. more important.

tranSMART(转化医学分析平台)最初是给制药公司和药物研发结构提供一个数据协同管理平台。随后，tranSMART逐步被转为开源软件。并由tranSMART Foundation组织来接手进行持续地开发和维护。这个平台是基于I2B2临床医疗数据仓库开发的，因此tranSMART可以对结构化的临床数据，如人口学信息和临床病理数据等进行数据管理。tranSMART (translational medicine analysis platform) initially provides a data collaborative management platform for pharmaceutical companies and drug research and development organizations. Subsequently, tranSMART was gradually converted to open source software. And the tranSMART Foundation organization will take over for continuous development and maintenance. This platform is developed based on the I2B2 clinical medical data warehouse, so tranSMART can manage structured clinical data, such as demographic information and clinical pathological data.

tranSMART可以让不同的群体用同一个数据仓库来整合内部和外部的数据资源。并且提供了系统级工具来检索，查看和分析仓库里的数据。tranSMART包含下面多个特色功能：tranSMART allows different groups to use the same data warehouse to integrate internal and external data resources. And it provides system-level tools to retrieve, view and analyze the data in the warehouse. tranSMART includes the following features:

搜索工能。通过使用Solr搜索服务可以让研究者像使用谷歌、百度这样的搜索工具来搜索感兴趣的资源信息，搜索的功能包括搜索一个独特的种类，例如疾病、基因或者临床路径等其他种类。建立复杂的搜索条件，可以自定义要搜索的种类的复杂条件，并可以保存起来。Search functionality. By using the Solr search service, researchers can use search tools such as Google and Baidu to search for resource information of interest. The search function includes searching for a unique category, such as diseases, genes or clinical pathways, and other categories. Establish complex search conditions, you can customize the complex conditions of the type to be searched, and you can save them.

数据集管理功能。以I2B2设计为基础。可以支持自己定义研究的子集。可以定义条件，比较两个子集的差异。数据管理器的特色功能包括保存用于构建子集的条件。可以支持用多种分析算法来对数据分析。比如可以用主成分分析来减少数据集的维数。Dataset management functions. Based on the I2B2 design. It is possible to support self-defined subsets of studies. Conditions can be defined to compare the differences between two subsets. Featured features of the Data Manager include saving the criteria used to construct subsets. It can support data analysis with multiple analysis algorithms. For example, principal component analysis can be used to reduce the dimensionality of the data set.

基因印记向导功能。tranSMART提供一个向导来辅助创建和定义基因印记和基因列表。保存的标记的基因可以用来在数据集管理器的分析功能。基因标记向导的功能包括，私有的基因印记，这样只有自己可以访问私有的基因，同时也可以将基因改为公有让所有用户都可以用。可以克隆一个存在的基因印记。Gene Imprinting Wizard function. tranSMART provides a wizard to assist in the creation and definition of gene signatures and gene lists. The saved tagged genes can be used in the analysis function of the dataset manager. The functions of the gene marker wizard include private gene imprinting, so that only you can access the private gene, and you can also change the gene to public so that all users can use it. An existing gene imprint can be cloned.

tranSMART基于Groovy语言开发，Groovy可以理解成为一种可以运行在Jvm上的脚本语言。它和Java语言之间可以相互调用并且很好地融合在一起。同时它又具有脚本语言方便灵活的特性。尤其是其中的闭包概念，可以使程序简短并且易读性不会减弱。Groovy语言的许多特性可以大大减少代码量，尤其重要的是Groovy中的闭包概念，在用它实现的项目中，闭包被广泛应用。Groovy语言允许省略掉括号，省略掉分号。由此可以看到使用Groovy语言可以在很大程度上提高编程的效率。同时Groovy语言可以无缝调用Java已经有的一些框架以及一些工具包，事实上由于某些特性它和Spring框架结合的非常好。tranSMART is developed based on the Groovy language, and Groovy can be understood as a scripting language that can run on the Jvm. It and the Java language can call each other and integrate well. At the same time, it has the convenience and flexibility of scripting language. In particular, the concept of closure can make the program short and the readability will not be weakened. Many features of the Groovy language can greatly reduce the amount of code, especially the closure concept in Groovy, which is widely used in projects implemented with it. The Groovy language allows omitting parentheses and omitting semicolons. It can be seen that using the Groovy language can greatly improve the efficiency of programming. At the same time, the Groovy language can seamlessly call some frameworks and toolkits that Java already has. In fact, it combines very well with the Spring framework due to certain features.

tranSMART的设计是典型的N-Tier架构，分为数据处理层、业务处理层和展示层。tranSMART所使用的Grails框架，遵循一种约定大于配置的原则，只需要简单的几行配置就可以区分开发、生产、测试几种不同环境，并可以在启动系统时指定是哪种环境，有一些经过验证的良好预先设定，并提供大量的API，使开发可以很顺畅不需要过度关注已经很成熟的技术细节。同时依靠Grails产生的Web应用可以兼容如今的J2EE容器。Grails最为人称道的一点是它有丰富的插件，有一个专门的插件社区，并且这个社区一直在壮大。此外Grails还有对ORM以及NoSQL的支持。The design of tranSMART is a typical N-Tier architecture, which is divided into data processing layer, business processing layer and display layer. The Grails framework used by tranSMART follows a principle that convention is greater than configuration. It only needs a few lines of configuration to distinguish between development, production, and test environments, and you can specify which environment it is when starting the system. Some Proven good presets and a large number of APIs are provided, so that the development can be smooth without paying too much attention to mature technical details. At the same time, Web applications generated by Grails are compatible with today's J2EE containers. The most commendable thing about Grails is that it has a wealth of plug-ins, a dedicated plug-in community, and this community has been growing. In addition, Grails also supports ORM and NoSQL.

tranSMART现有的数据集成技术有几种不同的实现方式。包括基于Kettle的tranSMART Data工具、tranSMART ICE工具以及脚本方式，还有使用存储过程的方式。这些工具在设计时只考虑了一次性导入数据的情况，而不能满足增量式地添加数据的需求，即满足数据的可更新性。而且二次开发的难度较大。最终选择了可以提供数据更新支持且基于Spring Batch开发的tranSMART Batch项目为基础，用于构建自主数据集成工具。There are several different implementations of tranSMART's existing data integration technology. Including the Kettle-based tranSMART Data tool, tranSMART ICE tool and scripting methods, as well as the method of using stored procedures. These tools only consider the one-time import of data when they are designed, but cannot meet the demand of adding data incrementally, that is, satisfy the updateability of data. And secondary development is more difficult. Finally, the tranSMART Batch project, which can provide data update support and is developed based on Spring Batch, was selected as the basis for building autonomous data integration tools.

tranSMART Batch是以Spring Batch为框架的数据导入程序。可以导入不同类型的程序。如适用于临时研究的一次性导入的临床科研数据、基因数据、维度数据、I2B2数据等等。其中导入I2B2形式的数据，是可以满足数据的更新，以及解决数据向维度表分散的问题。tranSMART Batch is a data import program based on Spring Batch. Different types of programs can be imported. Such as one-time imported clinical scientific research data, genetic data, dimension data, I2B2 data, etc. suitable for temporary research. Among them, importing data in the form of I2B2 can satisfy data update and solve the problem of data dispersion to dimension tables.

tranSMART Batch的实现是采用了Groovy语言。在自动化工具构建方面，tranSMART Batch使用了基于Ant和Maven概念的自动化构建工具Gradle。Gradle基于Groovy的特定领域语言(DSL)来声明项目设置。相对于Maven和Ant，Gradle抛弃了Ant和Maven的繁琐的XML配置。其他的Gradle的特性还有很多。比如将工程依赖的地位放得高。并支持局部构建。此外Gradle支持多方式的依赖构建。包括Maven仓库、Nexus私服、Ivy仓库以及本地文件系统的jars或者dirs。使用Gradle管理项目可以达到事半功倍的效果。在配置好数据库的连接信息以后，需要执行命令./GradlewSetupSchema来创建Spring Batch需要的数据库中的Schema，用于记录批处理的信息，以便回滚、记录等操作。tranSMART Batch是通过打好的jar包来执行导入数据的。通过执行定义好的capsule任务。可以得到一个用于导入数据的jar包。接下来需要配置数据库连接的配置信息。是一些常规的jdbc配置。包括所使用的driver、服务的位置、用户名、密码等。The realization of tranSMART Batch adopts Groovy language. In terms of automated tool building, tranSMART Batch uses Gradle, an automated building tool based on the concepts of Ant and Maven. Gradle is based on Groovy's domain-specific language (DSL) to declare project settings. Compared with Maven and Ant, Gradle abandons the tedious XML configuration of Ant and Maven. There are many other Gradle features. For example, put the status of engineering dependence at a high level. And supports local construction. In addition, Gradle supports multiple ways of building dependencies. Including Maven repository, Nexus private server, Ivy repository, and jars or dirs of the local file system. Using Gradle to manage projects can achieve twice the result with half the effort. After configuring the database connection information, you need to execute the command ./GradlewSetupSchema to create the Schema in the database required by Spring Batch, which is used to record batch processing information for rollback, recording and other operations. tranSMART Batch imports data through the packaged jar package. By executing defined capsule tasks. You can get a jar package for importing data. Next, you need to configure the configuration information for the database connection. Is some regular jdbc configuration. Including the driver used, the location of the service, user name, password, etc.

使用tranSMARTBatch集成技术的数据导入方法说明如下：The data import method using tranSMARTBatch integration technology is described as follows:

步骤一：准备文件。准备的文件包括：Step 1: Prepare documents. Prepared documents include:

11)数据文件：即要导入到tranSMART(转化医学分析平台，以下称为平台)的数据文件，文件中第一行是表头行，用于对数据说明。从第二行开始，每一行代表一条数据，行中每个具体数据以制表符(’/t’)分隔。11) Data file: the data file to be imported into tranSMART (translational medicine analysis platform, hereinafter referred to as the platform), the first line in the file is the header line, which is used to explain the data. Starting from the second line, each line represents a piece of data, and each specific data in the line is separated by a tab character ('/t').

12)列映射文件：列映射文件有固定的六列内容。分别为数据文件的文件名、数据列号、这个数据是否可以是空值、数据变量、数据类型(数值型数据或者文本型数据或是自然语言类型)、数据单位(如g、ml等单位)。其中数据变量就是用来描述数据含义的。数据变量的取值为一些预先定义的符号或者是代表着一个概念的概念路径。其中预先定义的符号包括有PAT_ID、PAT_SEX、VISIT_ID等。分别用于描述患者的ID，年龄、性别以及来访的ID。概念由概念路径和概念类型构成。概念路径是一个将概念以斜线分隔的文本，如‘\开放研究\心内疾病研究\’。12) Column mapping file: The column mapping file has six fixed columns. They are the file name of the data file, the data column number, whether the data can be null, data variable, data type (numeric data or text data or natural language type), data unit (such as g, ml and other units) . The data variable is used to describe the meaning of the data. The value of the data variable is some predefined symbols or a concept path representing a concept. The predefined symbols include PAT_ID, PAT_SEX, VISIT_ID and so on. They are used to describe the patient's ID, age, gender and visiting ID respectively. A concept consists of a concept path and a concept type. A concept path is a text that separates concepts with slashes, such as '\Open Research\Heart Disease Research\'.

13)文本替换文件：文本替换文件的作用是将一些不符合预定义规范的数据清洗掉。因此这个文件需要指定哪个数据文件中的哪列需要从什么值替换为什么值。这个文件由四列，分别为文件名、数据列的列号、原来的值、想要替换成的值。如数据列“患者基本检查文件，1，男，m”。代表的意思是将患者基本检查文件的第一列中所有“男”的数据替换成“m”。步骤二：将在列映射文件中的出现概念的路径和概念的类型手工插入到tranSMART数据库中的概念表中，除了插入概念表外还需要插入相关的内容到I2B2表中。13) Text replacement file: the function of the text replacement file is to clean up some data that does not meet the predefined specifications. So this file needs to specify which column in which data file needs to be replaced from what value to what value. This file consists of four columns, which are the file name, the column number of the data column, the original value, and the value to be replaced. For example, the data column is "Patient's basic examination file, 1, male, m". The representative means to replace all "male" data in the first column of the patient's basic examination file with "m". Step 2: Manually insert the path and type of the concept appearing in the column mapping file into the concept table in the tranSMART database. In addition to inserting the concept table, you also need to insert related content into the I2B2 table.

步骤三：读入列映射文件：验证列映射文件中的数据是否满足设定的规则，如验证映射文件中给出的概念是否已经存在于数据库中的概念表中，用来保证数据的正确性的。Step 3: Read the column mapping file: verify whether the data in the column mapping file meets the set rules, such as verifying whether the concepts given in the mapping file already exist in the concept table in the database, to ensure the correctness of the data of.

步骤四：第一次读取并处理数据：读取数据文件中的数据，首先根据文本替换文件中的定义将该数据替换成规范的值；然后对数据的正确性验证；补充一些数据的缺失信息。在遍历的过程中提取出病人的ID信息即在列映射文件中该列的数据变量为‘PAT_ID’的数据，以及来访的ID数据即在列映射文件中该列的数据变量为‘VIS_ID’的数据。Step 4: Read and process data for the first time: read the data in the data file, first replace the data with the standard value according to the definition in the text replacement file; then verify the correctness of the data; supplement some missing data information. In the process of traversal, the patient's ID information is extracted, that is, the data variable of the column in the column mapping file is 'PAT_ID', and the ID data of the visitor, that is, the data variable of the column in the column mapping file is 'VIS_ID' data.

步骤五：将病人ID、来访ID数据导入到平台数据库并第二次读取数据文件中的数据。按行读取这些数据。根据文本替换文件中的定义，将定义列的不符合要求的数据替换成符合要求的数据；根据列映射文件中定义的列序号以及数据变量，标识出每条数据中的每个数据的具体表示含义；将病人信息数据与病人ID对应插入到平台数据库中的病人信息表中，将来访数据与来访ID对应插入到平台数据库中的来访信息表中，将数据变量中的概念类型的变量标识的数据与概念路径本身以及在列映射文件中定义的其他类型的数据单位、数据类型等相关数据插入到平台数据库中的观测事实表中，即完成数据导入。Step 5: Import the patient ID and visitor ID data into the platform database and read the data in the data file for the second time. Read this data row by row. According to the definition in the text replacement file, replace the data that does not meet the requirements of the defined column with the data that meets the requirements; according to the column serial number and data variable defined in the column mapping file, identify the specific representation of each data in each piece of data Meaning: Insert the corresponding patient information data and patient ID into the patient information table in the platform database, insert the corresponding visitor data and visitor ID into the visitor information table in the platform database, and identify the variables of the concept type in the data variables The data and concept path itself and other types of data units, data types and other related data defined in the column mapping file are inserted into the observation fact table in the platform database, that is, the data import is completed.

上述的方法存在以下问题:There are following problems in above-mentioned method:

1.自动化程度不高。在步骤二中需要手工编辑大量数据。繁琐而且容易出错。1. The degree of automation is not high. In step two, a large amount of data needs to be manually edited. Cumbersome and error-prone.

2.支持导入的数据组织形式只能为每列数据为同一个概念的形式。根据现有的列映射文件的设计与变量类型。无法集成概念与数值存在于同一行中的数据组织形式情况。2. The data organization form that supports import can only be in the form of the same concept for each column of data. According to the design of the existing column mapping file with variable type. It is not possible to integrate data organization situations where concepts and values exist in the same row.

3.数据源支持单一。步骤一中需要准备要导入的数据文件。这个文件是文本形式。该方法不能直接从医院数据库中抽取数据。3. The data source supports a single. In step 1, the data file to be imported needs to be prepared. This file is in text form. This method cannot extract data directly from the hospital database.

4.在步骤第二次遍历数据中缺少异常处理机制。对于一些可以跳过的异常不会跳过，而是直接导致程序崩溃。4. The exception handling mechanism is missing in the second traversal of the data in the step. Some exceptions that can be skipped will not be skipped, but will directly cause the program to crash.

发明内容Contents of the invention

本发明的目的是为克服已有技术的不足之处，提出一种基于转化医学分析平台(tranSMART)的临床数据集成技术的数据导入方法，该方法具有自动化程度高、适应能力强的特点。The purpose of the present invention is to overcome the deficiencies of the prior art and propose a data import method based on the clinical data integration technology of the translational medicine analysis platform (tranSMART). The method has the characteristics of high degree of automation and strong adaptability.

本发明提出的一种基于转化医学分析平台的临床数据集成技术数据导入方法，其特征在于，该方法用于文本型的数据的导入，具体包括以下步骤：A clinical data integration technology data import method based on a translational medicine analysis platform proposed by the present invention is characterized in that the method is used for text-based data import, and specifically includes the following steps:

步骤一：准备文件，准备的文件包括：Step 1: Prepare documents, the prepared documents include:

11)数据文件：即要导入到平台的数据文件，文件中第一行是表头行，用于对数据说明；从第二行开始，每一行代表一条数据，行中每个具体数据以制表符(’/t’)分隔；11) Data file: the data file to be imported to the platform. The first line in the file is the header line, which is used to explain the data; starting from the second line, each line represents a piece of data, and each specific data in the line is represented by the system. Table character ('/t') separated;

数据文件还包括数据值一列对应不同的概念的数据，但对应不同的概念的数据；12)列映射文件：列映射文件有固定的六列内容。分别为数据文件的文件名、数据列号、这个数据是否可以是空值、数据变量、数据类型、数据单位；其中数据变量就是用来描述数据含义的。数据变量的取值为预先定义的符号或者是代表着一个概念的概念路径；概念由概念路径和概念类型构成；概念路径是一个将概念以斜线分隔的文本；The data file also includes a column of data values corresponding to data of different concepts, but corresponding to data of different concepts; 12) column mapping file: the column mapping file has fixed six columns of content. They are the file name of the data file, the data column number, whether the data can be null, the data variable, the data type, and the data unit; the data variable is used to describe the meaning of the data. The value of the data variable is a predefined symbol or a concept path representing a concept; a concept is composed of a concept path and a concept type; a concept path is a text that separates concepts with slashes;

还包括在数据变量中增加一个自定义的标识，体现一行中存在多个概念的数据变量的层级关系；It also includes adding a custom logo to the data variable to reflect the hierarchical relationship of data variables with multiple concepts in one row;

13)文本替换文件：用于是将一些不符合预定义规范的数据清洗掉；该文件由四列，分别为文件名、数据列的列号、原来的值、想要替换成的值；13) Text replacement file: used to clean out some data that does not meet the predefined specifications; the file consists of four columns, which are the file name, the column number of the data column, the original value, and the value to be replaced;

14)用于构建概念树的文件：该文件有概念路径和概念类别两列，概念路径将概念分为不同层级，不同层级之间之间以斜线“\”分隔，概念类别包括类别型和数值型两类；步骤二：根据步骤一中构建的概念树文件的概念路径和概念类别中的层级，构建概念树；将概念树中的每个节点中的概念路径与概念类型插入到平台数据库中的概念表中，将概念路径与概念类型以及节点在概念树中的相对位置插入到平台数据库的I2B2表中；14) The file used to build the concept tree: the file has two columns of concept path and concept category. The concept path divides the concept into different levels, and the different levels are separated by a slash "\". The concept category includes category type and Two types of numerical types; Step 2: Construct a concept tree according to the concept path and concept category hierarchy of the concept tree file constructed in step 1; insert the concept path and concept type in each node in the concept tree into the platform database In the concept table in , insert the concept path, concept type and the relative position of the node in the concept tree into the I2B2 table of the platform database;

步骤三：读入列映射文件，验证列映射文件中的数据是否满足设定的规则，用来保证数据的正确性的；Step 3: Read in the column mapping file, and verify whether the data in the column mapping file meets the set rules to ensure the correctness of the data;

步骤四：第一次读取并处理数据：首先根据文本替换文件中的定义将该数据替换成规范的值；然后对数据的正确性验证；补充数据的缺失信息；提取出病人的ID信息即在列映射文件中该列的数据变量为‘PAT_ID’的数据，以及来访的ID数据即在列映射文件中该列的数据变量为‘VIS_ID’的数据；Step 4: Read and process the data for the first time: first replace the data with a standardized value according to the definition in the text replacement file; then verify the correctness of the data; supplement the missing information of the data; extract the patient’s ID information and The data variable of this column in the column mapping file is the data of 'PAT_ID', and the ID data of the visit is the data of the data variable of the column in the column mapping file of 'VIS_ID';

加入自定义变量的标识。根据自定义标识的概念路径变量构造概念路径。根据自定义标识的概念类型变量构造概念类型。再次构造概念树，并将该概念树插入到平台数据库对应的表中；Add the ID of the custom variable. Constructs a concept path from a custom-identified concept path variable. Constructs a concept type from a custom-identified concept type variable. Construct the concept tree again, and insert the concept tree into the table corresponding to the platform database;

步骤五：将病人ID、来访ID数据导入到平台数据库并第二次读取数据文件中的数据：按行读取这些数据，根据文本替换文件中的定义，将指定列的不符合要求的数据替换成符合要求的数据；根据列映射文件中定义的列序号以及数据变量，标识出每条数据中的每个数据的具体表示含义，将病人信息数据与病人ID对应插入到平台数据库中的病人信息表中，将来访数据与来访ID对应插入到平台数据库中的来访信息表中，将数据变量中的概念类型的变量标识的数据与概念路径本身以及在列映射文件中定义的其他类型的数据单位、数据类型等相关数据插入到平台数据库中的观测事实表中；如果导入数据出现异常情况，只做相关记录供以后查看；如果是网络异常或平台数据库系统异常，则回滚事物，并记录当前导入的状态；即完成数据导入。Step 5: Import the patient ID and visitor ID data into the platform database and read the data in the data file for the second time: read the data by row, replace the data in the specified column that does not meet the requirements according to the definition in the text replacement file Replace with data that meets the requirements; identify the specific meaning of each data in each piece of data according to the column serial number and data variables defined in the column mapping file, and insert the patient information data and patient ID into the patient in the platform database In the information table, insert the visit data and the visit ID into the visit information table in the platform database, and compare the data identified by the variable of the concept type in the data variable with the concept path itself and other types of data defined in the column mapping file Insert relevant data such as units and data types into the observation fact table in the platform database; if there is an abnormality in the imported data, only make relevant records for later viewing; if it is a network abnormality or an abnormality in the platform database system, roll back the thing and record The current status of the import; that is, the data import is complete.

本发明还包括用于源数据库中数据的导入。The invention also includes importing data in the source database.

本发明的特点及有益效果：Features and beneficial effects of the present invention:

本发明的技术特点是：在tranSMART Batch基础上1.通过给出概念树文件。利用适当数据结构在内存中构建概念树，并自动填充tranSMART数据库中概念表和I2B2表的内容；2.通过在列映射文件中增加四个自定义变量，并实现根据自定义变量处理数据的功能，完成对原方法中不能集成的数据组织形式的集成；3.通过增加可以从数据库中读入数据的功能实现直接从数据库读入数据；4.通过在步骤第二次遍历数据中增加异常处理机制，使一些可以跳过的异常直接能跳过。The technical characteristics of the present invention are: on the basis of tranSMART Batch 1. by providing the concept tree file. Use the appropriate data structure to build a concept tree in memory, and automatically fill in the content of the concept table and I2B2 table in the tranSMART database; 2. By adding four custom variables in the column mapping file, and realize the function of processing data according to the custom variables , to complete the integration of the data organization form that cannot be integrated in the original method; 3. By adding the function of reading data from the database, the data can be read directly from the database; 4. By adding exception handling in the second traversal of the data in the step Mechanism, so that some exceptions that can be skipped can be skipped directly.

本发明的有益效果是：The beneficial effects of the present invention are:

本方法不但能将文本数据形式的临床数据正确、全面地导入到tranSMART分析平台中，还能导入源数据库中的数据，进而可以利用tranSMART的各种算法进行数据分析。This method can not only correctly and comprehensively import the clinical data in the form of text data into the tranSMART analysis platform, but also import the data in the source database, and then use various algorithms of tranSMART for data analysis.

本方法在数据导入过程中没有损失医学上关键的信息，每一条临床科研数据都可以溯源。在发生特殊情况，如网络异常、数据库系统异常时，可以记录此次导入任务的成败，并选择是提交事务、还是回滚事务。保证了数据导入的正确性。This method does not lose the key medical information during the data import process, and each piece of clinical scientific research data can be traced. In the event of special circumstances, such as network exceptions and database system exceptions, you can record the success or failure of the import task, and choose whether to commit the transaction or roll back the transaction. Ensure the correctness of data import.

本方法的数据导入的自动化程度高。The data import of this method has a high degree of automation.

本方法可以选择不同的数据源的形式，如文本或者是数据库。有完善的异常处理机制。可以导入数据值一列对应不同的概念的数据，适应能力强。This method can choose different data source forms, such as text or database. There is a perfect exception handling mechanism. It can import a column of data values corresponding to different concepts of data, and has strong adaptability.

附图说明Description of drawings

图1是本发明的方法的总体流程框图。Fig. 1 is an overall flow chart of the method of the present invention.

图2是本发明与概念树构建相关的示意图。FIG. 2 is a schematic diagram of the present invention related to concept tree construction.

具体实施方式Detailed ways

本发明提出的基于tranSMART平台的临床数据集成方法结合附图及实施方式做以下说明：The clinical data integration method based on the tranSMART platform proposed by the present invention will be described as follows in conjunction with the accompanying drawings and implementation methods:

本发明方法是在tranSMART Batch基础上以Spring Batch为框架实现的。发明的具体实施方法如图1所示，用于文本型的数据的导入，具体包括以下步骤：The method of the present invention is implemented on the basis of tranSMART Batch with Spring Batch as the framework. The specific implementation method of the invention as shown in Figure 1, is used for the import of the data of text type, specifically comprises the following steps:

11)数据文件：即要导入到平台的数据文件，文件中第一行是表头行，用于对数据说明。从第二行开始，每一行代表一条数据，行中每个具体数据以制表符(’/t’)分隔。11) Data file: the data file to be imported into the platform, the first line in the file is the header line, which is used to explain the data. Starting from the second line, each line represents a piece of data, and each specific data in the line is separated by a tab character ('/t').

本方法的数据文件还可以包括数据值一列对应不同的概念的数据。但对应不同的概念的数据。如表1中24.6数值对应的概念是凝血酶原时间活动度，而30.2这个数值对应的概念为凝血酶原时间。The data file in this method may also include a column of data values corresponding to data of different concepts. But data corresponding to different concepts. For example, the concept corresponding to the value of 24.6 in Table 1 is prothrombin time activity, and the concept corresponding to the value of 30.2 is prothrombin time.

表1发明可以集成的样例数据Table 1 Invention can be integrated sample data

本方法在列映射文件的数据变量中增加一个自定义的标识，在这个标识中体现一行中存在多个概念的数据变量的层级关系。以解决已有技术不能导入数据值一列对应不同的概念的数据。本实施例的列映射文件中数据变量的标识分为表示概念类型的数据变量的标识，表示数值的数据变量的标识，以及表示数值单位的数据变量的标识。This method adds a custom identifier to the data variable of the column mapping file, and this identifier reflects the hierarchical relationship of data variables with multiple concepts in a row. In order to solve the problem that the prior art cannot import data whose data values correspond to different concepts. The identifiers of data variables in the column mapping file in this embodiment are divided into identifiers of data variables representing concept types, identifiers of data variables representing numerical values, and identifiers of data variables representing numerical units.

13)文本替换文件：文本替换文件的作用是将一些不符合预定义规范的数据清洗掉。因此这个文件需要指定哪个数据文件中的哪列需要从什么值替换为什么值。这个文件由四列，分别为文件名、数据列的列号、原来的值、想要替换成的值。如数据列“患者基本检查文件，1，男，m”。代表的意思是将患者基本检查文件的第一列中所有“男”的数据替换成“m”。13) Text replacement file: the function of the text replacement file is to clean up some data that does not meet the predefined specifications. So this file needs to specify which column in which data file needs to be replaced from what value to what value. This file consists of four columns, which are the file name, the column number of the data column, the original value, and the value to be replaced. For example, the data column is "Patient's basic examination file, 1, male, m". The representative means to replace all "male" data in the first column of the patient's basic examination file with "m".

14)用于构建概念树的文件：该文件有概念路径和概念类别两列，概念路径将概念分为不同层级，不同层级之间之间以斜线“\”分隔，概念类别包括类别型和数值型两类。如表2所示是一个概念的概念树文件例子。14) The file used to build the concept tree: the file has two columns of concept path and concept category. The concept path divides the concept into different levels, and the different levels are separated by a slash "\". The concept category includes category type and Numerical types. Table 2 is an example of a concept tree file.

表2构建概念树文件例子Table 2 Example of building a concept tree file

步骤二：根据步骤一中构建的概念树文件的概念路径和概念类别中的层级，构建概念树。如图2是根据表2中的数据构建出的概念树的样例。将概念树中的每个节点中的概念路径与概念类型插入到平台数据库中的概念表中，将概念路径与概念类型以及节点在概念树中的相对位置插入到平台数据库的I2B2表中。Step 2: Construct a concept tree according to the concept path and the hierarchy in the concept category of the concept tree file constructed in step 1. Figure 2 is an example of a concept tree constructed based on the data in Table 2. Insert the concept path and concept type in each node in the concept tree into the concept table in the platform database, and insert the concept path, concept type and the relative position of the node in the concept tree into the I2B2 table in the platform database.

步骤三：读入列映射文件，验证列映射文件中的数据是否满足设定的规则，如验证映射文件中给出的概念是否已经存在于数据库中的概念表中，用来保证数据的正确性的。Step 3: Read the column mapping file and verify whether the data in the column mapping file meets the set rules, such as verifying whether the concepts given in the mapping file already exist in the concept table in the database to ensure the correctness of the data of.

步骤四：第一次读取并处理数据。首先根据文本替换文件中的定义将该数据替换成规范的值；然后对数据的正确性验证；补充一些数据的缺失信息。提取出病人的ID信息即在列映射文件中该列的数据变量为‘PAT_ID’的数据，以及来访的ID数据即在列映射文件中该列的数据变量为‘VIS_ID’的数据。Step 4: Read and process data for the first time. First, replace the data with a standard value according to the definition in the text replacement file; then verify the correctness of the data; supplement some missing information of the data. The ID information of the patient is extracted, that is, the data variable of the column in the column mapping file is 'PAT_ID', and the ID data of the visitor is the data of the data variable of the column in the column mapping file, which is 'VIS_ID'.

在读入的时候，加入自定义变量的标识。根据自定义标识的概念路径变量构造概念路径。根据自定义标识的概念类型变量构造概念类型。再构造如图2的形式的概念树，并将该概念树插入到平台数据库对应的表中。When reading in, add the identifier of the custom variable. Constructs a concept path from a custom-identified concept path variable. Constructs a concept type from a custom-identified concept type variable. Then construct a concept tree in the form of FIG. 2 , and insert the concept tree into the corresponding table of the platform database.

步骤五：将病人ID、来访ID数据导入到平台数据库并第二次读取数据文件中的数据：按行读取这些数据，根据文本替换文件中的定义，将指定列的不符合要求的数据替换成符合要求的数据；根据列映射文件中定义的列序号以及数据变量，标识出每条数据中的每个数据的具体表示含义，将病人信息数据与病人ID对应插入到平台数据库中的病人信息表中，将来访数据与来访ID对应插入到平台数据库中的来访信息表中，将数据变量中的概念类型的变量标识的数据与概念路径本身以及在列映射文件中定义的其他类型的数据单位、数据类型等相关数据插入到平台数据库中的观测事实表中；如果导入数据出现异常情况，如某些数据项目的缺失，某些数据项目的重复，只做相关记录供以后查看；如果是网络异常或平台数据库系统异常，则回滚事物，并记录当前导入的状态，包括配置信息、时间、异常情况等；即完成数据导入。Step 5: Import the patient ID and visitor ID data into the platform database and read the data in the data file for the second time: read the data by row, replace the data in the specified column that does not meet the requirements according to the definition in the text replacement file Replace with data that meets the requirements; identify the specific meaning of each data in each piece of data according to the column serial number and data variables defined in the column mapping file, and insert the patient information data and patient ID into the patient in the platform database In the information table, insert the visit data and the visit ID into the visit information table in the platform database, and compare the data identified by the variable of the concept type in the data variable with the concept path itself and other types of data defined in the column mapping file Units, data types and other relevant data are inserted into the observation fact table in the platform database; if there are abnormalities in the imported data, such as the absence of some data items, the duplication of some data items, only relevant records are made for later viewing; if it is If the network is abnormal or the platform database system is abnormal, the transaction will be rolled back, and the current import status will be recorded, including configuration information, time, abnormal conditions, etc.; that is, the data import will be completed.

本发明还包括用于源数据库中数据的导入：The invention also includes the import of data in the source database:

当要导入的数据是存在源数据库表中时，首先要在该源数据库中写好存储过程，存储过程返回想要导入数据的游标；同时对文本数据读入单元替换成从数据库中的存储过程读取数据的读入单元，并保证与文本读入单元读入后的结果相同；具体导入的步骤如下：When the data to be imported exists in the source database table, the stored procedure must first be written in the source database, and the stored procedure returns the cursor of the data to be imported; at the same time, the text data read-in unit is replaced with the stored procedure from the database The read-in unit that reads the data, and ensures that the result is the same as that of the text read-in unit; the specific import steps are as follows:

步骤一：准备文件：准备的文件包括：Step 1: Prepare documents: The prepared documents include:

11)列映射文件：列映射文件有固定的六列内容。分别为源数据库中预定义的存储过程名、数据列号、这个数据是否可以是空值、数据变量、数据类型(数值型数据或者文本型数据或是自然语言类型)、数据单位(如g、ml等单位)。其中数据变量就是用来描述数据含义的。数据变量的取值为一些预先定义的符号或者是代表着一个概念的概念路径。其中预先定义的符号包括有PAT_ID、PAT_SEX、VISIT_ID等。分别用于描述患者的ID，年龄、性别以及来访的ID。概念由概念路径和概念类型构成。概念路径是一个以斜线分隔的文本，如‘\开放研究\心内疾病研究\’。11) Column mapping file: The column mapping file has six fixed columns. They are the predefined stored procedure name in the source database, data column number, whether the data can be null, data variable, data type (numeric data or text data or natural language type), data unit (such as g, ml and other units). The data variable is used to describe the meaning of the data. The value of the data variable is some predefined symbols or a concept path representing a concept. The predefined symbols include PAT_ID, PAT_SEX, VISIT_ID and so on. They are used to describe the patient's ID, age, gender and visiting ID respectively. A concept consists of a concept path and a concept type. A concept path is a slash-separated text, such as '\Open Research\Heart Disease Research\'.

12)文本替换文件：文本替换文件的作用是将一些不符合预定义规范的数据清洗掉。因此这个文件需要指定哪个数据文件中的哪列需要从什么值替换为什么值。这个文件由四列，分别为文件名、数据列的列号、原来的值、想要替换成的值。如数据列“患者基本检查文件，1，男，m”。代表的意思是将患者基本检查文件的第一列中所有“男”的数据替换成“m”。12) Text replacement file: the function of the text replacement file is to clean up some data that does not meet the predefined specifications. So this file needs to specify which column in which data file needs to be replaced from what value to what value. This file consists of four columns, which are the file name, the column number of the data column, the original value, and the value to be replaced. For example, the data column is "Patient's basic examination file, 1, male, m". The representative means to replace all "male" data in the first column of the patient's basic examination file with "m".

13)源数据库连接配置文件：包括源数据库的用户名、密码、是否自动重新连接等信息。13) Source database connection configuration file: including information such as the user name and password of the source database, whether to automatically reconnect, etc.

14)用于构建概念树的文件：该文件有概念路径和概念类别两列，概念分为不同层级，不同层级之间之间以斜线“\”分隔，概念类别包括类别型和数值型两类。如表2所示是一个概念的概念树文件例子。14) The file used to build the concept tree: the file has two columns of concept path and concept category. Concepts are divided into different levels, and the different levels are separated by a slash "\". kind. Table 2 is an example of a concept tree file.

表2构建概念树文件例子Table 2 Example of building a concept tree file

步骤四：第一次遍历要导入的源数据库中的数据。首先根据文本替换文件中的定义将该数据替换成规范的值；然后对数据的正确性验证；补充一些数据的缺失信息。在遍历的过程中提取出病人的ID信息即在列映射文件中该列的数据变量为‘PAT_ID’的数据，以及来访的ID信息即在列映射文件中该列的数据变量为‘VIS_ID’的数据。Step 4: Traverse the data in the source database to be imported for the first time. First, replace the data with a standard value according to the definition in the text replacement file; then verify the correctness of the data; supplement some missing information of the data. In the process of traversal, the patient's ID information is extracted, that is, the data variable of the column in the column mapping file is 'PAT_ID', and the ID information of the visitor, that is, the data variable of the column in the column mapping file is 'VIS_ID' data.

在读入的时候，加入自定义变量的标识。根据自定义标识的概念路径变量构造概念路径。根据自定义标识的概念类型变量构造概念类型。再构造如图2的形式的概念树，并将树插入到平台数据库对应的表中。When reading in, add the identifier of the custom variable. Constructs a concept path from a custom-identified concept path variable. Constructs a concept type from a custom-identified concept type variable. Then construct a concept tree in the form of Figure 2, and insert the tree into the table corresponding to the platform database.

步骤五：将病人ID、来访ID数据导入到平台数据库并第二次遍历源数据库中的数据：按行读取源数据库中的数据，根据文本替换文件中的定义，将指定列的不符合要求的数据替换成符合要求的数据；根据列映射文件中定义的列序号以及数据变量，标识出每条数据中的每个数据的具体表示含义，将病人信息数据与病人ID对应插入到平台数据库中的病人信息表中，将来访数据与来访ID对应插入到平台数据库中的来访信息表中，将数据变量中的概念类型的变量标识的数据与概念路径本身以及在列映射文件中定义的其他类型的数据单位、数据类型等相关数据插入到平台数据库中的观测事实表中；如果导入数据出现异常，如某些数据项目的缺失，某些数据项目的重复，只做相关记录供以后查看；如果是网络异常或平台数据库系统异常，则回滚事物，并记录当前导入的状态，包括配置信息、时间、异常情况等；即完成数据导入。Step 5: Import the patient ID and visitor ID data into the platform database and traverse the data in the source database for the second time: read the data in the source database by row, and replace the specified columns with those that do not meet the requirements according to the definition in the text replacement file Replace the data with the data that meets the requirements; identify the specific meaning of each data in each piece of data according to the column serial number and data variables defined in the column mapping file, and insert the patient information data and patient ID into the platform database correspondingly In the patient information table, the visitor data and the visitor ID are inserted into the visitor information table in the platform database correspondingly, and the data identified by the variable of the concept type in the data variable is related to the concept path itself and other types defined in the column mapping file Insert relevant data such as the data unit and data type into the observation fact table in the platform database; if there is an abnormality in the imported data, such as the absence of some data items or the duplication of some data items, only relevant records will be made for later viewing; if If it is a network abnormality or an abnormality in the platform database system, roll back the transaction and record the current import status, including configuration information, time, abnormal conditions, etc.; that is, the data import is completed.

Claims

1. A clinical data integration technology data import method based on translation medicine analysis platform, it is characterized in that, the method is used for the import of text-type data, specifically comprises the following steps:

Step 1: Prepare documents, the prepared documents include:

11) Data file: the data file to be imported to the platform. The first line in the file is the header line, which is used to explain the data; starting from the second line, each line represents a piece of data, and each specific data in the line is represented by the system. The table symbol '/t' separates;

The data file also includes a column of data values corresponding to data of different concepts, but corresponding to data of different concepts;

12) Column mapping file: The column mapping file has fixed six columns of content; respectively, the file name of the data file, the data column number, whether this data can be a null value, the data variable, the data type, and the data unit; where the data variable is used To describe the meaning of the data; the value of the data variable is a predefined symbol or a concept path representing a concept; a concept is composed of a concept path and a concept type; a concept path is a text that separates concepts with slashes;

It also includes adding a custom logo to the data variable to reflect the hierarchical relationship of data variables with multiple concepts in one row;

13) Text replacement file: used to clean out some data that does not meet the predefined specifications; the file consists of four columns, which are the file name, the column number of the data column, the original value, and the value to be replaced;

14) The file used to build the concept tree: the file has two columns of concept path and concept category. The concept path divides the concept into different levels, and the different levels are separated by a slash "\". The concept category includes category type and numerical type two types;

Step 2: Construct a concept tree according to the concept path and concept category hierarchy of the concept tree file constructed in step 1; insert the concept path and concept type in each node in the concept tree into the concept table in the platform database , inserting the concept path, concept type and the relative position of the node in the concept tree into the I2B2 table of the platform database;

Step 3: Read the column mapping file, and verify whether the data in the column mapping file meets the set rules to ensure the correctness of the data;

Step 4: Read and process the data for the first time: first replace the data with a standardized value according to the definition in the text replacement file; then verify the correctness of the data; supplement the missing information of the data; extract the patient’s ID information and The data variable of this column in the column mapping file is the data of 'PAT_ID', and the ID data of the visit is the data of the data variable of the column in the column mapping file of 'VIS_ID';

Add the identity of the custom variable; construct the concept path according to the concept path variable of the custom identity; construct the concept type according to the concept type variable of the custom identity; construct the concept tree again, and insert the concept tree into the corresponding table of the platform database;

Step 5: Import the patient ID and visitor ID data into the platform database and read the data in the data file for the second time: read the data by row, replace the data in the specified column that does not meet the requirements according to the definition in the text replacement file Replace with data that meets the requirements; identify the specific meaning of each data in each piece of data according to the column serial number and data variables defined in the column mapping file, and insert the patient information data and patient ID into the patient in the platform database In the information table, insert the visit data and the visit ID into the visit information table in the platform database, and compare the data identified by the variable of the concept type in the data variable with the concept path itself and other types of data defined in the column mapping file Data related to units and data types is inserted into the observation fact table in the platform database; if there is an abnormality in the imported data, only relevant records are made for later viewing; The status of the import; that is, the data import is complete.

2. data import method as claimed in claim 1, is characterized in that, also comprises the import that is used for the data in source database: at first will write stored procedure in this source database, and stored procedure returns the cursor that wants to import data; Simultaneously Replace the text data read-in unit with the read-in unit that reads data from the stored procedure in the database, and ensure that the result is the same as that of the text read-in unit; specifically, the following steps are included:

Step 1: Prepare documents: The prepared documents include:

11) Column mapping file: The column mapping file has fixed six columns of content; they are the predefined stored procedure name in the source database, data column number, whether this data can be null, data variable, data type, and data unit; The data variable is used to describe the meaning of the data; the value of the data variable is a predefined symbol or a concept path representing a concept; a concept is composed of a concept path and a concept type; the concept path is a text separated by a slash;

12) Text replacement file: used to clean some data that does not meet the predefined specifications; the file consists of four columns, which are the file name, the column number of the data column, the original value, and the value to be replaced;

13) Source database connection configuration file: including the source database user name, password, and whether to automatically reconnect information;

14) The file used to build the concept tree: the file has two columns of concept path and concept category. Concepts are divided into different levels, and different levels are separated by slashes "\". The concept categories include category type and numerical type; Step 2: Construct a concept tree according to the concept path and concept category hierarchy of the concept tree file constructed in step 1; insert the concept path and concept type in each node in the concept tree into the concept table in the platform database , inserting the concept path, concept type and the relative position of the node in the concept tree into the I2B2 table of the platform database;

Step 4: Traverse the data in the source database to be imported for the first time: first, replace the data with the standard value according to the definition in the text replacement file; then verify the correctness of the data; supplement the missing information of the data; extract the patient The ID information of the column is the data whose data variable is 'PAT_ID' in the column mapping file, and the ID information of the visit is the data whose data variable of the column is 'VIS_ID' in the column mapping file;

Add the identification of the custom variable; construct the concept path according to the concept path variable of the custom identification; construct the concept type according to the concept type variable of the custom identification; then construct the concept tree, and insert the tree into the corresponding table of the platform database;

Step 5: Import the patient ID and visitor ID data into the platform database and traverse the data in the source database for the second time: read the data in the source database by row, and replace the specified columns with those that do not meet the requirements according to the definition in the text replacement file Replace the data with the data that meets the requirements; identify the specific meaning of each data in each piece of data according to the column serial number and data variables defined in the column mapping file, and insert the patient information data and patient ID into the platform database correspondingly In the patient information table, the visitor data and the visitor ID are inserted into the visitor information table in the platform database correspondingly, and the data identified by the variable of the concept type in the data variable is related to the concept path itself and other types defined in the column mapping file The relevant data of the data unit and data type are inserted into the observation fact table in the platform database; if there is an exception in the imported data, only related records are made for later viewing; The current status of the import; that is, the data import is complete.