CN111710365A - An ontology-based method for building protein/gene synonyms - Google Patents
An ontology-based method for building protein/gene synonyms Download PDFInfo
- Publication number
- CN111710365A CN111710365A CN202010525374.0A CN202010525374A CN111710365A CN 111710365 A CN111710365 A CN 111710365A CN 202010525374 A CN202010525374 A CN 202010525374A CN 111710365 A CN111710365 A CN 111710365A
- Authority
- CN
- China
- Prior art keywords
- ontology
- protein
- synonym
- gene
- biogrid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明基于本体的蛋白质/基因同义词表构建方法,包括:a).数据源Uniprot、BioGRID和NCBI Gene的获取;b).数据文件的分割;c).上层本体的建立;d).Uniprot‑Swissprot向上层本体的映射和融合;e).BioGRID向上层本体的映射和融合;f).NCBI Gene向上层本体的映射和融合;g).同义词的去重。本发明的蛋白质/基因同义词表构建方法,建立了同义词规模上更全面、准确度上更可靠、分类信息上更细致的蛋白质/基因同义词表,为进行更高效、准确的文献数据挖掘提供了前提保证,是生物医药专家进行科研发现的有力辅助。
The protein/gene synonym table construction method based on the ontology of the present invention includes: a) acquisition of data sources Uniprot, BioGRID and NCBI Gene; b) segmentation of data files; c) establishment of upper-level ontology; d). The mapping and fusion of Swissprot to the upper ontology; e). The mapping and fusion of BioGRID to the upper ontology; f). The mapping and fusion of NCBI Gene to the upper ontology; g). The deduplication of synonyms. The protein/gene synonym table construction method of the present invention establishes a protein/gene synonym table with more comprehensive synonym scale, more reliable accuracy and more detailed classification information, which provides a prerequisite for more efficient and accurate literature data mining Guarantee is a powerful aid for biomedical experts to make scientific research discoveries.
Description
技术领域technical field
本发明涉及一种基于本体的蛋白质/基因同义词表构建方法,更具体的说,尤其涉及一种以Uniprot-Swissprot、BioGRID和NCBI Gene为数据源的基于本体的蛋白质/基因同义词表构建方法。The present invention relates to an ontology-based method for constructing a protein/gene synonym table, more particularly, to a method for constructing an ontology-based protein/gene synonym table using Uniprot-Swissprot, BioGRID and NCBI Gene as data sources.
背景技术Background technique
同义词(synonym)或者更学术性的称呼同义异形,是世界上各种语言都存在的一种现象。它指的是表达的意义相同或相近,但是表达形式不同的词汇。几十年来生物研究实践已导致一些术语的使用不一致,数据集中存在大量的“同义词”术语。使用不同词汇描述同一基因功能,就会妨碍在多种蛋白质和物种中搜索共同的特征。更糟糕的是,同一术语还被不同生物学分支的学者用来描述不同的分子。这样无组织的数据积累与获取就会给数据流和业务流下游的使用者带来极大的挑战。为了保证系统或数据的互操作性,分散异构的数据需要有一个统一的蛋白质/基因同义词表。尤其是蛋白质/基因领域,一方面由于其结构和功能的复杂性、来源生物的多样性,使得同一蛋白在不同功能领域、在不同的物种中可能出现不同的命名,甚至同一蛋白在同一物种的同一生理过程中也会存在命名不同的情况。另一方面,蛋白质/基因的命名也随科技的发展不断更迭,使得蛋白质/基因的命名更难统一。Synonyms, or more academically called synonyms, are a phenomenon that exists in every language in the world. It refers to words that express the same or similar meanings, but different expressions. Decades of biological research practice have resulted in inconsistent usage of some terms, with a large number of "synonymous" terms in datasets. Using different terms to describe the same gene function hinders the search for common features across multiple proteins and species. To make matters worse, the same term is used by scholars in different branches of biology to describe different molecules. Such unorganized data accumulation and acquisition will bring great challenges to the downstream users of the data flow and business flow. In order to ensure interoperability of systems or data, dispersing heterogeneous data requires a unified protein/gene synonym. Especially in the field of protein/gene, on the one hand, due to the complexity of its structure and function and the diversity of source organisms, the same protein may have different names in different functional fields and different species, and even the same protein in the same species. The same physiological process can also be named differently. On the other hand, the nomenclature of proteins/genes is constantly changing with the development of technology, making it more difficult to unify the nomenclature of proteins/genes.
目前生物医药领域相继构建并发布了很多生物医药同义词表,有的机构在自己的网站中,对蛋白质信息添加了同义词的属性。常见的如Uniprot Swiss-Prot中收录的56万余蛋白质的信息中,包含了每个蛋白质的基因同义词信息。又如NCBI(National Centerfor Biotechnology Information,美国国家生物技术信息中心)构建了生物医学主题词表Mesh Terms(Medical Subject Headings Terms),其中也收录了部分蛋白质/基因的同义词信息;在NCBI Gene中,还专门对每个基因可能的同义词进行了收录和发布。At present, many biomedical synonym tables have been constructed and released in the field of biomedicine, and some institutions have added the attributes of synonyms to protein information on their websites. Common ones such as the information of more than 560,000 proteins included in Uniprot Swiss-Prot include gene synonym information for each protein. For another example, NCBI (National Center for Biotechnology Information, US National Center for Biotechnology Information) has constructed a biomedical subject thesaurus Mesh Terms (Medical Subject Headings Terms), which also includes synonym information for some proteins/genes; The possible synonyms for each gene are specifically indexed and published.
除了各大机构专门组织专家进行同义词的收录和整理,很多个人研究者也在不断努力试图通过自动化的手段从文献中自动识别和提取同义词,以解决人工方法需要较高的专业门槛,且需要花费大量的人力物力阅读海量文献的问题。如Morin和Jacquemin使用了术语变体来对同义词进行检测,但是很多同义词并非简单的变体,或是构成变体的两个单词并非同义词。这种情况在蛋白质/基因领域十分普遍,如ABFB和AO090023000001描述的是同一个蛋白,但它们之间的并不存在丝毫的变体特性;而ABFB和ABFC之间变体的特征很明显,但它们却是不同的蛋白。Snow等人使用自动构建的模式提取器来获得同义词,但这种方法依赖语法解析器,噪音会从分词开始,到依存句法解析,再到模式构建,直到最后的同义词判别,不断叠加和放大,导致最终的准确率和覆盖率都不理想。In addition to the special organization of experts in major institutions to collect and organize synonyms, many individual researchers are also constantly trying to automatically identify and extract synonyms from literature through automated means, so as to solve the problem that manual methods require a high professional threshold and cost A lot of manpower and material resources to read massive literature. For example, Morin and Jacquemin used term variants to detect synonyms, but many synonyms are not simple variants, or the two words that make up the variant are not synonyms. This situation is very common in the field of proteins/genes. For example, ABFB and AO090023000001 describe the same protein, but there is no variant characteristic between them; while the characteristics of variants between ABFB and ABFC are obvious, but They are different proteins. Snow et al. used an automatically constructed pattern extractor to obtain synonyms, but this method relies on a grammar parser, and the noise will start from word segmentation, to dependent syntactic parsing, to pattern construction, and until the final synonym discrimination, continuously superimposed and amplified, As a result, the final accuracy and coverage are not ideal.
归根结底,无论是各个机构所发布的同义词汇集还是发表的同义词提取方法,都存在以下的几个缺点:In the final analysis, both the synonym sets published by various institutions and the published synonym extraction methods have the following shortcomings:
(1)同义词汇集的全面性不足。(1) The comprehensiveness of the synonym set is insufficient.
对于各个机构整理的词汇集,能够保证有很强的权威性,但全面性方面仍有所欠缺。如对“P2R3B_HUMAN”这个蛋白而言,“NYREN8”是它的同义词,但只在BioGRID中有收录;“PPP2R3LY”同样是它的同义词,但只在NCBI Gene中有收录。由于各个数据源间的侧重不同,各研究机构在汇总整理同义词表时,会根据其擅长的研究方向有所侧重。如:Uniprot作为侧重蛋白质序列和结构的数据源,更多的会挖掘和收录有关序列的蛋白质文献中的同义词,而较少的关注有关蛋白质功能等方面的同义词表示;BioGRID则作为PPI(Protein-Protein Interactions)信息的数据源,更偏重收录描述蛋白质相互作用的文献中的同义词表示。各个数据源的侧重点不同,势必会造成对同义词的整理和收录的全面性有所欠缺。The vocabulary collections compiled by various institutions can be guaranteed to be highly authoritative, but there is still a lack of comprehensiveness. For example, for the protein "P2R3B_HUMAN", "NYREN8" is its synonym, but it is only included in BioGRID; "PPP2R3LY" is also its synonym, but it is only included in NCBI Gene. Due to the different emphases among various data sources, each research institution will focus on the research direction that it is good at when summarizing the synonym list. For example: Uniprot, as a data source focusing on protein sequence and structure, will mine and include more synonyms in protein literature related to sequences, but less attention to the representation of synonyms related to protein functions; BioGRID is used as PPI (Protein- Protein Interactions) data sources, more emphasis is placed on synonym representations in the literature describing protein interactions. The focus of each data source is different, which will inevitably lead to a lack of comprehensiveness in the sorting and collection of synonyms.
(2)同义词汇集的权威性不足。(2) The authority of the synonym set is insufficient.
自动化的手段获得的同义词表,普遍全面性要稍优于已有的公开词汇集,但是准确率始终难以匹敌人工校验过的公开词汇集,会存在偏差和错误,权威性得不到保证。The synonym list obtained by automated means is generally slightly more comprehensive than the existing public vocabulary set, but the accuracy rate is always difficult to match the public vocabulary set verified by the human hand, there will be deviations and errors, and the authority cannot be guaranteed.
(3)未按物种进行区分。(3) No distinction is made by species.
蛋白质/基因是需要按照物种进行区分的。名称相同的蛋白质出现在不同物种中时,它的同义词可能会出现差别。Proteins/genes need to be distinguished by species. When a protein with the same name appears in different species, its synonyms may differ.
一方面的差别是,不同物种间会出现完全不同的同义词。如名称为“PUR9”的蛋白质/基因,当出现在“HUMAN”中时,有“PURH”,“IMPCHASE”,“EL-S-70p”,“AICARFT”,“ATIC”等同义词表述;但当出现在“JANSC”( A genus of the Rhodobacteraceae)中时,就只有“purH”一种同义词表述。One difference is that completely different synonyms appear across species. For example, when the protein/gene named "PUR9" appears in "HUMAN", there are synonyms such as "PURH", "IMPCHASE", "EL-S-70p", "AICARFT" and "ATIC"; but when When appearing in "JANSC" (A genus of the Rhodobacteraceae), there is only one synonym for "purH".
另一方面的差别是,拼写相同的同义词,按照物种不同,也会有不同的大小写规范。以“PUR9”为例,当出现在“HUMAN”中时,其同义词是“PURH”;而出现在“JANSC”中时,其同义词是“purH”。The difference, on the other hand, is that synonyms that are spelled the same will have different capitalization conventions depending on the species. Take "PUR9" as an example, when it appears in "HUMAN", its synonym is "PURH"; when it appears in "JANSC", its synonym is "purH".
所以,寻找一种方法,能够有效的按蛋白质/基因和物种信息,对来自各个数据源的蛋白质/基因同义词进行汇总、对齐和整理,形成较完备的蛋白质/基因同义词表,是进行更高效、准确的文献数据挖掘的前提保证,是生物医药专家进行科研发现的有力辅助,是生物医药领域亟待开展的工作。Therefore, looking for a method that can effectively summarize, align and organize protein/gene synonyms from various data sources according to protein/gene and species information to form a more complete protein/gene synonym table is a more efficient and The premise and guarantee of accurate literature data mining is a powerful aid for biomedical experts to conduct scientific research and discovery, and is an urgent work in the field of biomedicine.
发明内容SUMMARY OF THE INVENTION
本发明为了克服上述技术问题的缺点,提供了一种基于本体的蛋白质/基因同义词表构建方法。In order to overcome the shortcomings of the above technical problems, the present invention provides an ontology-based method for constructing a protein/gene synonym table.
本发明的通过以下步骤来实现:The present invention is realized through the following steps:
a).数据源的获取,获取生物医药领域最权威的Uniprot-Swissprot、BioGRID和NCBIGene三个数据源数据,从Uniprot-Swissprot、BioGRID和NCBI Gene三个数据源获取的文件分别为uniprot-sprot.xml、BIOGRID-ALL-LATSET.psi25.xml和ALL_Date.gene_info;a). Acquisition of data sources, to obtain the most authoritative data sources of Uniprot-Swissprot, BioGRID and NCBIGene in the field of biomedicine. The files obtained from the three data sources of Uniprot-Swissprot, BioGRID and NCBI Gene are respectively uniprot-sprot. xml, BIOGRID-ALL-LATSET.psi25.xml and ALL_Date.gene_info;
b).数据文件的分割,对于步骤a)中获取的xml格式文件,其分割采用逐行读取的方式,并检测标识起始和终止的标签,将每对起始和终止标签及其包含的部分存储为单个实体文件;对于从BIOGRID-ALL-LATSET.psi25.xml文件中以实体起始和终止的标签获取的单个文件,只保留“Interactor”实体对应的文件,忽略“Interaction”和“Experiment”实体对应的文件;对于ALL_Date.gene_info,由于其每一行存储一个实体,对其不进行分割,采用直接逐行读取的方式获取单个实体文件;b). Segmentation of data files. For the xml format file obtained in step a), the segmentation is performed by reading line by line, and the tags identifying the start and end are detected, and each pair of start and end tags and their containing The part of the file is stored as a single entity file; for a single file obtained from the BIOGRID-ALL-LATSET.psi25.xml file with the tags starting and ending with the entity, only the file corresponding to the "Interactor" entity is kept, ignoring "Interaction" and "Interaction" The file corresponding to the Experiment" entity; for ALL_Date.gene_info, since each line stores one entity, it is not divided, and a single entity file is obtained by directly reading line by line;
c). 以Uniprot-Swissprot数据源为基础的上层本体的建立,参照Uniprot中蛋白质/基因的实际属性,构建包括蛋白质/基因名称Name、表示蛋白质/基因对应的物种类型id的TaxID、物种类型名称Specie以及蛋白质/基因同义词表Synomym在内的上层本体;c). The establishment of the upper-level ontology based on the Uniprot-Swissprot data source, referring to the actual properties of the protein/gene in Uniprot, the construction includes the protein/gene name Name, the TaxID representing the species type id corresponding to the protein/gene, and the species type name Upper ontologies including Specie and protein/gene synonyms Synomym;
d). Uniprot-Swissprot向上层本体的映射和融合,将Uniprot中蛋白质的“name”字段映射到上层本体的“Name”字段,Uniprot中的“NCBI TaxonomyID”字段映射到上层本体的“TaxID”字段,Uniprot中的“gene”字段映射到上层本体的“Synomym”字段,直至Uniprot-Swissprot数据源中所有的蛋白质及属性全部映射完毕,以使Uniprot数据源中蛋白质实例及属性全部融合到上层本体中;d). The mapping and fusion of Uniprot-Swissprot to the upper ontology, mapping the "name" field of the protein in Uniprot to the "Name" field of the upper ontology, and the "NCBI TaxonomyID" field in Uniprot to the "TaxID" field of the upper ontology , the "gene" field in Uniprot is mapped to the "Synomym" field of the upper ontology, until all the proteins and attributes in the Uniprot-Swissprot data source are mapped, so that the protein instances and attributes in the Uniprot data source are all fused into the upper ontology ;
e). BioGRID向上层本体的映射和融合,首先在BioGRID的reference字段中寻找和Uniprot对应的蛋白质,将其与Uniprot中的accession字段建立映射,从而借助步骤d)中建立的映射,将BioGRID中蛋白质的属性信息融合到上层本体中;e). For the mapping and fusion of BioGRID to the upper-level ontology, first find the protein corresponding to Uniprot in the reference field of BioGRID, and map it with the accession field in Uniprot, so that with the help of the mapping established in step d), the BioGRID The attribute information of the protein is fused into the upper ontology;
如果通过BioGRID的reference字段无法找到Uniprot中对应的蛋白质,则表明上层本体中还没有该蛋白质实例,那么就需要给上层本体添加该蛋白质实例,采用BioGRID的“shortLabel”向上层本体的“Name”对齐、BioGRID的“ncbiTaxld”向上层本体的“TaxID”对齐、BioGRID的“Synonym”向上层本体的“Synonym”对齐的方式对上层本体进行扩充;If the corresponding protein in Uniprot cannot be found through the reference field of BioGRID, it indicates that there is no instance of the protein in the upper ontology, then the protein instance needs to be added to the upper ontology, and the "shortLabel" of BioGRID is used to align to the "Name" of the upper ontology 、The "ncbiTaxld" of BioGRID is aligned to the "TaxID" of the upper-level ontology, and the "Synonym" of BioGRID is aligned to the "Synonym" of the upper-level ontology to expand the upper-level ontology;
f). NCBI Gene向上层本体的映射和融合,首先将NCBI Gene中蛋白质和上层本体中蛋白质的同义词全部转化为小写或大写,然后将每一个NCBI Gene蛋白质实例的同义词集合与上层本体中的每一个蛋白质实例的同义词集合进行比较,筛选出有共现同义词的蛋白质实例对,最后,根据物种信息再次筛选,保留同一物种的蛋白质实例对,将NCBI Gene蛋白质实例的属性信息融合到上层本体中;f). For the mapping and fusion of NCBI Gene to the upper ontology, first convert all the synonyms of proteins in NCBI Gene and proteins in the upper ontology to lowercase or uppercase, and then combine the synonym set of each NCBI Gene protein instance with each synonym in the upper ontology The synonym sets of a protein instance are compared, and the protein instance pairs with co-occurring synonyms are screened out. Finally, according to the species information, the protein instance pairs of the same species are retained, and the attribute information of the NCBI Gene protein instance is fused into the upper ontology;
如果NCBI Gene中的某个蛋白质与上层本体中的蛋白质之间不存在同义词,则表明步骤e)中建立的上层本体中尚未涵盖该蛋白质,则采用NCBI Gene的“Symbol”向上层本体的“Name”对齐、NCBI Gene的“taxID”向上层本体的“TaxID”对齐、NCBI Gene的“Synonym”向上层本体的“Synonym”对齐的方式对上层本体进行扩充;If there is no synonym between a certain protein in NCBI Gene and the protein in the upper ontology, it means that the protein is not covered in the upper ontology established in step e), then the "Symbol" of NCBI Gene is used to "Name" of the upper ontology. "Alignment, NCBI Gene's "taxID" is aligned to the upper-level ontology's "TaxID", and NCBI Gene's "Synonym" is aligned to the upper-level ontology's "Synonym" to extend the upper-level ontology;
g).同义词的去重和同义词表的导出,将Uniprot-Swissprot、BioGRID和NCBI Gene三个数据源中的蛋白质实例进行映射和融合后,将各自的同义词信息抽取出来,经去重后,获得带有物种分类信息的同义词表,并将其存储在上层本体中的同义词表中,即实现了蛋白质/基因同义词表的构建。g). De-duplication of synonyms and export of synonym tables. After mapping and merging the protein instances in the three data sources of Uniprot-Swissprot, BioGRID and NCBI Gene, the respective synonym information is extracted. After de-duplication, the obtained The synonym table with species classification information is stored in the synonym table in the upper ontology, that is, the construction of the protein/gene synonym table is realized.
本发明的基于本体的蛋白质/基因同义词表构建方法,步骤a)中,获取数据源数据的方式有两种,一种为利用爬虫或API,在线对各数据源网站上的蛋白质信息进行捕获和下载;另一种是从网站或FTP上获得开源数据文件,在本地分析文件结构。In the ontology-based protein/gene synonym table construction method of the present invention, in step a), there are two ways to obtain data source data. One is to use crawler or API to capture and analyze the protein information on each data source website online. Download; the other is to obtain open source data files from a website or FTP and analyze the file structure locally.
本发明的基于本体的蛋白质/基因同义词表构建方法,步骤e)所述的BioGRID向上层本体的融合或对齐通过以下步骤来实现:In the ontology-based protein/gene synonym table construction method of the present invention, the fusion or alignment of the BioGRID described in step e) to the upper ontology is achieved through the following steps:
e-1).判断BioGRID中的蛋白质在Uniprot中是否存在对应的蛋白质,查找BioGRID中的“primaryRef”或“secondaryRef”字段中的信息,在上层本体所包含的Uniprot蛋白质的“accession”字段信息中是否有出现,如果有出现,则表明BioGRID和Uniprot中出现的两蛋白质可进行融合,执行步骤e-2),如果没有出现,则应将BioGRID中的蛋白质进行扩充,执行步骤e-4);e-1). Determine whether the protein in BioGRID has a corresponding protein in Uniprot, look up the information in the "primaryRef" or "secondaryRef" field in BioGRID, and find the information in the "accession" field of the Uniprot protein contained in the upper ontology. Whether it appears, if it does, it means that the two proteins that appear in BioGRID and Uniprot can be fused, go to step e-2), if not, expand the protein in BioGRID, go to step e-4);
e-2). shortLabel值的融合,如果BioGRID蛋白质的“shortLabel”与所要融合的上层本体中的“Name”值不同,则保留上层本体的“Name”值不变,并将“shortLabel”添加至上层本体的“Synonym”中;e-2). Fusion of shortLabel values. If the "shortLabel" of the BioGRID protein is different from the "Name" value in the upper body to be fused, keep the "Name" value of the upper body unchanged, and add "shortLabel" to In the "Synonym" of the upper ontology;
e-3). Synonym值的融合,如果BioGRID蛋白质的“Synonym” 与上层本体中的“Synonym”完全一致,则保留上层本体中的“Synonym”不变,如果不完全一致,则将BioGRID蛋白质的“Synonym” 与上层本体中的“Synonym”同义词合并去重,作为新的“Synonym”字段值存储在上层本体中;e-3). Fusion of Synonym values. If the "Synonym" of the BioGRID protein is exactly the same as the "Synonym" in the upper ontology, keep the "Synonym" in the upper ontology unchanged. "Synonym" is merged with the "Synonym" synonym in the upper ontology to remove duplicates, and is stored in the upper ontology as a new "Synonym" field value;
e-4). BioGRID向上层本体蛋白质的扩充,采用BioGRID的“shortLabel”向上层本体的“Name”映射、BioGRID的“ncbiTaxld”向上层本体的“TaxID”映射、BioGRID的“Synonym”向上层本体的“Synonym”映射的方式对上层本体进行扩充。e-4). The extension of BioGRID to the upper-level ontology protein, using the "shortLabel" of BioGRID to map to the "Name" of the upper-level ontology, the "ncbiTaxld" of BioGRID to map to the "TaxID" of the upper-level ontology, and the "Synonym" of BioGRID to the upper-level ontology The upper-level ontology is expanded by means of the "Synonym" mapping.
本发明的基于本体的蛋白质/基因同义词表构建方法,步骤f)所述的NCBI Gene向上层本体的融合或对齐通过以下步骤来实现:In the ontology-based protein/gene synonym table construction method of the present invention, the fusion or alignment of the NCBI Gene described in step f) to the upper ontology is achieved through the following steps:
f-1).提取物种和同义词信息,分别提取NCBI Gene中蛋白质和上层本体蛋白质中物种信息TaxID和同义词信息Synonym,并将同义词全部转为小写;f-1). Extract the species and synonym information, respectively extract the species information TaxID and synonym information Synonym in the protein in NCBI Gene and the upper body protein, and convert all the synonyms to lowercase;
f-2).匹配同义词,将每个NCBI Gene蛋白质中的同义词向每个上层本体蛋白质中的同义词进行匹配,如果二者之间的小写同义词集合存在交集,则将存在交集的上层本体蛋白质作为NCBI Gene蛋白质的候选匹配项;如果二者之间不存在交集,则应将相应的NCBIGene蛋白质扩充至上层本体中的蛋白质,执行步骤f-6);f-2). Match synonyms, match the synonyms in each NCBI Gene protein to the synonyms in each upper-level ontology protein, if there is an intersection between the lowercase synonym sets between the two, the upper-level ontology protein with the intersection will be used as the upper-level ontology protein. Candidate matching items of NCBI Gene protein; if there is no intersection between the two, the corresponding NCBIGene protein should be expanded to the protein in the upper body, and step f-6);
f-3).候选匹配项的筛选,通过判断NCBI Gene蛋白质的物种信息TaxID与其候选匹配项中的物种信息TaxID标识的是否为同一物种,如果为同一物种,则表明NCBI Gene蛋白质与该上层本体蛋白质可进行融合,执行步骤f-4);如果为不同的物种,则表明NCBI Gene蛋白质与该上层本体蛋白质不可进行融合,如果所有的候选匹配项都判断完毕后,均不可进行融合,则执行步骤f-6);f-3). Screening of candidate matching items, by judging whether the species information TaxID of the NCBI Gene protein and the species information TaxID in the candidate matching items identify the same species, if they are the same species, it indicates that the NCBI Gene protein and the upper ontology The protein can be fused, and step f-4) is performed; if it is a different species, it means that the NCBI Gene protein cannot be fused with the upper body protein. step f-6);
f-4). shortLabel值的融合,如果NCBI Gene蛋白质中的“Symbol”与所要融合的上层本体中的“Name”值不同,则保留所要融合的上层本体的“Name”值不变,并将“Symbol”添加至所要融合的上层本体的“Synonym”中;f-4). Fusion of shortLabel values, if the "Symbol" in the NCBI Gene protein is different from the "Name" value in the upper ontology to be fused, keep the "Name" value of the upper ontology to be fused unchanged, and set the "Symbol" is added to the "Synonym" of the upper ontology to be merged;
f-5). Synonym值的融合,如果NCBI Gene蛋白质中的“Synonym”与所要融合的上层本体中的“Synonym”完全一致,则保留上层本体中的“Synonym”不变,如果不完全一致,则将NCBI Gene蛋白质中的“Synonym” 与上层本体中的“Synonym”同义词合并并去重后,作为新的“Synonym”字段值存储在上层本体中;f-5). Fusion of Synonym values. If the "Synonym" in the NCBI Gene protein is exactly the same as the "Synonym" in the upper ontology to be fused, keep the "Synonym" in the upper ontology unchanged. Then the "Synonym" in the NCBI Gene protein is merged with the "Synonym" synonym in the upper-level ontology and deduplicated, and then stored in the upper-level ontology as a new "Synonym" field value;
f-6). NCBI Gene蛋白质的扩充,采用NCBI Gene的“Symbol”向上层本体的“Name”对齐、NCBI Gene的“taxID”向上层本体的“TaxID”对齐、NCBI Gene的“Synonym”向上层本体的“Synonym”对齐的方式对上层本体进行扩充。f-6). NCBI Gene protein expansion, using the "Symbol" of NCBI Gene to align to the "Name" of the upper-level ontology, the "taxID" of NCBI Gene to the "TaxID" of the upper-level ontology, and the "Synonym" of NCBI Gene to the upper-level The "Synonym" alignment of the ontology extends the upper ontology.
本发明的有益效果是:本发明的基于本体的蛋白质/基因同义词表构建方法,以生物医药领域最权威的Uniprot-Swissprot、BioGRID和NCBI Gene为数据源,首先对获取的数据进行分割处理,以识别出每个数据源中的单个蛋白质文件;其次建立包括表示蛋白质/基因的名称Name、物种类型id的TaxID、类型名称Specie、同义词表Synomym属性的上层本体;再次先将Uniprot中的蛋白质逐一对齐到上层本体中,再将BioGRID中蛋白质融合到或对齐到上层本体;最后将NCBI Gene中的蛋白质对其到上层本体,从而建立了同义词规模更全面、准确度更可靠、分类信息更细致的蛋白质/基因同义词表,为蛋白质/基因同义词表的规范、存储、推理和应用提供了更为全面和权威的数据查询和参考,为进行更高效、准确的文献数据挖掘提供了前提保证,是生物医药专家进行科研发现的有力辅助。The beneficial effects of the present invention are as follows: the ontology-based protein/gene synonym table construction method of the present invention uses the most authoritative Uniprot-Swissprot, BioGRID and NCBI Gene in the field of biomedicine as the data sources, Identify a single protein file in each data source; secondly establish an upper-level ontology including the name of the protein/gene, the TaxID of the species type id, the type name Specie, and the Synomym attribute of the thesaurus; again, align the proteins in Uniprot one by one. To the upper ontology, then fuse or align the proteins in BioGRID to the upper ontology; finally, align the proteins in NCBI Gene to the upper ontology, thereby establishing a protein with more comprehensive synonym scale, more reliable accuracy, and more detailed classification information. /Gene Thesaurus provides a more comprehensive and authoritative data query and reference for the specification, storage, reasoning and application of the protein/gene synonym, and provides a prerequisite for more efficient and accurate literature data mining. It is a biomedical A powerful aid for experts to make scientific discoveries.
附图说明Description of drawings
图1为本发明的基于本体的蛋白质/基因同义词表构建方法的整体流程图;Fig. 1 is the overall flow chart of the protein/gene synonym table construction method based on ontology of the present invention;
图2为本发明中Uniprot-Swissprot、BioGRID和NCBI Gene三个数据源的关键属性映射图;Fig. 2 is the key attribute map of three data sources of Uniprot-Swissprot, BioGRID and NCBI Gene in the present invention;
图3为本发明中三个数据源向上层本体的属性对齐映射图;Fig. 3 is the attribute alignment mapping diagram of three data sources to upper-level ontology in the present invention;
图4为本发明中NCBI Gene蛋白质和上层本体中同义词经小写化同义词匹配后的候选匹配项示意图;4 is a schematic diagram of candidate matching items after the synonyms in the NCBI Gene protein and the upper-layer ontology are matched by lowercase synonyms in the present invention;
图5为图4中的候选匹配项经物种匹配后的蛋白质最终匹配结果。FIG. 5 is the final protein matching result of the candidate matching items in FIG. 4 after species matching.
具体实施方式Detailed ways
下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.
现有蛋白质/基因同义词表难以保证权威和全面的问题。针对生物医药文献使用自动化手段进行的同义词抽取,无论模型的准确率有多高,都难以在本身缺乏数据标杆的情况下保证所构建的同义词表的全面;同时在未经过生物医药专家和研究者长时间使用和验证的情况下,也难以对同义词表的权威性进行验证。Existing protein/gene synonyms hardly guarantee authoritative and comprehensive questions. For synonym extraction using automated means for biomedical literature, no matter how high the accuracy of the model is, it is difficult to ensure the comprehensiveness of the constructed synonym table without the lack of data benchmarks. In the case of long-term use and verification, it is also difficult to verify the authority of the synonym table.
数据抽取和实体对齐问题。本文中采用的技术方案选取生物医药领域最常用也是最权威的三个数据源:Uniprot-Swissprot、BioGRID和NCBI Gene作为同义词构建的数据基础。但这三个数据源彼此的数据领域侧重不同,存储格式不同,作为主要名字的蛋白质/基因名称不同,同义词字段的位置和字段名也不相同。故需要对各数据源中的蛋白质/基因实体进行对齐和融合,使各个数据源中的蛋白质具有相同的语义结构,从而建立蛋白质映射。Data extraction and entity alignment problems. The technical solution used in this paper selects the three most commonly used and authoritative data sources in the field of biomedicine: Uniprot-Swissprot, BioGRID and NCBI Gene as the data base for synonym construction. However, these three data sources have different data fields, different storage formats, different protein/gene names as main names, and different positions and field names of synonym fields. Therefore, it is necessary to align and fuse the protein/gene entities in each data source, so that the proteins in each data source have the same semantic structure, thereby establishing protein mapping.
针对物种的同义词分类问题。由于不同物种之间的蛋白质/基因命名存在名称、大小写等差异。要建立有效、可用的同义词表,需要解决如何对蛋白质/基因及其同义词按照物种进行有效划分的问题。Synonym classification problem for species. Due to differences in names, capitalization, etc., in the naming of proteins/genes between different species. To establish an effective and usable synonym list, it is necessary to solve the problem of how to effectively classify proteins/genes and their synonyms by species.
目前的蛋白质/基因命名还缺乏强制性和完备性的规范。而从文本提取或者单纯进行同义词匹配的时候,又存在如下困难:The current protein/gene nomenclature lacks mandatory and complete norms. When extracting from text or simply performing synonym matching, there are the following difficulties:
国际上,对蛋白质和基因在常用物种中的命名做了规范,但其余物种仍缺少规范,命名使用混乱。已有命名规范的物种包括:小鼠、大鼠、鸡、人类、非人类灵长类动物、家畜、苍蝇(果蝇)、鱼。其余物种的蛋白质、基因命名并没有做明确规定。Internationally, the naming of proteins and genes in commonly used species has been standardized, but the remaining species still lack standardization, and the naming is confusing. Species for which nomenclature has been established include: mice, rats, chickens, humans, non-human primates, livestock, flies (Drosophila), fish. The protein and gene names of other species are not clearly defined.
部分学术机构和期刊使用自己的命名规范。如广东药学院规定基因名称与蛋白质名称的大小写相反且蛋白质名称至少首字母大写,且允许例外写法的存在。Some academic institutions and journals use their own naming conventions. For example, the Guangdong University of Pharmacy stipulates that the case of gene names and protein names is opposite and the protein name at least the first letter is capitalized, and the existence of exceptions is allowed.
在命名规范中,大小写、斜体、希腊字母都是基因/蛋白命名的重要依 据。一般国际通用的命名法在大小写、斜体和希腊字母的使用上如表1所 示:In the naming convention, capitalization, italics, and Greek letters are all important bases for gene/protein naming. The general international nomenclature is shown in Table 1 on the use of upper and lower case, italics and Greek letters:
表1Table 1
然而,在文本信息进行电子化存储的过程中,往往只能保留命名的大 小写信息,而丢失了斜体这一主要属性,使得同义词命名的对齐和物种分 类存在很大的困难。However, in the process of electronic storage of text information, often only the upper and lower case information of the naming is retained, and the main attribute of italics is lost, which makes the alignment of synonym naming and species classification very difficult.
本发明的基于本体的蛋白质/基因同义词表构建方法,其特征在于,通过以下步骤来实现:The ontology-based protein/gene synonym table construction method of the present invention is characterized in that, it is realized by the following steps:
a).数据源的获取,获取生物医药领域最权威的Uniprot-Swissprot、BioGRID和NCBIGene三个数据源数据,从Uniprot-Swissprot、BioGRID和NCBI Gene三个数据源获取的文件分别为uniprot-sprot.xml、BIOGRID-ALL-LATSET.psi25.xml和ALL_Date.gene_info;a). Acquisition of data sources, to obtain the most authoritative data sources of Uniprot-Swissprot, BioGRID and NCBIGene in the field of biomedicine. The files obtained from the three data sources of Uniprot-Swissprot, BioGRID and NCBI Gene are respectively uniprot-sprot. xml, BIOGRID-ALL-LATSET.psi25.xml and ALL_Date.gene_info;
序列方面:Uniprot-Swissprot。Uniprot是国际著名的蛋白质数据库,是目前国际上序列数据最完整、注释信息最丰富的非冗余蛋白质序列数据库。他的数据主要来自于基因组测序项目完成后,后续获得的蛋白质序列。其中Swissprot是由专家校验后的数据集,包含检查过的、手动注释的条目。Sequence aspects: Uniprot-Swissprot. Uniprot is an internationally renowned protein database, and is currently the non-redundant protein sequence database with the most complete sequence data and richest annotation information in the world. His data mainly comes from the protein sequences obtained after the completion of the genome sequencing project. where Swissprot is an expert-verified dataset containing checked, manually annotated entries.
功能方面:BioGRID。BioGRID是老牌且经典的蛋白质相互作用数据库,并长期维持持续更新。该数据库从生物医药文献中筛选并整理出蛋白质相互作用信息,嵌合体信息和转录后修饰的PTM信息。由于其着眼点在蛋白质相互作用关系,故它的蛋白质实体和同义词描述更侧重于蛋白质及其转录基因的功能和作用方面。Functional: BioGRID. BioGRID is an old and classic protein interaction database, and it has been continuously updated for a long time. The database screened and sorted out protein interaction information, chimera information and PTM information of post-transcriptional modifications from biomedical literature. Because of its focus on protein interaction, its description of protein entities and synonyms focuses more on the functions and roles of proteins and their transcribed genes.
基因方面:NCBI Gene。NCBI的全称为美国国家生物技术信息中心(NationalCenter for Biotechnology Information)。NCBI Gene为NCBI收录的基因资源,它以每一个基因为单位,整合了所有相关的pathway、variation和phenotype等信息,涉及来自数千种不同生物的序列信息,代表着染色体、细胞器、质粒、病毒、转录模板和数百万种蛋白质信息。Genetics: NCBI Gene. The full name of NCBI is the National Center for Biotechnology Information. NCBI Gene is a gene resource collected by NCBI. It integrates all relevant information such as pathway, variation and phenotype with each gene as a unit, involving sequence information from thousands of different organisms, representing chromosomes, organelles, plasmids, viruses , transcription templates and millions of protein profiles.
保证获取的数据源全面且权威,是构建同义词表的基础,因此选取了生物医药领域最权威的上述三个数据源,从不同的侧重方面对同义词进行补全。Ensuring that the obtained data sources are comprehensive and authoritative is the basis for building a synonym table. Therefore, the above three most authoritative data sources in the field of biomedicine are selected to complete synonyms from different emphases.
该步骤中,获取数据源数据的方式有两种,一种为利用爬虫或API,在线对各数据源网站上的实体信息进行捕获和下载;另一种是从网站或FTP上获得开源数据文件,在本地分析文件结构。这两种方法各有利弊。在线爬取的方法速度较慢,受网络状况影响严重,稳定性较差,但可以不受网站文件更新的限制,获取到最新的实体信息。下载数据文件到本地的方法速度快,运行稳定,持久性好,但一方面会受到数据源文件更新频率的限制,获取到的往往不是最新的数据,另一方面不同数据源的文件格式不同,文件内部的组织形式也不同,往往需要对各个数据源制定各自的抽取策略。本文采用下载数据文件到本地的方法,所获取的三个数据源获取的文件分别为uniprot-sprot.xml、BIOGRID-ALL-LATSET.psi25.xml和ALL_Date.gene_info。In this step, there are two ways to obtain data source data. One is to use crawler or API to capture and download entity information on each data source website online; the other is to obtain open source data files from websites or FTP. , which analyzes the file structure locally. Both methods have pros and cons. The online crawling method is slow, seriously affected by network conditions, and has poor stability, but it can obtain the latest entity information without being restricted by website file updates. The method of downloading data files to the local is fast, stable and durable, but on the one hand, it is limited by the update frequency of the data source files, and the obtained data is often not the latest data. On the other hand, the file formats of different data sources are different. The internal organization of the file is also different, and it is often necessary to formulate its own extraction strategy for each data source. This paper adopts the method of downloading data files to the local. The files obtained from the three data sources are uniprot-sprot.xml, BIOGRID-ALL-LATSET.psi25.xml and ALL_Date.gene_info.
Uniprot网站提供了Uniprot KB及其子库数据源文件的下载,其下载地址为:https://www.uniprot.org/downloads。本文中选择下载经过专家验证的Swiss-Prot子数据库数据,并选择xml文件格式以获得更多的上下级语义关系,下载后的文件名为“uniprot-sprot.xml”。BioGRID提供了完整的数据源文件下载,其下载地址为:https://downloads.thebiogrid.org/BioGRID/Latest-Release/,本文中选择BIOGRID-ALL-LATEST.psi25.zip的最新发布的以psi25 xml格式存储的数据文件“BIOGRID-ALL-LATEST.psi25.xml”。The Uniprot website provides the download of Uniprot KB and its sub-database data source files. The download address is: https://www.uniprot.org/downloads. In this article, we choose to download the Swiss-Prot sub-database data verified by experts, and choose the xml file format to obtain more semantic relations between upper and lower levels. The downloaded file is named "uniprot-sprot.xml". BioGRID provides a complete data source file download, the download address is: https://downloads.thebiogrid.org/BioGRID/Latest-Release/, in this article, select the latest release of BIOGRID-ALL-LATEST.psi25.zip with psi25 The data file "BIOGRID-ALL-LATEST.psi25.xml" stored in xml format.
NCBI的数据文件可以从NCBI的官方FTP服务器上进行下载,其中Gene文件的地址为:ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/,The data files of NCBI can be downloaded from the official FTP server of NCBI. The address of the Gene file is: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/,
选择其中的ALL_Data.gene_info.gz文件进行下载,解压后获得ALL_Data.gene_info文件,其本质为以制表符分隔的文本文件,除第一行用来描述字段名称外,每一行都存储一个基因的实体属性信息。Select the ALL_Data.gene_info.gz file to download, decompress it to obtain the ALL_Data.gene_info file, which is essentially a tab-delimited text file. Except the first line is used to describe the field name, each line stores a gene's Entity attribute information.
b).数据文件的分割,对于步骤a)中获取的xml格式文件,其分割采用逐行读取的方式,并检测标识实体起始和终止的标签,将每对起始和终止标签及其包含的部分存储为单个实体文件;对于从BIOGRID-ALL-LATSET.psi25.xml文件中以实体起始和终止的标签获取的单个文件,只保留“Interactor”实体对应的文件,忽略“Interaction”和“Experiment”实体对应的文件;对于ALL_Date.gene_info,由于其每一行存储一个实体,对其不进行分割,采用直接逐行读取的方式获取单个实体文件;b). Segmentation of data files. For the xml format file obtained in step a), the segmentation adopts the method of reading line by line, and detects the tags that identify the start and end of the entity, and divides each pair of start and end tags and their Included sections are stored as a single entity file; for a single file obtained from the BIOGRID-ALL-LATSET.psi25.xml file with tags starting and ending with the entity, only the files corresponding to the "Interactor" entity are kept, ignoring "Interaction" and The file corresponding to the "Experiment" entity; for ALL_Date.gene_info, since each line stores one entity, it is not divided, and a single entity file is obtained by directly reading line by line;
由于下载的数据文件较大(Uniprot-sprot为6.26G,BioGRID为6.46G,NCBI Gene为3.29G),难以直接处理,所以需要对数据文件进行分割,将其按照实体分割为存储单个实体的小文件。对于xml文件,其分割采用逐行读取的方式,并检测标识实体起始和终止的标签,如Uniprot中的“<entry>”和“</entry>”,将每对起始和终止标签及其包含的部分存储为单个文件。Since the downloaded data file is large (Uniprot-sprot is 6.26G, BioGRID is 6.46G, and NCBI Gene is 3.29G), it is difficult to directly process it, so it is necessary to divide the data file and divide it according to entities into small files that store a single entity. document. For the xml file, the segmentation adopts the way of reading line by line, and detects the tags that identify the start and end of the entity, such as "<entry>" and "</entry>" in Uniprot, and converts each pair of start and end tags and the parts it contains are stored as a single file.
c).以Uniprot-Swissprot数据源为基础的上层本体的建立,参照Uniprot中蛋白质/基因的实际属性构建包括表示蛋白质/基因的名称Name、表示蛋白质/基因所对应的物种类型id的TaxID、表示蛋白质/基因所对应的物种的类型名称Specie以及表示蛋白质/基因的同义词表Synomym在内的上层本体;c). The establishment of the upper-level ontology based on the Uniprot-Swissprot data source, referring to the actual properties of the protein/gene in Uniprot, including the Name representing the protein/gene, the TaxID representing the species type id corresponding to the protein/gene, the representation The upper-level ontology including the type name Specie of the species corresponding to the protein/gene and the synonym table Synomym representing the protein/gene;
为了能够融合三个数据源的同义词信息,需要将三个数据源中的蛋白质进行对齐。采用的方法是建立一个更高层的本体,将三个数据源中的不同的蛋白质描述都向该本体中的蛋白质进行融合。要求本体schema能建立对实体及实体各个属性字段的语义描述标准,并能规范数据值的存储标准,要求:In order to be able to fuse the synonym information of the three data sources, the proteins in the three data sources need to be aligned. The method adopted is to build a higher-level ontology, and fuse the different protein descriptions in the three data sources to the proteins in the ontology. It is required that the ontology schema can establish a semantic description standard for entities and various attribute fields of the entity, and can standardize the storage standard of data values. Requirements:
(1)本体通过定义类和添加实例,能够给出实例(如:P53-HUMAN)规范化的的标签(label)和命名(name);(1) Ontology can give standardized labels and names of instances (such as P53-HUMAN) by defining classes and adding instances;
(2)本体通过定义对象属性(ObjectProperty)和数值型属性(DataProperty),能够给出实例所包含的规范化的属性字段名称;(2) The ontology can give the normalized property field name contained in the instance by defining the object property (ObjectProperty) and the numerical property (DataProperty);
(3)本体通过定义数值型属性,能够给出规范化的属性值的类型、数值范围、单位,及属性值的存储格式规范;(3) Ontology can provide the type, value range, unit of normalized attribute value, and the storage format specification of attribute value by defining numerical attributes;
(4)本体能给出统一的语义关系的定义及格式、关系的表示形式。(4) Ontology can provide a unified definition of semantic relationship, format and representation of relationship.
表示蛋白质/基因的名称Name表示该蛋白质/基因的常用名称或历史文献中被提出的第一个名称。该名称以Uniprot中的name字段的值为准。格式为:<pgso:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">P53_HUMAN </ pgso:name >。Indicates the name of the protein/gene Name indicates the common name of the protein/gene or the first name proposed in the historical literature. The name is based on the value of the name field in Uniprot. The format is: <pgso:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">P53_HUMAN </pgso:name >.
表示蛋白质/基因所对应的物种类型id的TaxID,该类型表示采用NCBI Taxonomy中的Id来统一表示。格式为:<pgso:taxId rdf:datatype="http://www.w3.org/2001/XMLSchema#string">9606</pgso:taxId>;表示蛋白质/基因所对应的物种的类型名称Specie,采用Uniprot中的物种缩写来表示,并与NCBI Taxonomy及其taxId一一匹配。格式为:<pgso:specierdf:datatype="http://www.w3.org/2001/XMLSchema#string">HUMAN</ pgso:specie>;以及表示蛋白质/基因的同义词表Synomym,同义词表应至少包括name字段的名称,且每一个同义词都用一个独立的owl本体标签存放。格式为:Indicates the TaxID of the species type id corresponding to the protein/gene, which is uniformly represented by the Id in NCBI Taxonomy. The format is: <pgso:taxId rdf:datatype="http://www.w3.org/2001/XMLSchema#string">9606</pgso:taxId>; indicates the type name Specie of the species corresponding to the protein/gene, It is represented by the species abbreviation in Uniprot, and it is matched with NCBI Taxonomy and its taxId one by one. The format is: <pgso:specierdf:datatype="http://www.w3.org/2001/XMLSchema#string">HUMAN</pgso:specie>; and the synonym table Synomym representing proteins/genes, the synonym table should be at least Contains the name of the name field, and each synonym is stored with a separate owl ontology tag. The format is:
<pgso:synonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">P53</pgso:synonym><pgso:synonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">P53</pgso:synonym>
<pgso:synonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">TP53</pgso:synonym><pgso:synonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">TP53</pgso:synonym>
<pgso:synonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">LFS1</pgso:synonym><pgso:synonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">LFS1</pgso:synonym>
……...
d).Uniprot-Swissprot向上层本体的映射和融合,将Uniprot中蛋白质中的“name”字段映射到上层本体的“Name”字段,Uniprot中的“NCBI TaxonomyID”字段映射到上层本体的“TaxID”字段,Uniprot中的“gene”字段映射到上层本体的“Synomym”字段,直至Uniprot-Swissprot数据源中所有的蛋白质及属性全部映射完毕,以使Uniprot数据源中蛋白质实例及属性全部融合到上层本体中;d). The mapping and fusion of Uniprot-Swissprot to the upper ontology, mapping the "name" field in the protein in Uniprot to the "Name" field of the upper ontology, and the "NCBI TaxonomyID" field in Uniprot to the "TaxID" of the upper ontology field, the "gene" field in Uniprot is mapped to the "Synomym" field of the upper ontology, until all the proteins and attributes in the Uniprot-Swissprot data source are mapped, so that the protein instances and attributes in the Uniprot data source are all fused to the upper ontology middle;
由于Uniprot中数据的规范性、全面性和权威性,所以抽取Uniprot的schema,作为最初的上层本体中的主要实体及属性,其他数据源的实体和属性均向该模型对齐,如果出现上层本体中尚未涵盖的实体或者属性,就在此基础上对上层本体进行扩充。Due to the normative, comprehensive and authoritative nature of the data in Uniprot, the schema of Uniprot is extracted as the main entities and attributes in the original upper-level ontology, and the entities and attributes of other data sources are aligned with this model. For entities or attributes that have not yet been covered, the upper-level ontology is expanded on this basis.
e). BioGRID向上层本体的映射和融合,首先在BioGRID的reference字段中寻找和Uniprot对应的蛋白质,将其与Uniprot中的accession字段建立映射,从而借助步骤d)中建立的映射,将BioGRID中蛋白质的属性信息融合到上层本体中;e). For the mapping and fusion of BioGRID to the upper-level ontology, first find the protein corresponding to Uniprot in the reference field of BioGRID, and map it with the accession field in Uniprot, so that with the help of the mapping established in step d), the BioGRID The attribute information of the protein is fused into the upper ontology;
如果通过BioGRID的reference字段无法找到Uniprot中对应的蛋白质,则表明上层本体中还没有该蛋白质实例,那么就需要给上层本体添加该蛋白质实例,采用BioGRID的“shortLabel”向上层本体的“Name”对齐、BioGRID的“ncbiTaxld”向上层本体的“TaxID”对齐、BioGRID的“Synonym”向上层本体的“Synonym”对齐的方式对上层本体进行扩充;图2中给出了三个数据源之间彼此的属性关联及映射关系。可以得知,BioGRID中的“primaryRef”或“secondaryRef”字段中记录了蛋白质所引用的Uniprot的“accession”字段信息。如在数据源BioGRID中,某蛋白质的“shortLabel” 属性值为“HSPA5”,其“secondaryRef”属性值包含“Q9UK02”、“Q2EF78”、“B0QZ61”、“Q9NPF1”等四个accession信息,通过这些accession即可寻找到Uniprot中的相应蛋白质,即可建立起BioGRID中该蛋白质实例向Uniprot中蛋白质实例的融合。If the corresponding protein in Uniprot cannot be found through the reference field of BioGRID, it indicates that there is no instance of the protein in the upper ontology, then the protein instance needs to be added to the upper ontology, and the "shortLabel" of BioGRID is used to align to the "Name" of the upper ontology , BioGRID's "ncbiTaxld" is aligned to the "TaxID" of the upper-level ontology, and the "Synonym" of BioGRID is aligned to the "Synonym" of the upper-level ontology to extend the upper-level ontology; Figure 2 shows the relationship between the three data sources. Attribute association and mapping relationship. It can be known that the "primaryRef" or "secondaryRef" field in BioGRID records the "accession" field information of the Uniprot referenced by the protein. For example, in the data source BioGRID, the "shortLabel" attribute value of a protein is "HSPA5", and its "secondaryRef" attribute value contains four accession information such as "Q9UK02", "Q2EF78", "B0QZ61", "Q9NPF1", etc. Accession can find the corresponding protein in Uniprot, and the fusion of the protein instance in BioGRID to the protein instance in Uniprot can be established.
需要注意的是,一个Uniprot蛋白质可能拥有多个accession,一个BioGRID蛋白质也对应着多个accession引用。Uniprot和BioGRID实际上是多对多的融合关系,需要将每一对可以融合的关系都进行一次映射。It should be noted that a Uniprot protein may have multiple accessions, and a BioGRID protein also corresponds to multiple accession references. Uniprot and BioGRID are actually many-to-many fusion relationships, and each pair of fused relationships needs to be mapped once.
该步骤中,所述的BioGRID向上层本体的融合或对齐通过以下步骤来实现:In this step, the fusion or alignment of the BioGRID to the upper-level ontology is achieved through the following steps:
e-1).判断BioGRID蛋白质是否存在对应的Uniprot,查找BioGRID中的“primaryRef”或“secondaryRef”字段中的信息,在上层本体所包含的Uniprot蛋白质的“accession”字段信息中是否有出现,如果有出现,则表明BioGRID和Uniprot中出现的两蛋白质可进行融合,执行步骤e-2),如果没有出现,则应将BioGRID中的蛋白质进行扩充,执行步骤e-4);e-1). Determine whether there is a corresponding Uniprot for the BioGRID protein, look for the information in the "primaryRef" or "secondaryRef" field in the BioGRID, and whether it appears in the "accession" field information of the Uniprot protein contained in the upper ontology, if If there is, it means that the two proteins in BioGRID and Uniprot can be fused, and step e-2) is performed. If not, the protein in BioGRID should be expanded, and step e-4) is performed;
e-2). shortLabel值的融合,如果BioGRID蛋白质中的“shortLabel”与所要融合的上层本体中的“Name”值不同,则保留上层本体的“Name”值不变,并将“shortLabel”添加至上层本体的“Synonym”中;e-2). Fusion of shortLabel values, if the "shortLabel" in the BioGRID protein is different from the "Name" value in the upper body to be fused, keep the "Name" value of the upper body unchanged, and add "shortLabel" to the "Synonym" of the upper ontology;
e-3). Synonym值的融合,如果BioGRID蛋白质中的“Synonym” 与上层本体中的“Synonym”完全一致,则保留上层本体中的“Synonym”不变,如果不完全一致,则将BioGRID蛋白质中的“Synonym” 与上层本体中的“Synonym”同义词合并去重,作为新的“Synonym”字段值存储在上层本体中;e-3). Fusion of Synonym values. If the "Synonym" in the BioGRID protein is exactly the same as the "Synonym" in the upper ontology, keep the "Synonym" in the upper ontology unchanged. The "Synonym" in the upper-level ontology is combined with the "Synonym" synonym in the upper-level ontology to remove duplicates, and is stored in the upper-level ontology as a new "Synonym" field value;
e-4). BioGRID蛋白质的扩充,采用BioGRID的“shortLabel”向上层本体的“Name”映射、BioGRID的“ncbiTaxld”向上层本体的“TaxID”映射、BioGRID的“Synonym”向上层本体的“Synonym”映射的方式对上层本体进行扩充。e-4). The expansion of BioGRID proteins, using the "shortLabel" of BioGRID to map to the "Name" of the upper-level ontology, the "ncbiTaxld" of BioGRID to map to the "TaxID" of the upper-level ontology, and the "Synonym" of BioGRID to the "Synonym" of the upper-level ontology "Mapping to expand the upper ontology.
f). NCBI Gene向上层本体的映射和融合,首先将NCBI Gene中蛋白质和上层本体中蛋白质中的同义词全部转化为小写或大写,然后将每一个NCBI Gene蛋白质实例的同义词集合与上层本体中的每一个蛋白质实例的同义词集合进行比较,筛选出有共现同义词的蛋白质实例对,最后,根据物种信息再次筛选,保留同一物种的蛋白质实例对,将NCBI Gene蛋白质实例的属性信息融合到上层本体中;f). For the mapping and fusion of NCBI Gene to the upper ontology, first convert all the synonyms in the protein in the NCBI Gene and the protein in the upper ontology to lowercase or uppercase, and then combine the synonym set of each NCBI Gene protein instance with the synonyms in the upper ontology The synonym sets of each protein instance are compared, and the protein instance pairs with co-occurring synonyms are screened out. Finally, according to the species information, the protein instance pairs of the same species are retained, and the attribute information of the NCBI Gene protein instance is fused into the upper ontology. ;
如果NCBI Gene中的某个蛋白质与上层本体之间不存在同义词,则表明步骤e)中建立的上层本体中尚未涵盖该蛋白质,则采用NCBI Gene的“Symbol”向上层本体的“Name”对齐、NCBI Gene的“taxID”向上层本体的“TaxID”对齐、NCBI Gene的“Synonym”向上层本体的“Synonym”对齐的方式对上层本体进行扩充;NCBI Gene本身难以和Uniprot、BioGRID和上层本体之间直接建立蛋白质的对齐关系。所以,需要从同义词和物种信息入手,将NCBIGene和上层本体进行匹配。首先应将NCBI Gene蛋白质和上层本体中蛋白质的同义词全部转化为小写(或大写),这样做的目的是屏蔽掉因物种不同而造成的同一基因/蛋白在大小写上的差异,使得NCBI Gene和上层本体之间能够忽略物种信息进行匹配。其次,将每一个NCBI Gene中蛋白质实例的同义词集合与上层本体中的每一个蛋白质的同义词集合进行比较,筛选出共现同义词的蛋白质。这样,每一个NCBI Gene蛋白质能匹配出若干个候选上层本体中的蛋白质(也可能完全匹配不到)。最后,根据该NCBI Gene蛋白质的物种信息,在候选上层本体的蛋白质中寻找同一物种的蛋白质。若存在,则这两个蛋白质之间可以对齐。If there is no synonym between a protein in NCBI Gene and the upper ontology, it means that the protein is not covered in the upper ontology established in step e), then use the "Symbol" of NCBI Gene to align to the "Name" of the upper ontology, The "taxID" of NCBI Gene is aligned with the "TaxID" of the upper-level ontology, and the "Synonym" of NCBI Gene is aligned with the "Synonym" of the upper-level ontology to extend the upper-level ontology; NCBI Gene itself is difficult to communicate with Uniprot, BioGRID and the upper-level ontology. Directly establish the alignment relationship of proteins. Therefore, it is necessary to start with synonyms and species information, and match NCBIGene with the upper ontology. First, the synonyms of the NCBI Gene protein and the protein in the upper ontology should be converted to lowercase (or uppercase). The upper ontologies can ignore species information for matching. Second, the synonym set of each protein instance in NCBI Gene is compared with the synonym set of each protein in the upper ontology, and the proteins with co-occurring synonyms are screened out. In this way, each NCBI Gene protein can match several proteins in the candidate upper ontology (or may not match at all). Finally, according to the species information of the NCBI Gene protein, the protein of the same species is searched among the proteins of the candidate upper ontology. If present, the two proteins can be aligned.
从图2中可以看出,NCBI Gene与Uniprot和BioGRID之间并没有出现明显的引用关系或者指向关系,所以无法像BioGRID融合时一样,直接通过关键属性字段的对齐来确定蛋白质的对齐。所以,需要结合NCBI Gene和上层本体已有的属性值确定蛋白质的对齐关系,即通过“TaxId”和“Synonym”这两个关键信息的匹配,来反映蛋白质间的匹配映射关系。As can be seen from Figure 2, there is no obvious reference or pointing relationship between NCBI Gene, Uniprot and BioGRID, so it is impossible to directly determine the alignment of proteins through the alignment of key attribute fields as in the case of BioGRID fusion. Therefore, it is necessary to combine the existing attribute values of NCBI Gene and the upper ontology to determine the alignment relationship of proteins, that is, to reflect the matching mapping relationship between proteins by matching the two key information of "TaxId" and "Synonym".
该步骤所述的NCBI Gene向上层本体的融合或对齐通过以下步骤来实现:The fusion or alignment of the NCBI Gene to the upper-level ontology described in this step is achieved through the following steps:
f-1).提取物种和同义词信息,分别提取NCBI Gene蛋白质和上层本体蛋白质中物种信息TaxID和同义词信息Synonym,并将同义词全部转为小写;f-1). Extract species and synonym information, respectively extract species information TaxID and synonym information Synonym in NCBI Gene protein and upper-level ontology protein, and convert all synonyms to lowercase;
f-2).匹配同义词,将每个NCBI Gene蛋白质实例的同义词集合向每个上层本体中的蛋白质实例同义词集合进行匹配,如果二者之间的小写同义词集合存在交集,则将存在交集的上层本体实例作为NCBI Gene实例的候选匹配项;如果二者之间不存在交集,则应将NCBIGene中的蛋白质实例扩充至上层本体中的蛋白质,执行步骤f-6);f-2). Match synonyms, match the synonym set of each NCBI Gene protein instance to the protein instance synonym set in each upper ontology, if there is an intersection between the lowercase synonym sets between the two, there will be an upper layer of the intersection. The ontology instance is used as the candidate matching item of the NCBI Gene instance; if there is no intersection between the two, the protein instance in the NCBIGene should be expanded to the protein in the upper ontology, and step f-6);
f-3).候选匹配项的筛选,通过判断NCBI Gene蛋白质的物种信息TaxID与其候选匹配项中的物种信息TaxID标识的是否为同一物种,如果为同一物种,则表明NCBI Gene实例与该上层本体实例可进行融合,执行步骤f-4);如果为不同的物种,则不可融合,如果所有的候选匹配项都判断完毕后,均不可进行融合,则执行步骤f-6);f-3). Screening of candidate matching items, by judging whether the species information TaxID of the NCBI Gene protein and the species information TaxID in the candidate matching items identify the same species, if they are the same species, it indicates that the NCBI Gene instance and the upper ontology Instances can be fused, and step f-4) is performed; if they are different species, they cannot be fused, and if all candidate matching items are judged, they cannot be fused, then step f-6) is performed;
f-4). shortLabel值的融合,如果NCBI Gene蛋白质中的“Symbol”与所要融合的上层本体中的“Name”值不同,则保留所要融合的上层本体的“Name”值不变,并将“Symbol”添加至所要融合的上层本体的“Synonym”中;f-4). Fusion of shortLabel values, if the "Symbol" in the NCBI Gene protein is different from the "Name" value in the upper ontology to be fused, keep the "Name" value of the upper ontology to be fused unchanged, and set the "Symbol" is added to the "Synonym" of the upper ontology to be merged;
f-5). Synonym值的融合,如果NCBI Gene蛋白质中的“Synonym”与所要融合的上层本体中的“Synonym”完全一致,则保留上层本体中的“Synonym”不变,如果不完全一致,则将NCBI Gene蛋白质中的“Synonym” 与上层本体中的“Synonym”同义词合并并去重后,作为新的“Synonym”字段值存储在上层本体中;f-5). Fusion of Synonym values. If the "Synonym" in the NCBI Gene protein is exactly the same as the "Synonym" in the upper ontology to be fused, keep the "Synonym" in the upper ontology unchanged. Then the "Synonym" in the NCBI Gene protein is merged with the "Synonym" synonym in the upper-level ontology and deduplicated, and then stored in the upper-level ontology as a new "Synonym" field value;
f-6). NCBI Gene蛋白质向上层本体的扩充,采用NCBI Gene的“Symbol”向上层本体的“Name”对齐、NCBI Gene的“taxID”向上层本体的“TaxID”对齐、NCBI Gene的“Synonym”向上层本体的“Synonym”对齐的方式对上层本体进行扩充。f-6). The extension of NCBI Gene protein to the upper-level ontology, using the "Symbol" of NCBI Gene to align to the "Name" of the upper-level ontology, the "taxID" of NCBI Gene to the "TaxID" of the upper-level ontology, and the "Synonym" of NCBI Gene ” extends the upper-level ontology by aligning the “Synonym” of the upper-level ontology.
如图4中所示,给出了本发明中NCBI Gene蛋白质和上层本体中同义词经小写化同义词匹配后的候选匹配项示意图,右侧是上层本体中的同义词及物种,左侧是来自数据源NCBI Gene的同义词及物种,过滤出NCBI Gene和上层本体的同义词集合存在交集的匹配组合,就能得到NCBI Gene蛋白质的候选匹配项,即图中的连接线部分,可见对于NCBI Gene中的3个蛋白质来说,均存在多个候选匹配项。如图5所示,给出了图4中的候选匹配项经物种匹配后的蛋白质最终匹配结果,候选匹配项经过物种信息过滤后,其过滤结果如图5中虚线线条所示。As shown in Figure 4, a schematic diagram of candidate matching items after the synonyms in the NCBI Gene protein and the upper-level ontology are matched by lowercase synonyms in the present invention is given, the right side is the synonyms and species in the upper-level ontology, the left side is from the data source The synonyms and species of NCBI Gene, filter out the matching combination of NCBI Gene and the synonym set of the upper-level ontology that have intersections, and then the candidate matching items of the NCBI Gene protein can be obtained, that is, the connecting line part in the figure. It can be seen that for 3 of the NCBI Gene For proteins, there are multiple candidate matches. As shown in Figure 5, the final protein matching results of the candidate matching items in Figure 4 after species matching are given. After the candidate matching items are filtered by species information, the filtering results are shown in the dotted line in Figure 5.
g).同义词的去重和同义词表的导出,将Uniprot-Swissprot、BioGRID和NCBIGene三个数据源中的蛋白质进行对其和融合后,将各自的同义词信息抽取出来,经去重后,获得带有物种分类信息的同义词表,并将其存储在上层本体中的同义词表中,即实现了蛋白质/基因同义词表的构建。g). Deduplication of synonyms and export of the synonym table. After the proteins in the three data sources of Uniprot-Swissprot, BioGRID and NCBIGene are aligned and fused, the respective synonym information is extracted. There is a synonym table with species classification information, and it is stored in the synonym table in the upper ontology, that is, the construction of the protein/gene synonym table is realized.
可见:visible:
(1)本发明提出了一种基因/蛋白质同义词表的构建方法,能够有效的整合多个权威数据源的同义词信息,并按照物种进行同义词的分类,做到了同义词规模上更全面,准确度上更可靠,分类信息上更细致。(1) The present invention proposes a method for constructing a gene/protein synonym table, which can effectively integrate synonym information from multiple authoritative data sources, and classify synonyms according to species, so as to achieve a more comprehensive synonym scale and higher accuracy. More reliable and more detailed in classification information.
(2)不同于传统的同义词构建方法,本发明在同义词表的构建过程中以本体作为多源异构数据的语义规范,制定了先对包含同义词的蛋白质进行语义映射,再进行蛋白质同义词融合,进而构建同义词表的语义标准化构建流程。语义映射技术的使用,能够显著增加同义词匹配的准确性;而本体的引入,能够在同义词表的规范、存储、推理和应用等方面都具有更强的优势。(2) Different from the traditional synonym construction method, the present invention uses ontology as the semantic specification of multi-source heterogeneous data in the construction of the synonym table, and formulates the semantic mapping of proteins containing synonyms first, and then the fusion of protein synonyms. Then, the semantic standardization construction process of the synonym table is constructed. The use of semantic mapping technology can significantly increase the accuracy of synonym matching; the introduction of ontology can have stronger advantages in the specification, storage, reasoning and application of synonym tables.
(3)本发明采用了建立独立于数据源之外的上层本体的融合方法,确立了明确的融合方向。构建上层本体,意味着同义词信息作为知识本身,其表示、存储和使用本身都符合国际上通用的知识表示标准。而上层本体的存在,也使得其它数据源可以通过统一的语义标准方便的融合到现有的同义词表中,而不单单局限于本发明中所提到的三个数据源,可扩展性大幅增强。(3) The present invention adopts a fusion method of establishing an upper-layer ontology independent of the data source, and establishes a clear fusion direction. The construction of the upper-level ontology means that the synonym information, as knowledge itself, is represented, stored and used in accordance with the international general knowledge representation standard. The existence of the upper-level ontology also enables other data sources to be easily integrated into the existing synonym list through a unified semantic standard, not only limited to the three data sources mentioned in the present invention, and the scalability is greatly enhanced .
Claims (4)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010525374.0A CN111710365B (en) | 2020-06-10 | 2020-06-10 | An ontology-based method for building protein/gene synonyms |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010525374.0A CN111710365B (en) | 2020-06-10 | 2020-06-10 | An ontology-based method for building protein/gene synonyms |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111710365A true CN111710365A (en) | 2020-09-25 |
| CN111710365B CN111710365B (en) | 2022-04-08 |
Family
ID=72539520
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010525374.0A Active CN111710365B (en) | 2020-06-10 | 2020-06-10 | An ontology-based method for building protein/gene synonyms |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111710365B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113539369A (en) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080082483A1 (en) * | 2006-09-29 | 2008-04-03 | Lim Joon-Ho | Method and apparatus for normalizing protein name using ontology mapping |
| CN104424399A (en) * | 2013-08-30 | 2015-03-18 | 中国科学院上海生命科学研究院 | Knowledge navigation method, device and system based on virus protein body |
| CN110688493A (en) * | 2019-09-26 | 2020-01-14 | 京东方科技集团股份有限公司 | A method, device and electronic device for establishing an association relationship |
| CN110717014A (en) * | 2019-09-12 | 2020-01-21 | 北京四海心通科技有限公司 | Ontology knowledge base dynamic construction method |
-
2020
- 2020-06-10 CN CN202010525374.0A patent/CN111710365B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080082483A1 (en) * | 2006-09-29 | 2008-04-03 | Lim Joon-Ho | Method and apparatus for normalizing protein name using ontology mapping |
| CN104424399A (en) * | 2013-08-30 | 2015-03-18 | 中国科学院上海生命科学研究院 | Knowledge navigation method, device and system based on virus protein body |
| CN110717014A (en) * | 2019-09-12 | 2020-01-21 | 北京四海心通科技有限公司 | Ontology knowledge base dynamic construction method |
| CN110688493A (en) * | 2019-09-26 | 2020-01-14 | 京东方科技集团股份有限公司 | A method, device and electronic device for establishing an association relationship |
Non-Patent Citations (2)
| Title |
|---|
| MARCO PELLEGRINI 等: "Protein complex prediction for large protein protein interaction networks with the Core&Peel method", 《BMC BIOINFORMATICS》 * |
| 李满生 等: "蛋白质相互作用信息的文本挖掘研究进展", 《中国科学》 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113539369A (en) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111710365B (en) | 2022-04-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111708773B (en) | A data fusion method of multi-source science and technology resources | |
| Zeng | Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article | |
| Han et al. | RDF123: from Spreadsheets to RDF | |
| CN102640145B (en) | Credible inquiry system and method | |
| CN113190687B (en) | Knowledge graph determining method and device, computer equipment and storage medium | |
| Su et al. | ODE: Ontology-assisted data extraction | |
| CN103631882B (en) | Semantization service generation system and method based on graph mining technique | |
| KR100882582B1 (en) | Semantic Web-based Research Information Service System and Its Method | |
| Brando et al. | REDEN: named entity linking in digital literary editions using linked data sets | |
| CN111061828B (en) | Digital library knowledge retrieval method and device | |
| CN106021354A (en) | Establishment method of digital interpretation library of Dongba classical ancient books | |
| CN106547893A (en) | A kind of photo sort management system and photo sort management method | |
| CN116775897A (en) | Knowledge graph construction and query method and device, electronic equipment and storage medium | |
| Elliott | Survey of Author Name Disambiguation: 2004 to 2010. | |
| CN111710365B (en) | An ontology-based method for building protein/gene synonyms | |
| US20100205229A1 (en) | System and method for instances registering based on history | |
| KR102794591B1 (en) | Retrieval augmented generation system for legal information applying version management of knowledge information | |
| CN119202275A (en) | A method and system for establishing a book library based on multi-version annotations | |
| CN117609432A (en) | A method to realize policy intelligent retrieval through tag extraction strategy | |
| CN118277578A (en) | General construction method for standard literature knowledge graph | |
| Wang et al. | 5 A semantic enrichment approach to linking and enhancing Dunhuang cultural heritage data | |
| Hoekstra et al. | Linkitup: Link Discovery for Research Data. | |
| Foulonneau et al. | Strategies for reprocessing aggregated metadata | |
| Wang et al. | Big Open Data Aided Institutions’ Name Normalization and Attribute Enrichment | |
| Ma | One concept, one term, good practice but how to achieve?—improving facet values quality for Samuel Proctor oral history collection, hosted by the University of Florida digital collections |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |