[go: up one dir, main page]

CN110704639A - Method and device for generating acronym documents - Google Patents

Method and device for generating acronym documents Download PDF

Info

Publication number
CN110704639A
CN110704639A CN201910942205.4A CN201910942205A CN110704639A CN 110704639 A CN110704639 A CN 110704639A CN 201910942205 A CN201910942205 A CN 201910942205A CN 110704639 A CN110704639 A CN 110704639A
Authority
CN
China
Prior art keywords
term
abbreviation
target
terms
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910942205.4A
Other languages
Chinese (zh)
Inventor
孙海霞
邓盼盼
李姣
钱庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN201910942205.4A priority Critical patent/CN110704639A/en
Publication of CN110704639A publication Critical patent/CN110704639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明提供了一种缩略语文档的生成方法及装置,在不同知识组织系统互操作过程中,提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,术语包括全称和缩略语多种形式;基于缩略语构词规则,识别集成词表中含有缩略语的目标概念;提取目标概念的全部术语及每个术语的属性,并确定每个术语的术语类型;基于ISO25964国际标准,对目标概念及目标概念的全部术语进行规范化知识表示,生成语义丰富且标准规范的缩略语文档。其中,缩略语及其所在概念继承了来源词表和集成词表的丰富语义属性信息,减少了缩略语歧义问题。

Figure 201910942205

The present invention provides a method and device for generating an acronym document. In the process of interoperating different knowledge organization systems, terms in different source vocabularies are extracted and synonymously merged to form concepts, and an integrated vocabulary is generated. The terms include Various forms of full names and abbreviations; based on abbreviation word formation rules, identify target concepts containing abbreviations in the integrated vocabulary; extract all terms of the target concept and the attributes of each term, and determine the term type of each term; The ISO25964 international standard, normalizes the knowledge representation of the target concept and all the terms of the target concept, and generates semantically rich and standardized abbreviation documents. Among them, abbreviations and their concepts inherit the rich semantic attribute information of source vocabulary and integrated vocabulary, which reduces the problem of abbreviation ambiguity.

Figure 201910942205

Description

一种缩略语文档的生成方法及装置Method and device for generating acronym documents

技术领域technical field

本发明涉及数据处理技术领域,更具体的,涉及一种缩略语文档的生成方法及装置。The present invention relates to the technical field of data processing, and more particularly, to a method and device for generating an abbreviation document.

背景技术Background technique

缩略语(Abbreviation)是全称简化后的表达形式,其中,在汉语中,缩略语是为了便于使用,由较长的词语缩短省略而成的词语;在英语中,缩略语通常是对多音节词进行缩写得到的词语,如photo就是photograph的缩略语。Abbreviation is a simplified form of the full name. In Chinese, abbreviations are words that are shortened and omitted from longer words for ease of use; in English, abbreviations are usually used for multi-syllable words. Words obtained by abbreviating, such as photo is the abbreviation of photograph.

目前一般通过分词、标注等加工方法结合人工校对识别和抽取缩略语,生成缩略语词典,实现对缩略语的描述。但是,传统的缩略语词典仅包括缩略语、全称、汉译名和注释几项内容,但是,在实际应用中,由于不同系统对缩略语的相关描述可能存在差异,缩略语词典在多领域内应用时容易造成歧义。At present, the abbreviation dictionary is generally generated by processing methods such as word segmentation and labeling combined with manual proofreading, identification and extraction of abbreviations to realize the description of abbreviations. However, the traditional abbreviation dictionary only includes abbreviations, full names, Chinese translated names and annotations. However, in practical applications, due to differences in the descriptions of abbreviations in different systems, abbreviation dictionaries are used in many fields. It is easy to cause ambiguity.

发明内容SUMMARY OF THE INVENTION

鉴于此,本发明提供了一种缩略语文档的生成方法及装置,基于ISO25964国际标准建立统一的缩略语元数据标准,语义丰富且标准规范,减少缩略语歧义问题。In view of this, the present invention provides a method and device for generating an abbreviation document, which establishes a unified abbreviation metadata standard based on the ISO25964 international standard, has rich semantics and standard specifications, and reduces the problem of abbreviation ambiguity.

为了解决上述技术问题,本发明提供的具体技术方案如下:In order to solve the above-mentioned technical problems, the specific technical solutions provided by the present invention are as follows:

一种缩略语文档的生成方法,包括:A method for generating acronym documents, including:

在不同知识组织系统互操作过程中,提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,术语包括全称和缩略语多种形式;In the process of interoperability of different knowledge organization systems, terms from different source vocabularies are extracted and synonymous to form concepts, and integrated vocabularies are generated. Terms include full names and abbreviations in various forms;

基于缩略语构词规则,识别所述集成词表中含有缩略语的目标概念;Identifying the target concept containing abbreviations in the integrated vocabulary based on the abbreviation word formation rules;

提取所述目标概念及全部术语的属性,并确定每个术语的术语类型;Extract the attributes of the target concept and all terms, and determine the term type of each term;

基于ISO25964国际标准,对所述目标概念及所述目标概念的全部术语进行规范化知识表示,生成缩略语文档。Based on the ISO25964 international standard, normalized knowledge representation is performed on the target concept and all terms of the target concept, and an abbreviation document is generated.

可选的,所述基于缩略语构词规则,识别所述集成词表中含有缩略语的目标概念,包括:Optionally, identifying the target concepts containing abbreviations in the integrated vocabulary based on the abbreviation word formation rules, including:

将所述集成词表中每个概念中的术语与缩略语词典中的记录进行比对;matching the terms in each concept in the integrated vocabulary with the records in the dictionary of abbreviations;

依据比对结果,将所述集成词表中同时存在缩略语词典中的缩略语及相应全称的概念确定为含有缩略语的所述目标概念。According to the comparison result, a concept in which both the abbreviation and the corresponding full name in the abbreviation dictionary exist in the integrated vocabulary list is determined as the target concept containing the abbreviation.

可选的,所述提取所述目标概念及全部术语的属性,并确定每个术语的术语类型,包括:Optionally, extracting the attributes of the target concept and all terms, and determining the term type of each term, including:

提取集成词表中含有缩略语的所述目标概念及全部术语的属性,概念属性包括概念标识、术语、注释、状态、学科分类、等级关系及相关关系和所在集成词表,术语的属性包括术语标识、词形、使用状态、注释和来源词表;Extract the attributes of the target concept and all terms containing abbreviations in the integrated vocabulary. Concept attributes include concept identification, term, comment, status, subject classification, hierarchical relationship and related relationship, and the integrated vocabulary where the term is located. Attributes include term Identification, word forms, usage status, annotations and source glossaries;

提取目标术语中每个非介词英文单词的首字母,所述目标术语为由多个英文单词构成的术语;Extract the first letter of each non-prepositional English word in the target term, where the target term is a term composed of multiple English words;

将每个首字母对应的大写字母进行拼接,得到拼接字符串;Splicing the capital letters corresponding to each initial letter to obtain a spliced string;

若所述集成词表目标概念中存在所述拼接字符串,则所述拼接字符串为目标概念中所述目标术语的目标缩略语,其中,所述目标缩略语的术语类型为规范缩略语,若所述目标术语的每个非介词英文单词的首字母为大写,则所述目标术语的术语类型为规范全称,若所述目标术语的每个非介词英文单词的首字母为小写或大小写混合形式,则所述目标术语的术语类型为普通全称。If the spliced character string exists in the integrated vocabulary target concept, the spliced character string is the target abbreviation of the target term in the target concept, wherein the term type of the target abbreviation is a canonical abbreviation, If the first letter of each non-prepositional English word of the target term is capitalized, the term type of the target term is the canonical full name; if the first letter of each non-prepositional English word of the target term is lowercase or uppercase Mixed form, the term type of the target term is the common full name.

可选的,所述方法还包括:Optionally, the method further includes:

基于词形变体规律,提取所述集成词表中与所述目标术语在同一概念的词形变体形式的术语,该类术语的术语类型为其他全称。Based on the morphological variation law, the terms in the morphological variation form of the same concept as the target term in the integrated vocabulary are extracted, and the term type of such terms is other full names.

可选的,所述方法还包括:Optionally, the method further includes:

提取缩略语文档中术语的规范全称和规范缩略语,得到术语类型为优选词的术语,并基于ISO25964国际标准将术语类型为优选词的术语加入所述缩略语文档。The canonical full names and canonical abbreviations of the terms in the abbreviation document are extracted to obtain the terms whose term type is the preferred word, and the terms whose term type is the preferred word are added to the abbreviation document based on the ISO25964 international standard.

可选的,所述基于ISO25964国际标准,对所述目标概念及所述目标概念的全部术语进行规范化知识表示,生成缩略语文档,包括:Optionally, the standardized knowledge representation is performed on the target concept and all terms of the target concept based on the ISO25964 international standard, and an abbreviation document is generated, including:

基于ISO25964国际标准,生成所述目标概念的基本信息及关系描述、以及所述目标概念全部术语的基本信息描述与术语类型标识,这些描述性信息构成缩略语文档的元数据模型;Based on the ISO25964 international standard, generate the basic information and relationship description of the target concept, as well as the basic information description and term type identification of all terms of the target concept, and these descriptive information constitute the metadata model of the abbreviation document;

依据所述缩略语文档的元数据模型、所述目标概念对应的术语和每个术语的术语类型生成所述缩略语文档。The abbreviation document is generated according to the metadata model of the abbreviation document, the terms corresponding to the target concept, and the term type of each term.

一种缩略语文档的生成装置,包括:A device for generating an abbreviation document, comprising:

集成词表生成单元,用于在不同知识组织系统互操作过程中,提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,术语包括全称和缩略语多种形式;The integrated word list generation unit is used to extract terms from different source thesaurus and synonymously merge them to form concepts during the interoperability process of different knowledge organization systems. ;

概念识别单元,用于基于缩略语构词规则,识别所述集成词表中含有缩略语的目标概念;术语提取单元,用于提取所述目标概念及全部术语的属性,并确定每个术语的术语类型;The concept identification unit is used to identify the target concept containing abbreviations in the integrated vocabulary based on the abbreviation word formation rules; the term extraction unit is used to extract the attributes of the target concept and all terms, and determine the attribute of each term. term type;

缩略语文档生成单元,用于基于ISO25964国际标准,对所述目标概念及所述目标概念的全部术语进行规范化知识表示,生成缩略语文档。The abbreviation document generating unit is configured to perform normalized knowledge representation on the target concept and all terms of the target concept based on the ISO25964 international standard, and generate an abbreviation document.

可选的,所述概念识别单元,具体用于:Optionally, the concept identification unit is specifically used for:

将所述集成词表中每个概念中的术语与缩略语词典中的记录进行比对;matching the terms in each concept in the integrated vocabulary with the records in the dictionary of abbreviations;

依据比对结果,将所述集成词表中同时存在缩略语词典中的缩略语及相应全称的概念确定为含有缩略语的所述目标概念。According to the comparison result, a concept in which both the abbreviation and the corresponding full name in the abbreviation dictionary exist in the integrated vocabulary list is determined as the target concept containing the abbreviation.

可选的,所述术语提取单元,具体用于:Optionally, the term extraction unit is specifically used for:

提取集成词表中含有缩略语的所述目标概念及全部术语的属性,概念属性包括概念标识、术语、注释、状态、学科分类、等级关系及相关关系和所在集成词表,术语的属性包括术语标识、词形、使用状态、注释和来源词表;Extract the attributes of the target concept and all terms containing abbreviations in the integrated vocabulary. Concept attributes include concept identification, term, comment, status, subject classification, hierarchical relationship and related relationship, and the integrated vocabulary where the term is located. Attributes include term Identification, word forms, usage status, annotations and source glossaries;

提取目标术语中每个非介词英文单词的首字母,所述目标术语为由多个英文单词构成的术语;Extract the first letter of each non-prepositional English word in the target term, where the target term is a term composed of multiple English words;

将每个首字母对应的大写字母进行拼接,得到拼接字符串;Splicing the capital letters corresponding to each initial letter to obtain a spliced string;

若所述集成词表目标概念中存在所述拼接字符串,则所述拼接字符串为目标概念中所述目标术语的目标缩略语,其中,所述目标缩略语的术语类型为规范缩略语,若所述目标术语的每个非介词英文单词的首字母为大写,则所述目标术语的术语类型为规范全称,若所述目标术语的每个非介词英文单词的首字母为小写或大小写混合形式,则所述目标术语的术语类型为普通全称。If the spliced character string exists in the integrated vocabulary target concept, the spliced character string is the target abbreviation of the target term in the target concept, wherein the term type of the target abbreviation is a canonical abbreviation, If the first letter of each non-prepositional English word of the target term is capitalized, the term type of the target term is the canonical full name; if the first letter of each non-prepositional English word of the target term is lowercase or uppercase Mixed form, the term type of the target term is the common full name.

可选的,所述术语提取单元,还用于基于词形变体规律,提取所述集成词表中与所述目标术语在同一概念的词形变体形式的术语,该类术语的术语类型为其他全称。Optionally, the term extraction unit is further configured to extract terms in the integrated vocabulary that are in the form of morphological variants of the same concept as the target term based on the morphological variation rule, and the term type of such terms is other. Full name.

可选的,所述装置还包括:Optionally, the device further includes:

优选词提取单元,用于提取缩略语文档中术语的规范全称和规范缩略语,得到术语类型为优选词的术语,并基于ISO25964国际标准将术语类型为优选词的术语加入所述缩略语文档。The preferred word extraction unit is used to extract the canonical full name and canonical abbreviation of the term in the abbreviation document, obtain the term with the term type as the preferred word, and add the term with the term type as the preferred word to the abbreviation document based on the ISO25964 international standard.

可选的,所述缩略语文档生成单元,具体用于:Optionally, the abbreviation document generation unit is specifically used for:

基于ISO25964国际标准,生成所述目标概念的基本信息及关系描述、以及所述目标概念全部术语的基本信息描述与术语类型标识,这些描述性信息构成缩略语文档的元数据模型;Based on the ISO25964 international standard, generate the basic information and relationship description of the target concept, as well as the basic information description and term type identification of all terms of the target concept, and these descriptive information constitute the metadata model of the abbreviation document;

依据所述缩略语文档的元数据模型、所述目标概念对应的术语和每个术语的术语类型生成所述缩略语文档。The abbreviation document is generated according to the metadata model of the abbreviation document, the terms corresponding to the target concept, and the term type of each term.

相对于现有技术,本发明的有益效果如下:With respect to the prior art, the beneficial effects of the present invention are as follows:

本发明公开的一种缩略语文档的生成方法,通过在不同知识组织系统互操作过程中提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,并基于缩略语构词规则,识别集成词表中含有缩略语的目标概念,通过提取目标概念的全部术语及每个术语的属性,并确定每个术语的术语类型,丰富了缩略语的语义信息,减少了缩略语歧义问题,并通过基于ISO25964国际标准,对目标概念及目标概念的全部术语进行规范化知识表示,使生成的缩略语文档语义丰富且标准规范。The invention discloses a method for generating an abbreviation document, by extracting terms in different source vocabularies during the interoperability process of different knowledge organization systems and merging them synonymously to form concepts, generating an integrated vocabulary, and based on abbreviations Word formation rules, identify the target concepts containing abbreviations in the integrated vocabulary, extract all the terms of the target concept and the attributes of each term, and determine the term type of each term, enrich the semantic information of abbreviations, reduce abbreviations. Abbreviation ambiguity problem, and through the standardized knowledge representation of the target concept and all terms of the target concept based on the ISO25964 international standard, the generated abbreviation documents are semantically rich and standardized.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.

图1为本发明实施例公开的一种缩略语文档的生成方法的流程示意图;1 is a schematic flowchart of a method for generating an abbreviation document disclosed in an embodiment of the present invention;

图2为本发明实施例公开的一种提取集成词表中具有相同概念的术语的方法的流程示意图;2 is a schematic flowchart of a method for extracting terms with the same concept in an integrated vocabulary according to an embodiment of the present invention;

图3为本发明实施例公开的一种基于ISO25964生成缩略语文档的方法的流程示意图;3 is a schematic flowchart of a method for generating an abbreviation document based on ISO25964 disclosed in an embodiment of the present invention;

图4为本发明实施例公开的一种缩略语文档示意图;4 is a schematic diagram of an abbreviation document disclosed in an embodiment of the present invention;

图5为本发明实施例公开的另一种缩略语文档示意图;5 is a schematic diagram of another abbreviation document disclosed in an embodiment of the present invention;

图6为本发明实施例公开的又一种缩略语文档示意图;6 is a schematic diagram of another acronym document disclosed in an embodiment of the present invention;

图7为本发明实施例公开的一种缩略语文档的生成装置的结构示意图。FIG. 7 is a schematic structural diagram of an apparatus for generating an abbreviation document disclosed in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

发明人通过研究发现,目前的不同知识组织系统中的来源词表中对同一缩略语的表述存在很多不同,且具有丰富的语义关系属性及学科特征。如“慢性粒细胞白血病”存在多种英文表达形式“chronic granulocytic leukemias”,、“Chronic GranulocyticLeukemia”、“Granulocytic Leukemia,Chronic”多种形式,“先天性全身性脂质营养不良”的英文表达形式为“Congenital Generalized Lipodystrophy”,二者的英文缩写形式均为CGL。The inventor found through research that there are many different expressions for the same abbreviation in the source vocabulary in different knowledge organization systems, and they have rich semantic relationship attributes and disciplinary characteristics. For example, "chronic myeloid leukemia" has various English expressions such as "chronic granulocytic leukemias", "chronic granulocytic leukemia", "granulocytic leukemia, Chronic", and "congenital systemic lipodystrophy" in English. "Congenital Generalized Lipodystrophy", both of which are abbreviated as CGL.

由于传统的缩略语词典仅包括缩略语、全称、汉译名和注释几项内容,在文本仅出现“CGL”时就无法准确的识别、确定其真正的含义,本发明基于集成词表提取的缩略语文档,缩略语及其所在概念都有明显的语义关系及学科特征,如CGL(CongenitalGeneralized Lipodystrophy)属于“代谢障碍性皮肤病”,“CGL(Chronic GranulocyticLeukemia)”属于“血液病”、“肿瘤”,结合“CGL”出现的语境即可以快速确定其含义,减少在多领域内应用时的歧义问题。Since the traditional dictionary of abbreviations only includes abbreviations, full names, Chinese translation names and annotations, it is impossible to accurately identify and determine its true meaning when only "CGL" appears in the text. Abbreviations documents, abbreviations and their concepts have obvious semantic relationships and disciplinary characteristics. For example, CGL (Congenital Generalized Lipodystrophy) belongs to "metabolic skin disease", "CGL (Chronic Granulocytic Leukemia)" belongs to "blood disease", "tumor" , combined with the context in which "CGL" appears, its meaning can be quickly determined, reducing ambiguity when it is applied in multiple fields.

本实施例公开了一种缩略语文档的生成方法,应用于能够与不同的知识组织系统进行对接的系统,具体的,请参阅图1,本实施例公开的缩略语文档的生成方法包括以下步骤:This embodiment discloses a method for generating an abbreviation document, which is applied to a system that can be connected with different knowledge organization systems. For details, please refer to FIG. 1. The method for generating an abbreviation document disclosed in this embodiment includes the following steps :

S101:在不同知识组织系统互操作过程中,提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,术语包括全称和缩略语多种形式;S101: During the interoperability process of different knowledge organization systems, extract terms from different source vocabularies, synonymously merge them to form concepts, and generate integrated vocabularies. Terms include various forms of full names and abbreviations;

知识组织系统(knowledge organization systems,KOS)是用于进行知识组织的各类规范和方法的系统统称,是获取、利用知识的重要手段。在具体应用中知识组织系统为语义工具。集成词表对应至少一个来源词表,来源词表涵盖了传统的主题词表和叙词表,还包括用于网站导航浏览用的等级体系结构,或者语义网的知识本体等。Knowledge organization system (KOS) is a collective term for various norms and methods for knowledge organization, and is an important means of acquiring and utilizing knowledge. Knowledge organization systems are semantic tools in specific applications. The integrated vocabulary corresponds to at least one source vocabulary, and the source vocabulary covers the traditional subject thesaurus and thesaurus, and also includes a hierarchical structure used for website navigation and browsing, or the knowledge ontology of the semantic web.

知识组织系统的互操作就是指不同知识组织系统之间的兼容互换。The interoperability of knowledge organization systems refers to the compatibility and exchange between different knowledge organization systems.

在不同知识组织系统互操作过程中,提取不同来源词表中的术语并对其进行同义归并形成概念,如根据语义分析技术将不同来源词表中的术语“Chronic GranulocyticLeukemia”、“CGL”、“chronic granulocytic leukemias”、“Granulocytic Leukemia,Chronic”和“granulocytic leukemia,chronic”进行同义归并形成概念,生成集成词表,集成词表中涵盖了不同知识组织系统中的来源词表中的术语及属性信息,丰富了集成词表中缩略词及所在概念的语义信息。In the process of interoperability of different knowledge organization systems, terms in different source vocabularies are extracted and synonymous to form concepts, such as the terms "Chronic Granulocytic Leukemia", "CGL", "CGL", "chronic granulocytic leukemias", "Granulocytic Leukemia, Chronic" and "granulocytic leukemia, chronic" are synonymous to form concepts, and an integrated vocabulary is generated. The integrated vocabulary covers the terms in the source vocabulary in different knowledge organization systems and Attribute information enriches the semantic information of acronyms and concepts in the integrated vocabulary.

其中,术语包括全称和缩略语多种形式,例如全称“Chronic GranulocyticLeukemia”,对应的缩略语为“CGL”。Among them, the term includes various forms of full name and abbreviation, for example, the full name is "Chronic Granulocytic Leukemia", and the corresponding abbreviation is "CGL".

S102:基于缩略语构词规则,识别所述集成词表中含有缩略语的目标概念;S102: Based on the abbreviation word formation rule, identify the target concept containing the abbreviation in the integrated vocabulary;

将所述集成词表中每个概念中的术语与缩略语词典中的记录进行比对;matching the terms in each concept in the integrated vocabulary with the records in the dictionary of abbreviations;

依据比对结果,将所述集成词表中同时存在缩略语词典中的缩略语及相应全称的概念确定为含有缩略语的所述目标概念。According to the comparison result, a concept in which both the abbreviation and the corresponding full name in the abbreviation dictionary exist in the integrated vocabulary list is determined as the target concept containing the abbreviation.

如集成词表中若同时存在缩略语词典中的缩略语“CGL”及相应全称“ChronicGranulocytic Leukemia”的概念,确定为目标概念。For example, if the concept of the abbreviation "CGL" in the abbreviation dictionary and the corresponding full name "Chronic Granulocytic Leukemia" both exist in the integrated vocabulary, it is determined as the target concept.

S103:提取所述目标概念及全部术语的属性,并确定每个术语的术语类型;S103: Extract the attributes of the target concept and all terms, and determine the term type of each term;

中文和英文的缩略语构词规则不同,如“欧盟”是中文“欧洲联盟”的缩略语,“EU”为“European Union”的缩略语。Chinese and English abbreviations have different word formation rules. For example, "EU" is the abbreviation of "European Union" in Chinese, and "EU" is the abbreviation of "European Union".

以英文缩略语构词规则为例,请参阅图2,提取目标概念及全部术语的属性,并确定每个术语的术语类型包括如下步骤:Taking the English abbreviation word formation rule as an example, please refer to Figure 2 to extract the attributes of the target concept and all terms, and determine the term type of each term, including the following steps:

S201:提取集成词表中含有缩略语的所述目标概念及全部术语的属性;S201: Extract the attributes of the target concept and all terms containing abbreviations in the integrated vocabulary;

概念属性包括概念标识、术语、注释、状态(包括禁用状态、激活状态等)、学科分类、等级关系及相关关系和所在集成词表等,术语的属性包括术语标识、词形(包括英文文本、中文文本等)、使用状态(包括禁用状态、激活状态等)、注释和来源词表等。Concept attributes include concept identification, term, comment, state (including disabled state, activation state, etc.), subject classification, hierarchical relationship and related relationship, and integrated vocabulary, etc. The attributes of term include term identification, word form (including English text, Chinese text, etc.), usage status (including disabled status, activated status, etc.), annotations and source vocabulary, etc.

目标术语为当前正在处理的术语,目标术语可以是集成词表中的任意一个术语。The target term is the term currently being processed, and the target term can be any term in the integrated vocabulary.

S202:提取目标术语中每个非介词英文单词的首字母,所述目标术语为由多个英文单词构成的术语;S202: Extract the first letter of each non-prepositional English word in the target term, where the target term is a term composed of multiple English words;

介词如of、at、in、by等,这些词没有实际意义,缩略语中一般不会包括介词的首字母,因此,在提取目标术语英文单词的首字母时不提取介词的首字母。Prepositions such as of, at, in, by, etc., these words have no practical meaning, and the initials of prepositions are generally not included in abbreviations. Therefore, the initials of prepositions are not extracted when extracting the initials of English words of target terms.

S203:将每个首字母对应的大写字母进行拼接,得到拼接字符串;S203: splicing the capital letters corresponding to each initial letter to obtain a spliced string;

S204:判断所述集成词表目标概念中是否存在所述拼接字符串;S204: judging whether the spliced character string exists in the integrated vocabulary target concept;

若不存在,S205:确定所述目标术语为不含缩略语的术语;If it does not exist, S205: determine that the target term is a term without abbreviations;

若存在,S206:将所述拼接字符串确定为目标概念中所述目标术语的目标缩略语,并将所述目标缩略语的术语类型确定为规范缩略语;If there is, S206: determine the spliced string as the target abbreviation of the target term in the target concept, and determine the term type of the target abbreviation as a canonical abbreviation;

S207:判断所述目标术语的每个非介词英文单词的首字母是否为大写;S207: Determine whether the first letter of each non-prepositional English word of the target term is capitalized;

若是大写,S208:确定所述目标术语的术语类型为规范全称;If it is uppercase, S208: determine that the term type of the target term is the canonical full name;

若不是大写,S209:确定所述目标术语的术语类型为普通全称。If it is not capitalized, S209: determine that the term type of the target term is a common full name.

其中,目标术语的每个非介词英文单词的首字母为小写或大小写混合形式,则目标术语的术语类型为普通全称,以目标术语为“Chronic Granulocytic Leukemia”为例,提取大写首字母组成拼接字符串“CGL”,若目标术语为“chronic granulocytic leukemias”,则需要对提取的首字母转换为大写首字母。集成词表中存在“CGL”,则“CGL”为目标概念中目标术语“Chronic Granulocytic Leukemia”的目标缩略词,“CGL”的术语类型为规范缩略语,“Chronic Granulocytic Leukemia”的术语类型为规范全称,若目标术语为“chronicgranulocytic leukemias”,则“chronic granulocytic leukemias”的术语类型为普通全称。Among them, the first letter of each non-prepositional English word of the target term is in lowercase or mixed case, and the term type of the target term is the common full name. Taking the target term as "Chronic Granulocytic Leukemia" as an example, extract the capital initial letter to form a splicing The string "CGL", if the target term is "chronic granulocytic leukemias", the extracted initials need to be converted to uppercase. If "CGL" exists in the integrated vocabulary, then "CGL" is the target abbreviation of the target term "Cronic Granulocytic Leukemia" in the target concept, the term type of "CGL" is the canonical abbreviation, and the term type of "Cronic Granulocytic Leukemia" is The canonical full name, if the target term is "chronicgranulocytic leukemias", the term type of "chronic granulocytic leukemias" is the common full name.

S104:基于ISO25964国际标准,对所述目标概念及所述目标概念的全部术语进行规范化知识表示,生成缩略语文档。S104: Based on the ISO25964 international standard, perform normalized knowledge representation on the target concept and all terms of the target concept, and generate an abbreviation document.

ISO25964:《叙词表与其他词表的互操作》(Thesauriand Interoperability withOther Vocabularies)。ISO25964-1《面向信息检索的叙词表》(Thesaurifor InformationRetrieval)对单语种和多语种叙词表的内容和编制、叙词表使用原则、叙词表建设和管理的基本原则、叙词表管理软件的基本功能、叙词表的数据模型等进行了说明。ISO 25964-2《与其他词表的互操作》提供了叙词表与其他类型知识组织映射的模型、映射指导原则。ISO25964: "Thesauriand Interoperability withOther Vocabularies". ISO25964-1 "Thesauri for Information Retrieval" (Thesaurifor Information Retrieval) provides information on the content and compilation of monolingual and multilingual thesaurus, the principles of using thesaurus, the basic principles of thesaurus construction and management, thesaurus management The basic functions of the software and the data model of the thesaurus are described. ISO 25964-2 "Interoperation with Other Thesaurus" provides a model and mapping guidelines for mapping thesaurus to other types of knowledge organizations.

具体的,请参阅图3,基于ISO25964生成缩略语文档的方法具体包括以下步骤:Specifically, please refer to Figure 3. The method for generating abbreviation documents based on ISO25964 specifically includes the following steps:

S301:基于ISO25964国际标准,生成目标概念的基本信息及关系描述、以及所述目标概念全部术语的基本信息描述与术语类型标识,这些描述性信息构成缩略语文档的元数据模型;S301: Based on the ISO25964 international standard, generate the basic information and relationship description of the target concept, as well as the basic information description and term type identification of all terms of the target concept, and these descriptive information constitute the metadata model of the abbreviation document;

其中,目标概念的基本信息包括上述概念标识、术语、注释、状态、学科分类等,目标概念的关系描述包括上述等级关系及相关关系等;术语的基本信息描述包括术语标识、词形、使用状态、注释和来源词表等。Among them, the basic information of the target concept includes the above-mentioned concept identification, term, comment, status, subject classification, etc., the relationship description of the target concept includes the above-mentioned hierarchical relationship and related relationship, etc.; the basic information description of the term includes the term identification, word form, usage status, etc. , annotations and source glossaries, etc.

S302:依据所述缩略语文档的元数据模型、目标概念对应的术语和每个术语的术语类型生成缩略语文档。S302: Generate an abbreviation document according to the metadata model of the abbreviation document, the term corresponding to the target concept, and the term type of each term.

以图4所示的缩略语文档为例,ID为概念标识,AID为术语标识,ID和AID和术语类型及集成词表中概念的属性信息构成缩略语文档的元数据模型。Taking the abbreviation document shown in FIG. 4 as an example, ID is a concept identifier, AID is a term identifier, ID and AID, term type and the attribute information of the concept in the integrated vocabulary constitute the metadata model of the abbreviation document.

元数据标准兼容性强,基于词表互操作标准ISO25964-2自定义扩展的元数据标准,具有很好的兼容性和互操作性。Metadata standard compatibility is strong, based on the vocabulary interoperability standard ISO25964-2 custom extended metadata standard, with good compatibility and interoperability.

本实施例公开的缩略语文档中除了缩略语、全称、汉译名、注释外,还集成了来源概念及当前所在概念的丰富语义信息,减少歧义。In addition to abbreviations, full names, Chinese translated names, and comments, the abbreviation document disclosed in this embodiment also integrates rich semantic information of source concepts and current concepts to reduce ambiguity.

优选的,在上述实施例所公开的缩略语文档的基础上,提取缩略语文档中术语的规范全称和规范缩略语,得到术语类型为优选词的术语,并基于ISO25964国际标准将术语类型为优选词的术语加入所述缩略语文档。Preferably, on the basis of the abbreviation document disclosed in the above embodiment, the canonical full name and canonical abbreviation of the term in the abbreviation document are extracted to obtain the term whose term type is the preferred word, and the term type is preferred based on the ISO25964 international standard. Word terms are added to the abbreviations document.

以图5所示的缩略语文档为例,术语类型为优选词的术语为GCL(ChronicGranulocytic Leukemia),通过加入优选词,使缩略语文档中缩略语相关术语形式更加丰富,减少缩略语使用过程中的歧义。Taking the abbreviation document shown in Figure 5 as an example, the term whose term type is the preferred word is GCL (Chronic Granulocytic Leukemia). ambiguity.

基于缩略语构词规则提取集成词表中含有缩略语的概念及全部术语,还可以基于词形变体规律提取集成词表中具有相同含义的其他术语形式上,词形变体规律包括单复数词形变体、时态词形变体、基于语法的词形变体。Extract the concepts and all terms containing abbreviations in the integrated vocabulary based on the abbreviation word formation rules, and also extract other terms with the same meaning in the integrated vocabulary based on the inflection rules. In the form, the inflection rules include singular and plural word inflections aspect, tense inflections, grammar-based inflections.

以图6所示的缩略语文档为例,“Granulocytic Leukemia,Chronic”和“granulocytic leukemia,chronic”为“Chronic Granulocytic Leukemia”基于语法的词形变体形式,实质上,“Granulocytic Leukemia,Chronic”、“granulocytic leukemia,chronic”和“Chronic Granulocytic Leukemia”三者为具有相同概念的术语。Taking the abbreviation document shown in Figure 6 as an example, "Granulocytic Leukemia, Chronic" and "granulocytic leukemia, chronic" are the grammatical variant forms of "Cronic Granulocytic Leukemia", in essence, "Granulocytic Leukemia, Chronic", " granulocytic leukemia, "chronic" and "chronic Granulocytic Leukemia" are terms with the same concept.

基于词形变体规律,提取集成词表中与上述实施例中的目标术语在同一概念的词形变体形式的术语,该类术语的术语类型为其他全称。Based on the morphological variation rule, the terms in the morphological variation form of the same concept as the target term in the above-mentioned embodiment are extracted from the integrated vocabulary, and the term type of such terms is other full names.

可见,本实施例公开的一种缩略语文档的生成方法,通过在不同知识组织系统互操作过程中提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,并基于缩略语构词规则,识别集成词表中含有缩略语的目标概念,通过提取目标概念及全部术语的属性,并确定每个术语的术语类型,丰富了缩略语的语义信息,减少了缩略语歧义问题,并通过基于ISO25964国际标准,对目标概念及目标概念的全部术语进行规范化知识表示,使生成的缩略语文档语义丰富且标准规范。It can be seen that, a method for generating acronym documents disclosed in this embodiment, by extracting terms from different source vocabularies during the interoperability process of different knowledge organization systems, and synonymously merging them to form concepts, generating an integrated vocabulary, and Based on the abbreviation word formation rules, identify the target concepts containing abbreviations in the integrated vocabulary, extract the attributes of the target concept and all terms, and determine the term type of each term, enrich the semantic information of the abbreviations and reduce the number of abbreviations. The problem of ambiguity is solved, and based on the ISO25964 international standard, the target concept and all the terms of the target concept are standardized knowledge representation, so that the generated abbreviation documents are semantically rich and standardized.

基于上述实施例公开的一种缩略语文档的生成方法,本实施例对应公开了一种缩略语文档的生成装置,请参阅图7,包括:Based on the method for generating an abbreviation document disclosed in the foregoing embodiment, the present embodiment correspondingly discloses an apparatus for generating an abbreviation document, please refer to FIG. 7 , including:

集成词表生成单701,用于在不同知识组织系统互操作过程中,提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,术语包括全称和缩略语多种形式;The integrated word list generation sheet 701 is used to extract terms from different source word lists and synonymously merge them to form concepts during the interoperability process of different knowledge organization systems, and generate an integrated word list, including full names and abbreviations. form;

概念识别单元702,用于基于缩略语构词规则,识别所述集成词表中含有缩略语的目标概念;A concept identification unit 702, configured to identify target concepts containing abbreviations in the integrated vocabulary based on abbreviation word formation rules;

术语提取单元703,用于提取所述目标概念及全部术语的属性,并确定每个术语的术语类型;A term extraction unit 703, configured to extract the attributes of the target concept and all terms, and determine the term type of each term;

缩略语文档生成单元704,用于基于ISO25964国际标准,对所述目标概念及所述目标概念的全部术语进行规范化知识表示,生成缩略语文档。The abbreviation document generating unit 704 is configured to perform normalized knowledge representation on the target concept and all terms of the target concept based on the ISO25964 international standard, and generate an abbreviation document.

可选的,所述概念识别单元702,具体用于:Optionally, the concept identification unit 702 is specifically configured to:

将所述集成词表中每个概念中的术语与缩略语词典中的记录进行比对;matching the terms in each concept in the integrated vocabulary with the records in the dictionary of abbreviations;

依据比对结果,将所述集成词表中同时存在缩略语词典中的缩略语及相应全称的概念确定为含有缩略语的所述目标概念。According to the comparison result, a concept in which both the abbreviation and the corresponding full name in the abbreviation dictionary exist in the integrated vocabulary list is determined as the target concept containing the abbreviation.

可选的,所述术语提取单元703,具体用于:Optionally, the term extraction unit 703 is specifically configured to:

提取集成词表中含有缩略语的所述目标概念及全部术语的属性,概念属性包括概念标识、术语、注释、状态、学科分类、等级关系及相关关系和所在集成词表,术语的属性包括术语标识、词形、使用状态、注释和来源词表;Extract the attributes of the target concept and all terms containing abbreviations in the integrated vocabulary. Concept attributes include concept identification, term, comment, status, subject classification, hierarchical relationship and related relationship, and the integrated vocabulary where the term is located. Attributes include term Identification, word forms, usage status, annotations and source glossaries;

提取目标术语中每个非介词英文单词的首字母,所述目标术语为由多个英文单词构成的术语;Extract the first letter of each non-prepositional English word in the target term, where the target term is a term composed of multiple English words;

将每个首字母对应的大写字母进行拼接,得到拼接字符串;Splicing the capital letters corresponding to each initial letter to obtain a spliced string;

若所述集成词表目标概念中存在所述拼接字符串,则所述拼接字符串为目标概念中所述目标术语的目标缩略语,其中,所述目标缩略语的术语类型为规范缩略语,若所述目标术语的每个非介词英文单词的首字母为大写,则所述目标术语的术语类型为规范全称,若所述目标术语的每个非介词英文单词的首字母为小写或大小写混合形式,则所述目标术语的术语类型为普通全称。If the spliced character string exists in the integrated vocabulary target concept, the spliced character string is the target abbreviation of the target term in the target concept, wherein the term type of the target abbreviation is a canonical abbreviation, If the first letter of each non-prepositional English word of the target term is capitalized, the term type of the target term is the canonical full name; if the first letter of each non-prepositional English word of the target term is lowercase or uppercase Mixed form, the term type of the target term is the common full name.

可选的,所述术语提取单元703,还用于基于词形变体规律,提取所述集成词表中与所述目标术语在同一概念的词形变体形式的术语,该类术语的术语类型为其他全称。Optionally, the term extraction unit 703 is further configured to, based on the morphological variation rule, extract terms in the integrated vocabulary that are in the form of morphological variants of the same concept as the target term, and the term types of such terms are: Other full names.

可选的,所述装置还包括:Optionally, the device further includes:

优选词提取单元,用于提取缩略语文档中术语的规范全称和规范缩略语,得到术语类型为优选词的术语,并基于ISO25964国际标准将术语类型为优选词的术语加入所述缩略语文档。The preferred word extraction unit is used to extract the canonical full name and canonical abbreviation of the term in the abbreviation document, obtain the term with the term type as the preferred word, and add the term with the term type as the preferred word to the abbreviation document based on the ISO25964 international standard.

可选的,所述缩略语文档生成单元704,具体用于:Optionally, the abbreviation document generation unit 704 is specifically used for:

基于ISO25964国际标准,生成所述目标概念的基本信息及关系描述、以及所述目标概念全部术语的基本信息描述与术语类型标识,这些描述性信息构成缩略语文档的元数据模型;Based on the ISO25964 international standard, generate the basic information and relationship description of the target concept, as well as the basic information description and term type identification of all terms of the target concept, and these descriptive information constitute the metadata model of the abbreviation document;

依据所述缩略语文档的元数据模型、所述目标概念对应的术语和每个术语的术语类型生成所述缩略语文档。The abbreviation document is generated according to the metadata model of the abbreviation document, the terms corresponding to the target concept, and the term type of each term.

本实施例公开的一种缩略语文档的生成装置,通过在不同知识组织系统互操作过程中提取不同来源词表中的术语并对其进行同义归并形成概念,生成集成词表,并基于缩略语构词规则,识别集成词表中含有缩略语的目标概念,通过提取目标概念及全部术语的属性,并确定每个术语的术语类型,丰富了缩略语的语义信息,减少了缩略语歧义问题,并通过基于ISO25964国际标准,对目标概念及目标概念的全部术语进行规范化知识表示,使生成的缩略语文档语义丰富且标准规范。An apparatus for generating an abbreviation document disclosed in this embodiment, by extracting terms from different source vocabularies during the interoperability process of different knowledge organization systems and merging them synonymously to form concepts, generating an integrated vocabulary, and based on abbreviations Abbreviation word formation rules, identify the target concepts containing abbreviations in the integrated vocabulary, extract the attributes of the target concept and all terms, and determine the term type of each term, enrich the semantic information of abbreviations, and reduce the problem of abbreviation ambiguity , and through the standardized knowledge representation of the target concept and all terms of the target concept based on the ISO25964 international standard, the generated abbreviation documents are semantically rich and standardized.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply those entities or operations There is no such actual relationship or order between them. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method for generating an abbreviation document, the method comprising:
in the interoperation process of different knowledge organization systems, extracting terms in word lists of different sources and carrying out synonymy merging on the terms to form concepts, and generating an integrated word list, wherein the terms comprise multiple forms of full names and abbreviations;
identifying a target concept containing an abbreviation in the integrated word list based on an abbreviation word construction rule;
extracting the target concept and the attributes of all terms, and determining the term type of each term;
and based on the ISO25964 international standard, carrying out normalized knowledge representation on the target concept and all terms of the target concept to generate an abbreviation document.
2. The method of claim 1, wherein the identifying target concepts in the integrated vocabulary that contain abbreviations based on abbreviation vocabulary formation rules comprises:
comparing terms in each concept in the integrated vocabulary with records in a abbreviation dictionary;
and according to the comparison result, determining the concept of the abbreviation and the corresponding full name in the abbreviation dictionary simultaneously existing in the integrated vocabulary as the target concept containing the abbreviation.
3. The method of claim 1, wherein the extracting attributes of the target concept and all terms and determining a term type for each term comprises:
extracting the attributes of the target concept and all terms containing the abbreviation in the integrated vocabulary, wherein the concept attributes comprise concept identification, terms, comments, states, subject classification, level relation, correlation and the integrated vocabulary where the concept attributes are located, and the attributes of the terms comprise term identification, word shapes, use states, comments and source vocabularies;
extracting the initial of each non-preposition English word in a target term, wherein the target term is a term formed by a plurality of English words;
splicing the capital letters corresponding to each initial letter to obtain a spliced character string;
if the splicing character string exists in the target concept of the integrated vocabulary, the splicing character string is a target abbreviation of the target term in the target concept, wherein the term type of the target abbreviation is a standard abbreviation, if the initial letter of each non-preposition English word of the target term is capitalized, the term type of the target term is a standard full name, and if the initial letter of each non-preposition English word of the target term is in a form of lowercase or mixed capitalization, the term type of the target term is a common full name.
4. The method of claim 3, further comprising:
and extracting terms in the form of the morphological variant in the integrated word list, which are in the same concept with the target term, based on the morphological variant law, wherein the term types of the terms are other full names.
5. The method of claim 3, further comprising:
extracting the standard full names and the standard abbreviations of the terms in the abbreviation document, obtaining the terms with the term types as the preferred words, and adding the terms with the term types as the preferred words to the abbreviation document based on the ISO25964 international standard.
6. The method of claim 1, wherein the normalizing knowledge representation of the target concept and all terms of the target concept based on the ISO25964 international standard to generate an abbreviation document comprises:
generating basic information and relation description of the target concept and basic information description and term type identification of all terms of the target concept based on ISO25964 international standard, wherein the descriptive information forms a metadata model of an abbreviation document;
and generating the abbreviation document according to the metadata model of the abbreviation document, the terms corresponding to the target concept and the term type of each term.
7. An apparatus for generating an abbreviation document, comprising:
the integrated word list generating unit is used for extracting terms in word lists of different sources and carrying out synonymy merging on the terms to form concepts in the interoperation process of different knowledge organization systems to generate an integrated word list, wherein the terms comprise multiple forms of full names and abbreviations;
the concept identification unit is used for identifying a target concept containing an abbreviation in the integrated word list based on an abbreviation word construction rule;
a term extraction unit for extracting the target concept and the attributes of all terms, and determining a term type of each term;
and the abbreviation document generating unit is used for carrying out normalized knowledge representation on the target concept and all terms of the target concept based on the ISO25964 international standard to generate the abbreviation document.
8. The apparatus according to claim 7, wherein the concept recognition unit is specifically configured to:
comparing terms in each concept in the integrated vocabulary with records in a abbreviation dictionary;
and according to the comparison result, determining the concept of the abbreviation and the corresponding full name in the abbreviation dictionary simultaneously existing in the integrated vocabulary as the target concept containing the abbreviation.
9. The apparatus according to claim 7, wherein the term extraction unit is specifically configured to:
extracting the target concept containing the abbreviation and the attributes of all the terms in the integrated vocabulary, wherein the concept attributes comprise concept identification, terms, comments, states, subject classification, level relation, related relation and the integrated vocabulary in which the concept attributes are located, and the attributes of the terms comprise term identification, word shapes, use states, comments and source vocabularies;
extracting the initial of each non-preposition English word in a target term, wherein the target term is a term formed by a plurality of English words;
splicing the capital letters corresponding to each initial letter to obtain a spliced character string;
if the splicing character string exists in the target concept of the integrated vocabulary, the splicing character string is a target abbreviation of the target term in the target concept, wherein the term type of the target abbreviation is a standard abbreviation, if the initial letter of each non-preposition English word of the target term is capitalized, the term type of the target term is a standard full name, and if the initial letter of each non-preposition English word of the target term is in a form of lowercase or mixed capitalization, the term type of the target term is a common full name.
10. The apparatus according to claim 9, wherein the term extracting unit is further configured to extract terms in a morphological variant form of the same concept as the target term in the integrated vocabulary based on morphological variant rules, and the term types of the terms are other full names.
11. The apparatus of claim 9, further comprising:
and the preferred word extraction unit is used for extracting the standard full names and the standard abbreviations of the terms in the abbreviation documents, obtaining the terms with the term types as the preferred words, and adding the terms with the term types as the preferred words into the abbreviation documents based on the ISO25964 international standard.
12. The apparatus according to claim 7, wherein the abbreviation document generating unit is specifically configured to:
generating basic information and relation description of the target concept and basic information description and term type identification of all terms of the target concept based on ISO25964 international standard, wherein the descriptive information forms a metadata model of an abbreviation document;
and generating the abbreviation document according to the metadata model of the abbreviation document, the terms corresponding to the target concept and the term type of each term.
CN201910942205.4A 2019-09-30 2019-09-30 Method and device for generating acronym documents Pending CN110704639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910942205.4A CN110704639A (en) 2019-09-30 2019-09-30 Method and device for generating acronym documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910942205.4A CN110704639A (en) 2019-09-30 2019-09-30 Method and device for generating acronym documents

Publications (1)

Publication Number Publication Date
CN110704639A true CN110704639A (en) 2020-01-17

Family

ID=69198114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910942205.4A Pending CN110704639A (en) 2019-09-30 2019-09-30 Method and device for generating acronym documents

Country Status (1)

Country Link
CN (1) CN110704639A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221543A (en) * 2021-05-07 2021-08-06 中国医学科学院医学信息研究所 Medical term integration method and system
CN114528828A (en) * 2022-01-21 2022-05-24 深圳市吉祥腾达科技有限公司 Method and system for extracting English abbreviation from word document
CN115062632A (en) * 2022-06-09 2022-09-16 中国电力科学研究院有限公司 Substation model data meaning English description method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030139921A1 (en) * 2002-01-22 2003-07-24 International Business Machines Corporation System and method for hybrid text mining for finding abbreviations and their definitions
CN101093478A (en) * 2007-07-25 2007-12-26 中国科学院计算技术研究所 Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN103229137A (en) * 2010-09-29 2013-07-31 国际商业机器公司 Context-based disambiguation of acronyms and abbreviations
US20130246047A1 (en) * 2012-03-16 2013-09-19 Hewlett-Packard Development Company, L.P. Identification and Extraction of Acronym/Definition Pairs in Documents
US20160041990A1 (en) * 2014-08-07 2016-02-11 AT&T Interwise Ltd. Method and System to Associate Meaningful Expressions with Abbreviated Names
US20170091164A1 (en) * 2015-09-25 2017-03-30 International Business Machines Corporation Dynamic Context Aware Abbreviation Detection and Annotation
CN110263184A (en) * 2019-06-20 2019-09-20 中国医学科学院医学信息研究所 A kind of data processing method and relevant device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030139921A1 (en) * 2002-01-22 2003-07-24 International Business Machines Corporation System and method for hybrid text mining for finding abbreviations and their definitions
CN101093478A (en) * 2007-07-25 2007-12-26 中国科学院计算技术研究所 Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN103229137A (en) * 2010-09-29 2013-07-31 国际商业机器公司 Context-based disambiguation of acronyms and abbreviations
US20130246047A1 (en) * 2012-03-16 2013-09-19 Hewlett-Packard Development Company, L.P. Identification and Extraction of Acronym/Definition Pairs in Documents
US20160041990A1 (en) * 2014-08-07 2016-02-11 AT&T Interwise Ltd. Method and System to Associate Meaningful Expressions with Abbreviated Names
US20170091164A1 (en) * 2015-09-25 2017-03-30 International Business Machines Corporation Dynamic Context Aware Abbreviation Detection and Annotation
CN110263184A (en) * 2019-06-20 2019-09-20 中国医学科学院医学信息研究所 A kind of data processing method and relevant device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王刘安 等: "同义术语归并中缩略语的处理方法研究", 《图书情报工作》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221543A (en) * 2021-05-07 2021-08-06 中国医学科学院医学信息研究所 Medical term integration method and system
CN113221543B (en) * 2021-05-07 2023-10-10 中国医学科学院医学信息研究所 Medical term integration method and system
CN114528828A (en) * 2022-01-21 2022-05-24 深圳市吉祥腾达科技有限公司 Method and system for extracting English abbreviation from word document
CN115062632A (en) * 2022-06-09 2022-09-16 中国电力科学研究院有限公司 Substation model data meaning English description method, device, equipment and medium
CN115062632B (en) * 2022-06-09 2024-09-27 中国电力科学研究院有限公司 Substation model data meaning English description method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Rigouts Terryn et al. In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
Grover et al. LT TTT-a flexible tokenisation tool
US9892111B2 (en) Method and device to estimate similarity between documents having multiple segments
US9189482B2 (en) Similar document search
US20150120738A1 (en) System and method for document classification based on semantic analysis of the document
US20150227505A1 (en) Word meaning relationship extraction device
US20130103390A1 (en) Method and apparatus for paraphrase acquisition
US20120035905A1 (en) System and method for handling multiple languages in text
US9235573B2 (en) Universal difference measure
Ehsan et al. Grammatical and context‐sensitive error correction using a statistical machine translation framework
Gupta et al. Cross-language high similarity search using a conceptual thesaurus
CN110704639A (en) Method and device for generating acronym documents
US8892423B1 (en) Method and system to automatically create content for dictionaries
Bondarenko A corpus-based contrastive study of verbless sentences: quantitative and qualitative perspectives
Gao et al. Statistical query translation models for cross-language information retrieval
Zarnoufi et al. Machine normalization: Bringing social media text from non-standard to standard form
Lu et al. A multi-media approach to cross-lingual entity knowledge transfer
Bendersky et al. Structural annotation of search queries using pseudo-relevance feedback
Fauzi et al. Image understanding and the web: a state-of-the-art review
Alfonseca et al. German decompounding in a difficult corpus
Hazman et al. An ontology based approach for automatically annotating document segments
Klang et al. Linking, searching, and visualizing entities in wikipedia
Spasic FlexiTerm: a more efficient implementation of flexible multi-word term recognition
Kallimani et al. Statistical and analytical study of guided abstractive text summarization
Jena et al. Semantic desktop search application for Hindi-English code-mixed user query with query sequence analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200117

RJ01 Rejection of invention patent application after publication