WO2021249311A1 - Named entity recognition method, recognition apparatus, and electronic apparatus - Google Patents
Named entity recognition method, recognition apparatus, and electronic apparatus Download PDFInfo
- Publication number
- WO2021249311A1 WO2021249311A1 PCT/CN2021/098444 CN2021098444W WO2021249311A1 WO 2021249311 A1 WO2021249311 A1 WO 2021249311A1 CN 2021098444 W CN2021098444 W CN 2021098444W WO 2021249311 A1 WO2021249311 A1 WO 2021249311A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- entity
- recognized
- entities
- text data
- named
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Definitions
- the present disclosure relates to the field of natural language processing, and in particular to a method for identifying named entities, identification equipment, and electronic equipment.
- Named entity recognition is a basic task of natural language processing, and it is an indispensable part of various natural language processing technologies such as information extraction, information retrieval, machine translation, and question answering systems.
- named entity recognition can be divided into three types: entity type, time type, and number type. It can also be divided into three types: person name, place name, organization name, time, date, currency, and percentage.
- entity type entity type
- time type time type
- number type number type
- person name place name
- organization name organization name
- time time
- date currency
- percentage percentage
- embodiments of the present disclosure provide a method for identifying named entities, including:
- the identifying the named entity among the candidate entities includes:
- the named entity of the text data to be recognized is recognized from the candidate entities.
- the segmentation of the text data to be recognized according to a pre-established entity dictionary to obtain candidate entities includes:
- the to-be-recognized text data is Chinese
- the method further includes:
- the entity extracted by the pre-established entity extraction rule is removed from the candidate entities.
- the method further includes:
- an entity extraction rule for the candidate entity is determined and updated.
- the embodiments of the present disclosure also provide a named entity recognition device, including:
- the determining unit is used to determine the field of the acquired text data to be recognized
- An obtaining unit segmenting the text data to be recognized according to a pre-established entity dictionary in the field to obtain a candidate entity, and the named entity in the entity dictionary is removed from the candidate entity;
- the identification unit is used to identify the named entity among the candidate entities.
- the identification unit is specifically configured to:
- the named entity of the text data to be recognized is recognized from the candidate entities.
- the obtaining unit is specifically configured to:
- the text data to be recognized is Chinese
- the text data to be recognized is divided into multiple word segments according to the stuttering word segmentation technology, and the part of speech of each word segmentation is determined;
- the recognition device further includes:
- the removing unit is used to remove the entities extracted by the pre-established entity extraction rules from the candidate entities.
- the recognition device further includes:
- the updating unit is used to determine and update the entity extraction rule for the candidate entity according to the context information of the text data to be recognized.
- the embodiments of the present disclosure also provide an electronic device for named entity recognition, including:
- the memory is used to store a program
- the processor is configured to execute the program in the memory and includes the following steps:
- the embodiments of the present disclosure also provide a computer-readable storage medium, including computer program instructions, which when run on a computer, cause the computer to execute the steps of the identification method described above.
- FIG. 1 is a method flowchart of a named entity identification method provided by an embodiment of the disclosure
- step S102 is a method flowchart of step S102 in a method for identifying a named entity provided by an embodiment of the present disclosure
- FIG. 3 is a structural block diagram of a named entity identification device provided by an embodiment of the disclosure.
- FIG. 4 is a structural block diagram of an electronic device for named entity recognition provided by an embodiment of the disclosure.
- an embodiment of the present disclosure provides a method for identifying a named entity, including:
- S101 Determine the domain of the acquired text data to be recognized
- the text data to be recognized may be structured text data, unstructured text data, or semi-structured text data, which is not limited here.
- the field can be art, architecture, medical, transportation, etc.
- S102 Perform word segmentation on the text data to be recognized according to the pre-established entity dictionary in the field to obtain candidate entities, and the named entities in the entity dictionary are removed from the candidate entities;
- the pre-established entity dictionary for this field can include entities such as painter's name, painting name, painting genre, painting theme, and artist's birthplace.
- a pre-established entity dictionary in a certain field can be created based on structured text data in the field. Still taking the art field as an example, if the art field has data stored in a relational database such as mysql, correspondingly, the data in the relational database is structured entity data, which can be derived from known relationships The columns with the same content are extracted from the type database, and then merged and deduplicated to build the initial entity dictionary of the art field.
- an initial dictionary of painter names can be built, in which a painter name corresponds to an id.
- the entity can also be the name of the painting, the genre of the painting, the subject of the painting, the birthplace of the artist, and so on.
- word segmentation is performed on the text data to be recognized, so as to obtain candidate entities from which the named entities in the entity dictionary are removed.
- candidate entities which are specifically determined according to actual applications, which are not limited here.
- the named entities among the candidate entities are identified.
- the named entity recognized in the candidate entity and the named entity belonging to the entity dictionary in the text data to be recognized constitute the named entity of the text data to be recognized.
- the entire named entity process only needs to identify named entities from candidate entities, not all entities in all fields, and no need to recognize entities in the entity dictionary of the field to which the text data to be recognized belongs in advance. Thereby improving the efficiency of recognizing named entities.
- step S103: identifying named entities in the candidate entities includes:
- the named entity of the text data to be recognized is recognized from the candidate entities.
- the process can be based on the position of the candidate entity in the text data to be recognized, for example, the position of the word corresponding to the candidate entity in the sentence corresponding to the text data to be recognized, and then, according to its position, based on the field of expertise
- the personnel’s experience identifies the named entity of the text data to be recognized from the candidate entities.
- step S102 performing word segmentation on the text data to be recognized according to a pre-established entity dictionary in the field to obtain candidate entities, including:
- step S201 to step S203 is as follows:
- the text data to be recognized is divided into multiple word segments according to the stuttering word segmentation technology, and the part of speech of each word segmentation is determined. It can be based on the stuttering word segmentation technology in python to segment the text data to be recognized, and divide the entire text corresponding to the text data to be recognized into non-repetitive words. Each word also corresponds to a part of speech, such as nouns, adjectives, and verbs. Wait.
- a named entity whose part of speech is a noun is extracted from multiple word segmentation; then, from a named entity whose part of speech is a noun, the named entity belonging to the entity dictionary is removed to obtain a candidate entity.
- the candidate entities are further determined by extracting named entities whose part of speech is nouns from multiple word segmentation. Since named entities are usually nouns, the efficiency of extracting candidate entities is improved, and the efficiency of named entities is improved. Recognition efficiency.
- the text data to be recognized is a text introduction about the painter Da Vinci obtained from crawling on the Internet, specifically, “Da Vinci returned to Florence in 1500 and began to create the Mona Li "The Mona Lisa” used perspective and other painting methods. After that, Da Vinci went to Milan and continued to serve in the Milan court.” After the word segmentation was performed according to the stammering technique, the part of speech was noun There are also “Nian”, “Da Vinci”, “Florence”, “Mona Lisa”, “Perspective”, “Painting”, “Method”, “Milan”, and “Palace”.
- step S203 removing the named entity belonging to the entity dictionary from the named entity whose part of speech is the noun, and after obtaining the candidate entity, the method further includes:
- the entities extracted by the pre-established entity extraction rules are removed from the candidate entities.
- the candidate entities can be further screened, which can be the entities extracted by removing the pre-established entity extraction rules , Thereby further narrowing the scope of candidate entities, thereby improving the efficiency of named entity recognition.
- the candidate entities obtained include “year”, “Da Vinci”, “Mona Lisa”, “perspective”, “painting”, “method”, After the “palace”, according to the pre-established time extraction rule “number + year”, matching to "1500”, according to the pre-established painting name extraction rule “painting name”, matching to "Mona Lisa”, from the above candidates
- the "year” and “Mona Lisa” are removed from the entities, and named entities are further identified from the remaining entities, which further narrows the recognition scope of named entities and improves the efficiency of named entity recognition.
- the process of determining candidate entities in addition to using the above-mentioned first word segmentation, and then sequentially screening the noun part-of-speech entities according to the pre-established entity dictionary and the pre-established entity extraction rules to determine the candidate entity, After the word segmentation, the entities of the noun part of speech can be filtered according to the pre-established entity extraction rules and the pre-established entity dictionary in turn. Of course, it can also be after the word segmentation, based on the pre-established entity dictionary and the pre-established entity extraction at the same time Rules to filter the entities of the noun part of speech. In the specific implementation process, the process of determining the candidate entity can be selected according to the actual application, which is not limited here.
- step S102 performing word segmentation on the text data to be recognized according to a pre-established entity dictionary in the field, and obtaining candidate entities
- the method further includes:
- the entity extraction rules for the candidate entities are determined and updated.
- the entity extraction rules for the candidate entity can be determined and updated according to the context information of the text data to be recognized. If the entity extraction rule has not been established previously, and the entity extraction rule is determined based on the context information of the text data to be recognized, the entity extraction rule is established. If there is a pre-established entity extraction rule before the candidate entity is obtained, and the entity extraction rule is different from the pre-established entity extraction rule after the candidate entity is obtained, that is, a new entity extraction rule, then the pre-established entity extraction rule is updated.
- the pre-established entity extraction rule may be that after obtaining a piece of unstructured text data, the entity extraction rule is established based on expert experience. Before obtaining any unstructured text data, no entity extraction rules have been established. With the subsequent identification of named entities to be recognized in text data, the entity extraction rules are constantly updated, and there are more and more entity extraction rules. With the increase of entity extraction rules, the selection of candidate entities can be further reduced, and the recognition of named entities can be further improved. efficient.
- the pre-established entity dictionary can also be continuously updated based on more and more structured text data.
- entity dictionaries There are more and more entity dictionaries. As the number of entity dictionaries increases, the candidate entities can be further reduced. The selection range of, further improve the efficiency of the recognition of named entities.
- an embodiment of the present disclosure also provides a named entity recognition device, including:
- the determining unit 10 is used to determine the field of the acquired text data to be recognized
- the obtaining unit 20 performs word segmentation on the text data to be recognized according to the pre-established entity dictionary in the field to obtain candidate entities, and the named entities in the entity dictionary are removed from the candidate entities;
- the identification unit 30 is used to identify named entities among the candidate entities.
- the identification unit 30 is specifically configured to:
- the named entity of the text data to be recognized is recognized from the candidate entities.
- the obtaining unit 20 is specifically configured to:
- the text data to be recognized is Chinese
- the text data to be recognized is divided into multiple word segments according to the stuttering word segmentation technology, and the part of speech of each word segmentation is determined;
- the recognition device after performing word segmentation on the text data to be recognized according to a pre-established entity dictionary in the field, and obtaining candidate entities, the recognition device further includes:
- the removing unit is used to remove the entities extracted by the pre-established entity extraction rules from the candidate entities.
- the recognition device after performing word segmentation on the text data to be recognized from the pre-established entity dictionary in the field to obtain candidate entities, the recognition device further includes:
- the update unit is used to determine and update the entity extraction rule for the candidate entity according to the context information of the text data to be recognized.
- an embodiment of the present disclosure also provides an electronic device for named entity recognition, including:
- Memory 100 and processor 200 are Memory 100 and processor 200;
- the memory 100 is used to store programs
- the processor 200 is configured to execute a program in the memory, and includes the following steps:
- the text data to be recognized is segmented to obtain candidate entities, and the named entities in the entity dictionary are removed from the candidate entities;
- the embodiments of the present disclosure also provide a computer-readable storage medium, including a computer program.
- the computer program includes program instructions.
- the program instructions When the program instructions are executed by an electronic device, the electronic device executes the naming provided in the above-mentioned embodiments. The identification method of the entity.
- the disclosed system, device, and method may be implemented in other ways.
- the device embodiments described above are merely illustrative, for example, the division of modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple modules or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may be in electrical, mechanical or other forms.
- modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional modules in the various embodiments of the present disclosure may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
- the above-mentioned integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the computer program product includes one or more computer instructions.
- the computer can be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
- Computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- computer instructions may be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to transmit to another website site, computer, server or data center.
- a cable such as Coaxial cable, optical fiber, digital subscriber line (DSL)
- wireless such as infrared, wireless, microwave, etc.
- the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
相关申请的交叉引用Cross-references to related applications
本公开要求在2020年06月10日提交中国专利局、申请号为202010522433.9、申请名称为“一种中文命名实体的识别方法、识别装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure requires the priority of a Chinese patent application filed with the Chinese Patent Office on June 10, 2020, the application number is 202010522433.9, and the application name is "A Chinese Named Entity Recognition Method, Recognition Device, and Electronic Equipment", and its entire contents Incorporated in this disclosure by reference.
本公开涉及自然语言处理领域,特别涉及一种命名实体的识别方法、识别设备及电子设备。The present disclosure relates to the field of natural language processing, and in particular to a method for identifying named entities, identification equipment, and electronic equipment.
命名实体识别是自然语言处理的一个基本任务,是信息抽取、信息检索、机器翻译、问答系统等多种自然语言处理技术必不可少的组成部分。Named entity recognition is a basic task of natural language processing, and it is an indispensable part of various natural language processing technologies such as information extraction, information retrieval, machine translation, and question answering systems.
命名实体识别按照不同的划分标准,可以是分为实体类、时间类和数字类这三类,还可以是分为人名、地名、组织机构名、时间、日期、货币和百分比。目前主要是在人工标注语料的情况下,进行命名实体的提取,整个识别过程工作量大,识别精确度低。According to different classification standards, named entity recognition can be divided into three types: entity type, time type, and number type. It can also be divided into three types: person name, place name, organization name, time, date, currency, and percentage. At present, the extraction of named entities is mainly carried out in the case of manually labeling the corpus. The entire recognition process has a large workload and low recognition accuracy.
可见,现有命名实体识别存在识别效率低的技术问题。It can be seen that the existing named entity recognition has a technical problem of low recognition efficiency.
发明内容Summary of the invention
第一方面,本公开实施例提供了一种命名实体的识别方法,包括:In the first aspect, embodiments of the present disclosure provide a method for identifying named entities, including:
确定获取到的待识别文本数据的所属领域;Determine the field of the acquired text data to be recognized;
根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体,所述候选实体中去除了所述实体词典中的命名实体;Perform word segmentation on the to-be-recognized text data according to a pre-established entity dictionary in the field to obtain candidate entities, where the named entities in the entity dictionary are removed from the candidate entities;
识别所述候选实体中的命名实体。Identify named entities among the candidate entities.
在一种可能的实现方式中,所述识别所述候选实体中的命名实体,包括:In a possible implementation manner, the identifying the named entity among the candidate entities includes:
根据所述候选实体在所述待识别文本数据中的位置,从所述候选实体中识别出所述待识别文本数据的命名实体。According to the position of the candidate entity in the text data to be recognized, the named entity of the text data to be recognized is recognized from the candidate entities.
在一种可能的实现方式中,所述根据预先建立的实体词典,对待识别文本数据进行分词,获得候选实体,包括:In a possible implementation manner, the segmentation of the text data to be recognized according to a pre-established entity dictionary to obtain candidate entities includes:
当所述待识别文本数据为中文时,根据结巴分词技术,将所述待识别文本数据分成多个分词,并确定出各个分词的词性;When the to-be-recognized text data is Chinese, divide the to-be-recognized text data into multiple word segments according to the stuttering word segmentation technology, and determine the part of speech of each word segmentation;
从所述多个分词中,提取出词性为名词的命名实体;Extract named entities whose parts of speech are nouns from the multiple word segmentation;
从所述词性为名词的命名实体中,去除掉属于所述实体词典中的命名实体,获得候选实体。From the named entities whose part of speech is a noun, remove named entities belonging to the entity dictionary to obtain candidate entities.
在一种可能的实现方式中,在所述根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体之后,所述方法还包括:In a possible implementation manner, after the segmentation of the text data to be recognized according to the pre-established entity dictionary in the field to obtain candidate entities, the method further includes:
从所述候选实体中去除掉由预先建立的实体提取规则所提取的实体。The entity extracted by the pre-established entity extraction rule is removed from the candidate entities.
在一种可能的实现方式中,在所述从根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体之后,所述方法还包括:In a possible implementation manner, after the segmentation of the text data to be recognized from the pre-established entity dictionary in the field to obtain candidate entities, the method further includes:
根据所述待识别文本数据的上下文信息,确定并更新针对所述候选实体的实体提取规则。According to the context information of the text data to be recognized, an entity extraction rule for the candidate entity is determined and updated.
第二方面,本公开实施例还提供了一种命名实体的识别设备,包括:In the second aspect, the embodiments of the present disclosure also provide a named entity recognition device, including:
确定单元,用于确定获取到的待识别文本数据的所属领域;The determining unit is used to determine the field of the acquired text data to be recognized;
获得单元,根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体,所述候选实体中去除了所述实体词典中的命名实体;An obtaining unit, segmenting the text data to be recognized according to a pre-established entity dictionary in the field to obtain a candidate entity, and the named entity in the entity dictionary is removed from the candidate entity;
识别单元,用于识别所述候选实体中的命名实体。The identification unit is used to identify the named entity among the candidate entities.
在一种可能的实现方式中,所述识别单元具体用于:In a possible implementation manner, the identification unit is specifically configured to:
根据所述候选实体在所述待识别文本数据中的位置,从所述候选实体中识别出所述待识别文本数据的命名实体。According to the position of the candidate entity in the text data to be recognized, the named entity of the text data to be recognized is recognized from the candidate entities.
在一种可能的实现方式中,所述获得单元具体用于:In a possible implementation manner, the obtaining unit is specifically configured to:
当所述待识别文本数据为中文时,根据结巴分词技术,将所述待识别文 本数据分成多个分词,并确定出各个分词的词性;When the text data to be recognized is Chinese, the text data to be recognized is divided into multiple word segments according to the stuttering word segmentation technology, and the part of speech of each word segmentation is determined;
从所述多个分词中,提取出词性为名词的命名实体;Extract named entities whose parts of speech are nouns from the multiple word segmentation;
从所述词性为名词的命名实体中,去除掉属于所述实体词典中的命名实体,获得候选实体。From the named entities whose part of speech is a noun, remove named entities belonging to the entity dictionary to obtain candidate entities.
在一种可能的实现方式中,在所述根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体之后,所述识别设备还包括:In a possible implementation manner, after the segmentation of the text data to be recognized according to the pre-established entity dictionary in the field to obtain candidate entities, the recognition device further includes:
去除单元,用于从所述候选实体中去除掉由预先建立的实体提取规则所提取的实体。The removing unit is used to remove the entities extracted by the pre-established entity extraction rules from the candidate entities.
在一种可能的实现方式中,在所述从根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体之后,所述识别设备还包括:In a possible implementation manner, after the segmentation of the text data to be recognized from the pre-established entity dictionary in the field to obtain candidate entities, the recognition device further includes:
更新单元,用于根据所述待识别文本数据的上下文信息,确定并更新针对所述候选实体的实体提取规则。The updating unit is used to determine and update the entity extraction rule for the candidate entity according to the context information of the text data to be recognized.
第三方面,本公开实施例还提供了一种命名实体识别的电子设备,包括:In the third aspect, the embodiments of the present disclosure also provide an electronic device for named entity recognition, including:
存储器和处理器;Memory and processor;
其中,所述存储器用于存储程序;Wherein, the memory is used to store a program;
所述处理器用于执行所述存储器中的程序,包括如下步骤:The processor is configured to execute the program in the memory and includes the following steps:
确定获取到的待识别文本数据的所属领域;Determine the field of the acquired text data to be recognized;
根据预先建立的该领域的实体词典,对所述待识别文本数据进行分词,获得候选实体,所述候选实体中去除了所述实体词典中的命名实体;Perform word segmentation on the to-be-recognized text data according to a pre-established entity dictionary in the field to obtain candidate entities, where the named entities in the entity dictionary are removed from the candidate entities;
识别所述候选实体中的命名实体。Identify named entities among the candidate entities.
第四方面,本公开实施例还提供了一种计算机可读存储介质,包括计算机程序指令,当其在计算机上运行时,使得计算机执行如上面所述的识别方法的步骤。In a fourth aspect, the embodiments of the present disclosure also provide a computer-readable storage medium, including computer program instructions, which when run on a computer, cause the computer to execute the steps of the identification method described above.
图1为本公开实施例提供的一种命名实体的识别方法的方法流程图;FIG. 1 is a method flowchart of a named entity identification method provided by an embodiment of the disclosure;
图2为本公开实施例提供的一种命名实体的识别方法中步骤S102的方法 流程图;2 is a method flowchart of step S102 in a method for identifying a named entity provided by an embodiment of the present disclosure;
图3为本公开实施例提供的一种命名实体的识别设备的结构框图;3 is a structural block diagram of a named entity identification device provided by an embodiment of the disclosure;
图4为本公开实施例提供的一种命名实体识别的电子设备的结构框图。FIG. 4 is a structural block diagram of an electronic device for named entity recognition provided by an embodiment of the disclosure.
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。并且在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the objectives, technical solutions, and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings of the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of the embodiments. And if there is no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other. Based on the described embodiments of the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without creative labor are within the protection scope of the present disclosure.
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。Unless otherwise defined, the technical or scientific terms used in the present disclosure shall have the usual meanings understood by those with ordinary skills in the field to which this disclosure belongs. The words "including" or "comprising" and other similar words used in the present disclosure mean that the element or item appearing before the word covers the element or item listed after the word and their equivalents, but does not exclude other elements or items.
本公开实施例描述的应用场景是为了更加清楚的说明本公开实施例的技术方案,并不构成对于本公开实施例提供的技术方案的限定,本领域普通技术人员可知,随着新应用场景的出现,本公开实施例提供的技术方案对于类似的技术问题,同样适用。The application scenarios described in the embodiments of the present disclosure are intended to more clearly illustrate the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation on the technical solutions provided by the embodiments of the present disclosure. Those of ordinary skill in the art will know that with the development of new application scenarios It appears that the technical solutions provided by the embodiments of the present disclosure are equally applicable to similar technical problems.
如图1所示,本公开实施例提供了一种命名实体的识别方法,包括:As shown in FIG. 1, an embodiment of the present disclosure provides a method for identifying a named entity, including:
S101:确定获取到的待识别文本数据的所属领域;S101: Determine the domain of the acquired text data to be recognized;
在具体实施过程中,待识别文本数据可以是结构化文本数据,还可以是非结构化文本数据,还可以是半结构化文本数据,在此不做限定。此外,领域可以是艺术、建筑、医疗、交通等。In the specific implementation process, the text data to be recognized may be structured text data, unstructured text data, or semi-structured text data, which is not limited here. In addition, the field can be art, architecture, medical, transportation, etc.
S102:根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体,候选实体中去除了实体词典中的命名实体;S102: Perform word segmentation on the text data to be recognized according to the pre-established entity dictionary in the field to obtain candidate entities, and the named entities in the entity dictionary are removed from the candidate entities;
在具体实施过程中,以待识别文本数据的所属领域为艺术领域为例,预 先建立的该领域的实体词典可以是包括画家名、画作名、画作流派、画作题材、画家出生地等实体在内。某一领域预先建立的实体词典可以是根据该领域中的结构化的文本数据来创建。仍以艺术领域为例,若该艺术领域中已有存放在mysql这样的关系型数据库中的数据,相应地,该关系型数据库中的数据就是具有结构化的实体数据,可以从已知的关系型数据库中提取出内容一致的列,然后,再融合去重从而建立该艺术领域初始的实体词典。比如,把关系型数据中表示为画家名、别名、称号的列提取出来,并融合去重,即可建立画家名的初始词典,其中,一个画家名对应于一个id。此外,关于艺术领域,实体还可以是画作名、画作流派、画作题材、画家出生地,等等。In the specific implementation process, taking the field of the text data to be recognized as the field of art as an example, the pre-established entity dictionary for this field can include entities such as painter's name, painting name, painting genre, painting theme, and artist's birthplace. . A pre-established entity dictionary in a certain field can be created based on structured text data in the field. Still taking the art field as an example, if the art field has data stored in a relational database such as mysql, correspondingly, the data in the relational database is structured entity data, which can be derived from known relationships The columns with the same content are extracted from the type database, and then merged and deduplicated to build the initial entity dictionary of the art field. For example, by extracting the columns representing painter names, aliases, and titles from the relational data, and fusing them to remove duplicates, an initial dictionary of painter names can be built, in which a painter name corresponds to an id. In addition, in the field of art, the entity can also be the name of the painting, the genre of the painting, the subject of the painting, the birthplace of the artist, and so on.
在具体实施过程中,根据预先建立的待识别文本数据的所属领域的实体词典,对待识别文本数据进行分词,获得去除了实体词典中的命名实体的候选实体。仍以艺术领域为例,比如,去除掉了实体词典中画家名这一命名实体,这样候选实体中就不包括画家名这一命名实体。此外,候选实体可以是一个或多个,具体根据实际应用来确定,在此不做限定。In the specific implementation process, according to the pre-established entity dictionary of the field of the text data to be recognized, word segmentation is performed on the text data to be recognized, so as to obtain candidate entities from which the named entities in the entity dictionary are removed. Still taking the field of art as an example, for example, the named entity of painter's name in the entity dictionary is removed, so that the named entity of painter's name is not included in the candidate entities. In addition, there may be one or more candidate entities, which are specifically determined according to actual applications, which are not limited here.
S103:识别候选实体中的命名实体。S103: Identify a named entity among the candidate entities.
在具体实施过程中,在获得候选实体之后,识别候选实体中的命名实体。这样的话,候选实体中识别出的命名实体和待识别文本数据中属于实体词典的命名实体的集合构成了待识别文本数据的命名实体。整个命名实体过程只需要从候选实体中来识别命名实体,而不需对所有领域中的所有实体进行识别,也不需要对预先建立的待识别文本数据所属领域的实体词典中的实体进行识别,从而提高了命名实体的识别效率。In the specific implementation process, after the candidate entities are obtained, the named entities among the candidate entities are identified. In this case, the named entity recognized in the candidate entity and the named entity belonging to the entity dictionary in the text data to be recognized constitute the named entity of the text data to be recognized. The entire named entity process only needs to identify named entities from candidate entities, not all entities in all fields, and no need to recognize entities in the entity dictionary of the field to which the text data to be recognized belongs in advance. Thereby improving the efficiency of recognizing named entities.
在本公开实施例中,步骤S103:识别候选实体中的命名实体,包括:In the embodiment of the present disclosure, step S103: identifying named entities in the candidate entities includes:
根据候选实体在待识别文本数据中的位置,从候选实体中识别出待识别文本数据的命名实体。According to the position of the candidate entity in the text data to be recognized, the named entity of the text data to be recognized is recognized from the candidate entities.
在具体实施过程中,可以是根据候选实体在待识别文本数据中的位置,比如,候选实体对应的词语在待识别文本数据所对应的句子中的位置,然后,根据其位置,基于本领域专业人员的经验,从候选实体中识别出待识别文本 数据的命名实体。In the specific implementation process, it can be based on the position of the candidate entity in the text data to be recognized, for example, the position of the word corresponding to the candidate entity in the sentence corresponding to the text data to be recognized, and then, according to its position, based on the field of expertise The personnel’s experience identifies the named entity of the text data to be recognized from the candidate entities.
在本公开实施例中,如图2所示,步骤S102:根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体,包括:In the embodiment of the present disclosure, as shown in FIG. 2, step S102: performing word segmentation on the text data to be recognized according to a pre-established entity dictionary in the field to obtain candidate entities, including:
S201:当待识别文本数据为中文时,根据结巴分词技术,将待识别文本数据分成多个分词,并确定出各个分词的词性;S201: When the to-be-recognized text data is Chinese, divide the to-be-recognized text data into multiple word segments according to the stuttering word segmentation technology, and determine the part of speech of each word segmentation;
S202:从多个分词中,提取出词性为名词的命名实体;S202: Extract a named entity whose part of speech is a noun from multiple word segmentation;
S203:从词性为名词的命名实体中,去除掉属于实体词典中的命名实体,获得候选实体。S203: Remove the named entities belonging to the entity dictionary from the named entities whose part of speech is a noun to obtain candidate entities.
在具体实施过程中,步骤S201至步骤S203的具体实现过程如下:In the specific implementation process, the specific implementation process of step S201 to step S203 is as follows:
首先,以待识别文本数据为中文为例,根据结巴分词技术,将待识别文本数据分成多个分词,并确定出各个分词的词性。可以是根据python中的结巴分词技术,对待识别文本数据进行分词,把待识别文本数据对应的整段文本分成一个个不重复的词,每个词还对应有词性,比如,名词,形容词,动词等。然后,从多个分词中提取出词性为名词的命名实体;然后,从词性为名词的命名实体中,去除掉属于实体词典中的命名实体,获得候选实体。在具体实施过程中,通过对多个分词中提取出词性为名词的命名实体,来进一步地确定候选实体,由于命名实体通常为名词,从而提高了提取候选实体的效率,进而提高了命名实体的识别效率。First, taking the text data to be recognized as Chinese as an example, the text data to be recognized is divided into multiple word segments according to the stuttering word segmentation technology, and the part of speech of each word segmentation is determined. It can be based on the stuttering word segmentation technology in python to segment the text data to be recognized, and divide the entire text corresponding to the text data to be recognized into non-repetitive words. Each word also corresponds to a part of speech, such as nouns, adjectives, and verbs. Wait. Then, a named entity whose part of speech is a noun is extracted from multiple word segmentation; then, from a named entity whose part of speech is a noun, the named entity belonging to the entity dictionary is removed to obtain a candidate entity. In the specific implementation process, the candidate entities are further determined by extracting named entities whose part of speech is nouns from multiple word segmentation. Since named entities are usually nouns, the efficiency of extracting candidate entities is improved, and the efficiency of named entities is improved. Recognition efficiency.
在具体实施过程中,在根据结巴分词技术,对待识别文本数据进行分词的同时,可以结合预先建立的实体词典来分词,保证了对待识别文本数据进行分词后的结构更贴近预先建立的实体词典中的完整词,从而提高了提取候选实体的效率,进而提高了命名实体的识别效率。In the specific implementation process, while performing word segmentation on the text data to be recognized according to the stuttering word segmentation technology, it can be combined with the pre-built entity dictionary to segment words, ensuring that the structure of the text data to be recognized is closer to the pre-built entity dictionary. Therefore, the efficiency of extracting candidate entities is improved, and the efficiency of identifying named entities is improved.
举个具体的例子来说,待识别文本数据为从网络中爬去得到的关于画家达·芬奇的一段文字介绍,具体为“1500年达·芬奇回到佛罗伦萨并开始创作《蒙娜丽莎》。《蒙娜丽莎》运用了透视法等多种绘画方法。这之后达·芬奇再去米兰,并继续服务于米兰宫廷。”,根据结巴分词技术进行分词后,词性为名词的还有“年”、“达·芬奇”、“佛罗伦萨”、“蒙娜丽莎”、“透视法”、“绘 画”、“方法”、“米兰”、“宫廷”。根据预先建立的地点实体词典,可以匹配到上述待识别文本数据中的“佛罗伦萨”、“米兰”。从所有词性为名词的词中去除掉“佛罗伦萨”、“米兰”,候选实体包括“年”、“达·芬奇”、“蒙娜丽莎”、“透视法”、“绘画”、“方法”、“宫廷”。后续再对候选实体中的命名实体进行识别,从而缩小了命名实体识别的范围,提高了命名实体识别的效率。To give a specific example, the text data to be recognized is a text introduction about the painter Da Vinci obtained from crawling on the Internet, specifically, “Da Vinci returned to Florence in 1500 and began to create the Mona Li "The Mona Lisa" used perspective and other painting methods. After that, Da Vinci went to Milan and continued to serve in the Milan court." After the word segmentation was performed according to the stammering technique, the part of speech was noun There are also "Nian", "Da Vinci", "Florence", "Mona Lisa", "Perspective", "Painting", "Method", "Milan", and "Palace". According to the pre-established location entity dictionary, it can be matched to "Florence" and "Milan" in the text data to be recognized. Remove "Florence" and "Milan" from all words whose part of speech is noun. Candidate entities include "Nian", "Da Vinci", "Mona Lisa", "Perspective", "Painting", and "Methods" ","palace". Subsequent identification of named entities among the candidate entities will reduce the scope of named entity identification and improve the efficiency of named entity identification.
在本公开实施例中,在步骤S203:从词性为名词的命名实体中,去除掉属于实体词典中的命名实体,获得候选实体之后,方法还包括:In the embodiment of the present disclosure, in step S203: removing the named entity belonging to the entity dictionary from the named entity whose part of speech is the noun, and after obtaining the candidate entity, the method further includes:
从候选实体中去除掉由预先建立的实体提取规则所提取的实体。The entities extracted by the pre-established entity extraction rules are removed from the candidate entities.
在具体实施过程中,在根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体之后,可以对候选实体进一步进行筛选,可以是去除预先建立的实体提取规则所提取的实体,从而进一步地缩小候选实体的范围,进而提高命名实体识别的效率。仍以上述关于达·芬奇的文字介绍为例,在获得候选实体包括“年”、“达·芬奇”、“蒙娜丽莎”、“透视法”、“绘画”、“方法”、“宫廷”之后,根据预先建立的时间提取规则“数字+年”,匹配到“1500年”,根据预先建立的画作名提取规则《画作名》,匹配到“蒙娜丽莎”,从上述候选实体中去除掉“年”、“蒙娜丽莎”,从余下的实体中进一步地识别命名实体,进一步缩小了命名实体的识别范围,进而提高了命名实体识别的效率。In the specific implementation process, after the text data to be recognized is segmented according to the pre-established entity dictionary in the field, and candidate entities are obtained, the candidate entities can be further screened, which can be the entities extracted by removing the pre-established entity extraction rules , Thereby further narrowing the scope of candidate entities, thereby improving the efficiency of named entity recognition. Still taking the above text introduction about Da Vinci as an example, the candidate entities obtained include "year", "Da Vinci", "Mona Lisa", "perspective", "painting", "method", After the "palace", according to the pre-established time extraction rule "number + year", matching to "1500", according to the pre-established painting name extraction rule "painting name", matching to "Mona Lisa", from the above candidates The "year" and "Mona Lisa" are removed from the entities, and named entities are further identified from the remaining entities, which further narrows the recognition scope of named entities and improves the efficiency of named entity recognition.
在本公开实施过程中,在确定候选实体的过程中,除了采用上述先分词,再依次根据预先建立的实体词典和预先建立的实体提取规则来对名词词性的实体进行筛选来确定候选实体外,还可以在分词之后,依次根据预先建立的实体提取规则和预先建立的实体词典来对名词词性的实体进行筛选,当然,还可以是分词之后,同时根据预先建立的实体词典和预先建立的实体提取规则来对名词词性的实体进行筛选。在具体实施过程中,可以根据实际应用来选择对候选实体的确定过程,在此不做限定。In the implementation of the present disclosure, in the process of determining candidate entities, in addition to using the above-mentioned first word segmentation, and then sequentially screening the noun part-of-speech entities according to the pre-established entity dictionary and the pre-established entity extraction rules to determine the candidate entity, After the word segmentation, the entities of the noun part of speech can be filtered according to the pre-established entity extraction rules and the pre-established entity dictionary in turn. Of course, it can also be after the word segmentation, based on the pre-established entity dictionary and the pre-established entity extraction at the same time Rules to filter the entities of the noun part of speech. In the specific implementation process, the process of determining the candidate entity can be selected according to the actual application, which is not limited here.
在本公开实施例中,在步骤S102:根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体之后,方法还包括:In the embodiment of the present disclosure, in step S102: performing word segmentation on the text data to be recognized according to a pre-established entity dictionary in the field, and obtaining candidate entities, the method further includes:
根据待识别文本数据的上下文信息,确定并更新针对候选实体的实体提取规则。According to the context information of the text data to be recognized, the entity extraction rules for the candidate entities are determined and updated.
在具体实施过程中,在获得候选实体之后,可以根据待识别文本数据的上下文信息,确定并更新针对候选实体的实体提取规则。如果先前没有建立实体提取规则,在根据待识别文本数据的上下文信息确定出有实体提取规则,则建立实体提取规则。如果在获得候选实体之前有预先建立的实体提取规则,在获得候选实体之后有不同于预先建立的实体提取规则,即新的实体提取规则,则更新预先建立的实体提取规则。仍然以上述关于达·芬奇的文字介绍为例,通过对候选实体进行分析,从中发现,“达·芬奇”是个画家名,但是和预先建立的画家名实体词典中“达·芬奇”不匹配,则可以建立一个新的提取画家名的提取规则,把“·”转换成“·”,从而实现了对预先建立的画家名的提取规则的更新,再用更新后的实体提取规则来提取实体,从而缩小了候选实体的选择范围,提高了命名实体的识别效率。In the specific implementation process, after the candidate entity is obtained, the entity extraction rules for the candidate entity can be determined and updated according to the context information of the text data to be recognized. If the entity extraction rule has not been established previously, and the entity extraction rule is determined based on the context information of the text data to be recognized, the entity extraction rule is established. If there is a pre-established entity extraction rule before the candidate entity is obtained, and the entity extraction rule is different from the pre-established entity extraction rule after the candidate entity is obtained, that is, a new entity extraction rule, then the pre-established entity extraction rule is updated. Still taking the above text introduction about Da Vinci as an example, through the analysis of candidate entities, it is found that "Da Vinci" is a painter's name, but it is the same as "Da Vinci" in the pre-built painter name entity dictionary. If there is no match, a new extraction rule for extracting painter names can be established, and "·" is converted to "·", so as to update the pre-established painter name extraction rules, and then use the updated entity extraction rules to Extract entities, thereby narrowing the selection range of candidate entities, and improving the recognition efficiency of named entities.
在本公开实施例中,预先建立的实体提取规则可以是获取一段非结构化文本数据之后,根据专家经验来建立实体提取规则。在未获取任何非结构化文本数据之前,是没有建立实体提取规则的。随着后续对待识别文本数据的命名实体的识别,不断更新实体提取规则,实体提取规则越来越多,随着实体提取规则的增多,可以进一步缩小候选实体的选择范围,进一步提高命名实体的识别效率。In the embodiment of the present disclosure, the pre-established entity extraction rule may be that after obtaining a piece of unstructured text data, the entity extraction rule is established based on expert experience. Before obtaining any unstructured text data, no entity extraction rules have been established. With the subsequent identification of named entities to be recognized in text data, the entity extraction rules are constantly updated, and there are more and more entity extraction rules. With the increase of entity extraction rules, the selection of candidate entities can be further reduced, and the recognition of named entities can be further improved. efficient.
此外,在本公开实施例中,还可以根据越来越多的结构化文本数据来不断地更新预先建立的实体词典,实体词典越来越多,随着实体词典的增多,可以进一步缩小候选实体的选择范围,进一步提高命名实体的识别效率。In addition, in the embodiments of the present disclosure, the pre-established entity dictionary can also be continuously updated based on more and more structured text data. There are more and more entity dictionaries. As the number of entity dictionaries increases, the candidate entities can be further reduced. The selection range of, further improve the efficiency of the recognition of named entities.
基于同样的发明构思,如图3所示,本公开实施例还提供了一种命名实体的识别设备,包括:Based on the same inventive concept, as shown in FIG. 3, an embodiment of the present disclosure also provides a named entity recognition device, including:
确定单元10,用于确定获取到的待识别文本数据的所属领域;The determining
获得单元20,根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体,候选实体中去除了实体词典中的命名实体;The obtaining
识别单元30,用于识别候选实体中的命名实体。The
在本公开实施例中,识别单元30具体用于:In the embodiment of the present disclosure, the
根据候选实体在待识别文本数据中的位置,从候选实体中识别出待识别文本数据的命名实体。According to the position of the candidate entity in the text data to be recognized, the named entity of the text data to be recognized is recognized from the candidate entities.
在本公开实施例中,获得单元20具体用于:In the embodiment of the present disclosure, the obtaining
当待识别文本数据为中文时,根据结巴分词技术,将待识别文本数据分成多个分词,并确定出各个分词的词性;When the text data to be recognized is Chinese, the text data to be recognized is divided into multiple word segments according to the stuttering word segmentation technology, and the part of speech of each word segmentation is determined;
从多个分词中,提取出词性为名词的命名实体;Extract named entities whose parts of speech are nouns from multiple word segmentation;
从词性为名词的命名实体中,去除掉属于实体词典中的命名实体,获得候选实体。From the named entities whose part of speech is noun, remove the named entities belonging to the entity dictionary to obtain candidate entities.
在本公开实施例中,在根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体之后,识别设备还包括:In the embodiment of the present disclosure, after performing word segmentation on the text data to be recognized according to a pre-established entity dictionary in the field, and obtaining candidate entities, the recognition device further includes:
去除单元,用于从候选实体中去除掉由预先建立的实体提取规则所提取的实体。The removing unit is used to remove the entities extracted by the pre-established entity extraction rules from the candidate entities.
在本公开实施例中,在从根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体之后,识别设备还包括:In the embodiment of the present disclosure, after performing word segmentation on the text data to be recognized from the pre-established entity dictionary in the field to obtain candidate entities, the recognition device further includes:
更新单元,用于根据待识别文本数据的上下文信息,确定并更新针对候选实体的实体提取规则。The update unit is used to determine and update the entity extraction rule for the candidate entity according to the context information of the text data to be recognized.
基于同样的发明构思,如图4所示,本公开实施例还提供了一种命名实体识别的电子设备,包括:Based on the same inventive concept, as shown in FIG. 4, an embodiment of the present disclosure also provides an electronic device for named entity recognition, including:
存储器100和处理器200;
其中,存储器100用于存储程序;Among them, the
处理器200用于执行存储器中的程序,包括如下步骤:The
确定获取到的待识别文本数据的所属领域;Determine the domain of the acquired text data to be recognized;
根据预先建立的该领域的实体词典,对待识别文本数据进行分词,获得候选实体,候选实体中去除了实体词典中的命名实体;According to the pre-established entity dictionary in the field, the text data to be recognized is segmented to obtain candidate entities, and the named entities in the entity dictionary are removed from the candidate entities;
识别候选实体中的命名实体。Identify named entities among candidate entities.
基于同样的发明构思,本公开实施例还提供了一种计算机可读存储介质,包括计算机程序,计算机程序包括程序指令,当程序指令被电子设备执行时,使电子设备执行上述实施例提供的命名实体的识别方法。Based on the same inventive concept, the embodiments of the present disclosure also provide a computer-readable storage medium, including a computer program. The computer program includes program instructions. When the program instructions are executed by an electronic device, the electronic device executes the naming provided in the above-mentioned embodiments. The identification method of the entity.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本公开所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple modules or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may be in electrical, mechanical or other forms.
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本公开各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。In addition, the functional modules in the various embodiments of the present disclosure may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the foregoing embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part.
计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本公开实施例的流程或功能。计算 机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present disclosure are generated in whole or in part. The computer can be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices. Computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, computer instructions may be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to transmit to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
以上对本公开所提供的技术方案进行了详细介绍,本公开中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本公开的方法及其核心思想;同时,对于本领域的一般技术人员,依据本公开的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本公开的限制。The technical solutions provided by the present disclosure are described in detail above. Specific examples are used in this disclosure to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. At the same time, for those of ordinary skill in the art, based on the ideas of the present disclosure, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be construed as a limitation of the present disclosure.
Claims (12)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010522433.9A CN111666768A (en) | 2020-06-10 | 2020-06-10 | Chinese named entity recognition method and device and electronic equipment |
| CN202010522433.9 | 2020-06-10 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021249311A1 true WO2021249311A1 (en) | 2021-12-16 |
Family
ID=72386425
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/098444 Ceased WO2021249311A1 (en) | 2020-06-10 | 2021-06-04 | Named entity recognition method, recognition apparatus, and electronic apparatus |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN111666768A (en) |
| WO (1) | WO2021249311A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114648994A (en) * | 2022-02-23 | 2022-06-21 | 厦门快商通科技股份有限公司 | Voiceprint identification comparison recommendation method and device, electronic equipment and storage medium |
| CN114692644A (en) * | 2022-03-11 | 2022-07-01 | 粤港澳大湾区数字经济研究院(福田) | Text entity labeling method, device, equipment and storage medium |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111666768A (en) * | 2020-06-10 | 2020-09-15 | 京东方科技集团股份有限公司 | Chinese named entity recognition method and device and electronic equipment |
| CN112364640B (en) * | 2020-11-09 | 2025-01-21 | 中国平安人寿保险股份有限公司 | Entity noun linking method, device, computer equipment and storage medium |
| CN112528663B (en) * | 2020-12-18 | 2024-02-20 | 中国南方电网有限责任公司 | Text error correction method and system in power grid field scheduling scene |
| CN114298045B (en) * | 2021-12-28 | 2024-12-24 | 携程旅游网络技术(上海)有限公司 | Method, electronic device and medium for automatically extracting travel diary data |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100250598A1 (en) * | 2009-03-30 | 2010-09-30 | Falk Brauer | Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases |
| CN109145303A (en) * | 2018-09-06 | 2019-01-04 | 腾讯科技(深圳)有限公司 | Name entity recognition method, device, medium and equipment |
| CN110516654A (en) * | 2019-09-03 | 2019-11-29 | 北京百度网讯科技有限公司 | Entity recognition method, device, electronic device and medium for video scene |
| CN111160023A (en) * | 2019-12-23 | 2020-05-15 | 华南理工大学 | A Method for Named Entity Recognition in Medical Text Based on Multiple Recall |
| CN111666768A (en) * | 2020-06-10 | 2020-09-15 | 京东方科技集团股份有限公司 | Chinese named entity recognition method and device and electronic equipment |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101118538B (en) * | 2007-09-17 | 2010-12-15 | 中国科学院计算技术研究所 | Method and system for recognizing feature lexical item in Chinese naming entity |
| CN105354199B (en) * | 2014-08-20 | 2019-10-08 | 北京羽扇智信息科技有限公司 | A kind of recognition methods of entity meaning and system based on scene information |
| CN104572625A (en) * | 2015-01-21 | 2015-04-29 | 北京云知声信息技术有限公司 | Recognition method of named entity |
| CN104933152B (en) * | 2015-06-24 | 2018-09-14 | 北京京东尚科信息技术有限公司 | Name entity recognition method and device |
| CN108491373B (en) * | 2018-02-01 | 2022-05-27 | 北京百度网讯科技有限公司 | An entity recognition method and system |
| CN108304385A (en) * | 2018-02-09 | 2018-07-20 | 叶伟 | A kind of speech recognition text error correction method and device |
-
2020
- 2020-06-10 CN CN202010522433.9A patent/CN111666768A/en active Pending
-
2021
- 2021-06-04 WO PCT/CN2021/098444 patent/WO2021249311A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100250598A1 (en) * | 2009-03-30 | 2010-09-30 | Falk Brauer | Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases |
| CN109145303A (en) * | 2018-09-06 | 2019-01-04 | 腾讯科技(深圳)有限公司 | Name entity recognition method, device, medium and equipment |
| CN110516654A (en) * | 2019-09-03 | 2019-11-29 | 北京百度网讯科技有限公司 | Entity recognition method, device, electronic device and medium for video scene |
| CN111160023A (en) * | 2019-12-23 | 2020-05-15 | 华南理工大学 | A Method for Named Entity Recognition in Medical Text Based on Multiple Recall |
| CN111666768A (en) * | 2020-06-10 | 2020-09-15 | 京东方科技集团股份有限公司 | Chinese named entity recognition method and device and electronic equipment |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114648994A (en) * | 2022-02-23 | 2022-06-21 | 厦门快商通科技股份有限公司 | Voiceprint identification comparison recommendation method and device, electronic equipment and storage medium |
| CN114692644A (en) * | 2022-03-11 | 2022-07-01 | 粤港澳大湾区数字经济研究院(福田) | Text entity labeling method, device, equipment and storage medium |
| CN114692644B (en) * | 2022-03-11 | 2024-06-11 | 粤港澳大湾区数字经济研究院(福田) | A text entity annotation method, device, equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111666768A (en) | 2020-09-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111797226B (en) | Meeting minutes generation method, device, electronic device and readable storage medium | |
| WO2021249311A1 (en) | Named entity recognition method, recognition apparatus, and electronic apparatus | |
| JP6643555B2 (en) | Text processing method and apparatus based on ambiguous entity words | |
| US9740771B2 (en) | Information handling system and computer program product for deducing entity relationships across corpora using cluster based dictionary vocabulary lexicon | |
| CN109582799B (en) | Method, device and electronic device for determining knowledge sample data set | |
| US10956662B2 (en) | List manipulation in natural language processing | |
| US20180075013A1 (en) | Method and system for automating training of named entity recognition in natural language processing | |
| CN110162630A (en) | A kind of method, device and equipment of text duplicate removal | |
| US10108602B2 (en) | Dynamic portmanteau word semantic identification | |
| CN109977233B (en) | Idiom knowledge graph construction method and device | |
| CN111814481B (en) | Shopping intention recognition method, device, terminal equipment and storage medium | |
| JP7324058B2 (en) | SENTENCE ANALYSIS METHOD, SENTENCE ANALYSIS PROGRAM, AND SENTENCE ANALYSIS SYSTEM | |
| CN113408660B (en) | Book clustering method, device, equipment and storage medium | |
| CN114218431B (en) | Video search methods, devices, electronic devices, and storage media | |
| US20210064697A1 (en) | List-based entity name detection | |
| CN114416976A (en) | Text annotation method, device and electronic equipment | |
| CN120030172B (en) | A method, device, equipment and medium for constructing a scientific and technological literature knowledge graph | |
| Bajestan et al. | DErivCELEX: Development and evaluation of a German derivational morphology lexicon based on CELEX | |
| CN116992883B (en) | Entity alignment processing method and device | |
| CN110083817B (en) | A naming disambiguation method, device and computer-readable storage medium | |
| US10002450B2 (en) | Analyzing a document that includes a text-based visual representation | |
| WO2015177861A1 (en) | Device and method for generating training data | |
| CN110414006B (en) | Text subject annotation method, device, electronic equipment and storage medium | |
| US20220083736A1 (en) | Information processing apparatus and non-transitory computer readable medium | |
| Ranjbar-Sahraei et al. | Distant supervision of relation extraction in sparse data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21822739 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21822739 Country of ref document: EP Kind code of ref document: A1 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21822739 Country of ref document: EP Kind code of ref document: A1 |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.06.2023) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21822739 Country of ref document: EP Kind code of ref document: A1 |