CN117829150A

CN117829150A - Synonym recognition method, device and network equipment

Info

Publication number: CN117829150A
Application number: CN202211200743.4A
Authority: CN
Inventors: 皇甫煜
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2024-04-05

Abstract

The present invention provides a synonym recognition method, device and network equipment, and relates to the technical field of natural language processing. The method comprises: based on the conditional random field (CRF) algorithm, entity word recognition is performed on a first test set to obtain a target test set, wherein the first test set is text data containing a point of interest (POI), and the target test set includes at least one candidate entity word, and the candidate entity word is associated with the POI; using a synonym recognition model, identifying whether each of the candidate entity words in the target test set is a synonym, and obtaining a recognition result, wherein the synonym recognition model is constructed based on at least two target similarity algorithms. The solution of the present invention solves the problem that the accuracy of the existing synonym recognition method is not high.

Description

Synonym recognition method, device and network equipment

技术领域Technical Field

本发明涉及自然语言处理技术领域，特别是指一种同义词识别方法、装置及网络设备。The present invention relates to the technical field of natural language processing, and in particular to a synonym recognition method, device and network equipment.

背景技术Background technique

随着社会经济的高速发展，移动互联网进入了繁荣发展的时期，人们通过手机软件可以搜索自己感兴趣的兴趣点(Point of Interest，POI)。在地理信息系统中，一个POI可以是一栋房子、一个商铺、一个邮筒、一个公交站等。With the rapid development of social economy, mobile Internet has entered a period of prosperity and development. People can search for points of interest (POI) they are interested in through mobile phone software. In the geographic information system, a POI can be a house, a shop, a mailbox, a bus stop, etc.

通常，用户可以通过地图导航类软件搜索商场并导航到想去的位置，也可以搜索自己周边的POI，选取自己想去的目的地。目前，随着人们使用的次数和场景越来越多，对于POI搜索功能的要求也越来越高，人们总是希望可以获得自己想要的结果，但有时候输入的关键词对于软件系统来说是“不正规”的，导致搜索结果不准确。例如：通常情况下软件系统存储的是“交通管理局”，但是用户在输入的时候为了方便通常会输入“交管局”，而这样的词汇对软件系统来说往往是难以匹配到的。这就需要软件系统利用词语相似度计算方法，进行同义词识别处理，针对用户输入的关键词，识别出同义词，并针对这些词依次进行搜索处理，将结果返回给用户。Usually, users can use map navigation software to search for shopping malls and navigate to the location they want to go, or they can search for POIs around them and select the destination they want to go. At present, as people use it more and more times and in more and more scenarios, the requirements for POI search functions are getting higher and higher. People always hope to get the results they want, but sometimes the keywords entered are "informal" for the software system, resulting in inaccurate search results. For example: usually the software system stores "Traffic Management Bureau", but users usually enter "Traffic Management Bureau" for convenience when entering, and such words are often difficult for the software system to match. This requires the software system to use the word similarity calculation method to perform synonym recognition processing, identify synonyms for the keywords entered by the user, and search for these words in turn, and return the results to the user.

然而，现有的词语相似度计算方法中：基于词典的计算方法需要利用词典或者词汇分类体系进行词汇语义相似度计算，因而受人的主观意识影响较大，有时候难以准确反映客观事实，且易受限于词典的更新和词汇的完整性，导致对同义词识别的准确率不高；而基于上下文向量空间模型的方法则主要根据汉语同义词的构词特点，绝大多数的同义词含有相同语素(即字)，通过计算字面相似度来发现同义词，因而该方法通常只能判断出字面上相似的同义词，却难以对异形同义词进行有效识别，因此对同义词识别的准确率也不高。。However, among the existing word similarity calculation methods: the dictionary-based calculation method needs to use a dictionary or vocabulary classification system to calculate the semantic similarity of words, which is greatly affected by human subjective consciousness, and sometimes it is difficult to accurately reflect objective facts, and is easily limited by the update of the dictionary and the integrity of the vocabulary, resulting in low accuracy in synonym recognition; while the method based on the context vector space model is mainly based on the word formation characteristics of Chinese synonyms. Most synonyms contain the same morphemes (i.e., characters), and synonyms are discovered by calculating the literal similarity. Therefore, this method can usually only determine synonyms that are similar in words, but it is difficult to effectively identify synonyms with different forms, so the accuracy of synonym recognition is not high. .

发明内容Summary of the invention

本发明的目的是提供一种同义词识别方法、装置及网络设备，解决了现有的同义词识别方法准确率不高的问题。The purpose of the present invention is to provide a synonym recognition method, device and network equipment, which solves the problem that the existing synonym recognition method has low accuracy.

为达到上述目的，本发明的实施例提供一种同义词识别方法，包括：To achieve the above object, an embodiment of the present invention provides a synonym recognition method, comprising:

基于条件随机场(conditional random fields，CRF)算法，对第一测试集进行实体词识别，得到目标测试集，所述第一测试集为包含兴趣点POI的文本数据，所述目标测试集包括至少一个候选实体词，所述候选实体词与POI 关联；Based on a conditional random fields (CRF) algorithm, entity word recognition is performed on a first test set to obtain a target test set, wherein the first test set is text data containing a point of interest (POI), and the target test set includes at least one candidate entity word, and the candidate entity word is associated with the POI;

利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果，其中，所述同义词识别模型是基于至少两种目标相似度算法构建的。A synonym recognition model is used to identify whether each of the candidate entity words in the target test set is a synonym, and a recognition result is obtained, wherein the synonym recognition model is constructed based on at least two target similarity algorithms.

可选地，所述基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集，包括：Optionally, the step of performing entity word recognition on the first test set based on the conditional random field (CRF) algorithm to obtain a target test set includes:

基于所述CRF算法，确定第一识别模型；Based on the CRF algorithm, determining a first recognition model;

使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型；Using the first training set to train the first recognition model to obtain an entity word recognition model;

将所述第一测试集输入至所述实体词识别模型，得到所述目标测试集。The first test set is input into the entity word recognition model to obtain the target test set.

可选地，所述使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型，包括：Optionally, the using the first training set to train the first recognition model to obtain an entity word recognition model includes:

对所述第一训练集进行第一预处理，得到至少一个第一字符串，所述第一预处理包括：分词和/或词性标注；Performing a first preprocessing on the first training set to obtain at least one first character string, wherein the first preprocessing includes: word segmentation and/or part-of-speech tagging;

利用目标标注法，对所述至少一个第一字符串进行标注，得到标注结果；Using a target labeling method, labeling the at least one first character string to obtain a labeling result;

使用所述标注结果，训练所述第一识别模型，得到所述实体词识别模型。The first recognition model is trained using the annotation results to obtain the entity word recognition model.

可选地，所述将所述第一测试集输入至所述实体词识别模型，得到所述目标测试集，包括：Optionally, inputting the first test set into the entity word recognition model to obtain the target test set includes:

对所述第一测试集进行所述第一预处理，得到至少一个待标记字符串；Performing the first preprocessing on the first test set to obtain at least one character string to be marked;

将所述至少一个待标记字符串输入至所述实体词识别模型，得到所述目标测试集。The at least one character string to be marked is input into the entity word recognition model to obtain the target test set.

可选地，所述目标相似度算法包括：同义词林算法和机器学习算法，所述利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果，包括：Optionally, the target similarity algorithm includes: a synonym forest algorithm and a machine learning algorithm, and the synonym recognition model is used to identify whether each of the candidate entity words in the target test set is a synonym, and the recognition result is obtained, including:

基于所述同义词词林算法，获得所述目标测试集中第一候选实体词和第二候选实体词之间的第一相似度；Based on the synonym word forest algorithm, obtaining a first similarity between a first candidate entity word and a second candidate entity word in the target test set;

基于所述机器学习算法，获得所述第一候选实体词和所述第二候选实体词之间的第二相似度；Based on the machine learning algorithm, obtaining a second similarity between the first candidate entity word and the second candidate entity word;

根据所述第一相似度和所述第二相似度，获得所述同义词识别模型；Obtaining the synonym recognition model according to the first similarity and the second similarity;

利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果。Using the synonym recognition model, it is identified whether each of the candidate entity words in the target test set is a synonym, and a recognition result is obtained.

可选地，所述基于所述同义词词林算法，获得所述目标测试集中第一候选实体词和第二候选实体词之间的第一相似度，包括：Optionally, obtaining a first similarity between a first candidate entity word and a second candidate entity word in the target test set based on the synonym word forest algorithm includes:

对所述第一候选实体词和所述第二候选实体词分别进行第二预处理，得到所述第一候选实体词对应的第一词语列表和所述第二候选实体词对应的第二词语列表，所述第二预处理包括：分词并删除非中文字符；Performing second preprocessing on the first candidate entity word and the second candidate entity word respectively to obtain a first word list corresponding to the first candidate entity word and a second word list corresponding to the second candidate entity word, wherein the second preprocessing includes: word segmentation and deleting non-Chinese characters;

基于所述同义词词林算法，依次计算第一词语列表中每个词和第二词语列表之间的相似度，获得所述第一候选实体词对应的第一相似度列表，以及，依次计算第二词语列表中每个词和第一词语列表之间的相似度，获得所述第二候选实体词对应的第二相似度列表；Based on the synonym word forest algorithm, sequentially calculating the similarity between each word in the first word list and the second word list to obtain a first similarity list corresponding to the first candidate entity word, and sequentially calculating the similarity between each word in the second word list and the first word list to obtain a second similarity list corresponding to the second candidate entity word;

根据所述第一相似度列表和所述第二相似度列表，获得所述第一候选实体词和所述第二候选实体词之间的第一相似度。A first similarity between the first candidate entity word and the second candidate entity word is obtained according to the first similarity list and the second similarity list.

可选地，所述基于所述机器学习算法，获得所述第一候选实体词和所述第二候选实体词之间的第二相似度，包括：Optionally, obtaining a second similarity between the first candidate entity word and the second candidate entity word based on the machine learning algorithm includes:

采用Word2Vec深度学习模型，使用第一训练集进行模型训练，获得目标词向量集合；The Word2Vec deep learning model is used to train the model using the first training set to obtain the target word vector set;

基于余弦相似度算法，根据所述第一候选实体词在所述目标词向量集合中对应的第一词向量和所述第二候选实体词在所述目标词向量集合中对应的第二词向量，获得所述第一词向量和所述第二词向量之间的相似度，并将所述第一词向量和所述第二词向量之间的相似度作为所述第一候选实体词和所述第二候选实体词之间的第二相似度。Based on the cosine similarity algorithm, the similarity between the first word vector and the second word vector is obtained according to the first word vector corresponding to the first candidate entity word in the target word vector set and the second word vector corresponding to the second candidate entity word in the target word vector set, and the similarity between the first word vector and the second word vector is used as the second similarity between the first candidate entity word and the second candidate entity word.

可选地，所述根据所述第一相似度和所述第二相似度，获得所述同义词识别模型，包括：Optionally, obtaining the synonym recognition model according to the first similarity and the second similarity includes:

使用第二训练集，对所述同义词识别模型对应的逻辑回归模型进行训练，直至所述逻辑回归模型收敛，并根据所述逻辑回归模型收敛时的回归系数向量，确定所述第一相似度和所述第二相似度分别对应的权重，所述第二训练集包括至少一组同义词和至少一组非同义词；Using a second training set, training a logistic regression model corresponding to the synonym recognition model until the logistic regression model converges, and determining weights corresponding to the first similarity and the second similarity respectively according to a regression coefficient vector when the logistic regression model converges, wherein the second training set includes at least one group of synonyms and at least one group of non-synonymous words;

根据所述第一相似度的权重和所述第二相似度的权重，构建所述同义词识别模型。The synonym recognition model is constructed according to the weight of the first similarity and the weight of the second similarity.

为达到上述目的，本发明的实施例提供一种同义词识别装置，包括：To achieve the above object, an embodiment of the present invention provides a synonym recognition device, comprising:

第一处理模块，用于基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集，所述第一测试集为包含兴趣点POI的文本数据，所述目标测试集包括至少一个候选实体词，所述候选实体词与POI关联；A first processing module is used to perform entity word recognition on a first test set based on a conditional random field (CRF) algorithm to obtain a target test set, wherein the first test set is text data containing a point of interest (POI), and the target test set includes at least one candidate entity word, and the candidate entity word is associated with the POI;

第二处理模块，用于利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果，其中，所述同义词识别模型是基于至少两种目标相似度算法构建的。The second processing module is used to use a synonym recognition model to identify whether each of the candidate entity words in the target test set is a synonym, and obtain a recognition result, wherein the synonym recognition model is constructed based on at least two target similarity algorithms.

可选地，所述第一处理模块包括：Optionally, the first processing module includes:

第一处理子模块，用于基于所述CRF算法，确定第一识别模型；A first processing submodule, used for determining a first recognition model based on the CRF algorithm;

第二处理子模块，用于使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型；A second processing submodule is used to train the first recognition model using the first training set to obtain an entity word recognition model;

第三处理子模块，用于将所述第一测试集输入至所述实体词识别模型，得到所述目标测试集。The third processing submodule is used to input the first test set into the entity word recognition model to obtain the target test set.

可选地，所述第二处理子模块包括：Optionally, the second processing submodule includes:

第三处理单元，用于对所述第一训练集进行第一预处理，得到至少一个第一字符串，所述第一预处理包括：分词和/或词性标注；A third processing unit is configured to perform a first preprocessing on the first training set to obtain at least one first character string, wherein the first preprocessing includes: word segmentation and/or part-of-speech tagging;

第四处理单元，用于利用目标标注法，对所述至少一个第一字符串进行标注，得到标注结果；A fourth processing unit, configured to label the at least one first character string by using a target labeling method to obtain a labeling result;

第五处理单元，用于使用所述标注结果，训练所述第一识别模型，得到所述实体词识别模型。The fifth processing unit is used to use the annotation results to train the first recognition model to obtain the entity word recognition model.

可选地，所述第三处理子模块包括：Optionally, the third processing submodule includes:

第一处理单元，用于对所述第一测试集进行所述第一预处理，得到至少一个待标记字符串；A first processing unit, configured to perform the first preprocessing on the first test set to obtain at least one character string to be marked;

第二处理单元，用于将所述至少一个待标记字符串输入至所述实体词识别模型，得到所述目标测试集。The second processing unit is used to input the at least one character string to be marked into the entity word recognition model to obtain the target test set.

可选地，所述目标相似度算法包括：同义词林算法和机器学习算法，所述第二处理模块包括：Optionally, the target similarity algorithm includes: a synonym forest algorithm and a machine learning algorithm, and the second processing module includes:

第四处理子模块，用于基于所述同义词词林算法，获得所述目标测试集中第一候选实体词和第二候选实体词之间的第一相似度；A fourth processing submodule is used to obtain a first similarity between the first candidate entity word and the second candidate entity word in the target test set based on the synonym word forest algorithm;

第五处理子模块，用于基于所述机器学习算法，获得所述第一候选实体词和所述第二候选实体词之间的第二相似度；a fifth processing submodule, configured to obtain a second similarity between the first candidate entity word and the second candidate entity word based on the machine learning algorithm;

第六处理子模块，用于根据所述第一相似度和所述第二相似度，获得所述同义词识别模型；a sixth processing submodule, configured to obtain the synonym recognition model according to the first similarity and the second similarity;

第七处理子模块，用于利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果。The seventh processing submodule is used to use a synonym recognition model to identify whether each of the candidate entity words in the target test set is a synonym, and obtain a recognition result.

可选地，所述第四处理子模块包括：Optionally, the fourth processing submodule includes:

第六处理单元，用于对所述第一候选实体词和所述第二候选实体词分别进行第二预处理，得到所述第一候选实体词对应的第一词语列表和所述第二候选实体词对应的第二词语列表，所述第二预处理包括：分词并删除非中文字符；a sixth processing unit, configured to perform second preprocessing on the first candidate entity word and the second candidate entity word respectively, to obtain a first word list corresponding to the first candidate entity word and a second word list corresponding to the second candidate entity word, wherein the second preprocessing includes: word segmentation and deletion of non-Chinese characters;

第七处理单元，用于基于所述同义词词林算法，依次计算第一词语列表中每个词和第二词语列表之间的相似度，获得所述第一候选实体词对应的第一相似度列表，以及，依次计算第二词语列表中每个词和第一词语列表之间的相似度，获得所述第二候选实体词对应的第二相似度列表；A seventh processing unit is used to sequentially calculate the similarity between each word in the first word list and the second word list based on the synonym word forest algorithm to obtain a first similarity list corresponding to the first candidate entity word, and sequentially calculate the similarity between each word in the second word list and the first word list to obtain a second similarity list corresponding to the second candidate entity word;

第八处理单元，用于根据所述第一相似度列表和所述第二相似度列表，获得所述第一候选实体词和所述第二候选实体词之间的第一相似度。An eighth processing unit is used to obtain a first similarity between the first candidate entity word and the second candidate entity word according to the first similarity list and the second similarity list.

可选地，所述第五处理子模块包括：Optionally, the fifth processing submodule includes:

第九处理单元，用于采用Word2Vec深度学习模型，使用第一训练集进行模型训练，获得目标词向量集合；A ninth processing unit is used to adopt a Word2Vec deep learning model, use the first training set to perform model training, and obtain a target word vector set;

第十处理单元，用于基于余弦相似度算法，根据所述第一候选实体词在所述目标词向量集合中对应的第一词向量和所述第二候选实体词在所述目标词向量集合中对应的第二词向量，获得所述第一词向量和所述第二词向量之间的相似度，并将所述第一词向量和所述第二词向量之间的相似度作为所述第一候选实体词和所述第二候选实体词之间的第二相似度。The tenth processing unit is used to obtain the similarity between the first word vector and the second word vector according to the first word vector corresponding to the first candidate entity word in the target word vector set and the second word vector corresponding to the second candidate entity word in the target word vector set based on the cosine similarity algorithm, and use the similarity between the first word vector and the second word vector as the second similarity between the first candidate entity word and the second candidate entity word.

可选地，所述第六处理子模块包括：Optionally, the sixth processing submodule includes:

第十一处理单元，用于使用第二训练集，对所述同义词识别模型对应的逻辑回归模型进行训练，直至所述逻辑回归模型收敛，并根据所述逻辑回归模型收敛时的回归系数向量，确定所述第一相似度和所述第二相似度分别对应的权重，所述第二训练集包括至少一组同义词和至少一组非同义词；an eleventh processing unit, configured to train a logistic regression model corresponding to the synonym recognition model using a second training set until the logistic regression model converges, and determine weights corresponding to the first similarity and the second similarity respectively according to a regression coefficient vector when the logistic regression model converges, wherein the second training set includes at least one group of synonyms and at least one group of non-synonymous words;

第十二处理单元，用于根据所述第一相似度的权重和所述第二相似度的权重，构建所述同义词识别模型。A twelfth processing unit is used to construct the synonym recognition model according to the weight of the first similarity and the weight of the second similarity.

为达到上述目的，本发明的实施例提供一种网络设备，包括处理器和收发机，其中，所述处理用于：To achieve the above object, an embodiment of the present invention provides a network device, including a processor and a transceiver, wherein the processing is used to:

基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集，所述第一测试集为包含兴趣点POI的文本数据，所述目标测试集包括至少一个候选实体词，所述候选实体词与POI关联；Based on the conditional random field (CRF) algorithm, entity word recognition is performed on the first test set to obtain a target test set, where the first test set is text data containing a point of interest (POI), and the target test set includes at least one candidate entity word, and the candidate entity word is associated with the POI;

可选地，所述处理器在基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集时，具体用于：Optionally, when the processor performs entity word recognition on the first test set based on the conditional random field (CRF) algorithm to obtain the target test set, the processor is specifically configured to:

可选地，所述处理器在使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型时，具体用于：Optionally, when the processor uses the first training set to train the first recognition model to obtain the entity word recognition model, it is specifically used to:

可选地，所述处理器在将所述第一测试集输入至所述实体词识别模型，得到所述目标测试集时，具体用于：Optionally, when the processor inputs the first test set into the entity word recognition model to obtain the target test set, the processor is specifically configured to:

可选地，所述目标相似度算法包括：同义词林算法和机器学习算法，所述处理器在利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果时，具体用于：Optionally, the target similarity algorithm includes: a synonym forest algorithm and a machine learning algorithm. When the processor uses a synonym recognition model to identify whether each of the candidate entity words in the target test set is a synonym and obtains a recognition result, it is specifically used to:

可选地，所述处理器在基于所述同义词词林算法，获得所述目标测试集中第一候选实体词和第二候选实体词之间的第一相似度时，具体用于：Optionally, when the processor obtains the first similarity between the first candidate entity word and the second candidate entity word in the target test set based on the synonym word forest algorithm, it is specifically configured to:

可选地，所述处理器在基于所述机器学习算法，获得所述第一候选实体词和所述第二候选实体词之间的第二相似度时，具体用于：Optionally, when the processor obtains the second similarity between the first candidate entity word and the second candidate entity word based on the machine learning algorithm, it is specifically configured to:

可选地，所述处理器在根据所述第一相似度和所述第二相似度，获得所述同义词识别模型时，具体用于：Optionally, when the processor obtains the synonym recognition model according to the first similarity and the second similarity, it is specifically configured to:

为达到上述目的，本发明的实施例提供一种网络设备，包括收发器、处理器、存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令；所述处理器执行所程序或指令时实现如上所述的同义词识别方法。To achieve the above-mentioned purpose, an embodiment of the present invention provides a network device, including a transceiver, a processor, a memory, and a program or instruction stored in the memory and executable on the processor; when the processor executes the program or instruction, the synonym recognition method as described above is implemented.

为达到上述目的，本发明的实施例提供一种可读存储介质，其上存储有程序或指令，所述程序或指令被处理器执行时实现如上所述的同义词识别方法中的步骤。To achieve the above objective, an embodiment of the present invention provides a readable storage medium having a program or instruction stored thereon, wherein the program or instruction, when executed by a processor, implements the steps in the above-mentioned synonym identification method.

本发明的上述技术方案的有益效果如下：The beneficial effects of the above technical solution of the present invention are as follows:

本发明实施例的方法，基于CRF算法，可以获得包括与POI关联的至少一个候选实体词的目标测试集，进一步可以利用同义词识别模型，识别目标测试集中的各个候选实体词之间是否为同义词，从而得到POI领域的各种同义词，可以提升POI搜索效率。这里，同义词识别模型是基于至少两种目标相似度算法构建的，因此可以结合两种目标相似度算法的优势，使得同义词识别结果更加精准，且识别过程简单，运行速度快。The method of the embodiment of the present invention can obtain a target test set including at least one candidate entity word associated with a POI based on a CRF algorithm, and can further use a synonym recognition model to identify whether each candidate entity word in the target test set is a synonym, thereby obtaining various synonyms in the POI field, which can improve the POI search efficiency. Here, the synonym recognition model is constructed based on at least two target similarity algorithms, so the advantages of the two target similarity algorithms can be combined to make the synonym recognition result more accurate, and the recognition process is simple and the running speed is fast.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的同义词识别方法的流程图；FIG1 is a flow chart of a synonym identification method according to an embodiment of the present invention;

图2为CRF算法示意图；Figure 2 is a schematic diagram of the CRF algorithm;

图3为本发明实施例的对第一识别模型进行训练的流程示意图；FIG3 is a schematic diagram of a process of training a first recognition model according to an embodiment of the present invention;

图4为本发明实施例的对第一测试集进行实体词识别的流程示意图；FIG4 is a schematic diagram of a process of performing entity word recognition on a first test set according to an embodiment of the present invention;

图5为本发明实施例的同义词识别装置的结构图；FIG5 is a structural diagram of a synonym recognition device according to an embodiment of the present invention;

图6为本发明实施例的网络设备的结构图；FIG6 is a structural diagram of a network device according to an embodiment of the present invention;

图7为本发明另一实施例的网络设备的结构图。FIG. 7 is a structural diagram of a network device according to another embodiment of the present invention.

具体实施方式Detailed ways

为使本发明要解决的技术问题、技术方案和优点更加清楚，下面将结合附图及具体实施例进行详细描述。In order to make the technical problems, technical solutions and advantages to be solved by the present invention more clear, a detailed description will be given below with reference to the accompanying drawings and specific embodiments.

应理解，说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本发明的至少一个实施例中。因此，在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外，这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。It should be understood that the references to "one embodiment" or "an embodiment" throughout the specification mean that the specific features, structures, or characteristics associated with the embodiment are included in at least one embodiment of the present invention. Therefore, the references to "in one embodiment" or "in an embodiment" appearing throughout the specification do not necessarily refer to the same embodiment. In addition, these specific features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

在本发明的各种实施例中，应理解，下述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。In various embodiments of the present invention, it should be understood that the size of the serial numbers of the following processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

另外，本文中术语“系统”和“网络”在本文中常可互换使用。Additionally, the terms "system" and "network" are often used interchangeably herein.

在本申请所提供的实施例中，应理解，“与A相应的B”表示B与A相关联，根据A可以确定B。但还应理解，根据A确定B并不意味着仅仅根据A 确定B，还可以根据A和/或其它信息确定B。In the embodiments provided in this application, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined according to A. However, it should also be understood that determining B according to A does not mean determining B only according to A, and B can also be determined according to A and/or other information.

目前，常用的词语相似度计算方法有两种：基于词典的计算方法和基于统计的上下文向量空间模型方法，具体说明如下：At present, there are two commonly used methods for calculating word similarity: a dictionary-based calculation method and a statistical context vector space model method, which are described as follows:

基于词典的计算方法：也叫做基于世界知识或某种分类体系的方法。在词典或词汇分类体系中，所有同类的语义项构成一个具有层次的树状结构，相应结点之间的距离(称为概念距离)即可表示词汇语义之间的相似程度。在利用词典或者词汇分类体系进行词汇语义相似度计算的时候，需要把复杂词汇概念首先转化成词典或者词汇分类体系中所收录的最小概念单元，然后利用概念层次体系结构，对各个概念单元进行语义相似度计算。常用的词典有《同义词词林》、《百科词条》等。然而，该方法得到的结果受人的主观意识影响较大，有时候并不能准确反映客观事实，且同义词通常比较偏向书面语，易受限于词典的更新和词汇的完整性。Dictionary-based calculation method: also called a method based on world knowledge or a certain classification system. In a dictionary or vocabulary classification system, all similar semantic items form a hierarchical tree structure, and the distance between corresponding nodes (called concept distance) can represent the similarity between vocabulary semantics. When using a dictionary or vocabulary classification system to calculate the similarity of vocabulary semantics, it is necessary to first convert complex vocabulary concepts into the smallest concept units included in the dictionary or vocabulary classification system, and then use the concept hierarchy system structure to calculate the semantic similarity of each concept unit. Commonly used dictionaries include "Synonym Cilin" and "Encyclopedia Encyclopedia". However, the results obtained by this method are greatly affected by human subjective consciousness, and sometimes cannot accurately reflect objective facts. In addition, synonyms are usually more inclined to written language and are easily restricted by dictionary updates and vocabulary integrity.

基于上下文向量空间模型的方法：基于文档向量空间模型的词汇相似度计算把当前词汇的上下文或者固定大小的窗口内容表示成一个词义空间向量。如果两个词的词义空间向量是相似的，这两个词就可以认为是同义词或者相关词。计算向量之间的相似度方法有很多，常见的有余弦相似度计算法，通过计算两个向量的夹角余弦值来评估他们的相似度，值越接近于1说明他们的相似性越高。该方法中，主要根据汉语同义词的构词特点，绝大多数的同义词含有相同语素(即字)，通过计算字面相似度来发现同义词。然而，该方法实现虽然简单，但通常只能判断出字面上相似的同义词，对于异形同义词却不能很好的识别，如：建设银行和CBB(ChinaConstructionBank，中国建设银行)。Method based on context vector space model: The vocabulary similarity calculation based on the document vector space model represents the context of the current vocabulary or the content of a fixed-size window as a word meaning space vector. If the word meaning space vectors of two words are similar, the two words can be considered synonyms or related words. There are many methods for calculating the similarity between vectors. The most common one is the cosine similarity calculation method, which evaluates their similarity by calculating the cosine value of the angle between two vectors. The closer the value is to 1, the higher their similarity. In this method, based on the word formation characteristics of Chinese synonyms, most synonyms contain the same morphemes (i.e., characters), and synonyms are discovered by calculating the literal similarity. However, although this method is simple to implement, it can usually only determine synonyms that are similar in literal sense, but cannot identify synonyms with different forms well, such as: Construction Bank and CBB (China Construction Bank).

如图1所示，本发明实施例的一种同义词识别方法，包括：As shown in FIG1 , a synonym recognition method according to an embodiment of the present invention includes:

步骤101，基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集，所述第一测试集为包含兴趣点POI的文本数据，所述目标测试集包括至少一个候选实体词，所述候选实体词与POI关联。Step 101, based on the conditional random field (CRF) algorithm, entity word recognition is performed on the first test set to obtain a target test set, wherein the first test set is text data containing points of interest (POIs), and the target test set includes at least one candidate entity word, and the candidate entity word is associated with the POI.

该步骤中，可以通过CRF算法抽取实体，也就是从包含兴趣点POI的文本数据中，提取与POI关联的候选实体词。这里，进行实体识别，也就是将第一测试集中的与POI关联的地名、机构名等实体词查找出来。例如，在POI 领域的“住宿”类别中，酒店、宾馆、民俗等设施即为实体，“某某酒店”、“某某某宾馆”等酒店名、宾馆名即为实体词。In this step, entities can be extracted through the CRF algorithm, that is, candidate entity words associated with the POI are extracted from the text data containing the POI. Here, entity recognition is performed, that is, entity words such as place names and institution names associated with the POI in the first test set are found. For example, in the "accommodation" category in the POI field, facilities such as hotels, guesthouses, and folk customs are entities, and hotel names such as "such and such hotel" and "such and such guesthouse" are entity words.

步骤102，利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果，其中，所述同义词识别模型是基于至少两种目标相似度算法构建的。Step 102, using a synonym recognition model, identifying whether each of the candidate entity words in the target test set is a synonym, and obtaining a recognition result, wherein the synonym recognition model is constructed based on at least two target similarity algorithms.

该步骤中，针对目标测试集中的候选实体词进行同义词识别，得到的识别结果是多组与POI关联的同义词。In this step, synonym recognition is performed on the candidate entity words in the target test set, and the recognition results obtained are multiple groups of synonyms associated with POIs.

需要说明的是，常见的同义词识别算法往往只考虑字形和含义因素，但是在POI搜索领域中，通常涉及诸如餐饮、住宿等具体的地点名称，均为名词，而名词在语言逻辑上没有近义词、同义词之说，因此相似度计算效果并不理想。例如，用户要查询自己希望搜索的地点时，往往输入的是别名或者地名缩写，如要搜索“沈阳市公安局交通警察局车辆管理所”时，用户很有可能由于不知道全名或其他原因而不输入全名，而可能输入“车管所”或“交管局”等词，导致同义词识别效果不理想。对此，考虑到POI在地理信息系统中，每个POI都有自己的名称，POI名称根据特点可以理解为是命名实体，针对POI领域的同义词识别算法可以转化为针对命名实体的同义词识别算法。It should be noted that common synonym recognition algorithms often only consider factors such as font and meaning. However, in the field of POI search, specific place names such as restaurants and accommodation are usually involved, which are all nouns. Nouns do not have synonyms or synonyms in language logic, so the similarity calculation effect is not ideal. For example, when users want to query the place they want to search, they often enter an alias or abbreviation of the place name. For example, when searching for "Shenyang Public Security Bureau Traffic Police Department Vehicle Management Office", the user may not enter the full name because they do not know the full name or other reasons, but may enter words such as "Vehicle Management Office" or "Traffic Management Bureau", resulting in unsatisfactory synonym recognition. In this regard, considering that each POI has its own name in the geographic information system, the POI name can be understood as a named entity based on its characteristics. The synonym recognition algorithm for the POI field can be converted into a synonym recognition algorithm for named entities.

该实施例中，可以基于CRF算法，获得包括与POI关联的至少一个候选实体词的目标测试集，进一步可以利用同义词识别模型，识别目标测试集中的各个候选实体词之间是否为同义词，从而得到POI领域的各种同义词，这样，可以应用在导航等地理信息领域，由于同义词识别模型是基于至少两种目标相似度算法构建的，可以结合两种目标相似度算法的优势，因此在用户进行POI 搜索时，能够得到更为精准的同义词识别结果，且识别过程简单，运行速度快。In this embodiment, based on the CRF algorithm, a target test set including at least one candidate entity word associated with the POI can be obtained. Further, the synonym recognition model can be used to identify whether the candidate entity words in the target test set are synonyms, thereby obtaining various synonyms in the POI field. In this way, it can be applied in geographic information fields such as navigation. Since the synonym recognition model is constructed based on at least two target similarity algorithms, it can combine the advantages of the two target similarity algorithms. Therefore, when the user searches for POI, a more accurate synonym recognition result can be obtained, and the recognition process is simple and the running speed is fast.

这里，对CRF算法简单说明如下：Here, the CRF algorithm is briefly described as follows:

CRF是Lafferty借助于最大熵模型的思想提出的一个算法模型。它是一种基于标记序列的统计模型，是一种隐马尔科夫模型。由字组成的序列叫做观察序列，每个字对应的标记组成标记序列。通过计算训练集中每个字对应的标记出现的概率生成CRF模型，利用这个模型在给定观察序列的情况下，计算整个标记序列的联合概率。CRF is an algorithm model proposed by Lafferty with the help of the maximum entropy model. It is a statistical model based on the tag sequence and a hidden Markov model. The sequence composed of words is called the observation sequence, and the tags corresponding to each word constitute the tag sequence. The CRF model is generated by calculating the probability of the tag corresponding to each word in the training set, and this model is used to calculate the joint probability of the entire tag sequence given the observation sequence.

用变量X表示文本的观察序列，变量Y表示与之对应的标记序列。规定一个固定大小的有限字符集，令集合Y中所有元素Y_i的表示范围不得超过这个字符集。P(Y|X)表示在给定观察序列的条件下标记序列的概率，P(X)表示这个观察序列的概率。变量X和Y满足联合概率分布的条件。Variable X represents the observation sequence of the text, and variable Y represents the corresponding tag sequence. A fixed-size finite character set is specified, and the representation range of all elements _Yi in set Y must not exceed this character set. P(Y|X) represents the probability of the tag sequence under the condition of a given observation sequence, and P(X) represents the probability of this observation sequence. Variables X and Y satisfy the conditions of joint probability distribution.

如图2所示，令G＝(V，E)表示一个无向图，V表示图的点集合，E表示图的边集合。Y＝(Y_v)_v∈V，T_v为标记序列集合Y中的元素，它与无向图G中的顶点一一对应。当观察序列X成立时，随机变量Y_v的条件概率分布服从图的马尔可夫属性：P(Y_v|X，Y_w w≠v)＝P(Y_v|X，Y_w，w～v)，其中w～v表示(w，v)是无向图G的边。这时我们称(X，Y)是一个条件随机场。理论上讲，图G的结构为任意，然而在构建模型时，CRF采用了最简单和最重要的一阶链式结构。如图2所示，条件随机场(X，Y)以观察序列X作为全局条件，并且不对X做任何假设。这种简单结构可以被用来在标记序列上定义一个联合概率分布P(Y|X)，因此，CRF中主要关注两个子序列：X＝(X₁，X₂，...，X_n)和Y＝(Y₁，Y₂，...，Y_n)。其中，n为大于0的整数， X₁表示观察序列X中的第一个字符；X₂表示观察序列X中的第二个字符；Xn表示观察序列X中的第n个字符；Y₁表示标记序列Y中的第一个字符；Y₂表示标记序列Y中的第二个字符；Y_n表示标记序列Y中的第三个字符。As shown in Figure 2, let G = (V, E) represent an undirected graph, V represents the set of points in the graph, and E represents the set of edges in the graph. Y = (Y _v ) _v∈V , T _v is an element in the tag sequence set Y, which corresponds one-to-one to the vertices in the undirected graph G. When the observation sequence X is established, the conditional probability distribution of the random variable Y _v obeys the Markov property of the graph: P(Y _v |X, Y _w w≠v) = P(Y _v |X, Y _w , w～v), where w～v means that (w, v) is an edge of the undirected graph G. At this time, we call (X, Y) a conditional random field. In theory, the structure of the graph G is arbitrary, but when building the model, CRF uses the simplest and most important first-order chain structure. As shown in Figure 2, the conditional random field (X, Y) takes the observation sequence X as the global condition and does not make any assumptions about X. This simple structure can be used to define a joint probability distribution P(Y|X) over the tag sequence. Therefore, CRF mainly focuses on two subsequences: X = ( _X1 , _X2 , ..., _Xn ) and Y = ( _Y1 , _Y2 , ..., _Yn ). Where n is an integer greater than 0, _X1 represents the first character in the observation sequence X; _X2 represents the second character in the observation sequence X; Xn represents the nth character in the observation sequence X; _Y1 represents the first character in the tag sequence Y; _Y2 represents the second character in the tag sequence Y; _Yn represents the third character in the tag sequence Y.

本发明实施例中，是基于条件随机场CRF算法对第一测试集进行实体词识别，具体说明如下：In the embodiment of the present invention, entity word recognition is performed on the first test set based on the conditional random field CRF algorithm, which is specifically described as follows:

作为一可选实施例，所述基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集，包括：As an optional embodiment, the method of performing entity word recognition on the first test set based on the conditional random field (CRF) algorithm to obtain the target test set includes:

(一)基于所述CRF算法，确定第一识别模型。(i) Based on the CRF algorithm, determine a first recognition model.

这里，第一识别模型是基于CRF算法构建的，也就是将CRF算法作为第一识别模型的模型算法。因此，对第一识别模型进行模型训练过程，也是概率统计计算的过程，即统计每个实体词出现的概率、相关字的标记出现的概率。Here, the first recognition model is constructed based on the CRF algorithm, that is, the CRF algorithm is used as the model algorithm of the first recognition model. Therefore, the model training process of the first recognition model is also a process of probability statistical calculation, that is, the probability of each entity word appearing and the probability of the tag of the related word appearing are counted.

(二)使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型，所述第一训练集为包含N类POI的文本数据。(ii) Using a first training set to train the first recognition model to obtain an entity word recognition model, the first training set being text data containing N types of POIs.

这里，若POI领域内分为16个POI类别，比如：公共设施、工厂企业、工商业区、住宿、教育文化场所、新闻媒体、购物零售场所、餐饮场所、医疗卫生场所、交通机构及场所、金融机构、通信机构及场所、商业服务场所、居民服务场所、休闲娱乐场所、风景名胜和出入口，则第一训练集可针对上述 16个POI类别收集训练语料(例如POI的文字资料等)，来构建第一训练集和第一测试集，使得数据更为全面，训练效果更好。Here, if the POI field is divided into 16 POI categories, such as: public facilities, factories and enterprises, industrial and commercial areas, accommodation, educational and cultural places, news media, shopping and retail places, catering places, medical and health places, transportation institutions and places, financial institutions, communication institutions and places, commercial service places, resident service places, leisure and entertainment places, scenic spots and entrances and exits, then the first training set can collect training corpus (such as text materials of POI, etc.) for the above 16 POI categories to construct the first training set and the first test set, so that the data is more comprehensive and the training effect is better.

训练完成之后，可得到模型文件，即实体词识别模型。进而可以用实体词识别模型识别第一测试集，得到目标测试集。After the training is completed, a model file, namely, an entity word recognition model, can be obtained. Then, the entity word recognition model can be used to recognize the first test set to obtain a target test set.

(三)将所述第一测试集输入至所述实体词识别模型，得到所述目标测试集。(3) Inputting the first test set into the entity word recognition model to obtain the target test set.

其中，目标测试集中包括有多组与POI关联的同义词。The target test set includes multiple groups of synonyms associated with POI.

如图4所示，作为一可选实施例，所述将所述第一测试集输入至所述实体词识别模型，得到所述目标测试集，包括：As shown in FIG. 4 , as an optional embodiment, the step of inputting the first test set into the entity word recognition model to obtain the target test set includes:

对所述第一测试集进行所述第一预处理，得到至少一个待标记字符串，所述第一预处理包括：分词和/或词性标注；Performing the first preprocessing on the first test set to obtain at least one character string to be marked, wherein the first preprocessing includes: word segmentation and/or part-of-speech tagging;

这里，将句子分割为待标记字串，利用实体词识别模型进行字的标记。Here, the sentence is divided into strings to be marked, and the words are marked using the entity word recognition model.

该实施例中，通过Model(即实体词识别模型)，可以得到各个字出现的概率，结合这些信息，可以从至少一个待标记字符串中快速找到联合概率最大的标记序列，并作为输出结果Result(即目标测试集)。其中，CRF_Test的任务是计算联合概率达到最大的标记序列。In this embodiment, the probability of each word appearing can be obtained through Model (i.e., entity word recognition model). Combining this information, the tag sequence with the largest joint probability can be quickly found from at least one string to be tagged, and used as the output result Result (i.e., target test set). Among them, the task of CRF_Test is to calculate the tag sequence with the largest joint probability.

如图3所示，作为一可选实施例，所述使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型，包括：As shown in FIG. 3 , as an optional embodiment, the first recognition model is trained using the first training set to obtain an entity word recognition model, including:

(一)对所述第一训练集进行第一预处理，得到至少一个第一字符串，所述第一预处理包括：分词和/或词性标注。(1) performing a first preprocessing on the first training set to obtain at least one first character string, wherein the first preprocessing includes: word segmentation and/or part-of-speech tagging.

这里，第一预处理的主要任务是：分词、词性标注处理，并人工标注地名、组织机构等实体信息。例如，第一训练集中一文本数据分词后的结果为：本报 /r讯/Ng记者/n孟宪励/nr：/w由/p中国业余舞蹈竞技协会/nt主办 /v的/u“/w科迪杯/nz”/w第二/m届/q中国/ns业余/b国际/n 标准舞/n大赛/vn，/w日前/t在/p北京朝阳体育馆/ns落/v下/v 帷幕/n。其中，中国业余舞蹈竞技协会、北京朝阳体育馆均为实体词。Here, the main tasks of the first preprocessing are: word segmentation, part-of-speech tagging, and manual tagging of entity information such as place names and organizations. For example, the result of word segmentation of a text data in the first training set is: This newspaper /r News / Ng Reporter / n Meng Xianli / nr: / wThe / u "/wCody Cup/nz/wThe / wSecond/m/qChina/nsAmateur/bInternational/nStandard Dance/nContest/vn hosted by / pChina Amateur Dance Competition Association/nt, / wRecently/t/t/pBeijing Chaoyang Gymnasium/ns/ns ended/v The curtain/n. Among them, / China Amateur Dance Competition Association / Beijing Chaoyang Gymnasium are both entity words.

(二)利用目标标注法，对所述至少一个第一字符串进行标注，得到标注结果。(2) Using a target labeling method, labeling the at least one first character string to obtain a labeling result.

由于CRF算法是基于序列标注的方式实现，因此基于CRF算法对第一测试集进行实体词识别时，需要进行词性标注。这里，示例性的，目标标注法可以选用5-tag标注法进行文本标注，5-tag标注法的标注集为(B1,B2,M,E,O)，也就是将实体词首标注为B1，实体词的第二个字符标记为B2，中间字符标注为M，最后一个字符标记为E，非实体词均标注为O。例如，“王厚元饺子馆”是一个餐饮实体，那么可以将其实体词标注为“王/B1厚/B2元/M饺/M子/M馆/E”。Since the CRF algorithm is implemented based on sequence labeling, part-of-speech tagging is required when the entity word recognition of the first test set is performed based on the CRF algorithm. Here, illustratively, the target labeling method can use the 5-tag labeling method for text labeling. The labeling set of the 5-tag labeling method is (B1, B2, M, E, O), that is, the first character of the entity word is marked as B1, the second character of the entity word is marked as B2, the middle character is marked as M, the last character is marked as E, and non-entity words are all marked as O. For example, "Wang Houyuan Dumpling Restaurant" is a catering entity, so its entity words can be marked as "Wang/B1 Hou/B2 Yuan/M Dumpling/M Zi/M Restaurant/E".

(三)使用所述标注结果，训练所述第一识别模型，得到所述实体词识别模型。(iii) Using the annotation results, training the first recognition model to obtain the entity word recognition model.

如图3所示，使用标注结果(即模板文件)、训练样本(即至少一个第一字符串)，训练所述第一识别模型(即CRF_Learn)，可以得到Model(即实体词识别模型)。其中，CRF_Learn的主要任务是综合考虑各个输入项的特征，计算并统计各个实体词的概率，为后续阶段提供依据。As shown in Figure 3, the first recognition model (i.e., CRF_Learn) is trained using the annotation results (i.e., template file) and training samples (i.e., at least one first string) to obtain Model (i.e., entity word recognition model). Among them, the main task of CRF_Learn is to comprehensively consider the characteristics of each input item, calculate and count the probability of each entity word, and provide a basis for the subsequent stages.

该实施例中，基于所述同义词词林算法和机器学习算法构建同义词识别模型，可以综合两种方式的优点，使得同义词识别模型的识别效果更优，准确率更高。In this embodiment, a synonym recognition model is constructed based on the synonym dictionary algorithm and the machine learning algorithm, which can combine the advantages of the two methods to make the synonym recognition model have better recognition effect and higher accuracy.

步骤1，对所述第一候选实体词和所述第二候选实体词分别进行第二预处理，得到所述第一候选实体词对应的第一词语列表和所述第二候选实体词对应的第二词语列表，所述第二预处理包括：分词并删除非中文字符。Step 1, perform second preprocessing on the first candidate entity word and the second candidate entity word respectively to obtain a first word list corresponding to the first candidate entity word and a second word list corresponding to the second candidate entity word, wherein the second preprocessing includes: word segmentation and deletion of non-Chinese characters.

该步骤中，进行第二预处理时，可以使用分词工具对两个待判断的实体词 (例如第一候选实体词和第二候选实体词)进行分词处理，并过滤掉其中的非中文字符(例如特殊符号、换行符、空格、空白字符等)，分别得到两个词语列表，例如，第一候选实体词(用W_a表示)对应的第一词语列表为 {W_a1,W_a2,...,W_an}，第二候选实体词(用W_b表示)对应的第二词语列表为 {W_b1,W_b2,...,W_bm}，其中，n表示第一词语列表中词的个数；m表示第二词语列表中词的个数。In this step, when performing the second preprocessing, a word segmentation tool can be used to perform word segmentation on two entity words to be judged (for example, the first candidate entity word and the second candidate entity word), and filter out non-Chinese characters (for example, special symbols, line breaks, spaces, blank characters, etc.) to obtain two word lists respectively. For example, the first word list corresponding to the first candidate entity word (represented by _Wa ) is {W _a1 , _Wa2 ,...,W _an }, and the second word list corresponding to the second candidate entity word (represented by W _b ) is {W _b1 , W _b2 ,...,W _bm }, where n represents the number of words in the first word list; m represents the number of words in the second word list.

步骤2，基于所述同义词词林算法，依次计算第一词语列表中每个词和第二词语列表之间的相似度，获得所述第一候选实体词对应的第一相似度列表，以及，依次计算第二词语列表中每个词和第一词语列表之间的相似度，获得所述第二候选实体词对应的第二相似度列表。Step 2, based on the synonym word forest algorithm, calculate the similarity between each word in the first word list and the second word list in turn, obtain the first similarity list corresponding to the first candidate entity word, and calculate the similarity between each word in the second word list and the first word list in turn, obtain the second similarity list corresponding to the second candidate entity word.

该步骤中，需要将W_a中的每个词依次与W_b中每个词进行相似度计算，得到第一词语列表中每个词和第二词语列表之间的相似度。例如，将W_a1分别与 W_b1、W_b2、...、W_bm进行相似度计算，得到的相似度值分别为： p₁、p₂、...、p_m，若p₁、p₂、...、p_m中，p₂的值最大，则将p₂确定为W_a1和第二词语列表之间的相似度。同理，可以依次分别求得W_a2、...、W_an和第二词语列表之间的相似度，假设分别表示为：q₃、...、h₅，则第一候选实体词对应的第一相似度列表可以表示为{p₂,q₃,...,h₅}。同理，也可以依次计算第二词语列表中每个词和第一词语列表之间的相似度，并获得第二候选实体词对应的第二相似度列表。In this step, it is necessary to calculate the similarity between each word in W _a and each word in W _b in turn, and obtain the similarity between each word in the first word list and the second word list. For example, the similarity between W _a1 and W _b1 , W _b2 , ..., W _bm is calculated respectively, and the obtained similarity values are: p ₁ , p ₂ , ..., p _m , if the value of p ₂ is the largest among p ₁ , p ₂ , ..., p _m , then p ₂ is determined as the similarity between W _a1 and the second word list. Similarly, the similarity between W _a2 , ..., _Wan and the second word list can be obtained in turn, assuming that they are expressed as: q ₃ , ..., h ₅ , respectively, then the first similarity list corresponding to the first candidate entity word can be expressed as {p ₂ ,q ₃ ,...,h ₅ }. Similarly, the similarity between each word in the second word list and the first word list can also be calculated in turn, and the second similarity list corresponding to the second candidate entity word can be obtained.

步骤3，根据所述第一相似度列表和所述第二相似度列表，获得所述第一候选实体词和所述第二候选实体词之间的第一相似度。Step 3: Obtain a first similarity between the first candidate entity word and the second candidate entity word according to the first similarity list and the second similarity list.

这里，假设第一候选实体词(W_a)对应的第一相似度列表为： {SIM_a1,...,SIM_an}，第一候选实体词(W_b)对应的第二相似度列表为： {SIM_b1,...,SIM_bm}。Here, it is assumed that the first similarity list corresponding to the first candidate entity word (W _a ) is: {SIM _a1 , ..., SIM _an }, and the second similarity list corresponding to the first candidate entity word (W _b ) is: {SIM _b1 , ..., SIM _bm }.

则，可以根据以下公式第一候选实体词和第二候选实体词之间的第一相似度：Then, the first similarity between the first candidate entity word and the second candidate entity word can be calculated according to the following formula:

其中，SIM_dict表示第一相似度；n表示所述第一词语列表中词的个数；m 表示所述第二词语列表中词的个数；SIM_ai表示所述第一相似度列表中的第i 个值(即Wa中第i个词和W_a的相似度)；SIM_bj表示所述第二相似度列表中的第j个值(Wb中第j个词和W_b的相似度)。Among them, SIM _dict represents the first similarity; n represents the number of words in the first word list; m represents the number of words in the second word list; SIM _ai represents the i-th value in the first similarity list (that is, the similarity between the i-th word in Wa and _Wa ); SIM _bj represents the j-th value in the second similarity list (the similarity between the j-th word in Wb and _Wb ).

需要说明的是，在确定W_a中的某个词)与W_b中的某个词之间的相似度时，通过以下方式确定：It should be noted that when determining the similarity between a certain word in W _a ) and a certain word in W _b , it is determined in the following way:

判断两个词的编码(即两个词在基于同义词林构造的词典中对应的编码，假设分别为C_a、C_b)是否相同；若相同，则两个词的相似度值为1；若不相同，则需要通过第一公式计算两个编码的相似度，第一公式表示为：Determine whether the codes of two words (i.e., the codes corresponding to the two words in the dictionary constructed based on the synonym forest, assuming they are _Ca and _Cb respectively) are the same; if they are the same, the similarity value of the two words is 1; if they are not the same, the similarity of the two codes needs to be calculated using the first formula, which is expressed as:

其中，LCS(C_a,C_b)表示编码C_a和编码C_b的最近公共父节点；IC(C)表示编码C的信息内容含量，IC(C)可以根据第二公式计算得到，第二公式表示为：Wherein, LCS( _Ca , _Cb ) represents the nearest common parent node of code _Ca and code _Cb ; IC(C) represents the information content of code C, and IC(C) can be calculated according to the second formula, which is expressed as:

其中，n表示本体节点总数，k表示本体中下位节点个数。Among them, n represents the total number of ontology nodes, and k represents the number of lower-level nodes in the ontology.

作为本发明一可选实施例，所述基于所述机器学习算法，获得所述第一候选实体词和所述第二候选实体词之间的第二相似度，包括：As an optional embodiment of the present invention, obtaining a second similarity between the first candidate entity word and the second candidate entity word based on the machine learning algorithm includes:

采用Word2Vec深度学习模型，使用第一训练集进行模型训练，获得目标词向量集合，所述第一训练集为包含N类POI的文本数据；Using the Word2Vec deep learning model, using the first training set to perform model training to obtain a target word vector set, where the first training set is text data containing N types of POIs;

需要说明的是，Word2vec是一群用来产生词向量的相关模型。这些模型为浅而双层的神经网络，用来训练以重新建构语言学之词文本。网络以词表现，并且需猜测相邻位置的输入词，在word2vec中词袋模型假设下，词的顺序是不重要的。训练完成之后，word2vec模型可用来映射每个词到一个向量，可用来表示词对词之间的关系，该向量为神经网络之隐藏层。Word2Ve包含两种算法，分别是skip-gram和连续词袋模型(Continuous Bag ofWords Model， CBOW)算法，它们的最大区别是skip-gram是通过中心词去预测中心词周围的词，而CBOW是通过周围的词去预测中心词。本发明实施例中可采用 skip-gram算法，使用第一训练集进行模型训练。It should be noted that Word2vec is a group of related models used to generate word vectors. These models are shallow and double-layer neural networks used for training to reconstruct the word text of linguistics. The network is represented by words, and the input words in adjacent positions need to be guessed. Under the assumption of the bag of words model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words. The vector is the hidden layer of the neural network. Word2Ve includes two algorithms, namely skip-gram and continuous bag of words model (Continuous Bag of Words Model, CBOW) algorithm. The biggest difference between them is that skip-gram predicts the words around the central word through the central word, while CBOW predicts the central word through the surrounding words. In the embodiment of the present invention, the skip-gram algorithm can be used to perform model training using the first training set.

在自然语言处理领域，我们研究的对象通常都是文本内容，而不同的文本通常具有不同的格式或包括不同的内容，这里，需要对测试语料(即第一训练集)进行预处理，比如，将第一训练集中的非中文字符去除掉，保留对POI 领域有用的汉字、英文字母和数字。In the field of natural language processing, the objects we study are usually text content, and different texts usually have different formats or include different contents. Here, the test corpus (i.e., the first training set) needs to be preprocessed. For example, non-Chinese characters in the first training set are removed, and Chinese characters, English letters, and numbers that are useful to the POI field are retained.

这里，还需要对第一训练集中的文本进行分词处理，比如，可使用分词器 (比如中文分词系统)进行分词处理。Here, the text in the first training set also needs to be segmented. For example, a word segmenter (such as a Chinese word segmentation system) can be used for word segmentation.

具体的，作为一可选实施例，Word2Vec深度学习模型的参数可以设置如下：Specifically, as an optional embodiment, the parameters of the Word2Vec deep learning model can be set as follows:

(1)默认sg＝0对应CBOW算法，sg＝1对应skip-gram算法，由于skip-gram 算法对低频词较敏感，所以本发明实施例可以将sg设置为1。(1) By default, sg=0 corresponds to the CBOW algorithm, and sg=1 corresponds to the skip-gram algorithm. Since the skip-gram algorithm is more sensitive to low-frequency words, sg can be set to 1 in the embodiment of the present invention.

(2)min_count是对词进行过滤，频率小于min-count的单词则会被忽视， min_count的默认值为5。(2)min_count is used to filter words. Words with a frequency less than min-count will be ignored. The default value of min_count is 5.

(3)size是输出词向量的维数，即神经网络的隐藏层的单元数。由于Size 的值太小会导致词映射因冲突而影响结果，Size的值太大则会比较耗内存且使算法计算速度变慢，另外，Size的值较大时会需要更多的训练数据,但训练效果会更好，这里，size值可设置为300维度。(3) size is the dimension of the output word vector, that is, the number of units in the hidden layer of the neural network. If the value of size is too small, the word mapping will affect the result due to conflicts. If the value of size is too large, it will consume more memory and slow down the algorithm calculation speed. In addition, when the value of size is large, more training data will be required, but the training effect will be better. Here, the size value can be set to 300 dimensions.

(4)window是句子中当前词与目标词之间的最大距离，即为窗口。可选的，窗口移动的大小可设置为5。(4) window is the maximum distance between the current word and the target word in the sentence, that is, the window. Optionally, the size of the window movement can be set to 5.

(5)negative和sample可根据训练结果进行微调，其中，negative为负采样的个数，负采样的核心思想是将多分类问题转化为二分类问题，即判断是正样本还是负样本，默认值为5；sample表示更高频率的词被随机下采样到所设置的阈值，默认值为1e-3。(5) negative and sample can be fine-tuned according to the training results, where negative is the number of negative samples. The core idea of negative sampling is to transform the multi-classification problem into a binary classification problem, that is, to determine whether it is a positive sample or a negative sample. The default value is 5; sample means that higher frequency words are randomly downsampled to the set threshold. The default value is 1e-3.

(6)hs＝1表示softmax层级将会被使用，默认hs＝0且negative不为0时，负采样将会被选择使用。(6) hs = 1 means that the softmax layer will be used. By default, hs = 0 and negative is not 0, negative sampling will be selected.

使用第一训练集对Word2Vec深度学习模型进行训练，获得训练结果(即目标词向量集合)可以用文本形式保存，如表1所示。The Word2Vec deep learning model is trained using the first training set, and the training results (i.e., the target word vector set) can be saved in text form, as shown in Table 1.

表1：Table 1:

作为一可选实施例，通过余弦相似度算法计算第一候选实体词和第二候选实体词之间的第二相似度时，可对第一候选实体词和第二候选实体词分词之后，按照Word2Vec向量模型构造出词向量(也即第一候选实体词在目标词向量集合中对应的第一词向量和第二候选实体词在目标词向量集合中对应的第二词向量)，然后，通过余弦相似度算法，计算第一候选实体词和第二候选实体词之间的第二相似度，计算公式表示如下：As an optional embodiment, when calculating the second similarity between the first candidate entity word and the second candidate entity word by the cosine similarity algorithm, the first candidate entity word and the second candidate entity word can be segmented, and then a word vector (that is, a first word vector corresponding to the first candidate entity word in the target word vector set and a second word vector corresponding to the second candidate entity word in the target word vector set) can be constructed according to the Word2Vec vector model. Then, the second similarity between the first candidate entity word and the second candidate entity word is calculated by the cosine similarity algorithm. The calculation formula is expressed as follows:

其中，SIM_word2vec表示第二相似度；X_i表示第一候选实体词(用X表示) 分词之后的第i个词(X共分为了n个词)；Yj表示第二候选实体词(用Y表示)分词之后的第j个词(Y共分为了m个词)。其中，m和n均为大于0的整数。Among them, SIM _word2vec represents the second similarity; _Xi represents the i-th word after the first candidate entity word (represented by X) is segmented (X is divided into n words in total); Yj represents the j-th word after the second candidate entity word (represented by Y) is segmented (Y is divided into m words in total). Among them, m and n are both integers greater than 0.

该实施例中，利用Word2Vec深度学习模型根据POI领域的测试语料训练出词向量模型，再通过余弦相似度算法计算得到第一候选实体词和第二候选实体词之间的第二相似度。In this embodiment, a word vector model is trained using a Word2Vec deep learning model based on a test corpus in the POI field, and then a second similarity between the first candidate entity word and the second candidate entity word is calculated using a cosine similarity algorithm.

这里，对第一相似度和第二相似度进行加权求和，可以获得同义词识别模型的表达式：score＝θ₁*SIM_dict+θ₂*SIM_word2vec；Here, by performing weighted summation on the first similarity and the second similarity, the expression of the synonym recognition model can be obtained: score = θ ₁ *SIM _dict + θ ₂ *SIM _word2vec ;

其中，score表示第一候选实体词和第二候选实体词之间的相似度； SIM_dict表示第一相似度；SIM_word2vec表示第二相似度；θ₁表示第一相似度的权重；θ₂表示第二相似度的权重。Among them, score represents the similarity between the first candidate entity word and the second candidate entity word; SIM _dict represents the first similarity; SIM _word2vec represents the second similarity; θ ₁ represents the weight of the first similarity; θ ₂ represents the weight of the second similarity.

需要说明的是，可以通过构建同义词识别模型的对应的逻辑回归模型，来确定同义词识别模型的表达式中的参数，即第一相似度和所述第二相似度分别对应的权重(θ₁、θ₂)。It should be noted that the parameters in the expression of the synonym recognition model, ie, the weights (θ ₁ , θ ₂ ) corresponding to the first similarity and the second similarity, respectively, can be determined by constructing a corresponding logistic regression model of the synonym recognition model.

其中，逻辑回归模型的表达式为：The expression of the logistic regression model is:

其中，X为特征输入量，这里，X表示2维向量(X1,X2)，X1表示第一相似度(即SIM_dict)，X2表示第二相似度(即SIM_word2vec)；θ表示回归系数向量θ₁表示第一相似度的权重；θ₂表示第二相似度的权重。Where X is the feature input, where X represents a 2D vector (X1, X2), X1 represents the first similarity (i.e., SIM _dict ), and X2 represents the second similarity (i.e., SIM _word2vec ); θ represents the regression coefficient vector θ ₁ represents the weight of the first similarity; θ ₂ represents the weight of the second similarity.

需要说明的是，h_θ(X)的值在[0,1]之间。例如，可以规定：当h_θ(X)>0.5 时，认为h_θ(X)＝1，即第一候选实体词和第二候选实体词为同义词；h_θ(X)<0.5 时，认为h_θ(X)＝0，即第一候选实体词和第二候选实体词不是同义词。It should be noted that the value of h _θ (X) is between [0,1]. For example, it can be stipulated that: when h _θ (X)>0.5, it is considered that h _θ (X)＝1, that is, the first candidate entity word and the second candidate entity word are synonyms; when h _θ (X)<0.5, it is considered that h _θ (X)＝0, that is, the first candidate entity word and the second candidate entity word are not synonyms.

这里，通过人工识别出数量相同的同义词和非同义词作为第二训练集，根据训练结果和h_θ(X)的值不断调整θ₁和θ₂的值，直到逻辑回归模型收敛，最终求SIM_dict的权重(即θ₁)和SIM_word2vec的权重(θ₂)。Here, the same number of synonyms and non-synonyms are manually identified as the second training set, and the values of θ ₁ and θ ₂ are continuously adjusted according to the training results and the value of h _θ (X) until the logistic regression model converges, and finally the weight of SIM _dict (ie θ ₁ ) and the weight of SIM _word2vec (θ ₂ ) are calculated.

该实施例中，可以通过逻辑回归模型动态赋予第一相似度和第二相似度的权重，最后由第一相似度和第二相似度加权求和得出最终的相似度的值，该值越高，说明第一候选实体词和第二候选实体词的相似性越高。这样，由于同义词识别模型中结合了两种相似度算法(基于所述同义词词林相似度计算法、基于所述机器学习的相似度计算法)进行了加权求和，得到最终的相似度，因此，可以综合两种相似度算法的优点，最终计算得到的相似度更为准确，对同义词的识别效果更好。In this embodiment, the weights of the first similarity and the second similarity can be dynamically assigned through the logistic regression model, and finally the final similarity value is obtained by weighted summing the first similarity and the second similarity. The higher the value, the higher the similarity between the first candidate entity word and the second candidate entity word. In this way, since the synonym recognition model combines two similarity algorithms (based on the synonym word forest similarity calculation method and the similarity calculation method based on the machine learning) and performs weighted summation to obtain the final similarity, the advantages of the two similarity algorithms can be combined, and the similarity finally calculated is more accurate, and the recognition effect of synonyms is better.

需要说明的是，本发明实施例提供了适用于POI领域的同义词识别方法，识别过程中结合了POI领域的特点，通过实体词识别算法进行模型训练，得出POI类别的地点名词的别名、缩略词等作为候选实体词，形成目标测试集，再在各个POI类别下计算候选实体词的匹配度(相似度)，从而选出分值高的候选实体词作为同义词。该同义词识别方法不受限于词典的更新和词汇的完整性，也不受字面含义影响；可以识别出异形同义词、缩略词等同义词，提升异形同义词识别效率；能准确反映客观事实，而不受人的主观影响。使得识别结果更加精准。It should be noted that the embodiment of the present invention provides a synonym recognition method applicable to the POI field. The characteristics of the POI field are combined in the recognition process. The model is trained through the entity word recognition algorithm to obtain the aliases, abbreviations, etc. of the place nouns of the POI category as candidate entity words to form a target test set, and then the matching degree (similarity) of the candidate entity words is calculated under each POI category, so as to select the candidate entity words with high scores as synonyms. The synonym recognition method is not limited by the update of the dictionary and the integrity of the vocabulary, nor is it affected by the literal meaning; it can identify synonyms such as variant synonyms and abbreviations, and improve the recognition efficiency of variant synonyms; it can accurately reflect objective facts without being affected by human subjectivity. Make the recognition result more accurate.

该实施例的同义词识别方法，基于CRF算法，可以获得包括与POI关联的至少一个候选实体词的目标测试集，进一步可以利用同义词识别模型，识别目标测试集中的各个候选实体词之间是否为同义词，从而得到POI领域的各种同义词，可以提升POI搜索效率。这里，同义词识别模型是基于至少两种目标相似度算法构建的，因此可以结合两种目标相似度算法的优势，降低了人为的主观因素影响，提升了异形同义词识别效率，使得识别结果更加精准，且识别过程简单，运行速度快。The synonym recognition method of this embodiment can obtain a target test set including at least one candidate entity word associated with a POI based on the CRF algorithm, and can further use the synonym recognition model to identify whether each candidate entity word in the target test set is a synonym, thereby obtaining various synonyms in the POI field, which can improve the POI search efficiency. Here, the synonym recognition model is constructed based on at least two target similarity algorithms, so it can combine the advantages of the two target similarity algorithms, reduce the influence of human subjective factors, improve the recognition efficiency of variant synonyms, make the recognition result more accurate, and the recognition process is simple and the running speed is fast.

如图5所示，本发明实施例的一种同义词识别装置，包括：As shown in FIG5 , a synonym recognition device according to an embodiment of the present invention includes:

第一处理模块510，用于基于条件随机场CRF算法，对第一测试集进行实体词识别，得到目标测试集，所述第一测试集为包含兴趣点POI的文本数据，所述目标测试集包括至少一个候选实体词，所述候选实体词与POI关联；A first processing module 510 is configured to perform entity word recognition on a first test set based on a conditional random field (CRF) algorithm to obtain a target test set, wherein the first test set is text data containing a point of interest (POI), and the target test set includes at least one candidate entity word, and the candidate entity word is associated with the POI;

第二处理模块520，用于利用同义词识别模型，识别所述目标测试集中的各个所述候选实体词之间是否为同义词，得到识别结果，其中，所述同义词识别模型是基于至少两种目标相似度算法构建的。The second processing module 520 is used to use a synonym recognition model to identify whether each of the candidate entity words in the target test set is a synonym, and obtain a recognition result, wherein the synonym recognition model is constructed based on at least two target similarity algorithms.

可选地，所述第一处理模块510包括：Optionally, the first processing module 510 includes:

第二处理子模块，用于使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型，所述第一训练集为包含N类POI的文本数据；A second processing submodule is used to train the first recognition model using a first training set to obtain an entity word recognition model, wherein the first training set is text data containing N types of POIs;

可选地，所述目标相似度算法包括：同义词林算法和机器学习算法，所述第二处理模块520包括：Optionally, the target similarity algorithm includes: a synonym forest algorithm and a machine learning algorithm, and the second processing module 520 includes:

第九处理单元，用于采用Word2Vec深度学习模型，使用第一训练集进行模型训练，获得目标词向量集合，所述第一训练集为包含N类POI的文本数据；A ninth processing unit is used to adopt a Word2Vec deep learning model and use a first training set to perform model training to obtain a target word vector set, where the first training set is text data containing N types of POIs;

在此需要说明的是，本发明实施例提供的上述同义词识别装置，能够实现上述的同义词识别方法实施例所实现的所有方法步骤，且能够达到相同的技术效果，在此不再对本实施例中与方法实施例相同的部分及有益效果进行具体赘述。It should be noted here that the above-mentioned synonym recognition device provided by the embodiment of the present invention can implement all the method steps implemented by the above-mentioned synonym recognition method embodiment, and can achieve the same technical effect. The parts and beneficial effects that are the same as the method embodiment in this embodiment will not be specifically repeated here.

如图6所示，本发明实施例的一种网络设备600，包括处理器610和收发机620，其中，所述处理器用于：As shown in FIG6 , a network device 600 according to an embodiment of the present invention includes a processor 610 and a transceiver 620, wherein the processor is configured to:

使用第一训练集对所述第一识别模型进行训练，得到实体词识别模型，所述第一训练集为包含N类POI的文本数据；Using a first training set to train the first recognition model to obtain an entity word recognition model, wherein the first training set is text data containing N types of POIs;

在此需要说明的是，本发明实施例提供的上述网络设备，能够实现上述的同义词识别方法实施例所实现的所有方法步骤，且能够达到相同的技术效果，在此不再对本实施例中与方法实施例相同的部分及有益效果进行具体赘述。It should be noted here that the above-mentioned network device provided in the embodiment of the present invention can implement all the method steps implemented in the above-mentioned synonym recognition method embodiment, and can achieve the same technical effect. The parts and beneficial effects that are the same as the method embodiment in this embodiment will not be described in detail here.

本发明另一实施例的网络设备，如图7所示，包括收发器710、处理器700、存储器720及存储在所述存储器720上并可在所述处理器700上运行的程序或指令；所述处理器700执行所述程序或指令时实现上述的同义词识别方法。A network device according to another embodiment of the present invention, as shown in FIG7 , includes a transceiver 710, a processor 700, a memory 720, and a program or instruction stored in the memory 720 and executable on the processor 700; the processor 700 implements the above-mentioned synonym recognition method when executing the program or instruction.

所述收发器710，用于在处理器700的控制下接收和发送数据。The transceiver 710 is used to receive and send data under the control of the processor 700.

其中，在图7中，总线架构可以包括任意数量的互联的总线和桥，具体由处理器700代表的一个或多个处理器和存储器720代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起，这些都是本领域所公知的，因此，本文不再对其进行进一步描述。总线接口提供接口。收发器710可以是多个元件，即包括发送机和接收机，提供用于在传输介质上与各种其他装置通信的单元。处理器700 负责管理总线架构和通常的处理，存储器720可以存储处理器700在执行操作时所使用的数据。Among them, in Figure 7, the bus architecture can include any number of interconnected buses and bridges, specifically one or more processors represented by processor 700 and various circuits of memory represented by memory 720 are linked together. The bus architecture can also link various other circuits such as peripherals, regulators, and power management circuits together, which are all well known in the art, so they are not further described herein. The bus interface provides an interface. The transceiver 710 can be a plurality of components, that is, including a transmitter and a receiver, providing a unit for communicating with various other devices on a transmission medium. The processor 700 is responsible for managing the bus architecture and general processing, and the memory 720 can store data used by the processor 700 when performing operations.

本发明实施例的一种可读存储介质，其上存储有程序或指令，所述程序或指令被处理器执行时实现如上所述的同义词识别方法中的步骤，且能达到相同的技术效果，为避免重复，这里不再赘述。其中，所述的计算机可读存储介质，如只读存储器(Read-OnlyMemory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等。A computer-readable storage medium according to an embodiment of the present invention stores a program or instruction thereon, and when the program or instruction is executed by a processor, the steps in the synonym recognition method described above are implemented, and the same technical effect can be achieved. To avoid repetition, it is not described here. The computer-readable storage medium is, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk.

进一步需要说明的是，此说明书中所描述的终端包括但不限于智能手机、平板电脑等，且所描述的许多功能部件都被称为模块，以便更加特别地强调其实现方式的独立性。It should be further explained that the terminals described in this specification include but are not limited to smart phones, tablet computers, etc., and many of the functional components described are called modules in order to more particularly emphasize the independence of their implementation methods.

本发明实施例中，模块可以用软件实现，以便由各种类型的处理器执行。举例来说，一个标识的可执行代码模块可以包括计算机指令的一个或多个物理或者逻辑块，举例来说，其可以被构建为对象、过程或函数。尽管如此，所标识模块的可执行代码无需物理地位于一起，而是可以包括存储在不同位里上的不同的指令，当这些指令逻辑上结合在一起时，其构成模块并且实现该模块的规定目的。In the embodiment of the present invention, the module can be implemented with software so that it can be executed by various types of processors. For example, an executable code module of an identification can include one or more physical or logical blocks of computer instructions, for example, it can be constructed as an object, process or function. Nevertheless, the executable code of the identified module does not need to be physically located together, but can include different instructions stored in different positions, and when these instructions are logically combined together, it constitutes a module and realizes the specified purpose of the module.

实际上，可执行代码模块可以是单条指令或者是许多条指令，并且甚至可以分布在多个不同的代码段上，分布在不同程序当中，以及跨越多个存储器设备分布。同样地，操作数据可以在模块内被识别，并且可以依照任何适当的形式实现并且被组织在任何适当类型的数据结构内。所述操作数据可以作为单个数据集被收集，或者可以分布在不同位置上(包括在不同存储设备上)，并且至少部分地可以仅作为电子信号存在于系统或网络上。In fact, executable code module can be a single instruction or many instructions, and can even be distributed on a plurality of different code segments, distributed among different programs, and distributed across a plurality of memory devices. Similarly, operating data can be identified in the module, and can be implemented and organized in the data structure of any appropriate type according to any appropriate form. The operating data can be collected as a single data set, or can be distributed in different locations (including on different storage devices), and can only be present on a system or network as an electronic signal at least in part.

在模块可以利用软件实现时，考虑到现有硬件工艺的水平，所以可以以软件实现的模块，在不考虑成本的情况下，本领域技术人员都可以搭建对应的硬件电路来实现对应的功能，所述硬件电路包括常规的超大规模集成(VLSI) 电路或者门阵列以及诸如逻辑芯片、晶体管之类的现有半导体或者是其它分立的元件。模块还可以用可编程硬件设备，诸如现场可编程门阵列、可编程阵列逻辑、可编程逻辑设备等实现。When a module can be implemented by software, considering the level of existing hardware technology, a person skilled in the art can build a corresponding hardware circuit to implement the corresponding function of the module that can be implemented by software without considering the cost. The hardware circuit includes a conventional very large scale integration (VLSI) circuit or gate array and existing semiconductors such as logic chips, transistors, or other discrete components. The module can also be implemented by a programmable hardware device, such as a field programmable gate array, a programmable array logic, a programmable logic device, etc.

上述范例性实施例是参考该些附图来描述的，许多不同的形式和实施例是可行而不偏离本发明精神及教示，因此，本发明不应被建构成为在此所提出范例性实施例的限制。更确切地说，这些范例性实施例被提供以使得本发明会是完善又完整，且会将本发明范围传达给那些熟知此项技术的人士。在该些图式中，组件尺寸及相对尺寸也许基于清晰起见而被夸大。在此所使用的术语只是基于描述特定范例性实施例目的，并无意成为限制用。如在此所使用地，除非该内文清楚地另有所指，否则该单数形式“一”、“一个”和“该”是意欲将该些多个形式也纳入。会进一步了解到该些术语“包含”及/或“包括”在使用于本说明书时，表示所述特征、整数、步骤、操作、构件及/或组件的存在，但不排除一或更多其它特征、整数、步骤、操作、构件、组件及/或其族群的存在或增加。除非另有所示，陈述时，一值范围包含该范围的上下限及其间的任何子范围。The above exemplary embodiments are described with reference to the accompanying drawings, and many different forms and embodiments are feasible without departing from the spirit and teachings of the present invention. Therefore, the present invention should not be constructed as a limitation of the exemplary embodiments proposed herein. More specifically, these exemplary embodiments are provided so that the present invention will be perfect and complete, and the scope of the present invention will be conveyed to those who are familiar with the technology. In these figures, the component sizes and relative sizes may be exaggerated for clarity. The terms used here are only based on the purpose of describing specific exemplary embodiments and are not intended to be limiting. As used herein, unless the context clearly indicates otherwise, the singular forms "one", "one" and "the" are intended to include these multiple forms. It will be further understood that the terms "including" and/or "comprising" when used in this specification indicate the presence of the features, integers, steps, operations, components and/or components, but do not exclude the presence or increase of one or more other features, integers, steps, operations, components, components and/or their groups. Unless otherwise indicated, when stated, a value range includes the upper and lower limits of the range and any sub-ranges therebetween.

以上所述是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明所述原理的前提下，还可以作出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims

1. A method of synonym identification, comprising:

performing entity word recognition on a first test set based on a Conditional Random Field (CRF) algorithm to obtain a target test set, wherein the first test set is text data containing POI (point of interest), and the target test set comprises at least one candidate entity word which is associated with the POI;

and identifying whether each candidate entity word in the target test set is a synonym or not by using a synonym identification model to obtain an identification result, wherein the synonym identification model is constructed based on at least two target similarity algorithms.

2. The method of claim 1, wherein the performing entity word recognition on the first test set based on the conditional random field CRF algorithm to obtain the target test set comprises:

determining a first recognition model based on the CRF algorithm;

Training the first recognition model by using a first training set to obtain an entity word recognition model;

and inputting the first test set into the entity word recognition model to obtain the target test set.

3. The method of claim 2, wherein training the first recognition model using the first training set to obtain the entity word recognition model comprises:

performing first preprocessing on the first training set to obtain at least one first character string, wherein the first preprocessing comprises the following steps: word segmentation and/or part-of-speech tagging;

labeling the at least one first character string by using a target labeling method to obtain a labeling result;

and training the first recognition model by using the labeling result to obtain the entity word recognition model.

4. The method of claim 3, wherein said inputting the first test set into the entity word recognition model results in the target test set comprising:

performing first preprocessing on the first test set to obtain at least one character string to be marked;

and inputting the at least one character string to be marked into the entity word recognition model to obtain the target test set.

5. The method of claim 1, wherein the target similarity algorithm comprises: the synonym forest algorithm and the machine learning algorithm, the identifying whether the candidate entity words in the target test set are synonyms or not by using a synonym identification model, and obtaining an identification result, comprises the following steps:

obtaining a first similarity between a first candidate entity word and a second candidate entity word in the target test set based on the synonym Lin Suanfa;

obtaining a second similarity between the first candidate entity word and the second candidate entity word based on the machine learning algorithm;

obtaining the synonym recognition model according to the first similarity and the second similarity;

and identifying whether the candidate entity words in the target test set are synonyms or not by using the synonym identification model, and obtaining an identification result.

6. The method of claim 5, wherein the obtaining a first similarity between a first candidate entity word and a second candidate entity word in the target test set based on the synonym Lin Suanfa comprises:

respectively performing second preprocessing on the first candidate entity word and the second candidate entity word to obtain a first word list corresponding to the first candidate entity word and a second word list corresponding to the second candidate entity word, wherein the second preprocessing comprises the following steps: word segmentation and deletion of non-Chinese characters;

Based on the synonym word forest algorithm, sequentially calculating the similarity between each word in a first word list and a second word list to obtain a first similarity list corresponding to the first candidate entity word, and sequentially calculating the similarity between each word in the second word list and the first word list to obtain a second similarity list corresponding to the second candidate entity word;

and obtaining the first similarity between the first candidate entity word and the second candidate entity word according to the first similarity list and the second similarity list.

7. The method of claim 5, wherein the obtaining a second similarity between the first candidate entity word and the second candidate entity word based on the machine learning algorithm comprises:

adopting a Word2Vec deep learning model, and carrying out model training by using a first training set to obtain a target Word vector set;

based on a cosine similarity algorithm, according to a first word vector corresponding to the first candidate entity word in the target word vector set and a second word vector corresponding to the second candidate entity word in the target word vector set, obtaining similarity between the first word vector and the second word vector, and taking the similarity between the first word vector and the second word vector as second similarity between the first candidate entity word and the second candidate entity word.

8. The method of claim 5, wherein the obtaining the synonym recognition model from the first degree of similarity and the second degree of similarity comprises:

training a logistic regression model corresponding to the synonym recognition model by using a second training set until the logistic regression model converges, and determining weights respectively corresponding to the first similarity and the second similarity according to regression coefficient vectors when the logistic regression model converges, wherein the second training set comprises at least one group of synonyms and at least one group of non-synonyms;

and constructing the synonym recognition model according to the weight of the first similarity and the weight of the second similarity.

9. A synonym identification device, comprising:

the first processing module is used for carrying out entity word recognition on a first test set based on a Conditional Random Field (CRF) algorithm to obtain a target test set, wherein the first test set is text data containing POI (point of interest), and the target test set comprises at least one candidate entity word which is associated with the POI;

and the second processing module is used for identifying whether the candidate entity words in the target test set are synonyms or not by using a synonym identification model to obtain an identification result, wherein the synonym identification model is constructed based on at least two target similarity algorithms.

10. A network device, comprising: a transceiver and a processor; the processor is configured to:

11. A network device, comprising: a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; the synonym identification method as claimed in any one of claims 1-8, wherein said program or instructions, when executed by said processor, implement said synonym identification method.

12. A readable storage medium having stored thereon a program or instructions, which when executed by a processor, performs the steps in the synonym identification method as claimed in any one of claims 1 to 8.