CN111897924A

CN111897924A - A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors

Info

Publication number: CN111897924A
Application number: CN202010774138.2A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-11-06

Abstract

The invention proposes a text retrieval method based on the fusion and expansion of association rules and word vectors. The initial inspection documents obtained by querying and retrieving the original Chinese document set are used to construct the initial inspection document set, and deep learning tools are used to carry out the initial inspection document set. The word vector semantic learning training is performed to obtain the feature word vector set, and then the first m documents are extracted from the initial inspection document set as the pseudo-related feedback document set, and the support and confidence based on the Copulas function are used to mine the pseudo-related feedback document set. Candidate extension words , establish a candidate expansion word set, finally calculate the vector cosine similarity between the candidate expansion word and the original query, extract the final expansion word, combine the final expansion word with the original query into a new query, and retrieve the original document set again to obtain the final retrieval result. The experimental results show that the retrieval performance of the method of the present invention is better than the existing methods, can effectively reduce the problem of query subject drift and word mismatch, improve the information retrieval performance, and has good application value and promotion prospect.

Description

A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors

技术领域technical field

本发明涉及一种基于关联规则与词向量融合扩展的文本检索方法，属于信息检索技术领域。The invention relates to a text retrieval method based on the fusion and expansion of association rules and word vectors, and belongs to the technical field of information retrieval.

背景技术Background technique

随着网络技术的发展，数字资源迅猛增长，网络用户如何快速精准找到所需的信息资源，减少查询主题漂移和词不匹配问题以满足网络用户的信息需求，是信息检索领域亟待解决的一个重要问题。在信息检索中使用查询扩展技术可以解决上述问题，查询扩展指的是对原查询权重进行改造，或者添加与原查询语义相关的其他特征词，弥补原查询过于简单导致的语义信息不足，达到改善信息检索性能的目的。近十几年来，学者们从不同的视角对基于查询扩展的信息检索方法开展研究，产生了一些有效的信息检索方法，例如，岳文等提出的一种基于查询扩展和分类的信息检索方法(见文献：岳文,陈治平,林亚平.基于查询扩展和分类的信息检索算法[J].系统仿真学报,2006,018(007):1926-1929,1934.)，Vaidyanathan等(Vaidyanathan R,Das S,Srivastava N.Query Expansion Strategybased on Pseudo Relevance Feedback and Term Weight Scheme for MonolingualRetrieval[J].International Journal of Computer Applications,2015,105(8):1-6.)提出一种伪相关反馈扩展的信息检索方法，等等，这些方法经过实验验证了所述检索方法的有效性，但还没有最终完全解决信息检索中存在的查询主题漂移和词不匹配等技术问题。With the development of network technology and the rapid growth of digital resources, how network users can quickly and accurately find the information resources they need, and how to reduce query subject drift and word mismatch to meet the information needs of network users is an important issue that needs to be solved urgently in the field of information retrieval. question. The use of query expansion technology in information retrieval can solve the above problems. Query expansion refers to transforming the original query weight, or adding other feature words related to the original query semantics, to make up for the lack of semantic information caused by the original query being too simple, and to improve Purpose of Information Retrieval Performance. In the past ten years, scholars have studied information retrieval methods based on query expansion from different perspectives, and some effective information retrieval methods have been produced. For example, an information retrieval method based on query expansion and classification proposed by Yue Wen et al. (see Literature: Yue Wen, Chen Zhiping, Lin Yaping. Information Retrieval Algorithm Based on Query Expansion and Classification [J]. Journal of System Simulation, 2006, 018(007): 1926-1929, 1934.), Vaidyanathan et al. (Vaidyanathan R, Das S, Srivastava N. Query Expansion Strategy based on Pseudo Relevance Feedback and Term Weight Scheme for Monolingual Retrieval [J]. International Journal of Computer Applications, 2015, 105(8): 1-6.) proposes a pseudo-relevant feedback extension information retrieval method, Etc., these methods have verified the effectiveness of the retrieval method through experiments, but have not finally completely solved the technical problems such as query subject drift and word mismatch in information retrieval.

为了解决当前信息检索系统中查询主题漂移和词不匹配等技术问题，提高信息系统检索性能，本发明将Copulas函数(见文献：Sklar A.Fonctions de repartitionàndimensions et leurs marges[J].Publication de l'Institut de Statistique l'Universite Paris,1959,8(1):229-231.)引入信息检索领域，并在信息检索中将关联模式挖掘和词向量语义学习融合实现查询扩展，提出一种基于关联规则与词向量融合扩展的文本检索方法，实验结果表明，本发明方法能提高和改善跨信息检索性能，具有较好的应用价值和推广前景。In order to solve the technical problems such as query subject drift and word mismatch in the current information retrieval system, and improve the retrieval performance of the information system, the present invention uses the Copulas function (see document: Sklar A.Fonctions de repartitionàndimensions et leurs marges[J].Publication de l' Institut de Statistique l'Universite Paris, 1959, 8(1):229-231.) introduced the field of information retrieval, and integrated association pattern mining and word vector semantic learning in information retrieval to achieve query expansion, and proposed a new method based on association rules. The text retrieval method fused and extended with the word vector, the experimental results show that the method of the invention can improve and improve the cross-information retrieval performance, and has good application value and promotion prospect.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提出一种基于关联规则与词向量融合扩展的文本检索方法，将该方法用于信息检索领域，如web信息检索系统和搜索引擎，能减少信息检索中查询主题漂移和词不匹配问题，改善和提高信息检索系统的查询性能。The purpose of the present invention is to propose a text retrieval method based on the fusion and expansion of association rules and word vectors, which can be used in the field of information retrieval, such as web information retrieval systems and search engines, and can reduce query topic drift and word inconsistency in information retrieval. Matching questions to improve and enhance the query performance of information retrieval systems.

本发明所采用的具体技术方案如下：The concrete technical scheme adopted in the present invention is as follows:

一种基于关联规则与词向量融合扩展的文本检索方法，包括下列步骤：A text retrieval method based on the fusion extension of association rules and word vectors, comprising the following steps:

步骤1.中文用户查询检索原始中文文档集得到初检文档，构建初检文档集。Step 1. Chinese users query and retrieve the original Chinese document set to obtain a preliminary inspection document, and construct a preliminary inspection document set.

步骤2.用深度学习工具对初检文档集中进行词向量语义学习训练，得到特征词词向量集。Step 2. Use deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word word vector set.

本发明所述深度学习工具是指Google开源词向量工具word2vec的Skip-gram模型(详见：https://code.google.com/p/word2vec/)。The deep learning tool in the present invention refers to the Skip-gram model of Google's open source word vector tool word2vec (see: https://code.google.com/p/word2vec/ for details).

步骤3.从初检文档集中提取前列m篇初检文档作为伪相关反馈文档，构建伪相关反馈文档集，对初检伪相关反馈文档集进行中文分词、去除中文停用词和提取特征词的预处理，并计算特征词权值，最后构建伪相关反馈中文文档库和中文特征词库。Step 3. Extract the first m initial inspection documents from the initial inspection document set as pseudo-relevant feedback documents, construct a pseudo-relevant feedback document set, and perform Chinese word segmentation, Chinese stop words removal, and feature word extraction on the initial inspection pseudo-relevant feedback document set. Preprocessing, and calculating feature word weights, and finally constructing pseudo-relevant feedback Chinese document database and Chinese feature word database.

本发明采用TF-IDF(term frequency–inverse document frequency)加权技术(见文献：Ricardo Baeza-Yates Berthier Ribeiro-Neto等著，王知津等译,《现代信息检索》，机械工业出版社，2005年：21-22。)计算特征词权值。The present invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see document: Ricardo Baeza-Yates Berthier Ribeiro-Neto et al., translated by Wang Zhijin et al., "Modern Information Retrieval", Machinery Industry Press, 2005: 21-22.) Calculate feature word weights.

步骤4.采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词，建立候选扩展词集，具体步骤如下：Step 4. Use the support and confidence based on the Copulas function to mine the candidate extension words from the pseudo-related feedback document set, and establish the candidate extension word set. The specific steps are as follows:

(4.1)产生1_候选项集C₁：从中文特征词库中提取特征词作为1_候选项集C₁。(4.1) Generate 1_candidate item set C ₁ : extract feature words from the Chinese feature lexicon as 1_candidate item set C ₁ .

(4.2)产生1_频繁项集L₁：计算C₁基于Copulas函数的支持度Cop_Sup(C₁)，提取Cop_Sup(C₁)不低于最小支持度阈值ms的C₁作为1_频繁项集L₁，并添加到频繁项集集合FIS(Frequent ItemSet)。(4.2) Generate 1_frequent itemset L ₁ : calculate the support Cop_Sup(C ₁ ) of C ₁ based on the Copulas function, and extract C ₁ whose Cop_Sup(C ₁ ) is not lower than the minimum support threshold ms as 1_ frequent itemsets L ₁ , and added to the frequent itemset set FIS (Frequent ItemSet).

所述Cop_Sup(Copulas basedSupport)表示基于Copulas函数的支持度。所述1_候选项集C₁的Cop_Sup(C₁)的计算如式(1)所示：The Cop_Sup (Copulas basedSupport) represents the support degree based on the Copulas function. The calculation of Cop_Sup(C ₁ ) of the 1_candidate item set C ₁ is shown in formula (1):

式(1)中，

表示1_候选项集C₁在伪相关反馈中文文档库中出现的频度，DocNum表示伪相关反馈中文文档库总文档数量，

表示1_候选项集C₁在伪相关反馈中文文档库中的项集权重，ItemsWeight表示伪相关反馈中文文档库中全体中文特征词的权重累加和。In formula (1),

Represents the frequency of 1_candidate item set C ₁ appearing in the pseudo-relevant feedback Chinese document library, DocNum indicates the total number of documents in the pseudo-relevant feedback Chinese document library,

Represents the item set weight of 1_candidate item set C ₁ in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database.

(4.3)产生k_候选项集C_k：将(k-1)_频繁项集L_k-1自连接生成k_候选项集C_k，所述k≥2。(4.3) Generate k_candidate item set C _k : self-connect (k-1)_frequent itemset L _k-1 to generate k_ candidate item set C _k , where k≥2.

所述自连接方法采用Apriori算法(见文献：Agrawal R,Imielinski T,SwamiA.Mining association rules between sets of items in large database[C]//Proceedings of the 1993ACM SIGMOD International Conference on Management ofData,Washington D C,USA,1993:207-216.)中给出的候选项集连接方法。The self-connection method adopts Apriori algorithm (see document: Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large database [C]//Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington DC, USA , 1993:207-216.) The candidate item set connection method given in.

(4.4)2_候选项集C₂剪枝：当挖掘到2_候选项集C₂时，如果该C₂不含有原查询词项，则删除该C₂，如果该C₂含有原查询词项，则留下该C₂，然后，留下的C₂转入步骤(4.5)；当挖掘到k_候选项集C_k，所述k≥3，则C_k直接转入步骤(4.5)。(4.4) 2_candidate item set C ₂ pruning: when mining to 2_ candidate item set C ₂ , if the C ₂ does not contain the original query term, delete the C ₂ , if the C ₂ contains the original query term item, then leave the C ₂ , and then, go to step (4.5) with the remaining C ₂ ; when the k_candidate item set C _k is mined, and the k ≥ 3, then C _k directly goes to step (4.5) .

(4.5)产生k_频繁项集L_k：计算Cop_Sup(C_k)，提取Cop_Sup(C_k)不低于ms的C_k作为k_频繁项集L_k，并添加到FIS。(4.5) Generating k_frequent itemsets L _k : Calculate Cop_Sup(C _k ), extract C _k whose Cop_Sup(C _k ) is not lower than ms as k_ frequent itemsets L _k , and add them to FIS.

所述Cop_Sup(C_k)的计算如式(2)所示：The calculation of the Cop_Sup(C _k ) is shown in formula (2):

式(2)中，

表示k_候选项集C_k在伪相关反馈中文文档库中出现的频度，

表示k_候选项集C_k在伪相关反馈中文文档库中的项集权重；DocNum和ItemsWeight的定义与式(1)相同。In formula (2),

represents the frequency of k_candidate item set C _k appearing in the pseudo-relevant feedback Chinese document library,

Represents the item set weight of k_candidate item set C _k in the pseudo-relevant feedback Chinese document library; the definitions of DocNum and ItemsWeight are the same as formula (1).

(4.6)k加1后转入步骤(4.3)继续顺序执行其后步骤，直到产生所述L_k为空集，则频繁项集挖掘结束，转入步骤(4.7)。(4.6) After adding 1 to k, go to step (4.3) and continue to execute subsequent steps in sequence until the L _k is an empty set, then the frequent itemset mining ends, and go to step (4.7).

(4.7)从FIS中任意取出L_k，所述k≥2。(4.7) Take L _k arbitrarily from the FIS, where k ≥ 2.

(4.8)从L_k中提取关联规则Q_i→Ret_j，计算关联规则Q_i→Ret_j基于Copulas函数的置信度Cop_Con(Q_i→Ret_j)，所述i≥1，j≥1，且

Q_i∪Ret_j＝L_k，

所述Ret_j为不含查询词项的真子集项集，所述Q_i为含查询词项的真子集项集，所述Q为原查询词项集合。(4.8) Extract the association rule Q _i →Ret _j from L _k , calculate the confidence level Cop_Con(Q _i →Ret _j ) of the association rule Q _i →Ret _j based on the Copulas function, the i≥1, j≥1, and

Q _i ∪Ret _j =L _k ,

The Ret _j is the proper subset itemset without query terms, the Q _i is the proper subset itemsets with query terms, and the Q is the original query term set.

所述Cop_Con(Copulas based Confidence)表示基于Copulas函数的置信度，所述Cop_Con(Q_i→Ret_j)的计算公式如式(3)所示：The Cop_Con (Copulas based Confidence) represents the confidence based on the Copulas function, and the calculation formula of the Cop_Con (Q _i →Ret _j ) is shown in formula (3):

式(3)中，

表示k_频繁项集L_k在伪相关反馈中文文档库中出现的频度，

表示k_频繁项集L_k在伪相关反馈中文文档库中的项集权重，num_Qi表示k_频繁项集L_k的真子集项集Q_i在伪相关反馈中文文档库中出现的频度，weight_Qi表示k_频繁项集L_k的真子集项集Q_i在伪相关反馈中文文档库中的项集权重；exp表示以自然常数e为底的指数函数。In formula (3),

represents the frequency of k_frequent itemsets L _k appearing in the pseudo-relevant feedback Chinese document library,

Represents the itemset weight of _{k_frequent} itemset L _k in the pseudo-relevant feedback Chinese document database, and num _Qi represents the frequency of the proper subset itemset Qi of k_ frequent itemset L _k in the pseudo-relevant feedback Chinese document database , weight _Qi represents the item set weight of the proper subset itemset Qi of _{k_frequent} itemset L _k in the pseudo-relevant feedback Chinese document database; exp represents the exponential function with the natural constant e as the base.

(4.9)产生关联规则Q_i→Ret_j：提取Cop_Con(Q_i→Ret_j)不小于最小置信度阈值mc的关联规则Q_i→Ret_j加入到关联规则集AR(Association Rule)，然后，转入步骤(4.8)，从L_k中重新提取其他的真子集项集Ret_j和Q_i，再顺序进行其后步骤，如此循环，直到L_k的所有真子集项集当且仅当都被取出一次为止，这时转入如步骤(4.7)，进行新一轮关联规则模式挖掘，从FIS中再取出任意其他L_k，再顺序进行其后步骤，如此循环，直到FIS中所有k_频繁项集L_k当且仅当都被取出一次为止，这时关联规则模式挖掘结束，转入如下步骤(4.10)。(4.9) Generate association rules Qi → Ret _j : extract the association rules Qi → Ret _{j whose Cop_Con (Q i} _→ _Ret _j ) is not less than the minimum confidence threshold mc _and add them to the association rule set AR (Association Rule), then turn to Go to step (4.8), re-extract other proper subset itemsets Ret _j and Qi from L _k _, and then perform subsequent steps in sequence, and so on, until all proper subset itemsets of L _k are taken out if and only if all At this time, go to step (4.7), carry out a new round of association rule pattern mining, take out any other L _k from the FIS, and then perform the subsequent steps in sequence, and so on, until all k_frequent items in the FIS are If and only if the set L _k is taken out once, then the association rule pattern mining is completed, and the following step (4.10) is performed.

(4.10)产生候选扩展词：从关联规则集AR中提取关联规则后件Ret_j作为候选扩展词，得到候选扩展词集CETS(Candidate Expansion Term Set)，计算候选扩展词权值w_Ret，然后，转入步骤5。(4.10) Generate candidate expansion words: extract the association rule consequent Ret _j from the association rule set AR as a candidate expansion word, obtain the candidate expansion word set CETS (Candidate Expansion Term Set), calculate the candidate expansion word weight w _Ret , and then, Go to step 5.

所述CETS如式(4)所示：The CETS is shown in formula (4):

式(4)中，Ret_i表示第i个候选扩展词。In formula (4), Ret _i represents the ith candidate extension word.

所述候选扩展词权值w_Ret计算公式如式(5)所示：The calculation formula of the candidate extended word weight w _Ret is shown in formula (5):

式(5)中，max()表示关联规则置信度的最大值，当多个关联规则模式中同时出现相同的规则扩展词时，取其置信度值最大的作为该规则扩展词的权值。In formula (5), max() represents the maximum value of association rule confidence. When the same rule expansion word appears in multiple association rule patterns at the same time, the one with the largest confidence value is taken as the weight of the rule expansion word.

步骤5.计算候选扩展词与原查询词的向量余弦相似度，提取不低于相似度阈值的候选扩展词作为最终扩展词，具体步骤如下：Step 5. Calculate the vector cosine similarity between the candidate expansion word and the original query word, and extract the candidate expansion word that is not lower than the similarity threshold as the final expansion word. The specific steps are as follows:

(5.1)计算向量相似度VecSim(Ret_i,Q)：在特征词词向量集中，计算候选扩展词Ret_i与原查询词项集合Q(所述Q＝(q₁,q₂,…,q_r))中各个查询词{q₁,q₂,…,q_r}的向量余弦相似度VecSim(Ret_i,q_j)，所述1≤j≤r，累加所述候选扩展词与各个查询词的向量相似度作为该候选扩展词总的向量相似度VecSim(Ret_i,Q)。(5.1) Calculate the vector similarity VecSim(Ret _i , Q): In the feature word vector set, calculate the candidate extension word Ret _i and the original query term set Q (the Q=(q ₁ ,q ₂ ,...,q _r )) vector cosine similarity VecSim(Ret _i ,q _j ) of each query word {q ₁ ,q ₂ ,...,q _r }, the 1≤j≤r, the candidate expansion word and each query are accumulated The word vector similarity is taken as the total vector similarity VecSim(Ret _i ,Q) of the candidate expanded word.

所述VecSim(Ret_i,q_j)如式(6)所示：The VecSim(Ret _i ,q _j ) is shown in formula (6):

式(6)中，所述vRet_i表示第i个候选扩展词Ret_i的词向量值，vq_j表示第j个查询词项q_j的词向量值。In formula (6), the vRet _i represents the word vector value of the i-th candidate expansion word Ret _i , and vq _j represents the word vector value of the j-th query term q _j .

所述VecSim(Ret_i,Q)计算公式如式(7)所示：The VecSim(Ret _i , Q) calculation formula is shown in formula (7):

(5.2)产生最终扩展词：提取VecSim(Ret_i,Q)不低于向量相似度阈值minVSim的候选扩展词作为原查询词项集合Q的最终扩展词，得到最终扩展词集FETS(FinalExpansionTerm Set)，如式(8)所示：(5.2) Generate the final expanded word: extract the candidate expanded word whose VecSim(Ret _i , Q) is not lower than the vector similarity threshold minVSim as the final expanded word of the original query term set Q, and obtain the final expanded word set FETS (FinalExpansionTerm Set) , as shown in formula (8):

式(8)中，Ret_i为第i个最终扩展词。In formula (8), Ret _i is the ith final extended word.

然后计算最终扩展词Ret的权值w_Fet，最终扩展词Ret的权值w_Fet由候选扩展词权值w_Ret和向量相似度VecSim(Ret_i,Q)组成。所述w_Fet计算公式如式(9)所示：Then the weight w _Fet of the final extended word Ret is calculated, and the weight w _Fet of the final extended word Ret is composed of the candidate extended word weight w _Ret and the vector similarity VecSim(Ret _i , Q). The w _Fet calculation formula is shown in formula (9):

步骤6.扩展词与原查询组合为新查询再次检索原始中文文档集得到最终检索文档。Step 6. The expanded word is combined with the original query into a new query, and the original Chinese document set is retrieved again to obtain the final retrieved document.

步骤7.将最终检索文档集返回给用户。Step 7. Return the final retrieved document set to the user.

本发明与现有技术相比，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

(1)本发明提出了一种基于关联规则与词向量融合扩展的文本检索方法，该发明方法利用深度学习工具对初检文档集中进行词向量语义学习训练，得到特征词词向量集，采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词，计算候选扩展词与原查询的向量余弦相似度，提取不低于相似度阈值的候选扩展词作为最终扩展词，最终扩展词与原查询组合为新查询，再次检索原始文档集得到最终检索结果，并将最终检索文档集返回给用户。实验结果表明，本发明方法的检索性能优于现有方法，能有效地减少查询主题漂移和词不匹配问题，提高信息检索性能，具有较好的应用价值和推广前景。(1) The present invention proposes a text retrieval method based on the fusion and expansion of association rules and word vectors. The inventive method uses deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word vector set. The support and confidence of the Copulas function are used to mine the candidate expansion words from the pseudo-relevant feedback document set, calculate the vector cosine similarity between the candidate expansion words and the original query, and extract the candidate expansion words that are not lower than the similarity threshold as the final expansion words. The word and the original query are combined into a new query, the original document set is retrieved again to obtain the final retrieval result, and the final retrieved document set is returned to the user. The experimental results show that the retrieval performance of the method of the present invention is better than the existing methods, can effectively reduce the problem of query subject drift and word mismatch, improve the information retrieval performance, and has good application value and promotion prospect.

(2)选择近年出现的4种同类查询扩展方法作为本发明方法的对比方法，实验数据是国家标准数据集NTCIR-5CLIR中文语料。实验结果表明，本发明方法的实验结果P@5值都高于基准检索，相对于4种对比方法，本发明方法P@5值绝大部分都得到提升，说明本发明方法的检索性能均优于基准检索和对比方法，能提高信息检索性能，减少信息检索中查询漂移和词不匹配问题，具有很高的应用价值和广阔的推广前景。(2) Four similar query expansion methods that have appeared in recent years are selected as the comparison methods of the method of the present invention, and the experimental data is the Chinese corpus of the national standard data set NTCIR-5CLIR. The experimental results show that the P@5 value of the experimental result of the method of the present invention is higher than that of the benchmark retrieval. Compared with the four comparison methods, the P@5 value of the method of the present invention is mostly improved, indicating that the retrieval performance of the method of the present invention is excellent. Based on benchmark retrieval and comparison methods, it can improve information retrieval performance, reduce query drift and word mismatch problems in information retrieval, and has high application value and broad promotion prospects.

附图说明Description of drawings

图1为本发明所述的基于关联规则与词向量融合扩展的文本检索方法的总体流程示意图。FIG. 1 is a schematic overall flow chart of the text retrieval method based on the fusion extension of association rules and word vectors according to the present invention.

具体实施方式Detailed ways

一、为了更好地说明本发明的技术方案，下面将本发明涉及的相关概念介绍如下：1. In order to better illustrate the technical solution of the present invention, the related concepts involved in the present invention are introduced as follows:

1.项集1. Itemset

在文本挖掘中，将一篇文本文档当作一个事务，文档中的各个特征词称为项目，特征词项目的集合称为项集，项集中所有项目的个数称为项集长度。k_项集指含有k个项目的项集，k即为项集的长度。In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, the collection of feature word items is called an itemset, and the number of all items in an itemset is called the itemset length. k_itemsets refer to itemsets containing k items, where k is the length of the itemsets.

2.关联规则前件和后件2. Antecedents and Consequences of Association Rules

设x、y是任意的特征词项集，将形如x→y的蕴含式称为关联规则，其中，x称为规则前件，y称为规则后件。Let x and y be an arbitrary set of feature terms, and the implication in the form of x→y is called an association rule, where x is called the antecedent of the rule, and y is called the consequent of the rule.

3.基于Copulas函数的支持度3. Support based on Copulas function

基于Copulas函数的支持度(Copulas based Support)表示为Cop_Sup()。Copulas based Support is represented as Cop_Sup().

基于Copulas函数的特征词项集(T₁∪T₂)支持度Cop_Sup(T₁∪T₂)的计算如式(10)所示：The calculation of the support degree Cop_Sup(T ₁ ∪T ₂ ) of the feature word item set (T ₁ ∪ T ₂ ) based on the Copulas function is shown in formula (10):

式(10)中，

表示项集(T₁∪T₂)在伪相关反馈中文文档库中出现的频度，

表示项集(T₁∪T₂)在伪相关反馈中文文档库中的项集权重。DocNum表示伪相关反馈中文文档库总文档数量，ItemsWeight表示伪相关反馈中文文档库中全体中文特征词的权重累加和。In formula (10),

represents the frequency of itemsets (T ₁ ∪ T ₂ ) appearing in the pseudo-relevant feedback Chinese document database,

Represents the itemset weight of the itemset (T ₁ ∪ T ₂ ) in the pseudo-relevant feedback Chinese document database. DocNum represents the total number of documents in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database.

4.基于Copulas函数的置信度4. Confidence based on Copulas function

基于Copulas函数的置信度((Copulas based Confidence)表示为Cop_Con()。The confidence based on the Copulas function ((Copulas based Confidence) is represented as Cop_Con().

基于Copulas函数的特征词关联规则T₁→T₂置信度Cop_Con(T₁→T₂)的计算公式，如式(11)所示：The calculation formula of the feature word association rule T ₁ →T ₂ confidence Cop_Con(T ₁ →T ₂ ) based on the Copulas function is shown in formula (11):

式(11)中，

表示项集(T₁∪T₂)在伪相关反馈中文文档库中出现的频度，

表示项集(T₁∪T₂)在伪相关反馈中文文档库中的项集权重，

表示项集T₁在伪相关反馈中文文档库中出现的频度，

表示项集T₁在伪相关反馈中文文档库中的项集权重。exp表示以自然常数e为底的指数函数。In formula (11),

represents the itemset weight of the itemset (T ₁ ∪ T ₂ ) in the pseudo-relevant feedback Chinese document database,

represents the frequency of itemset T ₁ appearing in the pseudo-relevant feedback Chinese document database,

Represents the itemset weight of the itemset T1 in the pseudo _- relevant feedback Chinese document database. exp represents the exponential function with the natural constant e as the base.

二、下面结合附图和具体对比实验来对本发明作进一步说明。2. The present invention will be further described below in conjunction with the accompanying drawings and specific comparative experiments.

如图1所示，本发明的基于关联规则与词向量融合扩展的文本检索方法，包括下列步骤：As shown in Figure 1, the text retrieval method based on the fusion extension of association rules and word vectors of the present invention comprises the following steps:

本发明所述深度学习工具是指Google开源词向量工具word2vec的Skip-gram模型。The deep learning tool in the present invention refers to the Skip-gram model of Google's open source word vector tool word2vec.

步骤3.从初检文档集中提取前列m篇初检文档作为伪相关反馈文档，构建伪相关反馈文档集，对初检伪相关反馈文档集进行中文分词、去除中文停用词和提取特征词的预处理，并采用TF-IDF加权技术计算特征词权值，最后构建伪相关反馈中文文档库和中文特征词库。Step 3. Extract the first m initial inspection documents from the initial inspection document set as pseudo-relevant feedback documents, construct a pseudo-relevant feedback document set, and perform Chinese word segmentation, Chinese stop words removal, and feature word extraction on the initial inspection pseudo-relevant feedback document set. After preprocessing, the TF-IDF weighting technology is used to calculate the weights of feature words, and finally the pseudo-relevant feedback Chinese document database and Chinese feature word database are constructed.

式(1)中，

所述自连接方法采用Apriori算法中给出的候选项集连接方法。The self-connection method adopts the candidate itemset connection method given in the Apriori algorithm.

式(2)中，

表示k_候选项集C_k在伪相关反馈中文文档库中出现的频度，

Q_i∪Ret_j＝L_k，

Q _i ∪Ret _j =L _k ,

式(3)中，

表示k_频繁项集L_k在伪相关反馈中文文档库中出现的频度，

所述CETS如式(4)所示：The CETS is shown in formula (4):

实验设计与结果：Experimental design and results:

我们通过和现有同类方法进行实验对比，以说明本发明方法的有效性。We carry out experimental comparison with existing similar methods to illustrate the effectiveness of the method of the present invention.

1.实验环境及实验数据：1. Experimental environment and experimental data:

本发明实验数据是NTCIR-5CLIR(详见：http://research.nii.ac.jp/ntcir/data/data-en.html.)中文文本语料，共8个数据集，合计901446篇中文文档，具体信息如表1所示。该语料是国际上通用的标准语料，有文档集、查询集和结果集，即50个中文查询，4种类型的查询主题和2种评价标准的结果集。本文采用Description(简称Desc)和Title查询主题完成检索实验，Title查询属于短查询，以名词和名词性短语简要描述查询主题，而Desc查询属于长查询，以句子形式简要描述查询主题；结果集有Rigid(与查询高度相关，相关)和Relax(与查询高度相关、相关和部分相关)评价标准。The experimental data of the present invention is NTCIR-5CLIR (see: http://research.nii.ac.jp/ntcir/data/data-en.html.) Chinese text corpus, a total of 8 data sets, a total of 901446 Chinese documents , and the specific information is shown in Table 1. The corpus is an international standard corpus, including document set, query set and result set, that is, 50 Chinese queries, 4 types of query topics and 2 evaluation criteria result sets. This paper uses Description (Desc) and Title query subjects to complete the retrieval experiment. Title query is a short query, which briefly describes the query subject in terms of nouns and noun phrases, while Desc query is a long query, which briefly describes the query subject in sentence form; the result set includes Rigid (highly relevant, relevant to the query) and Relax (highly relevant, relevant and partially relevant to the query) evaluation criteria.

实验数据预处理是：分词和去除停用词。The experimental data preprocessing is: word segmentation and removal of stop words.

实验结果检索评价指标是MAP(Mean Average Precision)The experimental result retrieval evaluation index is MAP (Mean Average Precision)

表1本发明实验原始语料集及其数量Table 1 The original corpus of the experiment of the present invention and its quantity

2.基准检索与对比方法：2. Benchmark retrieval and comparison methods:

实验基础检索环境采用Lucene.Net(详见：http://lucenenet.apache.org/)搭建。The basic retrieval environment of the experiment is built with Lucene.Net (see: http://lucenenet.apache.org/).

基准检索是原始查询提交到Lucene.Net进行初次检索得到的检索结果。The benchmark retrieval is the retrieval result obtained by submitting the original query to Lucene.Net for initial retrieval.

对比方法描述如下：The comparison method is described as follows:

对比方法1：采用文献(黄名选.基于加权关联模式挖掘的越-英跨语言查询扩展[J].情报学报,2017,36(3):307-318.)的加权关联模式挖掘技术挖掘规则扩展词，参数为mc＝0.1，mi＝0.0001，实验结果为ms分别为0.004,0.005,0.006,0.007时的平均值。Comparison method 1: Use the weighted association pattern mining technology of literature (Huang Mingxuan. Vietnamese-English cross-language query expansion based on weighted association pattern mining [J]. Journal of Information Science, 2017, 36(3): 307-318.) to mine rule expansion word, the parameters are mc=0.1, mi=0.0001, and the experimental result is the average value when ms is 0.004, 0.005, 0.006, and 0.007 respectively.

对比方法2：采用文献(黄名选,蒋曹清.基于完全加权正负关联模式挖掘的越-英跨语言查询译后扩展[J].电子学报,2018,46(12):3029-3036.)的完全加权正负关联模式挖掘技术挖掘规则扩展词。参数为mc＝0.1，α＝0.3，minPR＝0.1，minNR＝0.01，以及ms分别为0.10,0.11,0.12,0.13，实验结果取平均值。Comparison method 2: Using literature (Huang Mingxuan, Jiang Caoqing. Post-translation expansion of Vietnamese-English cross-language query based on fully weighted positive and negative association pattern mining [J]. Journal of Electronic Engineering, 2018, 46(12): 3029-3036.) The fully weighted positive and negative association pattern mining technique mines regular expansion words. The parameters are mc=0.1, α=0.3, minPR=0.1, minNR=0.01, and ms are 0.10, 0.11, 0.12, 0.13, respectively, and the experimental results are averaged.

对比方法3：采用文献(Zhang H R,Zhang J W,Wei X Y,et al.A new frequentpattern mining algorithm with weighted multiple minimum supports[J].Intelligent Automation&Soft Computing,2017,23(4):605-612.)的基于多支持度阈值的加权频繁模式挖掘技术挖掘规则扩展词，参数为mc＝0.1，LMS＝0.2，HMS＝0.25，WT＝0.1，ms分别为0.1,0.15,0.2,0.25，实验结果取平均值。Comparison method 3: Using the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequent pattern mining algorithm with weighted multiple minimum supports[J]. Intelligent Automation & Soft Computing, 2017, 23(4): 605-612.) The weighted frequent pattern mining technology based on multiple support thresholds mines rule expansion words, the parameters are mc=0.1, LMS=0.2, HMS=0.25, WT=0.1, ms are 0.1, 0.15, 0.2, 0.25, respectively, and the experimental results are averaged .

对比方法4：采用文献(许侃,林原,曲忱,等.专利查询扩展的词向量方法研究[J].计算机科学与探索,2018,12(6):972-980.)的基于词向量的查询扩展方法。实验参数：k＝60，α＝0.1。Comparison method 4: Using the literature (Xu Kan, Lin Yuan, Qu Chen, et al. Research on word vector method for patent query expansion [J]. Computer Science and Exploration, 2018, 12(6): 972-980.) Based on word vector query extension method. Experimental parameters: k=60, α=0.1.

本发明所用的Skip-gram模型词向量语义学习训练参数：batch_size＝128，embedding_size＝300，embedding_size＝300，skip_window＝2，num_skips＝4，num_sampled＝64。The skip-gram model word vector semantic learning training parameters used in the present invention: batch_size=128, embedding_size=300, embedding_size=300, skip_window=2, num_skips=4, num_sampled=64.

3.实验方法和结果如下：3. The experimental methods and results are as follows:

NTCIR-5CLIR语料50个中文查询在8个数据集上进行本发明实验，得到基准检索BLR结果、对比方法和本发明方法的检索结果P@5平均值，如表2和表3所示。The NTCIR-5CLIR corpus 50 Chinese queries were used to conduct the experiments of the present invention on 8 data sets, and the average value of the P@5 retrieval results of the benchmark retrieval BLR results, the comparison method and the method of the present invention was obtained, as shown in Tables 2 and 3.

表2本发明方法与基准检索、对比方法的检索性能P@5值(Title查询)Table 2 The retrieval performance P@5 value of the method of the present invention and the benchmark retrieval and comparison method (Title query)

表3本发明方法与基准检索、对比方法的检索性能P@5值(Desc查询)Table 3 The retrieval performance P@5 value of the method of the present invention and the benchmark retrieval and comparison method (Desc query)

实验结果表明，本发明方法的实验结果P@5值都高于基准检索，相对于4种对比方法，本发明方法P@5值绝大部分都得到了改善和提高，说明本发明方法扩展检索性能高于基准检索和同类的对比方法。实验结果表明，本发明方法是有效的，确实能提高息检索性能，具有很高的应用价值和广阔的推广前景。The experimental results show that the P@5 values of the experimental results of the method of the present invention are all higher than the benchmark retrieval. Compared with the four comparison methods, most of the P@5 values of the method of the present invention have been improved and improved, indicating that the method of the present invention expands the retrieval. The performance is higher than that of benchmark retrieval and similar comparison methods. The experimental results show that the method of the invention is effective, can indeed improve the information retrieval performance, and has high application value and broad promotion prospects.

Claims

1. a text retrieval method based on association rule and word vector fusion expansion, is characterized in that, comprises the following steps:

Step 1. Chinese users query and retrieve the original Chinese document set to obtain the initial inspection document, and construct the initial inspection document set;

Step 2. Use deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word word vector set;

The deep learning tool of the present invention refers to the Skip-gram model of Google's open source word vector tool word2vec;

Step 3. Extract the first m initial inspection documents from the initial inspection document set as pseudo-relevant feedback documents, construct a pseudo-relevant feedback document set, and perform Chinese word segmentation, Chinese stop words removal, and feature word extraction on the initial inspection pseudo-relevant feedback document set. Preprocess, and use TF-IDF weighting technology to calculate the weight of feature words, and finally build a pseudo-relevant feedback Chinese document database and Chinese feature word database;

Step 4. Use the support and confidence based on the Copulas function to mine the candidate extension words from the pseudo-related feedback document set, and establish the candidate extension word set. The specific steps are as follows:

(4.1) Generate 1_candidate item set C ₁ : extract feature words from Chinese feature vocabulary as 1_ candidate item set C ₁ ;

(4.2) Generate 1_frequent itemset L ₁ : calculate the support Cop_Sup(C ₁ ) of C ₁ based on the Copulas function, and extract C ₁ whose Cop_Sup(C ₁ ) is not lower than the minimum support threshold ms as 1_ frequent itemsets L ₁ , and add it to the frequent itemset set FIS (Frequent ItemSet);

The calculation of Cop_Sup(C ₁ ) of the 1_candidate item set C ₁ is shown in formula (1):

In formula (1),

Represents the item set weight of 1_candidate item set C ₁ in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database;

(4.3) Generate k_candidate item set C _k : self-connect (k-1)_frequent itemset L _k-1 to generate k_ candidate item set C _k , where k≥2;

The self-connection method adopts the candidate itemset connection method given in the Apriori algorithm;

(4.4) 2_candidate item set C ₂ pruning: when mining to 2_ candidate item set C ₂ , if the C ₂ does not contain the original query term, delete the C ₂ , if the C ₂ contains the original query term item, then leave the C ₂ , and then, go to step (4.5) with the remaining C ₂ ; when the k_candidate item set C _k is mined, and the k ≥ 3, then C _k directly goes to step (4.5) ;

(4.5) Generate k_frequent itemset L _k : calculate Cop_Sup(C _k ), extract C _k whose Cop_Sup(C _k ) is not lower than ms as k_ frequent itemset L _k , and add it to FIS;

The calculation of the Cop_Sup(C _k ) is shown in formula (2):

In formula (2),

Represents the item set weight of k_candidate item set C _k in the pseudo-relevant feedback Chinese document library; the definitions of DocNum and ItemsWeight are the same as formula (1);

(4.6) After adding 1 to k, go to step (4.3) and continue to execute subsequent steps in sequence until the L _k is an empty set, then the frequent itemset mining ends, and go to step (4.7);

(4.7) arbitrarily take out L _k from the FIS, the k ≥ 2;

(4.8) Extract the association rule Q _i →Ret _j from L _k , calculate the confidence level Cop_Con(Q _i →Ret _j ) of the association rule Q _i →Ret _j based on the Copulas function, the i≥1, j≥1, and

Q _i ∪Ret _j =L _k ,

The Ret _j is the proper subset itemset without query terms, the Q _i is the proper subset itemsets with query terms, and the Q is the original query term set;

The calculation formula of Cop_Con(Q _i →Ret _j ) is shown in formula (3):

In formula (3),

represents the itemset weight of k_frequent itemset L _k in the pseudo-relevant feedback Chinese document database,

Represents the frequency of the proper subset itemset Q _i of _{k_frequent} itemset L _k appearing in the pseudo-relevant feedback Chinese document database, and weight _Qi represents the proper subset itemset Qi of k_ frequent itemset L _k in the pseudo-relevant feedback Chinese The itemset weight in the document library; exp represents the exponential function with the natural constant e as the base;

(4.9) Generate association rules Qi → Ret _j : extract the association rules Qi → Ret _{j whose Cop_Con(Q i} _→ _Ret _j ) is not less than the minimum confidence threshold mc _and add them to the association rule set AR, and then go to step (4.8 ), re-extract other proper subset itemsets Ret _j and Q _i from L _k , and then perform subsequent steps in sequence, and so on, until all proper subset itemsets of L _k are taken out once and only if they are taken out once, this Then go to step (4.7), carry out a new round of association rule pattern mining, take out any other L _k from the FIS, and then perform the subsequent steps in sequence, and so on, until all k_frequent itemsets L _k in the FIS are when And only when all are taken out once, then the association rule pattern mining is over, and then go to the following step (4.10);

(4.10) generate candidate extension words: extract association rule consequent Ret _j as candidate extension word from association rule set AR, obtain candidate extension word set CETS, calculate candidate extension word weight w _Ret , then, go to step 5;

The CETS is shown in formula (4):

In formula (4), Ret _i represents the ith candidate extension word;

The calculation formula of the candidate extended word weight w _Ret is shown in formula (5):

In formula (5), max( ) represents the maximum value of the association rule confidence. When the same rule expansion word appears in multiple association rule patterns at the same time, the one with the largest confidence value is taken as the weight of the rule expansion word;

Step 5. Calculate the vector cosine similarity between the candidate expansion word and the original query word, and extract the candidate expansion word that is not lower than the similarity threshold as the final expansion word. The specific steps are as follows:

(5.1) Calculate the vector similarity VecSim(Ret _i , Q): In the feature word vector set, calculate the candidate extension word Ret _i and the original query term set Q (the Q=(q ₁ ,q ₂ ,...,q _r )) vector cosine similarity VecSim(Ret _i ,q _j ) of each query word {q ₁ ,q ₂ ,...,q _r }, the 1≤j≤r, the candidate expansion word and each query are accumulated The vector similarity of the word is used as the total vector similarity of the candidate expanded word VecSim(Ret _i ,Q);

The VecSim(Ret _i ,q _j ) is shown in formula (6):

In formula (6), the vRet _i represents the word vector value of the ith candidate expansion word Ret _i , and vq _j represents the word vector value of the jth query term q _j ;

The VecSim(Ret _i , Q) calculation formula is shown in formula (7):

(5.2) Generate the final expanded word: extract the candidate expanded word whose VecSim(Ret _i , Q) is not lower than the vector similarity threshold minVSim as the final expanded word of the original query term set Q, and obtain the final expanded word set FETS, such as formula ( 8) shown:

In formula (8), Ret _i is the ith final extended word;

Then the weight w _Fet of the final extended word Ret is calculated, and the weight w _Fet of the final extended word Ret is composed of the candidate extended word weight w _Ret and the vector similarity VecSim(Ret _i , Q); the w _Fet calculation formula is as follows (9) shows:

Step 6. The expanded word and the original query are combined into a new query to retrieve the original Chinese document set again to obtain the final retrieval document;

Step 7. Return the final retrieved document set to the user.