[go: up one dir, main page]

CN111897924A - A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors - Google Patents

A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors Download PDF

Info

Publication number
CN111897924A
CN111897924A CN202010774138.2A CN202010774138A CN111897924A CN 111897924 A CN111897924 A CN 111897924A CN 202010774138 A CN202010774138 A CN 202010774138A CN 111897924 A CN111897924 A CN 111897924A
Authority
CN
China
Prior art keywords
word
ret
candidate
formula
pseudo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010774138.2A
Other languages
Chinese (zh)
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN202010774138.2A priority Critical patent/CN111897924A/en
Publication of CN111897924A publication Critical patent/CN111897924A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提出了一种基于关联规则与词向量融合扩展的文本检索方法,其将用户查询检索原始中文文档集得到的初检文档来构建初检文档集,利用深度学习工具对初检文档集中进行词向量语义学习训练,得到特征词向量集,然后从初检文档集提取前列m篇文档作为伪相关反馈文档集,采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词,建立候选扩展词集,最后计算候选扩展词与原查询的向量余弦相似度,提取最终扩展词,最终扩展词与原查询组合为新查询,再次检索原始文档集得到最终检索结果。实验结果表明,本发明方法的检索性能优于现有方法,能有效地减少查询主题漂移和词不匹配问题,提高信息检索性能,具有较好的应用价值和推广前景。

Figure 202010774138

The invention proposes a text retrieval method based on the fusion and expansion of association rules and word vectors. The initial inspection documents obtained by querying and retrieving the original Chinese document set are used to construct the initial inspection document set, and deep learning tools are used to carry out the initial inspection document set. The word vector semantic learning training is performed to obtain the feature word vector set, and then the first m documents are extracted from the initial inspection document set as the pseudo-related feedback document set, and the support and confidence based on the Copulas function are used to mine the pseudo-related feedback document set. Candidate extension words , establish a candidate expansion word set, finally calculate the vector cosine similarity between the candidate expansion word and the original query, extract the final expansion word, combine the final expansion word with the original query into a new query, and retrieve the original document set again to obtain the final retrieval result. The experimental results show that the retrieval performance of the method of the present invention is better than the existing methods, can effectively reduce the problem of query subject drift and word mismatch, improve the information retrieval performance, and has good application value and promotion prospect.

Figure 202010774138

Description

基于关联规则与词向量融合扩展的文本检索方法A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors

技术领域technical field

本发明涉及一种基于关联规则与词向量融合扩展的文本检索方法,属于信息检索技术领域。The invention relates to a text retrieval method based on the fusion and expansion of association rules and word vectors, and belongs to the technical field of information retrieval.

背景技术Background technique

随着网络技术的发展,数字资源迅猛增长,网络用户如何快速精准找到所需的信息资源,减少查询主题漂移和词不匹配问题以满足网络用户的信息需求,是信息检索领域亟待解决的一个重要问题。在信息检索中使用查询扩展技术可以解决上述问题,查询扩展指的是对原查询权重进行改造,或者添加与原查询语义相关的其他特征词,弥补原查询过于简单导致的语义信息不足,达到改善信息检索性能的目的。近十几年来,学者们从不同的视角对基于查询扩展的信息检索方法开展研究,产生了一些有效的信息检索方法,例如,岳文等提出的一种基于查询扩展和分类的信息检索方法(见文献:岳文,陈治平,林亚平.基于查询扩展和分类的信息检索算法[J].系统仿真学报,2006,018(007):1926-1929,1934.),Vaidyanathan等(Vaidyanathan R,Das S,Srivastava N.Query Expansion Strategybased on Pseudo Relevance Feedback and Term Weight Scheme for MonolingualRetrieval[J].International Journal of Computer Applications,2015,105(8):1-6.)提出一种伪相关反馈扩展的信息检索方法,等等,这些方法经过实验验证了所述检索方法的有效性,但还没有最终完全解决信息检索中存在的查询主题漂移和词不匹配等技术问题。With the development of network technology and the rapid growth of digital resources, how network users can quickly and accurately find the information resources they need, and how to reduce query subject drift and word mismatch to meet the information needs of network users is an important issue that needs to be solved urgently in the field of information retrieval. question. The use of query expansion technology in information retrieval can solve the above problems. Query expansion refers to transforming the original query weight, or adding other feature words related to the original query semantics, to make up for the lack of semantic information caused by the original query being too simple, and to improve Purpose of Information Retrieval Performance. In the past ten years, scholars have studied information retrieval methods based on query expansion from different perspectives, and some effective information retrieval methods have been produced. For example, an information retrieval method based on query expansion and classification proposed by Yue Wen et al. (see Literature: Yue Wen, Chen Zhiping, Lin Yaping. Information Retrieval Algorithm Based on Query Expansion and Classification [J]. Journal of System Simulation, 2006, 018(007): 1926-1929, 1934.), Vaidyanathan et al. (Vaidyanathan R, Das S, Srivastava N. Query Expansion Strategy based on Pseudo Relevance Feedback and Term Weight Scheme for Monolingual Retrieval [J]. International Journal of Computer Applications, 2015, 105(8): 1-6.) proposes a pseudo-relevant feedback extension information retrieval method, Etc., these methods have verified the effectiveness of the retrieval method through experiments, but have not finally completely solved the technical problems such as query subject drift and word mismatch in information retrieval.

为了解决当前信息检索系统中查询主题漂移和词不匹配等技术问题,提高信息系统检索性能,本发明将Copulas函数(见文献:Sklar A.Fonctions de repartitionàndimensions et leurs marges[J].Publication de l'Institut de Statistique l'Universite Paris,1959,8(1):229-231.)引入信息检索领域,并在信息检索中将关联模式挖掘和词向量语义学习融合实现查询扩展,提出一种基于关联规则与词向量融合扩展的文本检索方法,实验结果表明,本发明方法能提高和改善跨信息检索性能,具有较好的应用价值和推广前景。In order to solve the technical problems such as query subject drift and word mismatch in the current information retrieval system, and improve the retrieval performance of the information system, the present invention uses the Copulas function (see document: Sklar A.Fonctions de repartitionàndimensions et leurs marges[J].Publication de l' Institut de Statistique l'Universite Paris, 1959, 8(1):229-231.) introduced the field of information retrieval, and integrated association pattern mining and word vector semantic learning in information retrieval to achieve query expansion, and proposed a new method based on association rules. The text retrieval method fused and extended with the word vector, the experimental results show that the method of the invention can improve and improve the cross-information retrieval performance, and has good application value and promotion prospect.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提出一种基于关联规则与词向量融合扩展的文本检索方法,将该方法用于信息检索领域,如web信息检索系统和搜索引擎,能减少信息检索中查询主题漂移和词不匹配问题,改善和提高信息检索系统的查询性能。The purpose of the present invention is to propose a text retrieval method based on the fusion and expansion of association rules and word vectors, which can be used in the field of information retrieval, such as web information retrieval systems and search engines, and can reduce query topic drift and word inconsistency in information retrieval. Matching questions to improve and enhance the query performance of information retrieval systems.

本发明所采用的具体技术方案如下:The concrete technical scheme adopted in the present invention is as follows:

一种基于关联规则与词向量融合扩展的文本检索方法,包括下列步骤:A text retrieval method based on the fusion extension of association rules and word vectors, comprising the following steps:

步骤1.中文用户查询检索原始中文文档集得到初检文档,构建初检文档集。Step 1. Chinese users query and retrieve the original Chinese document set to obtain a preliminary inspection document, and construct a preliminary inspection document set.

步骤2.用深度学习工具对初检文档集中进行词向量语义学习训练,得到特征词词向量集。Step 2. Use deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word word vector set.

本发明所述深度学习工具是指Google开源词向量工具word2vec的Skip-gram模型(详见:https://code.google.com/p/word2vec/)。The deep learning tool in the present invention refers to the Skip-gram model of Google's open source word vector tool word2vec (see: https://code.google.com/p/word2vec/ for details).

步骤3.从初检文档集中提取前列m篇初检文档作为伪相关反馈文档,构建伪相关反馈文档集,对初检伪相关反馈文档集进行中文分词、去除中文停用词和提取特征词的预处理,并计算特征词权值,最后构建伪相关反馈中文文档库和中文特征词库。Step 3. Extract the first m initial inspection documents from the initial inspection document set as pseudo-relevant feedback documents, construct a pseudo-relevant feedback document set, and perform Chinese word segmentation, Chinese stop words removal, and feature word extraction on the initial inspection pseudo-relevant feedback document set. Preprocessing, and calculating feature word weights, and finally constructing pseudo-relevant feedback Chinese document database and Chinese feature word database.

本发明采用TF-IDF(term frequency–inverse document frequency)加权技术(见文献:Ricardo Baeza-Yates Berthier Ribeiro-Neto等著,王知津等译,《现代信息检索》,机械工业出版社,2005年:21-22。)计算特征词权值。The present invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see document: Ricardo Baeza-Yates Berthier Ribeiro-Neto et al., translated by Wang Zhijin et al., "Modern Information Retrieval", Machinery Industry Press, 2005: 21-22.) Calculate feature word weights.

步骤4.采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词,建立候选扩展词集,具体步骤如下:Step 4. Use the support and confidence based on the Copulas function to mine the candidate extension words from the pseudo-related feedback document set, and establish the candidate extension word set. The specific steps are as follows:

(4.1)产生1_候选项集C1:从中文特征词库中提取特征词作为1_候选项集C1(4.1) Generate 1_candidate item set C 1 : extract feature words from the Chinese feature lexicon as 1_candidate item set C 1 .

(4.2)产生1_频繁项集L1:计算C1基于Copulas函数的支持度Cop_Sup(C1),提取Cop_Sup(C1)不低于最小支持度阈值ms的C1作为1_频繁项集L1,并添加到频繁项集集合FIS(Frequent ItemSet)。(4.2) Generate 1_frequent itemset L 1 : calculate the support Cop_Sup(C 1 ) of C 1 based on the Copulas function, and extract C 1 whose Cop_Sup(C 1 ) is not lower than the minimum support threshold ms as 1_ frequent itemsets L 1 , and added to the frequent itemset set FIS (Frequent ItemSet).

所述Cop_Sup(Copulas basedSupport)表示基于Copulas函数的支持度。所述1_候选项集C1的Cop_Sup(C1)的计算如式(1)所示:The Cop_Sup (Copulas basedSupport) represents the support degree based on the Copulas function. The calculation of Cop_Sup(C 1 ) of the 1_candidate item set C 1 is shown in formula (1):

Figure BDA0002617758040000021
Figure BDA0002617758040000021

式(1)中,

Figure BDA0002617758040000022
表示1_候选项集C1在伪相关反馈中文文档库中出现的频度,DocNum表示伪相关反馈中文文档库总文档数量,
Figure BDA0002617758040000023
表示1_候选项集C1在伪相关反馈中文文档库中的项集权重,ItemsWeight表示伪相关反馈中文文档库中全体中文特征词的权重累加和。In formula (1),
Figure BDA0002617758040000022
Represents the frequency of 1_candidate item set C 1 appearing in the pseudo-relevant feedback Chinese document library, DocNum indicates the total number of documents in the pseudo-relevant feedback Chinese document library,
Figure BDA0002617758040000023
Represents the item set weight of 1_candidate item set C 1 in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database.

(4.3)产生k_候选项集Ck:将(k-1)_频繁项集Lk-1自连接生成k_候选项集Ck,所述k≥2。(4.3) Generate k_candidate item set C k : self-connect (k-1)_frequent itemset L k-1 to generate k_ candidate item set C k , where k≥2.

所述自连接方法采用Apriori算法(见文献:Agrawal R,Imielinski T,SwamiA.Mining association rules between sets of items in large database[C]//Proceedings of the 1993ACM SIGMOD International Conference on Management ofData,Washington D C,USA,1993:207-216.)中给出的候选项集连接方法。The self-connection method adopts Apriori algorithm (see document: Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large database [C]//Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington DC, USA , 1993:207-216.) The candidate item set connection method given in.

(4.4)2_候选项集C2剪枝:当挖掘到2_候选项集C2时,如果该C2不含有原查询词项,则删除该C2,如果该C2含有原查询词项,则留下该C2,然后,留下的C2转入步骤(4.5);当挖掘到k_候选项集Ck,所述k≥3,则Ck直接转入步骤(4.5)。(4.4) 2_candidate item set C 2 pruning: when mining to 2_ candidate item set C 2 , if the C 2 does not contain the original query term, delete the C 2 , if the C 2 contains the original query term item, then leave the C 2 , and then, go to step (4.5) with the remaining C 2 ; when the k_candidate item set C k is mined, and the k ≥ 3, then C k directly goes to step (4.5) .

(4.5)产生k_频繁项集Lk:计算Cop_Sup(Ck),提取Cop_Sup(Ck)不低于ms的Ck作为k_频繁项集Lk,并添加到FIS。(4.5) Generating k_frequent itemsets L k : Calculate Cop_Sup(C k ), extract C k whose Cop_Sup(C k ) is not lower than ms as k_ frequent itemsets L k , and add them to FIS.

所述Cop_Sup(Ck)的计算如式(2)所示:The calculation of the Cop_Sup(C k ) is shown in formula (2):

Figure BDA0002617758040000031
Figure BDA0002617758040000031

式(2)中,

Figure BDA0002617758040000032
表示k_候选项集Ck在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000033
表示k_候选项集Ck在伪相关反馈中文文档库中的项集权重;DocNum和ItemsWeight的定义与式(1)相同。In formula (2),
Figure BDA0002617758040000032
represents the frequency of k_candidate item set C k appearing in the pseudo-relevant feedback Chinese document library,
Figure BDA0002617758040000033
Represents the item set weight of k_candidate item set C k in the pseudo-relevant feedback Chinese document library; the definitions of DocNum and ItemsWeight are the same as formula (1).

(4.6)k加1后转入步骤(4.3)继续顺序执行其后步骤,直到产生所述Lk为空集,则频繁项集挖掘结束,转入步骤(4.7)。(4.6) After adding 1 to k, go to step (4.3) and continue to execute subsequent steps in sequence until the L k is an empty set, then the frequent itemset mining ends, and go to step (4.7).

(4.7)从FIS中任意取出Lk,所述k≥2。(4.7) Take L k arbitrarily from the FIS, where k ≥ 2.

(4.8)从Lk中提取关联规则Qi→Retj,计算关联规则Qi→Retj基于Copulas函数的置信度Cop_Con(Qi→Retj),所述i≥1,j≥1,且

Figure BDA0002617758040000037
Qi∪Retj=Lk
Figure BDA0002617758040000038
所述Retj为不含查询词项的真子集项集,所述Qi为含查询词项的真子集项集,所述Q为原查询词项集合。(4.8) Extract the association rule Q i →Ret j from L k , calculate the confidence level Cop_Con(Q i →Ret j ) of the association rule Q i →Ret j based on the Copulas function, the i≥1, j≥1, and
Figure BDA0002617758040000037
Q i ∪Ret j =L k ,
Figure BDA0002617758040000038
The Ret j is the proper subset itemset without query terms, the Q i is the proper subset itemsets with query terms, and the Q is the original query term set.

所述Cop_Con(Copulas based Confidence)表示基于Copulas函数的置信度,所述Cop_Con(Qi→Retj)的计算公式如式(3)所示:The Cop_Con (Copulas based Confidence) represents the confidence based on the Copulas function, and the calculation formula of the Cop_Con (Q i →Ret j ) is shown in formula (3):

Figure BDA0002617758040000034
Figure BDA0002617758040000034

式(3)中,

Figure BDA0002617758040000035
表示k_频繁项集Lk在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000036
表示k_频繁项集Lk在伪相关反馈中文文档库中的项集权重,numQi表示k_频繁项集Lk的真子集项集Qi在伪相关反馈中文文档库中出现的频度,weightQi表示k_频繁项集Lk的真子集项集Qi在伪相关反馈中文文档库中的项集权重;exp表示以自然常数e为底的指数函数。In formula (3),
Figure BDA0002617758040000035
represents the frequency of k_frequent itemsets L k appearing in the pseudo-relevant feedback Chinese document library,
Figure BDA0002617758040000036
Represents the itemset weight of k_frequent itemset L k in the pseudo-relevant feedback Chinese document database, and num Qi represents the frequency of the proper subset itemset Qi of k_ frequent itemset L k in the pseudo-relevant feedback Chinese document database , weight Qi represents the item set weight of the proper subset itemset Qi of k_frequent itemset L k in the pseudo-relevant feedback Chinese document database; exp represents the exponential function with the natural constant e as the base.

(4.9)产生关联规则Qi→Retj:提取Cop_Con(Qi→Retj)不小于最小置信度阈值mc的关联规则Qi→Retj加入到关联规则集AR(Association Rule),然后,转入步骤(4.8),从Lk中重新提取其他的真子集项集Retj和Qi,再顺序进行其后步骤,如此循环,直到Lk的所有真子集项集当且仅当都被取出一次为止,这时转入如步骤(4.7),进行新一轮关联规则模式挖掘,从FIS中再取出任意其他Lk,再顺序进行其后步骤,如此循环,直到FIS中所有k_频繁项集Lk当且仅当都被取出一次为止,这时关联规则模式挖掘结束,转入如下步骤(4.10)。(4.9) Generate association rules Qi → Ret j : extract the association rules Qi → Ret j whose Cop_Con (Q i Ret j ) is not less than the minimum confidence threshold mc and add them to the association rule set AR (Association Rule), then turn to Go to step (4.8), re-extract other proper subset itemsets Ret j and Qi from L k , and then perform subsequent steps in sequence, and so on, until all proper subset itemsets of L k are taken out if and only if all At this time, go to step (4.7), carry out a new round of association rule pattern mining, take out any other L k from the FIS, and then perform the subsequent steps in sequence, and so on, until all k_frequent items in the FIS are If and only if the set L k is taken out once, then the association rule pattern mining is completed, and the following step (4.10) is performed.

(4.10)产生候选扩展词:从关联规则集AR中提取关联规则后件Retj作为候选扩展词,得到候选扩展词集CETS(Candidate Expansion Term Set),计算候选扩展词权值wRet,然后,转入步骤5。(4.10) Generate candidate expansion words: extract the association rule consequent Ret j from the association rule set AR as a candidate expansion word, obtain the candidate expansion word set CETS (Candidate Expansion Term Set), calculate the candidate expansion word weight w Ret , and then, Go to step 5.

所述CETS如式(4)所示:The CETS is shown in formula (4):

Figure BDA0002617758040000041
Figure BDA0002617758040000041

式(4)中,Reti表示第i个候选扩展词。In formula (4), Ret i represents the ith candidate extension word.

所述候选扩展词权值wRet计算公式如式(5)所示:The calculation formula of the candidate extended word weight w Ret is shown in formula (5):

Figure BDA0002617758040000042
Figure BDA0002617758040000042

式(5)中,max()表示关联规则置信度的最大值,当多个关联规则模式中同时出现相同的规则扩展词时,取其置信度值最大的作为该规则扩展词的权值。In formula (5), max() represents the maximum value of association rule confidence. When the same rule expansion word appears in multiple association rule patterns at the same time, the one with the largest confidence value is taken as the weight of the rule expansion word.

步骤5.计算候选扩展词与原查询词的向量余弦相似度,提取不低于相似度阈值的候选扩展词作为最终扩展词,具体步骤如下:Step 5. Calculate the vector cosine similarity between the candidate expansion word and the original query word, and extract the candidate expansion word that is not lower than the similarity threshold as the final expansion word. The specific steps are as follows:

(5.1)计算向量相似度VecSim(Reti,Q):在特征词词向量集中,计算候选扩展词Reti与原查询词项集合Q(所述Q=(q1,q2,…,qr))中各个查询词{q1,q2,…,qr}的向量余弦相似度VecSim(Reti,qj),所述1≤j≤r,累加所述候选扩展词与各个查询词的向量相似度作为该候选扩展词总的向量相似度VecSim(Reti,Q)。(5.1) Calculate the vector similarity VecSim(Ret i , Q): In the feature word vector set, calculate the candidate extension word Ret i and the original query term set Q (the Q=(q 1 ,q 2 ,...,q r )) vector cosine similarity VecSim(Ret i ,q j ) of each query word {q 1 ,q 2 ,...,q r }, the 1≤j≤r, the candidate expansion word and each query are accumulated The word vector similarity is taken as the total vector similarity VecSim(Ret i ,Q) of the candidate expanded word.

所述VecSim(Reti,qj)如式(6)所示:The VecSim(Ret i ,q j ) is shown in formula (6):

Figure BDA0002617758040000043
Figure BDA0002617758040000043

式(6)中,所述vReti表示第i个候选扩展词Reti的词向量值,vqj表示第j个查询词项qj的词向量值。In formula (6), the vRet i represents the word vector value of the i-th candidate expansion word Ret i , and vq j represents the word vector value of the j-th query term q j .

所述VecSim(Reti,Q)计算公式如式(7)所示:The VecSim(Ret i , Q) calculation formula is shown in formula (7):

Figure BDA0002617758040000051
Figure BDA0002617758040000051

(5.2)产生最终扩展词:提取VecSim(Reti,Q)不低于向量相似度阈值minVSim的候选扩展词作为原查询词项集合Q的最终扩展词,得到最终扩展词集FETS(FinalExpansionTerm Set),如式(8)所示:(5.2) Generate the final expanded word: extract the candidate expanded word whose VecSim(Ret i , Q) is not lower than the vector similarity threshold minVSim as the final expanded word of the original query term set Q, and obtain the final expanded word set FETS (FinalExpansionTerm Set) , as shown in formula (8):

Figure BDA0002617758040000052
Figure BDA0002617758040000052

式(8)中,Reti为第i个最终扩展词。In formula (8), Ret i is the ith final extended word.

然后计算最终扩展词Ret的权值wFet,最终扩展词Ret的权值wFet由候选扩展词权值wRet和向量相似度VecSim(Reti,Q)组成。所述wFet计算公式如式(9)所示:Then the weight w Fet of the final extended word Ret is calculated, and the weight w Fet of the final extended word Ret is composed of the candidate extended word weight w Ret and the vector similarity VecSim(Ret i , Q). The w Fet calculation formula is shown in formula (9):

Figure BDA0002617758040000053
Figure BDA0002617758040000053

步骤6.扩展词与原查询组合为新查询再次检索原始中文文档集得到最终检索文档。Step 6. The expanded word is combined with the original query into a new query, and the original Chinese document set is retrieved again to obtain the final retrieved document.

步骤7.将最终检索文档集返回给用户。Step 7. Return the final retrieved document set to the user.

本发明与现有技术相比,具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:

(1)本发明提出了一种基于关联规则与词向量融合扩展的文本检索方法,该发明方法利用深度学习工具对初检文档集中进行词向量语义学习训练,得到特征词词向量集,采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词,计算候选扩展词与原查询的向量余弦相似度,提取不低于相似度阈值的候选扩展词作为最终扩展词,最终扩展词与原查询组合为新查询,再次检索原始文档集得到最终检索结果,并将最终检索文档集返回给用户。实验结果表明,本发明方法的检索性能优于现有方法,能有效地减少查询主题漂移和词不匹配问题,提高信息检索性能,具有较好的应用价值和推广前景。(1) The present invention proposes a text retrieval method based on the fusion and expansion of association rules and word vectors. The inventive method uses deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word vector set. The support and confidence of the Copulas function are used to mine the candidate expansion words from the pseudo-relevant feedback document set, calculate the vector cosine similarity between the candidate expansion words and the original query, and extract the candidate expansion words that are not lower than the similarity threshold as the final expansion words. The word and the original query are combined into a new query, the original document set is retrieved again to obtain the final retrieval result, and the final retrieved document set is returned to the user. The experimental results show that the retrieval performance of the method of the present invention is better than the existing methods, can effectively reduce the problem of query subject drift and word mismatch, improve the information retrieval performance, and has good application value and promotion prospect.

(2)选择近年出现的4种同类查询扩展方法作为本发明方法的对比方法,实验数据是国家标准数据集NTCIR-5CLIR中文语料。实验结果表明,本发明方法的实验结果P@5值都高于基准检索,相对于4种对比方法,本发明方法P@5值绝大部分都得到提升,说明本发明方法的检索性能均优于基准检索和对比方法,能提高信息检索性能,减少信息检索中查询漂移和词不匹配问题,具有很高的应用价值和广阔的推广前景。(2) Four similar query expansion methods that have appeared in recent years are selected as the comparison methods of the method of the present invention, and the experimental data is the Chinese corpus of the national standard data set NTCIR-5CLIR. The experimental results show that the P@5 value of the experimental result of the method of the present invention is higher than that of the benchmark retrieval. Compared with the four comparison methods, the P@5 value of the method of the present invention is mostly improved, indicating that the retrieval performance of the method of the present invention is excellent. Based on benchmark retrieval and comparison methods, it can improve information retrieval performance, reduce query drift and word mismatch problems in information retrieval, and has high application value and broad promotion prospects.

附图说明Description of drawings

图1为本发明所述的基于关联规则与词向量融合扩展的文本检索方法的总体流程示意图。FIG. 1 is a schematic overall flow chart of the text retrieval method based on the fusion extension of association rules and word vectors according to the present invention.

具体实施方式Detailed ways

一、为了更好地说明本发明的技术方案,下面将本发明涉及的相关概念介绍如下:1. In order to better illustrate the technical solution of the present invention, the related concepts involved in the present invention are introduced as follows:

1.项集1. Itemset

在文本挖掘中,将一篇文本文档当作一个事务,文档中的各个特征词称为项目,特征词项目的集合称为项集,项集中所有项目的个数称为项集长度。k_项集指含有k个项目的项集,k即为项集的长度。In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, the collection of feature word items is called an itemset, and the number of all items in an itemset is called the itemset length. k_itemsets refer to itemsets containing k items, where k is the length of the itemsets.

2.关联规则前件和后件2. Antecedents and Consequences of Association Rules

设x、y是任意的特征词项集,将形如x→y的蕴含式称为关联规则,其中,x称为规则前件,y称为规则后件。Let x and y be an arbitrary set of feature terms, and the implication in the form of x→y is called an association rule, where x is called the antecedent of the rule, and y is called the consequent of the rule.

3.基于Copulas函数的支持度3. Support based on Copulas function

基于Copulas函数的支持度(Copulas based Support)表示为Cop_Sup()。Copulas based Support is represented as Cop_Sup().

基于Copulas函数的特征词项集(T1∪T2)支持度Cop_Sup(T1∪T2)的计算如式(10)所示:The calculation of the support degree Cop_Sup(T 1 ∪T 2 ) of the feature word item set (T 1 ∪ T 2 ) based on the Copulas function is shown in formula (10):

Figure BDA0002617758040000061
Figure BDA0002617758040000061

式(10)中,

Figure BDA0002617758040000062
表示项集(T1∪T2)在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000063
表示项集(T1∪T2)在伪相关反馈中文文档库中的项集权重。DocNum表示伪相关反馈中文文档库总文档数量,ItemsWeight表示伪相关反馈中文文档库中全体中文特征词的权重累加和。In formula (10),
Figure BDA0002617758040000062
represents the frequency of itemsets (T 1 ∪ T 2 ) appearing in the pseudo-relevant feedback Chinese document database,
Figure BDA0002617758040000063
Represents the itemset weight of the itemset (T 1 ∪ T 2 ) in the pseudo-relevant feedback Chinese document database. DocNum represents the total number of documents in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database.

4.基于Copulas函数的置信度4. Confidence based on Copulas function

基于Copulas函数的置信度((Copulas based Confidence)表示为Cop_Con()。The confidence based on the Copulas function ((Copulas based Confidence) is represented as Cop_Con().

基于Copulas函数的特征词关联规则T1→T2置信度Cop_Con(T1→T2)的计算公式,如式(11)所示:The calculation formula of the feature word association rule T 1 →T 2 confidence Cop_Con(T 1 →T 2 ) based on the Copulas function is shown in formula (11):

Figure BDA0002617758040000064
Figure BDA0002617758040000064

式(11)中,

Figure BDA0002617758040000065
表示项集(T1∪T2)在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000066
表示项集(T1∪T2)在伪相关反馈中文文档库中的项集权重,
Figure BDA0002617758040000067
表示项集T1在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000068
表示项集T1在伪相关反馈中文文档库中的项集权重。exp表示以自然常数e为底的指数函数。In formula (11),
Figure BDA0002617758040000065
represents the frequency of itemsets (T 1 ∪ T 2 ) appearing in the pseudo-relevant feedback Chinese document database,
Figure BDA0002617758040000066
represents the itemset weight of the itemset (T 1 ∪ T 2 ) in the pseudo-relevant feedback Chinese document database,
Figure BDA0002617758040000067
represents the frequency of itemset T 1 appearing in the pseudo-relevant feedback Chinese document database,
Figure BDA0002617758040000068
Represents the itemset weight of the itemset T1 in the pseudo - relevant feedback Chinese document database. exp represents the exponential function with the natural constant e as the base.

二、下面结合附图和具体对比实验来对本发明作进一步说明。2. The present invention will be further described below in conjunction with the accompanying drawings and specific comparative experiments.

如图1所示,本发明的基于关联规则与词向量融合扩展的文本检索方法,包括下列步骤:As shown in Figure 1, the text retrieval method based on the fusion extension of association rules and word vectors of the present invention comprises the following steps:

步骤1.中文用户查询检索原始中文文档集得到初检文档,构建初检文档集。Step 1. Chinese users query and retrieve the original Chinese document set to obtain a preliminary inspection document, and construct a preliminary inspection document set.

步骤2.用深度学习工具对初检文档集中进行词向量语义学习训练,得到特征词词向量集。Step 2. Use deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word word vector set.

本发明所述深度学习工具是指Google开源词向量工具word2vec的Skip-gram模型。The deep learning tool in the present invention refers to the Skip-gram model of Google's open source word vector tool word2vec.

步骤3.从初检文档集中提取前列m篇初检文档作为伪相关反馈文档,构建伪相关反馈文档集,对初检伪相关反馈文档集进行中文分词、去除中文停用词和提取特征词的预处理,并采用TF-IDF加权技术计算特征词权值,最后构建伪相关反馈中文文档库和中文特征词库。Step 3. Extract the first m initial inspection documents from the initial inspection document set as pseudo-relevant feedback documents, construct a pseudo-relevant feedback document set, and perform Chinese word segmentation, Chinese stop words removal, and feature word extraction on the initial inspection pseudo-relevant feedback document set. After preprocessing, the TF-IDF weighting technology is used to calculate the weights of feature words, and finally the pseudo-relevant feedback Chinese document database and Chinese feature word database are constructed.

步骤4.采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词,建立候选扩展词集,具体步骤如下:Step 4. Use the support and confidence based on the Copulas function to mine the candidate extension words from the pseudo-related feedback document set, and establish the candidate extension word set. The specific steps are as follows:

(4.1)产生1_候选项集C1:从中文特征词库中提取特征词作为1_候选项集C1(4.1) Generate 1_candidate item set C 1 : extract feature words from the Chinese feature lexicon as 1_candidate item set C 1 .

(4.2)产生1_频繁项集L1:计算C1基于Copulas函数的支持度Cop_Sup(C1),提取Cop_Sup(C1)不低于最小支持度阈值ms的C1作为1_频繁项集L1,并添加到频繁项集集合FIS(Frequent ItemSet)。(4.2) Generate 1_frequent itemset L 1 : calculate the support Cop_Sup(C 1 ) of C 1 based on the Copulas function, and extract C 1 whose Cop_Sup(C 1 ) is not lower than the minimum support threshold ms as 1_ frequent itemsets L 1 , and added to the frequent itemset set FIS (Frequent ItemSet).

所述Cop_Sup(Copulas basedSupport)表示基于Copulas函数的支持度。所述1_候选项集C1的Cop_Sup(C1)的计算如式(1)所示:The Cop_Sup (Copulas basedSupport) represents the support degree based on the Copulas function. The calculation of Cop_Sup(C 1 ) of the 1_candidate item set C 1 is shown in formula (1):

Figure BDA0002617758040000071
Figure BDA0002617758040000071

式(1)中,

Figure BDA0002617758040000072
表示1_候选项集C1在伪相关反馈中文文档库中出现的频度,DocNum表示伪相关反馈中文文档库总文档数量,
Figure BDA0002617758040000073
表示1_候选项集C1在伪相关反馈中文文档库中的项集权重,ItemsWeight表示伪相关反馈中文文档库中全体中文特征词的权重累加和。In formula (1),
Figure BDA0002617758040000072
Represents the frequency of 1_candidate item set C 1 appearing in the pseudo-relevant feedback Chinese document library, DocNum indicates the total number of documents in the pseudo-relevant feedback Chinese document library,
Figure BDA0002617758040000073
Represents the item set weight of 1_candidate item set C 1 in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database.

(4.3)产生k_候选项集Ck:将(k-1)_频繁项集Lk-1自连接生成k_候选项集Ck,所述k≥2。(4.3) Generate k_candidate item set C k : self-connect (k-1)_frequent itemset L k-1 to generate k_ candidate item set C k , where k≥2.

所述自连接方法采用Apriori算法中给出的候选项集连接方法。The self-connection method adopts the candidate itemset connection method given in the Apriori algorithm.

(4.4)2_候选项集C2剪枝:当挖掘到2_候选项集C2时,如果该C2不含有原查询词项,则删除该C2,如果该C2含有原查询词项,则留下该C2,然后,留下的C2转入步骤(4.5);当挖掘到k_候选项集Ck,所述k≥3,则Ck直接转入步骤(4.5)。(4.4) 2_candidate item set C 2 pruning: when mining to 2_ candidate item set C 2 , if the C 2 does not contain the original query term, delete the C 2 , if the C 2 contains the original query term item, then leave the C 2 , and then, go to step (4.5) with the remaining C 2 ; when the k_candidate item set C k is mined, and the k ≥ 3, then C k directly goes to step (4.5) .

(4.5)产生k_频繁项集Lk:计算Cop_Sup(Ck),提取Cop_Sup(Ck)不低于ms的Ck作为k_频繁项集Lk,并添加到FIS。(4.5) Generating k_frequent itemsets L k : Calculate Cop_Sup(C k ), extract C k whose Cop_Sup(C k ) is not lower than ms as k_ frequent itemsets L k , and add them to FIS.

所述Cop_Sup(Ck)的计算如式(2)所示:The calculation of the Cop_Sup(C k ) is shown in formula (2):

Figure BDA0002617758040000081
Figure BDA0002617758040000081

式(2)中,

Figure BDA0002617758040000082
表示k_候选项集Ck在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000083
表示k_候选项集Ck在伪相关反馈中文文档库中的项集权重;DocNum和ItemsWeight的定义与式(1)相同。In formula (2),
Figure BDA0002617758040000082
represents the frequency of k_candidate item set C k appearing in the pseudo-relevant feedback Chinese document library,
Figure BDA0002617758040000083
Represents the item set weight of k_candidate item set C k in the pseudo-relevant feedback Chinese document library; the definitions of DocNum and ItemsWeight are the same as formula (1).

(4.6)k加1后转入步骤(4.3)继续顺序执行其后步骤,直到产生所述Lk为空集,则频繁项集挖掘结束,转入步骤(4.7)。(4.6) After adding 1 to k, go to step (4.3) and continue to execute subsequent steps in sequence until the L k is an empty set, then the frequent itemset mining ends, and go to step (4.7).

(4.7)从FIS中任意取出Lk,所述k≥2。(4.7) Take L k arbitrarily from the FIS, where k ≥ 2.

(4.8)从Lk中提取关联规则Qi→Retj,计算关联规则Qi→Retj基于Copulas函数的置信度Cop_Con(Qi→Retj),所述i≥1,j≥1,且

Figure BDA0002617758040000084
Qi∪Retj=Lk
Figure BDA0002617758040000085
所述Retj为不含查询词项的真子集项集,所述Qi为含查询词项的真子集项集,所述Q为原查询词项集合。(4.8) Extract the association rule Q i →Ret j from L k , calculate the confidence level Cop_Con(Q i →Ret j ) of the association rule Q i →Ret j based on the Copulas function, the i≥1, j≥1, and
Figure BDA0002617758040000084
Q i ∪Ret j =L k ,
Figure BDA0002617758040000085
The Ret j is the proper subset itemset without query terms, the Q i is the proper subset itemsets with query terms, and the Q is the original query term set.

所述Cop_Con(Copulas based Confidence)表示基于Copulas函数的置信度,所述Cop_Con(Qi→Retj)的计算公式如式(3)所示:The Cop_Con (Copulas based Confidence) represents the confidence based on the Copulas function, and the calculation formula of the Cop_Con (Q i →Ret j ) is shown in formula (3):

Figure BDA0002617758040000086
Figure BDA0002617758040000086

式(3)中,

Figure BDA0002617758040000087
表示k_频繁项集Lk在伪相关反馈中文文档库中出现的频度,
Figure BDA0002617758040000088
表示k_频繁项集Lk在伪相关反馈中文文档库中的项集权重,numQi表示k_频繁项集Lk的真子集项集Qi在伪相关反馈中文文档库中出现的频度,weightQi表示k_频繁项集Lk的真子集项集Qi在伪相关反馈中文文档库中的项集权重;exp表示以自然常数e为底的指数函数。In formula (3),
Figure BDA0002617758040000087
represents the frequency of k_frequent itemsets L k appearing in the pseudo-relevant feedback Chinese document library,
Figure BDA0002617758040000088
Represents the itemset weight of k_frequent itemset L k in the pseudo-relevant feedback Chinese document database, and num Qi represents the frequency of the proper subset itemset Qi of k_ frequent itemset L k in the pseudo-relevant feedback Chinese document database , weight Qi represents the item set weight of the proper subset itemset Qi of k_frequent itemset L k in the pseudo-relevant feedback Chinese document database; exp represents the exponential function with the natural constant e as the base.

(4.9)产生关联规则Qi→Retj:提取Cop_Con(Qi→Retj)不小于最小置信度阈值mc的关联规则Qi→Retj加入到关联规则集AR(Association Rule),然后,转入步骤(4.8),从Lk中重新提取其他的真子集项集Retj和Qi,再顺序进行其后步骤,如此循环,直到Lk的所有真子集项集当且仅当都被取出一次为止,这时转入如步骤(4.7),进行新一轮关联规则模式挖掘,从FIS中再取出任意其他Lk,再顺序进行其后步骤,如此循环,直到FIS中所有k_频繁项集Lk当且仅当都被取出一次为止,这时关联规则模式挖掘结束,转入如下步骤(4.10)。(4.9) Generate association rules Qi → Ret j : extract the association rules Qi → Ret j whose Cop_Con (Q i Ret j ) is not less than the minimum confidence threshold mc and add them to the association rule set AR (Association Rule), then turn to Go to step (4.8), re-extract other proper subset itemsets Ret j and Qi from L k , and then perform subsequent steps in sequence, and so on, until all proper subset itemsets of L k are taken out if and only if all At this time, go to step (4.7), carry out a new round of association rule pattern mining, take out any other L k from the FIS, and then perform the subsequent steps in sequence, and so on, until all k_frequent items in the FIS are If and only if the set L k is taken out once, then the association rule pattern mining is completed, and the following step (4.10) is performed.

(4.10)产生候选扩展词:从关联规则集AR中提取关联规则后件Retj作为候选扩展词,得到候选扩展词集CETS(Candidate Expansion Term Set),计算候选扩展词权值wRet,然后,转入步骤5。(4.10) Generate candidate expansion words: extract the association rule consequent Ret j from the association rule set AR as a candidate expansion word, obtain the candidate expansion word set CETS (Candidate Expansion Term Set), calculate the candidate expansion word weight w Ret , and then, Go to step 5.

所述CETS如式(4)所示:The CETS is shown in formula (4):

Figure BDA0002617758040000091
Figure BDA0002617758040000091

式(4)中,Reti表示第i个候选扩展词。In formula (4), Ret i represents the ith candidate extension word.

所述候选扩展词权值wRet计算公式如式(5)所示:The calculation formula of the candidate extended word weight w Ret is shown in formula (5):

Figure BDA0002617758040000092
Figure BDA0002617758040000092

式(5)中,max()表示关联规则置信度的最大值,当多个关联规则模式中同时出现相同的规则扩展词时,取其置信度值最大的作为该规则扩展词的权值。In formula (5), max() represents the maximum value of association rule confidence. When the same rule expansion word appears in multiple association rule patterns at the same time, the one with the largest confidence value is taken as the weight of the rule expansion word.

步骤5.计算候选扩展词与原查询词的向量余弦相似度,提取不低于相似度阈值的候选扩展词作为最终扩展词,具体步骤如下:Step 5. Calculate the vector cosine similarity between the candidate expansion word and the original query word, and extract the candidate expansion word that is not lower than the similarity threshold as the final expansion word. The specific steps are as follows:

(5.1)计算向量相似度VecSim(Reti,Q):在特征词词向量集中,计算候选扩展词Reti与原查询词项集合Q(所述Q=(q1,q2,…,qr))中各个查询词{q1,q2,…,qr}的向量余弦相似度VecSim(Reti,qj),所述1≤j≤r,累加所述候选扩展词与各个查询词的向量相似度作为该候选扩展词总的向量相似度VecSim(Reti,Q)。(5.1) Calculate the vector similarity VecSim(Ret i , Q): In the feature word vector set, calculate the candidate extension word Ret i and the original query term set Q (the Q=(q 1 ,q 2 ,...,q r )) vector cosine similarity VecSim(Ret i ,q j ) of each query word {q 1 ,q 2 ,...,q r }, the 1≤j≤r, the candidate expansion word and each query are accumulated The word vector similarity is taken as the total vector similarity VecSim(Ret i ,Q) of the candidate expanded word.

所述VecSim(Reti,qj)如式(6)所示:The VecSim(Ret i ,q j ) is shown in formula (6):

Figure BDA0002617758040000093
Figure BDA0002617758040000093

式(6)中,所述vReti表示第i个候选扩展词Reti的词向量值,vqj表示第j个查询词项qj的词向量值。In formula (6), the vRet i represents the word vector value of the i-th candidate expansion word Ret i , and vq j represents the word vector value of the j-th query term q j .

所述VecSim(Reti,Q)计算公式如式(7)所示:The VecSim(Ret i , Q) calculation formula is shown in formula (7):

Figure BDA0002617758040000094
Figure BDA0002617758040000094

(5.2)产生最终扩展词:提取VecSim(Reti,Q)不低于向量相似度阈值minVSim的候选扩展词作为原查询词项集合Q的最终扩展词,得到最终扩展词集FETS(FinalExpansionTerm Set),如式(8)所示:(5.2) Generate the final expanded word: extract the candidate expanded word whose VecSim(Ret i , Q) is not lower than the vector similarity threshold minVSim as the final expanded word of the original query term set Q, and obtain the final expanded word set FETS (FinalExpansionTerm Set) , as shown in formula (8):

Figure BDA0002617758040000095
Figure BDA0002617758040000095

式(8)中,Reti为第i个最终扩展词。In formula (8), Ret i is the ith final extended word.

然后计算最终扩展词Ret的权值wFet,最终扩展词Ret的权值wFet由候选扩展词权值wRet和向量相似度VecSim(Reti,Q)组成。所述wFet计算公式如式(9)所示:Then the weight w Fet of the final extended word Ret is calculated, and the weight w Fet of the final extended word Ret is composed of the candidate extended word weight w Ret and the vector similarity VecSim(Ret i , Q). The w Fet calculation formula is shown in formula (9):

Figure BDA0002617758040000101
Figure BDA0002617758040000101

步骤6.扩展词与原查询组合为新查询再次检索原始中文文档集得到最终检索文档。Step 6. The expanded word is combined with the original query into a new query, and the original Chinese document set is retrieved again to obtain the final retrieved document.

步骤7.将最终检索文档集返回给用户。Step 7. Return the final retrieved document set to the user.

实验设计与结果:Experimental design and results:

我们通过和现有同类方法进行实验对比,以说明本发明方法的有效性。We carry out experimental comparison with existing similar methods to illustrate the effectiveness of the method of the present invention.

1.实验环境及实验数据:1. Experimental environment and experimental data:

本发明实验数据是NTCIR-5CLIR(详见:http://research.nii.ac.jp/ntcir/data/data-en.html.)中文文本语料,共8个数据集,合计901446篇中文文档,具体信息如表1所示。该语料是国际上通用的标准语料,有文档集、查询集和结果集,即50个中文查询,4种类型的查询主题和2种评价标准的结果集。本文采用Description(简称Desc)和Title查询主题完成检索实验,Title查询属于短查询,以名词和名词性短语简要描述查询主题,而Desc查询属于长查询,以句子形式简要描述查询主题;结果集有Rigid(与查询高度相关,相关)和Relax(与查询高度相关、相关和部分相关)评价标准。The experimental data of the present invention is NTCIR-5CLIR (see: http://research.nii.ac.jp/ntcir/data/data-en.html.) Chinese text corpus, a total of 8 data sets, a total of 901446 Chinese documents , and the specific information is shown in Table 1. The corpus is an international standard corpus, including document set, query set and result set, that is, 50 Chinese queries, 4 types of query topics and 2 evaluation criteria result sets. This paper uses Description (Desc) and Title query subjects to complete the retrieval experiment. Title query is a short query, which briefly describes the query subject in terms of nouns and noun phrases, while Desc query is a long query, which briefly describes the query subject in sentence form; the result set includes Rigid (highly relevant, relevant to the query) and Relax (highly relevant, relevant and partially relevant to the query) evaluation criteria.

实验数据预处理是:分词和去除停用词。The experimental data preprocessing is: word segmentation and removal of stop words.

实验结果检索评价指标是MAP(Mean Average Precision)The experimental result retrieval evaluation index is MAP (Mean Average Precision)

表1本发明实验原始语料集及其数量Table 1 The original corpus of the experiment of the present invention and its quantity

Figure BDA0002617758040000102
Figure BDA0002617758040000102

2.基准检索与对比方法:2. Benchmark retrieval and comparison methods:

实验基础检索环境采用Lucene.Net(详见:http://lucenenet.apache.org/)搭建。The basic retrieval environment of the experiment is built with Lucene.Net (see: http://lucenenet.apache.org/).

基准检索是原始查询提交到Lucene.Net进行初次检索得到的检索结果。The benchmark retrieval is the retrieval result obtained by submitting the original query to Lucene.Net for initial retrieval.

对比方法描述如下:The comparison method is described as follows:

对比方法1:采用文献(黄名选.基于加权关联模式挖掘的越-英跨语言查询扩展[J].情报学报,2017,36(3):307-318.)的加权关联模式挖掘技术挖掘规则扩展词,参数为mc=0.1,mi=0.0001,实验结果为ms分别为0.004,0.005,0.006,0.007时的平均值。Comparison method 1: Use the weighted association pattern mining technology of literature (Huang Mingxuan. Vietnamese-English cross-language query expansion based on weighted association pattern mining [J]. Journal of Information Science, 2017, 36(3): 307-318.) to mine rule expansion word, the parameters are mc=0.1, mi=0.0001, and the experimental result is the average value when ms is 0.004, 0.005, 0.006, and 0.007 respectively.

对比方法2:采用文献(黄名选,蒋曹清.基于完全加权正负关联模式挖掘的越-英跨语言查询译后扩展[J].电子学报,2018,46(12):3029-3036.)的完全加权正负关联模式挖掘技术挖掘规则扩展词。参数为mc=0.1,α=0.3,minPR=0.1,minNR=0.01,以及ms分别为0.10,0.11,0.12,0.13,实验结果取平均值。Comparison method 2: Using literature (Huang Mingxuan, Jiang Caoqing. Post-translation expansion of Vietnamese-English cross-language query based on fully weighted positive and negative association pattern mining [J]. Journal of Electronic Engineering, 2018, 46(12): 3029-3036.) The fully weighted positive and negative association pattern mining technique mines regular expansion words. The parameters are mc=0.1, α=0.3, minPR=0.1, minNR=0.01, and ms are 0.10, 0.11, 0.12, 0.13, respectively, and the experimental results are averaged.

对比方法3:采用文献(Zhang H R,Zhang J W,Wei X Y,et al.A new frequentpattern mining algorithm with weighted multiple minimum supports[J].Intelligent Automation&Soft Computing,2017,23(4):605-612.)的基于多支持度阈值的加权频繁模式挖掘技术挖掘规则扩展词,参数为mc=0.1,LMS=0.2,HMS=0.25,WT=0.1,ms分别为0.1,0.15,0.2,0.25,实验结果取平均值。Comparison method 3: Using the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequent pattern mining algorithm with weighted multiple minimum supports[J]. Intelligent Automation & Soft Computing, 2017, 23(4): 605-612.) The weighted frequent pattern mining technology based on multiple support thresholds mines rule expansion words, the parameters are mc=0.1, LMS=0.2, HMS=0.25, WT=0.1, ms are 0.1, 0.15, 0.2, 0.25, respectively, and the experimental results are averaged .

对比方法4:采用文献(许侃,林原,曲忱,等.专利查询扩展的词向量方法研究[J].计算机科学与探索,2018,12(6):972-980.)的基于词向量的查询扩展方法。实验参数:k=60,α=0.1。Comparison method 4: Using the literature (Xu Kan, Lin Yuan, Qu Chen, et al. Research on word vector method for patent query expansion [J]. Computer Science and Exploration, 2018, 12(6): 972-980.) Based on word vector query extension method. Experimental parameters: k=60, α=0.1.

本发明所用的Skip-gram模型词向量语义学习训练参数:batch_size=128,embedding_size=300,embedding_size=300,skip_window=2,num_skips=4,num_sampled=64。The skip-gram model word vector semantic learning training parameters used in the present invention: batch_size=128, embedding_size=300, embedding_size=300, skip_window=2, num_skips=4, num_sampled=64.

3.实验方法和结果如下:3. The experimental methods and results are as follows:

NTCIR-5CLIR语料50个中文查询在8个数据集上进行本发明实验,得到基准检索BLR结果、对比方法和本发明方法的检索结果P@5平均值,如表2和表3所示。The NTCIR-5CLIR corpus 50 Chinese queries were used to conduct the experiments of the present invention on 8 data sets, and the average value of the P@5 retrieval results of the benchmark retrieval BLR results, the comparison method and the method of the present invention was obtained, as shown in Tables 2 and 3.

表2本发明方法与基准检索、对比方法的检索性能P@5值(Title查询)Table 2 The retrieval performance P@5 value of the method of the present invention and the benchmark retrieval and comparison method (Title query)

Figure BDA0002617758040000111
Figure BDA0002617758040000111

表3本发明方法与基准检索、对比方法的检索性能P@5值(Desc查询)Table 3 The retrieval performance P@5 value of the method of the present invention and the benchmark retrieval and comparison method (Desc query)

Figure BDA0002617758040000121
Figure BDA0002617758040000121

实验结果表明,本发明方法的实验结果P@5值都高于基准检索,相对于4种对比方法,本发明方法P@5值绝大部分都得到了改善和提高,说明本发明方法扩展检索性能高于基准检索和同类的对比方法。实验结果表明,本发明方法是有效的,确实能提高息检索性能,具有很高的应用价值和广阔的推广前景。The experimental results show that the P@5 values of the experimental results of the method of the present invention are all higher than the benchmark retrieval. Compared with the four comparison methods, most of the P@5 values of the method of the present invention have been improved and improved, indicating that the method of the present invention expands the retrieval. The performance is higher than that of benchmark retrieval and similar comparison methods. The experimental results show that the method of the invention is effective, can indeed improve the information retrieval performance, and has high application value and broad promotion prospects.

Claims (1)

1.一种基于关联规则与词向量融合扩展的文本检索方法,其特征在于,包括下列步骤:1. a text retrieval method based on association rule and word vector fusion expansion, is characterized in that, comprises the following steps: 步骤1.中文用户查询检索原始中文文档集得到初检文档,构建初检文档集;Step 1. Chinese users query and retrieve the original Chinese document set to obtain the initial inspection document, and construct the initial inspection document set; 步骤2.用深度学习工具对初检文档集中进行词向量语义学习训练,得到特征词词向量集;Step 2. Use deep learning tools to perform word vector semantic learning training on the initial inspection document set to obtain a feature word word vector set; 本发明所述深度学习工具是指Google开源词向量工具word2vec的Skip-gram模型;The deep learning tool of the present invention refers to the Skip-gram model of Google's open source word vector tool word2vec; 步骤3.从初检文档集中提取前列m篇初检文档作为伪相关反馈文档,构建伪相关反馈文档集,对初检伪相关反馈文档集进行中文分词、去除中文停用词和提取特征词的预处理,并采用TF-IDF加权技术计算特征词权值,最后构建伪相关反馈中文文档库和中文特征词库;Step 3. Extract the first m initial inspection documents from the initial inspection document set as pseudo-relevant feedback documents, construct a pseudo-relevant feedback document set, and perform Chinese word segmentation, Chinese stop words removal, and feature word extraction on the initial inspection pseudo-relevant feedback document set. Preprocess, and use TF-IDF weighting technology to calculate the weight of feature words, and finally build a pseudo-relevant feedback Chinese document database and Chinese feature word database; 步骤4.采用基于Copulas函数的支持度和置信度对伪相关反馈文档集挖掘候选扩展词,建立候选扩展词集,具体步骤如下:Step 4. Use the support and confidence based on the Copulas function to mine the candidate extension words from the pseudo-related feedback document set, and establish the candidate extension word set. The specific steps are as follows: (4.1)产生1_候选项集C1:从中文特征词库中提取特征词作为1_候选项集C1(4.1) Generate 1_candidate item set C 1 : extract feature words from Chinese feature vocabulary as 1_ candidate item set C 1 ; (4.2)产生1_频繁项集L1:计算C1基于Copulas函数的支持度Cop_Sup(C1),提取Cop_Sup(C1)不低于最小支持度阈值ms的C1作为1_频繁项集L1,并添加到频繁项集集合FIS(Frequent ItemSet);(4.2) Generate 1_frequent itemset L 1 : calculate the support Cop_Sup(C 1 ) of C 1 based on the Copulas function, and extract C 1 whose Cop_Sup(C 1 ) is not lower than the minimum support threshold ms as 1_ frequent itemsets L 1 , and add it to the frequent itemset set FIS (Frequent ItemSet); 所述1_候选项集C1的Cop_Sup(C1)的计算如式(1)所示:The calculation of Cop_Sup(C 1 ) of the 1_candidate item set C 1 is shown in formula (1):
Figure FDA0002617758030000011
Figure FDA0002617758030000011
式(1)中,
Figure FDA0002617758030000012
表示1_候选项集C1在伪相关反馈中文文档库中出现的频度,DocNum表示伪相关反馈中文文档库总文档数量,
Figure FDA0002617758030000013
表示1_候选项集C1在伪相关反馈中文文档库中的项集权重,ItemsWeight表示伪相关反馈中文文档库中全体中文特征词的权重累加和;
In formula (1),
Figure FDA0002617758030000012
Represents the frequency of 1_candidate item set C 1 appearing in the pseudo-relevant feedback Chinese document library, DocNum indicates the total number of documents in the pseudo-relevant feedback Chinese document library,
Figure FDA0002617758030000013
Represents the item set weight of 1_candidate item set C 1 in the pseudo-relevant feedback Chinese document database, and ItemsWeight represents the weight accumulation of all Chinese feature words in the pseudo-relevant feedback Chinese document database;
(4.3)产生k_候选项集Ck:将(k-1)_频繁项集Lk-1自连接生成k_候选项集Ck,所述k≥2;(4.3) Generate k_candidate item set C k : self-connect (k-1)_frequent itemset L k-1 to generate k_ candidate item set C k , where k≥2; 所述自连接方法采用Apriori算法中给出的候选项集连接方法;The self-connection method adopts the candidate itemset connection method given in the Apriori algorithm; (4.4)2_候选项集C2剪枝:当挖掘到2_候选项集C2时,如果该C2不含有原查询词项,则删除该C2,如果该C2含有原查询词项,则留下该C2,然后,留下的C2转入步骤(4.5);当挖掘到k_候选项集Ck,所述k≥3,则Ck直接转入步骤(4.5);(4.4) 2_candidate item set C 2 pruning: when mining to 2_ candidate item set C 2 , if the C 2 does not contain the original query term, delete the C 2 , if the C 2 contains the original query term item, then leave the C 2 , and then, go to step (4.5) with the remaining C 2 ; when the k_candidate item set C k is mined, and the k ≥ 3, then C k directly goes to step (4.5) ; (4.5)产生k_频繁项集Lk:计算Cop_Sup(Ck),提取Cop_Sup(Ck)不低于ms的Ck作为k_频繁项集Lk,并添加到FIS;(4.5) Generate k_frequent itemset L k : calculate Cop_Sup(C k ), extract C k whose Cop_Sup(C k ) is not lower than ms as k_ frequent itemset L k , and add it to FIS; 所述Cop_Sup(Ck)的计算如式(2)所示:The calculation of the Cop_Sup(C k ) is shown in formula (2):
Figure FDA0002617758030000021
Figure FDA0002617758030000021
式(2)中,
Figure FDA0002617758030000025
表示k_候选项集Ck在伪相关反馈中文文档库中出现的频度,
Figure FDA0002617758030000026
表示k_候选项集Ck在伪相关反馈中文文档库中的项集权重;DocNum和ItemsWeight的定义与式(1)相同;
In formula (2),
Figure FDA0002617758030000025
represents the frequency of k_candidate item set C k appearing in the pseudo-relevant feedback Chinese document library,
Figure FDA0002617758030000026
Represents the item set weight of k_candidate item set C k in the pseudo-relevant feedback Chinese document library; the definitions of DocNum and ItemsWeight are the same as formula (1);
(4.6)k加1后转入步骤(4.3)继续顺序执行其后步骤,直到产生所述Lk为空集,则频繁项集挖掘结束,转入步骤(4.7);(4.6) After adding 1 to k, go to step (4.3) and continue to execute subsequent steps in sequence until the L k is an empty set, then the frequent itemset mining ends, and go to step (4.7); (4.7)从FIS中任意取出Lk,所述k≥2;(4.7) arbitrarily take out L k from the FIS, the k ≥ 2; (4.8)从Lk中提取关联规则Qi→Retj,计算关联规则Qi→Retj基于Copulas函数的置信度Cop_Con(Qi→Retj),所述i≥1,j≥1,且
Figure FDA0002617758030000023
Qi∪Retj=Lk
Figure FDA0002617758030000024
所述Retj为不含查询词项的真子集项集,所述Qi为含查询词项的真子集项集,所述Q为原查询词项集合;
(4.8) Extract the association rule Q i →Ret j from L k , calculate the confidence level Cop_Con(Q i →Ret j ) of the association rule Q i →Ret j based on the Copulas function, the i≥1, j≥1, and
Figure FDA0002617758030000023
Q i ∪Ret j =L k ,
Figure FDA0002617758030000024
The Ret j is the proper subset itemset without query terms, the Q i is the proper subset itemsets with query terms, and the Q is the original query term set;
所述Cop_Con(Qi→Retj)的计算公式如式(3)所示:The calculation formula of Cop_Con(Q i →Ret j ) is shown in formula (3):
Figure FDA0002617758030000022
Figure FDA0002617758030000022
式(3)中,
Figure FDA0002617758030000027
表示k_频繁项集Lk在伪相关反馈中文文档库中出现的频度,
Figure FDA0002617758030000028
表示k_频繁项集Lk在伪相关反馈中文文档库中的项集权重,
Figure FDA0002617758030000029
表示k_频繁项集Lk的真子集项集Qi在伪相关反馈中文文档库中出现的频度,weightQi表示k_频繁项集Lk的真子集项集Qi在伪相关反馈中文文档库中的项集权重;exp表示以自然常数e为底的指数函数;
In formula (3),
Figure FDA0002617758030000027
represents the frequency of k_frequent itemsets L k appearing in the pseudo-relevant feedback Chinese document library,
Figure FDA0002617758030000028
represents the itemset weight of k_frequent itemset L k in the pseudo-relevant feedback Chinese document database,
Figure FDA0002617758030000029
Represents the frequency of the proper subset itemset Q i of k_frequent itemset L k appearing in the pseudo-relevant feedback Chinese document database, and weight Qi represents the proper subset itemset Qi of k_ frequent itemset L k in the pseudo-relevant feedback Chinese The itemset weight in the document library; exp represents the exponential function with the natural constant e as the base;
(4.9)产生关联规则Qi→Retj:提取Cop_Con(Qi→Retj)不小于最小置信度阈值mc的关联规则Qi→Retj加入到关联规则集AR,然后,转入步骤(4.8),从Lk中重新提取其他的真子集项集Retj和Qi,再顺序进行其后步骤,如此循环,直到Lk的所有真子集项集当且仅当都被取出一次为止,这时转入如步骤(4.7),进行新一轮关联规则模式挖掘,从FIS中再取出任意其他Lk,再顺序进行其后步骤,如此循环,直到FIS中所有k_频繁项集Lk当且仅当都被取出一次为止,这时关联规则模式挖掘结束,转入如下步骤(4.10);(4.9) Generate association rules Qi → Ret j : extract the association rules Qi → Ret j whose Cop_Con(Q i Ret j ) is not less than the minimum confidence threshold mc and add them to the association rule set AR, and then go to step (4.8 ), re-extract other proper subset itemsets Ret j and Q i from L k , and then perform subsequent steps in sequence, and so on, until all proper subset itemsets of L k are taken out once and only if they are taken out once, this Then go to step (4.7), carry out a new round of association rule pattern mining, take out any other L k from the FIS, and then perform the subsequent steps in sequence, and so on, until all k_frequent itemsets L k in the FIS are when And only when all are taken out once, then the association rule pattern mining is over, and then go to the following step (4.10); (4.10)产生候选扩展词:从关联规则集AR中提取关联规则后件Retj作为候选扩展词,得到候选扩展词集CETS,计算候选扩展词权值wRet,然后,转入步骤5;(4.10) generate candidate extension words: extract association rule consequent Ret j as candidate extension word from association rule set AR, obtain candidate extension word set CETS, calculate candidate extension word weight w Ret , then, go to step 5; 所述CETS如式(4)所示:The CETS is shown in formula (4):
Figure FDA0002617758030000031
Figure FDA0002617758030000031
式(4)中,Reti表示第i个候选扩展词;In formula (4), Ret i represents the ith candidate extension word; 所述候选扩展词权值wRet计算公式如式(5)所示:The calculation formula of the candidate extended word weight w Ret is shown in formula (5):
Figure FDA0002617758030000032
Figure FDA0002617758030000032
式(5)中,max()表示关联规则置信度的最大值,当多个关联规则模式中同时出现相同的规则扩展词时,取其置信度值最大的作为该规则扩展词的权值;In formula (5), max( ) represents the maximum value of the association rule confidence. When the same rule expansion word appears in multiple association rule patterns at the same time, the one with the largest confidence value is taken as the weight of the rule expansion word; 步骤5.计算候选扩展词与原查询词的向量余弦相似度,提取不低于相似度阈值的候选扩展词作为最终扩展词,具体步骤如下:Step 5. Calculate the vector cosine similarity between the candidate expansion word and the original query word, and extract the candidate expansion word that is not lower than the similarity threshold as the final expansion word. The specific steps are as follows: (5.1)计算向量相似度VecSim(Reti,Q):在特征词词向量集中,计算候选扩展词Reti与原查询词项集合Q(所述Q=(q1,q2,…,qr))中各个查询词{q1,q2,…,qr}的向量余弦相似度VecSim(Reti,qj),所述1≤j≤r,累加所述候选扩展词与各个查询词的向量相似度作为该候选扩展词总的向量相似度VecSim(Reti,Q);(5.1) Calculate the vector similarity VecSim(Ret i , Q): In the feature word vector set, calculate the candidate extension word Ret i and the original query term set Q (the Q=(q 1 ,q 2 ,...,q r )) vector cosine similarity VecSim(Ret i ,q j ) of each query word {q 1 ,q 2 ,...,q r }, the 1≤j≤r, the candidate expansion word and each query are accumulated The vector similarity of the word is used as the total vector similarity of the candidate expanded word VecSim(Ret i ,Q); 所述VecSim(Reti,qj)如式(6)所示:The VecSim(Ret i ,q j ) is shown in formula (6):
Figure FDA0002617758030000033
Figure FDA0002617758030000033
式(6)中,所述vReti表示第i个候选扩展词Reti的词向量值,vqj表示第j个查询词项qj的词向量值;In formula (6), the vRet i represents the word vector value of the ith candidate expansion word Ret i , and vq j represents the word vector value of the jth query term q j ; 所述VecSim(Reti,Q)计算公式如式(7)所示:The VecSim(Ret i , Q) calculation formula is shown in formula (7):
Figure FDA0002617758030000034
Figure FDA0002617758030000034
(5.2)产生最终扩展词:提取VecSim(Reti,Q)不低于向量相似度阈值minVSim的候选扩展词作为原查询词项集合Q的最终扩展词,得到最终扩展词集FETS,如式(8)所示:(5.2) Generate the final expanded word: extract the candidate expanded word whose VecSim(Ret i , Q) is not lower than the vector similarity threshold minVSim as the final expanded word of the original query term set Q, and obtain the final expanded word set FETS, such as formula ( 8) shown:
Figure FDA0002617758030000035
Figure FDA0002617758030000035
式(8)中,Reti为第i个最终扩展词;In formula (8), Ret i is the ith final extended word; 然后计算最终扩展词Ret的权值wFet,最终扩展词Ret的权值wFet由候选扩展词权值wRet和向量相似度VecSim(Reti,Q)组成;所述wFet计算公式如式(9)所示:Then the weight w Fet of the final extended word Ret is calculated, and the weight w Fet of the final extended word Ret is composed of the candidate extended word weight w Ret and the vector similarity VecSim(Ret i , Q); the w Fet calculation formula is as follows (9) shows:
Figure FDA0002617758030000041
Figure FDA0002617758030000041
步骤6.扩展词与原查询组合为新查询再次检索原始中文文档集得到最终检索文档;Step 6. The expanded word and the original query are combined into a new query to retrieve the original Chinese document set again to obtain the final retrieval document; 步骤7.将最终检索文档集返回给用户。Step 7. Return the final retrieved document set to the user.
CN202010774138.2A 2020-08-04 2020-08-04 A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors Withdrawn CN111897924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010774138.2A CN111897924A (en) 2020-08-04 2020-08-04 A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010774138.2A CN111897924A (en) 2020-08-04 2020-08-04 A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors

Publications (1)

Publication Number Publication Date
CN111897924A true CN111897924A (en) 2020-11-06

Family

ID=73245723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010774138.2A Withdrawn CN111897924A (en) 2020-08-04 2020-08-04 A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors

Country Status (1)

Country Link
CN (1) CN111897924A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687097A (en) * 2020-11-16 2021-04-20 招商新智科技有限公司 Highway highway section level data center platform system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687097A (en) * 2020-11-16 2021-04-20 招商新智科技有限公司 Highway highway section level data center platform system

Similar Documents

Publication Publication Date Title
CN107133213B (en) A method and system for automatic extraction of text summaries based on algorithm
CN106844658B (en) A method and system for automatically constructing a Chinese text knowledge graph
Wen et al. Research on keyword extraction based on word2vec weighted textrank
CN108763196A (en) A kind of keyword extraction method based on PMI
CN103955542B (en) Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method
Pan et al. An improved TextRank keywords extraction algorithm
CN104933183A (en) Inquiring term rewriting method merging term vector model and naive Bayes
CN107943919B (en) A Query Expansion Method for Conversational Entity Search
Ao et al. News keywords extraction algorithm based on TextRank and classified TF-IDF
Wang et al. A semantic query expansion-based patent retrieval approach
CN111897922A (en) Chinese query expansion method based on pattern mining and word vector similarity calculation
CN111897928A (en) A Chinese query expansion method based on the union of query word embedding expansion words and statistical expansion words
CN111897923A (en) A Text Retrieval Method Based on Intersection Extension of Word Vectors and Association Patterns
CN111897924A (en) A Text Retrieval Method Based on Fusion and Extension of Association Rules and Word Vectors
CN111897926A (en) A Chinese query expansion method based on the intersection of deep learning and extended word mining
CN109684463B (en) Cross-language post-translation and front-part extension method based on weight comparison and mining
CN111897921A (en) A text retrieval method based on the fusion of word vector learning and pattern mining
CN109726263B (en) Cross-language post-translation hybrid extension method based on feature word weighted association pattern mining
El Mahdaouy et al. Semantically enhanced term frequency based on word embeddings for Arabic information retrieval
CN111897925B (en) Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning
CN111897927B (en) Chinese query expansion method integrating Copulas theory and association rule mining
CN109684465B (en) Pattern Mining and Hybrid Extended Text Retrieval Method Based on Itemset Weight Comparison
CN111897919A (en) A Text Retrieval Method Based on Copulas Function and Pseudo-Relevant Feedback Rule Extension
CN109739953A (en) A Text Retrieval Method Based on Chi-Square Analysis-Confidence Framework and Consequence Expansion
CN109684462B (en) Mining method of association rules between text words based on weight comparison and chi-square analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201106