[go: up one dir, main page]

CN101119326B - A management method and device for instant messaging session records - Google Patents

A management method and device for instant messaging session records Download PDF

Info

Publication number
CN101119326B
CN101119326B CN2006101095396A CN200610109539A CN101119326B CN 101119326 B CN101119326 B CN 101119326B CN 2006101095396 A CN2006101095396 A CN 2006101095396A CN 200610109539 A CN200610109539 A CN 200610109539A CN 101119326 B CN101119326 B CN 101119326B
Authority
CN
China
Prior art keywords
conversation
session
record
records
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2006101095396A
Other languages
Chinese (zh)
Other versions
CN101119326A (en
Inventor
石燕伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2006101095396A priority Critical patent/CN101119326B/en
Publication of CN101119326A publication Critical patent/CN101119326A/en
Application granted granted Critical
Publication of CN101119326B publication Critical patent/CN101119326B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种即时通信会话记录的管理方法,用于解决现有技术中即时通信用户在会话记录中查询信息时,不仅操作繁琐,而且查询效率低的问题。该方法包括:获取用户的会话记录并对其进行分类得到样本集合;分别对各样本集合进行相关性分析生成相应的分类组合,该分类组合包含所述样本集合中会话记录对应的特征向量;根据各分类组合中词语出现的频率确定分类组合对应的会话主题,并使该会话主题关联到分类组合对应的会话记录;以及根据用户查询时输入的关键词查找与该关键词匹配的会话主题,并将查找到的与会话主题关联的会话记录呈现给用户。本发明同时公开了一种即时通信会话记录的管理装置。

Figure 200610109539

The invention discloses a management method of an instant communication session record, which is used to solve the problem in the prior art that when an instant communication user queries information in the session record, not only the operation is cumbersome, but also the query efficiency is low. The method includes: obtaining user session records and classifying them to obtain a sample set; performing correlation analysis on each sample set to generate a corresponding classification combination, the classification combination including feature vectors corresponding to the session records in the sample set; The frequency of occurrence of words in each classification combination determines the corresponding conversation theme of the classification combination, and makes the conversation theme associated with the corresponding conversation record of the classification combination; Present the found session record associated with the session topic to the user. The invention also discloses a management device for instant communication session records.

Figure 200610109539

Description

一种即时通信会话记录的管理方法及装置 A management method and device for instant messaging session records

技术领域technical field

本发明涉及通信及计算机技术领域,尤其涉及一种即时通讯会话记录的管理方法及装置。The invention relates to the technical fields of communication and computer, in particular to a method and device for managing instant messaging session records.

背景技术Background technique

随着即时通讯(IM)技术的不断发展和普及,越来越多的用户不仅采用IM软件在网络中与其他用户进行交流,还可以将IM软件作为用户向其他用户咨询工作或学习中遇到问题的工具,同时,用户间的会话记录伴随着用户间的交流在IM系统中保存下来,为用户以后查找自己关注的信息提供了资料。With the continuous development and popularization of instant messaging (IM) technology, more and more users not only use IM software to communicate with other users in the network, but also use IM software as a user to consult other users encountered in work or study. At the same time, the conversation records between users are saved in the IM system along with the communication between users, which provides information for users to find the information they care about in the future.

例如:当用户A向用户B对一个问题进行了咨询,用户B返回了问题的答案,当用户C就同一个问题咨询用户A或用户B时,用户A需要查看与用户B的会话记录中的相关信息,或者用户B需要查看与用户A的会话记录中的相关信息时,用户A或用户B都需要在会话记录中人工查找相关记录,当会话记录较多或用户A与用户C咨询问题的时间间隔较长时,采用现有技术的方法,不仅增加了人工查找的工作量,而且查找效率较低。For example: when user A consults user B about a question, and user B returns the answer to the question, when user C consults user A or user B on the same question, user A needs to check the session record with user B related information, or when user B needs to check the relevant information in the conversation records with user A, user A or user B needs to manually search for relevant records in the conversation records. When the time interval is long, adopting the method of the prior art not only increases the workload of manual search, but also has low search efficiency.

如果用户A就同一问题对多个用户进行了咨询,当用户A希望从与多个用户的会话记录中查询信息时,采用现有技术的方法,如用户使用的即时通信系统提供会话记录查看功能的即时通信系统时,用户A只能人工对多个用户的会话记录逐一查看,找到自己关心的信息。即使用户A使用其它一些提供了用户会话记录的数据导入/导出功能的即时通信系统,用户A也需要将多个用户的会话记录数据先进行导出,然后在导出数据中进行查询,用户A还可根据自己关心的信息的关键词在导出数据中进行查询,但采用关键词的方式也只能定位到包含该关键词的语段,该语段不一定与用户关心的信息相关,也不能实现用户在会话记录中有效查找信息。If user A has consulted multiple users on the same issue, when user A wishes to inquire information from the session records with multiple users, the method of the prior art is adopted, such as the instant messaging system used by the user to provide the session record viewing function In the instant messaging system, user A can only manually check the session records of multiple users one by one to find the information he cares about. Even if user A uses some other instant messaging system that provides the data import/export function of user session records, user A also needs to export the session record data of multiple users first, and then query in the exported data. User A can also According to the keywords of the information you care about, you can query in the exported data, but the method of using keywords can only locate the segment containing the keyword, and the segment is not necessarily related to the information that the user cares about, nor can the user Efficiently look up information in session recordings.

发明内容Contents of the invention

本发明提供一种即时通讯会话记录的管理方法及装置,用以解决现有技术中存在的即时通讯用户在会话记录中查询信息时,不仅操作繁琐,而且查询效率低的问题。The present invention provides a management method and device for instant messaging session records, which are used to solve the problem of cumbersome operations and low query efficiency when instant messaging users query information in session records in the prior art.

本发明提供以下技术方案:The invention provides the following technical solutions:

一种即时通信会话记录的管理方法,包括如下步骤:A method for managing instant messaging session records, comprising the steps of:

获取用户的会话记录并对其进行分类得到样本集合;Obtain the user's session records and classify them to obtain a sample set;

生成所述样本集合中每条会话记录对应的特征向量,分析各特征向量与其他特征向量的相关性,根据所述相关性对特征向量进行分类生成分类组合;Generate a feature vector corresponding to each session record in the sample set, analyze the correlation between each feature vector and other feature vectors, and classify the feature vectors according to the correlation to generate a classification combination;

根据各分类组合中词语出现的频率确定分类组合对应的会话主题,并使该会话主题关联到分类组合对应的会话记录;以及Determine the conversation topic corresponding to the classification combination according to the frequency of occurrence of words in each classification combination, and make the conversation topic associated with the corresponding conversation record of the classification combination; and

根据用户查询时输入的关键词查找与该关键词匹配的会话主题,并将查找到的与会话主题关联的会话记录呈现给用户。According to the keyword entered by the user when querying, search for a conversation topic matching the keyword, and present the found conversation record associated with the conversation topic to the user.

其中,生成会话主题后进一步分析会话主题之间的相关性,并将相关性大于预定阈值的会话主题合并为同一个会话主题,使合并后的会话主题与被合并的所有会话主题所对应的会话记录关联。Wherein, after generating the session topics, further analyze the correlation between the session topics, and merge the session topics whose correlations are greater than a predetermined threshold into the same session topic, so that the merged session topics and the sessions corresponding to all the session topics that are merged Record association.

按不同的会话用户对会话记录进行分类生成样本集合。Classify session records by different session users to generate sample sets.

较佳的,根据所述样本集合中会话记录的间隔时间,进一步将一个样本集合划分为多个不同的样本集合。Preferably, a sample set is further divided into multiple different sample sets according to the interval time of session records in the sample set.

生成样本集合中每条会话记录对应的特征向量,分析各特征向量与其他特征向量的相关性,具体包括:Generate feature vectors corresponding to each session record in the sample set, and analyze the correlation between each feature vector and other feature vectors, including:

对每条会话记录进行分词处理,删除该会话记录中无实际意义的词语得到集合S,合并S中的同义词,并进行向量化,则生成与该会话记录对应的特征向量

Figure DEST_PATH_GA20179767200610109539601D00011
(W1,W2,W3......Wn),其中Wi为第i个元素的权重,各元素为S中的词语;Perform word segmentation for each session record, delete meaningless words in the session record to obtain a set S, merge synonyms in S, and perform vectorization, then generate a feature vector corresponding to the session record
Figure DEST_PATH_GA20179767200610109539601D00011
(W 1 , W 2 , W 3 ......Wn), where Wi is the weight of the i-th element, and each element is a word in S;

计算与各会话记录对应的特征向量

Figure DEST_PATH_GA20179767200610109539601D00021
中各词的权重,根据组成所述特征向量的各词在其特征向量中的权重计算各特征向量的相关性。Compute the feature vector corresponding to each session record
Figure DEST_PATH_GA20179767200610109539601D00021
According to the weight of each word in the feature vector, the correlation of each feature vector is calculated according to the weight of each word forming the feature vector in its feature vector.

根据分类组合中出现频率大于预定阈值的词语确定该分类组合的会话主题。The conversation topic of the category combination is determined according to the words whose frequency of occurrence in the category combination is greater than a predetermined threshold.

一种即时通信会话记录的管理装置,包括:A management device for instant messaging session records, comprising:

用于存储用户会话记录的单元;A unit for storing user session records;

用于对所述会话记录进行分类生成样本集合的单元;A unit for classifying the session records to generate a sample set;

用于生成所述样本集合中每条会话记录对应的特征向量,分析各特征向量与其他特征向量的相关性,根据所述相关性对特征向量进行分类生成分类组合的单元;A unit for generating a feature vector corresponding to each session record in the sample set, analyzing the correlation between each feature vector and other feature vectors, and classifying the feature vectors according to the correlation to generate a classification combination;

用于确定所述分类组合对应的会话主题,并使该会话主题关联到分类组合对应的会话记录的单元;以及A unit for determining a conversation topic corresponding to the classification combination, and associating the conversation topic with a conversation record corresponding to the classification combination; and

用于根据用户查询时输入的关键词查找与该关键词匹配的会话主题,并将查找到的与会话主题关联的会话记录呈现给用户的单元。A unit for finding a conversation topic matching the keyword according to a keyword entered by the user when querying, and presenting the found conversation record associated with the conversation topic to the user.

较佳的,所述装置还包括:Preferably, the device also includes:

用于分析会话主题之间的相关性,并将相关性大于预定阈值的会话主题合并为同一个会话主题,以及将合并后的会话主题与被合并的所有会话主题所对应的会话记录关联的单元。A unit for analyzing the correlation between session topics, merging session topics whose correlation is greater than a predetermined threshold into the same session topic, and associating the merged session topic with the session records corresponding to all the merged session topics .

本发明有益效果如下:The beneficial effects of the present invention are as follows:

本发明对用户会话记录进行分类生成样本集合后,分别对各样本集合进行相关性分析生成相应的分类组合并确定出分类组合对应的会话主题,以及将会话主题关联到分类组合对应的会话记录。采用本发明后,当用户需要从会话记录中查询信息时,用户只需输入关键词,系统将自动查找与该关键词匹配的会话主题,并将查找到的会话主题所关联的会话记录呈现给用户,不仅避免了用户手工查询信息时的繁琐操作,而且提高了查询效率。The present invention classifies user session records to generate sample sets, performs correlation analysis on each sample set to generate corresponding classification combinations, determines conversation topics corresponding to the classification combinations, and associates the conversation topics with session records corresponding to the classification combinations. After adopting the present invention, when the user needs to query information from the session record, the user only needs to input a keyword, and the system will automatically search for a session topic matching the keyword, and present the session record associated with the found session topic to the It not only avoids the cumbersome operation when users manually query information, but also improves the query efficiency.

附图说明Description of drawings

图1为本发明实施例中用户会话记录的管理装置结构示意图;FIG. 1 is a schematic structural diagram of a management device for user session records in an embodiment of the present invention;

图2为本发明实施例中用户会话记录管理方法的示意图;2 is a schematic diagram of a user session record management method in an embodiment of the present invention;

图3为本发明实施例中对用户会话记录进行分类的处理流程图;Fig. 3 is the processing flowchart of classifying user session records in the embodiment of the present invention;

图4为本发明实施例中对样本集合进行相关性分析的处理流程图。Fig. 4 is a processing flow chart of performing correlation analysis on sample sets in an embodiment of the present invention.

具体实施方式Detailed ways

为了解决现有技术中,即时通讯用户在会话记录中查询信息时,不仅操作繁琐,而且查询效率低的问题,本实施例中对用户会话记录进行分类生成样本集合,分别对各样本集合进行相关性分析生成相应的分类组合并确定出分类组合对应的会话主题,并将会话主题关联到分类组合对应的会话记录,以及根据用户输入的关键词查找与该关键词匹配的会话主题,并将查找到的会话主题所关联的会话记录呈现给用户。In order to solve the problem of cumbersome operations and low query efficiency in the prior art when instant messaging users query information in session records, in this embodiment, user session records are classified to generate sample sets, and each sample set is correlated Generative analysis generates the corresponding classification combination and determines the conversation topic corresponding to the classification combination, and associates the conversation topic with the conversation record corresponding to the classification combination, and finds the conversation topic matching the keyword according to the keyword input by the user, and searches The session record associated with the received session topic is presented to the user.

参阅图1所示为本实施例中用户会话记录的管理装置结构示意图,包括:存储单元101、分类单元102、分析单元103、会话主题单元104、合并单元105和查询单元106。Referring to FIG. 1 , it is a schematic structural diagram of a management device for user session records in this embodiment, including: a storage unit 101 , a classification unit 102 , an analysis unit 103 , a session topic unit 104 , a merge unit 105 and a query unit 106 .

存储单元101用于保存用户的会话记录和会话主题。分类单元102用于获取会话记录并对会话记录进行分类得到样本集合。分析单元103用于对样本集合进行相关性分析,生成样本集合的分类组合。会话主题单元104用于确定样本集合分类组合的会话主题,并使该会话主题关联到分类组合对应的会话记录。合并单元105用于分析会话主题之间的相关性,并将相关性大于预定阈值的会话主题合并为同一个会话主题,以及将合并后的会话主题关联到被合并的所有会话主题对应的会话记录。查询单元106用于接收用户在会话记录中查询信息时输入的关键词和查找与该关键词匹配的会话主题,并将查找到的会话主题所关联的会话记录呈现给用户。The storage unit 101 is used for saving user's session records and session topics. The classifying unit 102 is configured to acquire session records and classify the session records to obtain a sample set. The analysis unit 103 is used for performing correlation analysis on the sample set to generate a classification combination of the sample set. The conversation theme unit 104 is configured to determine the conversation theme of the classification combination of the sample set, and associate the conversation theme with the conversation record corresponding to the classification combination. The merging unit 105 is used for analyzing the correlation between the conversation topics, and merging the conversation topics whose correlation is greater than a predetermined threshold into the same conversation topic, and associating the merged conversation topic to the conversation records corresponding to all the conversation topics being merged . The query unit 106 is configured to receive a keyword input by the user when querying information in the conversation record, find a conversation topic matching the keyword, and present the conversation record associated with the found conversation topic to the user.

参阅图2所示为本实施例中用户会话记录管理方法的示意图,包括:Referring to Figure 2, it is a schematic diagram of a user session record management method in this embodiment, including:

步骤201、获取用户的会话记录并对该会话记录进行分类得到样本集合。Step 201. Obtain the user's session records and classify the session records to obtain a sample set.

步骤202、对生成的样本集合进行相关性分析生成相应的分类组合。Step 202, performing correlation analysis on the generated sample sets to generate corresponding classification combinations.

步骤203、根据各分类组合中词语出现的频率确定分类组合对应的会话主题,并使该会话主题关联到分类组合对应的会话记录。Step 203: Determine the conversation topic corresponding to the classification combination according to the occurrence frequency of words in each classification combination, and associate the conversation topic with the conversation record corresponding to the classification combination.

步骤204、分析会话主题之间的相关性,并将相关性大于预定阈值的会话主题合并为同一个会话主题,使合并后的会话主题关联到被合并的所有会话主题对应的会话记录。Step 204 , analyzing the correlation between the conversation topics, and merging the conversation topics whose correlation is greater than a predetermined threshold into a same conversation topic, so that the merged conversation topic is associated with the conversation records corresponding to all the merged conversation topics.

步骤205、当用户在会话记录中查询信息时,根据用户查询时输入的关键词查找与该关键词匹配的会话主题,并将查找到的会话主题所关联的会话记录呈现给用户。Step 205, when the user searches for information in the conversation record, search for a conversation topic matching the keyword according to the keyword entered by the user when querying, and present the conversation record associated with the found conversation topic to the user.

在步骤201中,对会话记录进行分类的处理流程参阅图3所示,处理过程如下:In step 201, the processing flow for classifying session records is shown in Figure 3, and the processing is as follows:

步骤301、判断会话记录是否已经过分类处理,如果已经过分类处理,则不对其进行处理;否则,执行步骤302。Step 301 , judging whether the session record has been classified, and if it has been classified, it is not processed; otherwise, step 302 is executed.

步骤302、对没有经过分类处理的会话记录根据不同的用户对会话记录进行分类,如:判断会话记录TRi和会话记录TRj是否属于同一用户间的会话记录,如果会话记录TRi和会话记录TRj分属于不同用户间的会话,将会话记录TRi和会话记录TRj划分为不同的样本集合TS;如果会话记录TRi和会话记录TRj属于同一用户间的会话记录,则将会话记录TRi和会话记录TRj划分到相同的样本集合中。Step 302, classify the session records that have not been classified according to different users, such as: judge whether the session records TRi and the session records TR j belong to the session records between the same users, if the session records TRi and the session records TR j Belonging to sessions between different users, the session records TRi and session records TR j are divided into different sample sets TS; if the session records TRi and session records TR j belong to the session records between the same users, the session records TRi and session records TR j is divided into the same sample set.

步骤303、将同一样本集合根据该样本集合中的会话记录的间隔时间进行划分,进一步划分为不同的样本集合,会话记录的间隔时间根据实际应用,可设为一星期等。Step 303 : Divide the same sample set according to the interval time of conversation records in the sample set, and further divide it into different sample sets. The interval time of conversation records can be set to one week according to the actual application.

经过步骤303处理生成的样本集合TS为进行相关性分析的样本集合。The sample set TS generated through step 303 is a sample set for correlation analysis.

参阅图4所示,对一个样本集合采用KNN(K Nearest Neighbor,K最近邻居)算法进行相关性分析的处理过程如下:Referring to Figure 4, the process of performing correlation analysis on a sample set using the KNN (K Nearest Neighbor, K nearest neighbor) algorithm is as follows:

步骤401、对样本集合TS中的每条会话记录TR生成对应的特征向量。首先对每条会话记录TR进行分词处理,去除其中的助词,叹词等无实际意义的词,得到集合S;合并S中的同义词,例如将{“电脑”,“计算机”}合并为{“计算机”,“计算机”}。将经过同义词合并后的对应于每条会话记录的集合S进行向量化,生成特征向量

Figure G061A9539620060810D000061
(W1,W2,W3......Wn),其中Wi为第i个元素的权重,各元素为S中的词语。Step 401. Generate a corresponding feature vector for each session record TR in the sample set TS. First, perform word segmentation processing on each session record TR, remove the auxiliary words, interjections and other meaningless words, and obtain the set S; merge the synonyms in S, for example, merge {"computer", "computer"} into {"computer","computer"}. Vectorize the set S corresponding to each session record after merging synonyms to generate a feature vector
Figure G061A9539620060810D000061
(W 1 , W 2 , W 3 ......Wn), where Wi is the weight of the i-th element, and each element is a word in S.

步骤402、计算与各会话记录TR对应的特征向量

Figure G061A9539620060810D000062
中各元素的权值W。采用如下公式进行权值计算:Step 402, calculating feature vectors corresponding to each session record TR
Figure G061A9539620060810D000062
The weight W of each element in . Use the following formula to calculate the weight:

WW (( tt ,, dd →&Right Arrow; )) == tftf (( tt ,, dd →&Right Arrow; )) ×× loglog (( NN // nno tt ++ 0.010.01 )) ΣΣ tt ∈∈ dd →&Right Arrow; [[ tftf (( tt ,, dd →&Right Arrow; )) ×× loglog (( NN // nno tt ++ 0.010.01 )) ]] 22

其中,

Figure G061A9539620060810D000064
为词t在特征向量
Figure G061A9539620060810D000065
中的权重,而为词t在特征向量中的词频,N为每个样本集合TS中会话记录TR的总数,nt为每个样本集合TS中出现词t的会话记录TR数,分母为归一化因子。in,
Figure G061A9539620060810D000064
is the word t in the feature vector
Figure G061A9539620060810D000065
weights in , while is the word t in the feature vector The word frequency in , N is the total number of conversation records TR in each sample set TS, n t is the number of conversation record TRs where word t appears in each sample set TS, and the denominator is the normalization factor.

步骤403、计算与各会话记录对应的特征向量之间的相关系数,根据计算所得相关系数确定与各特征向量最相似的K个特征向量。Step 403: Calculate correlation coefficients between feature vectors corresponding to each session record, and determine K feature vectors most similar to each feature vector according to the calculated correlation coefficients.

具体实施时,采用如下公式 Sim ( d i , d j ) = Σ k = 1 M W ik × W jk ( Σ k = 1 M W ik 2 ) ( Σ k = 1 M W jk 2 ) 计算出各会话记录对应的特征向量间的相关系数,其中,Sim(di,dj)为特征向量di与特征向量dj的相关系数,Wik和Wjk分别为特征向量di和特征向量dj的第k个元素的权值。For specific implementation, the following formula is used Sim ( d i , d j ) = Σ k = 1 m W ik × W jk ( Σ k = 1 m W ik 2 ) ( Σ k = 1 m W jk 2 ) Calculate the correlation coefficient between the eigenvectors corresponding to each session record, where Sim(d i , d j ) is the correlation coefficient between the eigenvector d i and the eigenvector d j , W ik and W jk are the eigenvectors d i and The weight of the kth element of the feature vector dj .

通过计算,获得各特征向量间的相关系数,根据该相关系数,将与每一个特征向量最相关的K个特征向量分别组合为一个集合,K的取值可根据实际应用进行确定。Through calculation, the correlation coefficient between the feature vectors is obtained, and according to the correlation coefficient, the K feature vectors most correlated with each feature vector are combined into a set, and the value of K can be determined according to the actual application.

步骤404、将各会话记录对应的特征向量划分到分类C中的不同类中生成分类组合。Step 404: Divide the feature vectors corresponding to each session record into different classes in the classification C to generate a classification combination.

分类C为样本集合TS中各会话记录对应的特征向量组成的集合。Classification C is a set composed of feature vectors corresponding to each session record in the sample set TS.

方法一:当分类C为空时,则采用如下方式生成分类C中的一个向量集合c,然后将c添加到分类C中:Method 1: When category C is empty, generate a vector set c in category C in the following way, and then add c to category C:

对应于会话记录的特征向量

Figure G061A9539620060810D000071
和特征向量
Figure G061A9539620060810D000072
分别属于对方最相似的K个邻居组成的集合,则
Figure G061A9539620060810D000073
Figure G061A9539620060810D000074
属于同一类c,生成类c并将该类与特征向量
Figure G061A9539620060810D000075
和特征向量对应的会话记录关联,然后将类c添加到分类C,每个类c中的特征向量组成一个分类组合。A feature vector corresponding to the session record
Figure G061A9539620060810D000071
and eigenvectors
Figure G061A9539620060810D000072
belong to the set of K neighbors most similar to each other, then
Figure G061A9539620060810D000073
and
Figure G061A9539620060810D000074
belong to the same class c, generate class c and combine this class with the feature vector
Figure G061A9539620060810D000075
and eigenvectors The corresponding session records are associated, and then class c is added to classification C, and the feature vectors in each class c form a classification combination.

方法二:当分类C不为空时,则计算对应于各会话记录的特征向量

Figure G061A9539620060810D000077
属于某个类c(c∈C)的权重,采用如下的公式:Method 2: When the category C is not empty, calculate the feature vector corresponding to each session record
Figure G061A9539620060810D000077
The weight belonging to a certain class c (c∈C) adopts the following formula:

pp (( xx →&Right Arrow; ,, CC jj )) == ΣΣ dd →&Right Arrow; ii ∈∈ KNNKNN SimSim (( xx →&Right Arrow; ,, dd →&Right Arrow; ii )) ythe y (( dd →&Right Arrow; ii ,, CC jj ))

其中,

Figure G061A9539620060810D000079
为对应于一条会话记录的特征向量,
Figure G061A9539620060810D0000710
为与
Figure G061A9539620060810D0000711
最相似的K个邻居组成的集合中的特征向量,
Figure G061A9539620060810D0000712
Figure G061A9539620060810D0000713
与其最相关的特征向量
Figure G061A9539620060810D0000714
的相关系数,该相关系数可根据步骤403计算结果获得,
Figure G061A9539620060810D0000715
为类别属性函数,如果特征向量
Figure G061A9539620060810D0000716
属于类Cj
Figure G061A9539620060810D0000717
的函数值为1,否则为0。根据计算得到比较特征向量
Figure G061A9539620060810D0000719
在各类Cj中的权值,将特征向量
Figure G061A9539620060810D0000720
分到权值较大的类Cj中,并将该类Cj与特征向量
Figure G061A9539620060810D0000721
对应的会话记录关联。in,
Figure G061A9539620060810D000079
is the feature vector corresponding to a session record,
Figure G061A9539620060810D0000710
for with
Figure G061A9539620060810D0000711
The eigenvectors in the set of the most similar K neighbors,
Figure G061A9539620060810D0000712
for
Figure G061A9539620060810D0000713
The eigenvector most correlated with it
Figure G061A9539620060810D0000714
The correlation coefficient, which can be obtained according to the calculation result of step 403,
Figure G061A9539620060810D0000715
is a category attribute function, if the feature vector
Figure G061A9539620060810D0000716
belongs to class C j ,
Figure G061A9539620060810D0000717
The value of the function is 1, otherwise it is 0. According to the calculation compare eigenvectors
Figure G061A9539620060810D0000719
The weights in each category C j , the eigenvector
Figure G061A9539620060810D0000720
Classify into the class C j with larger weight, and combine the class C j with the feature vector
Figure G061A9539620060810D0000721
Corresponding session record association.

采用方法二时,如果特征向量

Figure G061A9539620060810D0000722
和现存每个类c的相关度都很小,则可采用方法一的方式生成一个新的类c′,并将类c′加入到分类C中,并将类c′与特性向量
Figure G061A9539620060810D0000723
对应的会话记录关联。When using the second method, if the eigenvector
Figure G061A9539620060810D0000722
The correlation with each existing class c is very small, then a new class c' can be generated by method 1, and class c' is added to the classification C, and the class c' and the feature vector
Figure G061A9539620060810D0000723
Corresponding session record association.

对各特征向量进行处理后,将特征向量都划分到一个类中,由各类分别组成分类组合。After processing each eigenvector, the eigenvectors are all divided into one class, and each class forms a classification combination.

将生成的各分类组合中出现频率最高的N个词语或者频率大于a的词语,确定为该分类组合的会话主题,N值和a值根据实际应用进行确定。The N words with the highest frequency of occurrence in each generated classification combination or the words with a frequency greater than a are determined as the conversation topic of the classification combination, and the N value and a value are determined according to the actual application.

对每个样本集合TS进行上述处理后生成分类组合及该分类组合对应的会话主题,将生成的会话主题进行相关性分析时,将会话主题作为KNN算法的一个样本集合,计算该集合中每一个会话主题中各词在该会话主题中的权重,根据权重,利用步骤403中的公式,计算出各会话主题的相关系数,将相关系数大于设定阈值的会话主题进行合并。After performing the above processing on each sample set TS, a classification combination and a conversation topic corresponding to the classification combination are generated. When performing correlation analysis on the generated conversation topic, the conversation topic is regarded as a sample collection of the KNN algorithm, and each of the conversation topics in the collection is calculated. According to the weight of each word in the conversation topic in the conversation topic, the formula in step 403 is used to calculate the correlation coefficient of each conversation topic, and the conversation topics whose correlation coefficient is greater than the set threshold are merged.

呈现会话记录给用户时,根据不同的会话用户将会话记录进行排列,也可以根据会话主题中会话记录的权重顺序排列。When presenting the session records to the user, the session records are arranged according to different session users, and can also be arranged according to the weight order of the session records in the session topic.

以上实施例中采用了KNN算法对样本集合进行相关性分析,但本发明不仅限于采用KNN算法对样本集合进行分析。对会话记录进行相关性分析的方法还可以应用向量机算法、神经网络算法以及贝叶斯算法等基于向量空间的训练算法和分类方法。例如采用贝叶斯算法时,计算各会话记录对应特征向量中每个词出现在某个会话中的概率,然后根据贝叶斯公式计算出特征向量属于某个会话的概率,将其加入到概率最大的会话中。In the above embodiments, the KNN algorithm is used to analyze the correlation of the sample set, but the present invention is not limited to using the KNN algorithm to analyze the sample set. The method for analyzing the correlation of session records can also apply vector space-based training algorithms and classification methods such as vector machine algorithms, neural network algorithms, and Bayesian algorithms. For example, when using the Bayesian algorithm, calculate the probability that each word in the corresponding feature vector of each session record appears in a certain session, and then calculate the probability that the feature vector belongs to a certain session according to the Bayesian formula, and add it to the probability largest session.

采用本发明,当用户在会话记录中查询用户关心的信息时,用户只需要输入关键词,系统将自动查询与关键词匹配的会话主题,并将与该会话主题关联的会话记录呈现给用户,不仅避免了用户手工查询信息时的繁琐操作,而且提高了查询效率。With the present invention, when the user inquires the information that the user cares about in the conversation records, the user only needs to input keywords, and the system will automatically search for the conversation topics that match the keywords, and present the conversation records associated with the conversation topics to the user, It not only avoids the cumbersome operation when users manually query information, but also improves the query efficiency.

显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若对本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies, the present invention also intends to include these modifications and variations.

Claims (8)

1. A management method for instant communication session records is characterized by comprising the following steps:
acquiring session records of a user and classifying the session records to obtain a sample set;
generating a feature vector corresponding to each session record in the sample set, analyzing the correlation between each feature vector and other feature vectors, and classifying the feature vectors according to the correlation to generate a classification combination;
determining conversation topics corresponding to the classification combinations according to the occurrence frequency of the words in each classification combination, and enabling the conversation topics to be associated with conversation records corresponding to the classification combinations; and
and searching a conversation theme matched with the keyword according to the keyword input by the user during the query, and presenting the searched conversation record associated with the conversation theme to the user.
2. The method of claim 1, wherein after the conversation topics are generated, the correlation between the conversation topics is further analyzed, and the conversation topics with the correlation larger than a predetermined threshold are combined into the same conversation topic, so that the combined conversation topic is associated with the conversation records corresponding to all the combined conversation topics.
3. The method of claim 1 or 2, wherein the set of samples is generated by classifying session records by different session users.
4. The method of claim 3, wherein a sample set is further divided into a plurality of different sample sets according to the time intervals of session records in the sample set.
5. The method according to claim 1, wherein the generating a feature vector corresponding to each session record in the sample set, and analyzing the correlation between each feature vector and other feature vectors specifically includes:
performing word segmentation processing on each conversation record, deleting words without practical meaning in the conversation record to obtain a set S, merging synonyms in the set S, and performing vectorization to generate a feature vector corresponding to the conversation record
Figure FA20179767200610109539601C00011
(W1,W2,W3.... Wn), where Wi is the weight of the ith element, each element being a word in S;
computing correspondence to session recordsFeature vector of
Figure FA20179767200610109539601C00012
And calculating the relevance of each feature vector according to the weight of each word forming the feature vector in the feature vector.
6. The method of claim 1, wherein the topic of conversation for a taxonomic group is determined according to terms in the taxonomic group that occur more frequently than a predetermined threshold.
7. An apparatus for managing instant messaging session records, comprising:
means for storing a user session record;
means for classifying the session records to generate a sample set;
a unit for generating a feature vector corresponding to each session record in the sample set, analyzing the correlation between each feature vector and other feature vectors, and classifying the feature vectors according to the correlation to generate a classification combination;
a unit for determining a conversation topic corresponding to the classification combination and associating the conversation topic to a conversation record corresponding to the classification combination; and
and the unit is used for searching the conversation theme matched with the keyword according to the keyword input by the user during the query and presenting the searched conversation record associated with the conversation theme to the user.
8. The apparatus of claim 7, further comprising:
and the unit is used for analyzing the correlation among the conversation topics, combining the conversation topics with the correlation larger than a preset threshold value into the same conversation topic, and associating the combined conversation topic with the conversation records corresponding to all the combined conversation topics.
CN2006101095396A 2006-08-04 2006-08-04 A management method and device for instant messaging session records Expired - Fee Related CN101119326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2006101095396A CN101119326B (en) 2006-08-04 2006-08-04 A management method and device for instant messaging session records

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2006101095396A CN101119326B (en) 2006-08-04 2006-08-04 A management method and device for instant messaging session records

Publications (2)

Publication Number Publication Date
CN101119326A CN101119326A (en) 2008-02-06
CN101119326B true CN101119326B (en) 2010-07-28

Family

ID=39055265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2006101095396A Expired - Fee Related CN101119326B (en) 2006-08-04 2006-08-04 A management method and device for instant messaging session records

Country Status (1)

Country Link
CN (1) CN101119326B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2541900C2 (en) * 2008-12-12 2015-02-20 Конинклейке Филипс Электроникс, Н.В. Comparing records focused on statements in distributed and independent medical environment
CN101483620B (en) * 2009-02-17 2012-09-26 腾讯科技(深圳)有限公司 Session reservation method and system in instant communication tool
CN101997964A (en) * 2009-08-13 2011-03-30 中国电信股份有限公司 Processing method of mobile communication terminal and contact records thereof
CN103078781A (en) * 2011-10-25 2013-05-01 国际商业机器公司 Method for instant messaging system and instant messaging system
CN102646134A (en) * 2012-03-29 2012-08-22 百度在线网络技术(北京)有限公司 Method and device for determining message session in message record
CN103425648B (en) * 2012-05-15 2016-04-13 腾讯科技(深圳)有限公司 The disposal route of relation loop and system
CN103279465B (en) * 2012-12-18 2018-05-25 北京奇虎科技有限公司 The control method and device of communication historical data
CN103279466B (en) * 2012-12-18 2018-01-26 北京奇虎科技有限公司 Method and device for controlling communication history data
CN105024906B (en) * 2014-04-21 2018-10-02 腾讯科技(深圳)有限公司 The storage of group's message, querying method and system in social networks
CN105450497A (en) * 2014-07-31 2016-03-30 国际商业机器公司 Method and device for generating clustering model and carrying out clustering based on clustering model
CN104361003A (en) * 2014-10-10 2015-02-18 金硕澳门离岸商业服务有限公司 Method and device for classified displaying of chat records
CN104462518B (en) * 2014-12-22 2018-10-19 百度在线网络技术(北京)有限公司 Method and apparatus for being labeled to IM information
CN105141502A (en) * 2015-08-12 2015-12-09 深圳前海珩昌科技有限公司 Method and device for managing instant communication process
CN105049336A (en) * 2015-08-12 2015-11-11 深圳前海珩昌科技有限公司 Method and system for processing instant communication messages, server and client
CN106487640A (en) * 2015-08-25 2017-03-08 平安科技(深圳)有限公司 Many communication modules control method and server
CN106888236B (en) * 2015-12-15 2021-08-31 腾讯科技(深圳)有限公司 Session management method and session management device
CN105589625B (en) * 2015-12-21 2020-06-02 惠州Tcl移动通信有限公司 Processing method and device of social media message and communication terminal
CN105959205A (en) * 2016-04-29 2016-09-21 杨夫春 Chatting records keeping method
CN106599147A (en) * 2016-12-06 2017-04-26 庄爱芹 Method and device for browser browsing history management
CN106777013B (en) * 2016-12-07 2020-09-11 科大讯飞股份有限公司 Dialogue management method and device
CN108737240A (en) * 2017-04-18 2018-11-02 阿里巴巴集团控股有限公司 The method that the method, apparatus and group that chat group automatically creates create
CN111357245B (en) * 2017-11-15 2022-08-09 华为技术有限公司 Information searching method, terminal, network equipment and system
CN111698143B (en) * 2019-03-14 2022-12-16 阿里巴巴集团控股有限公司 Information processing method, information display method and device
CN110138645B (en) * 2019-03-29 2021-06-18 腾讯科技(深圳)有限公司 Session message display method, device, equipment and storage medium
CN110781930A (en) * 2019-10-14 2020-02-11 西安交通大学 A method and system for user portrait grouping and behavior analysis based on network security device log data
CN112769673A (en) * 2019-11-05 2021-05-07 钉钉控股(开曼)有限公司 Communication record generation, recommendation and display method and device
CN111327518B (en) * 2020-01-21 2022-10-11 上海掌门科技有限公司 Method and device for splicing messages
CN111708866B (en) * 2020-08-24 2020-12-11 北京世纪好未来教育科技有限公司 Session segmentation method, apparatus, electronic device and storage medium
CN111798870A (en) * 2020-09-08 2020-10-20 共道网络科技有限公司 Session link determining method, device and equipment and storage medium
CN113113017B (en) * 2021-04-08 2024-04-09 百度在线网络技术(北京)有限公司 Audio processing method and device
CN113595886A (en) * 2021-07-29 2021-11-02 北京达佳互联信息技术有限公司 Instant messaging message processing method and device, electronic equipment and storage medium
CN114691830B (en) * 2022-03-31 2022-12-20 江苏冬云云计算股份有限公司 Network security analysis method and system based on big data
WO2025260343A1 (en) * 2024-06-20 2025-12-26 抖音视界有限公司 Information processing method and apparatus, device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584883A (en) * 2004-05-27 2005-02-23 威盛电子股份有限公司 Relational file link management system, method and recording medium
CN1609859A (en) * 2004-11-26 2005-04-27 孙斌 Search result clustering method
CN1741012A (en) * 2004-08-23 2006-03-01 富士施乐株式会社 Text retrieval device and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584883A (en) * 2004-05-27 2005-02-23 威盛电子股份有限公司 Relational file link management system, method and recording medium
CN1741012A (en) * 2004-08-23 2006-03-01 富士施乐株式会社 Text retrieval device and method
CN1609859A (en) * 2004-11-26 2005-04-27 孙斌 Search result clustering method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JP特开2005-173847A 2005.06.30
同上.

Also Published As

Publication number Publication date
CN101119326A (en) 2008-02-06

Similar Documents

Publication Publication Date Title
CN101119326B (en) A management method and device for instant messaging session records
WO2022142027A1 (en) Knowledge graph-based fuzzy matching method and apparatus, computer device, and storage medium
CN107391687B (en) A Hybrid Recommendation System for Local Chronicle Websites
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
KR100544514B1 (en) Method and system for determining search query relevance
US6925433B2 (en) System and method for context-dependent probabilistic modeling of words and documents
CN104199965B (en) Semantic information retrieval method
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN104077407B (en) A kind of intelligent data search system and method
US9971828B2 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
CN107193883B (en) Data processing method and system
CN102591948A (en) Method and system for improving search results based on user behavior analysis
WO2019196259A1 (en) Method for identifying false message and device thereof
CN108804577B (en) Method for estimating interest degree of information tag
CN119336809A (en) A method and system for retrieving petroleum business data assets
CN113177061B (en) Searching method and device and electronic equipment
CN119577124A (en) A method and device for information retrieval and guidance based on big data software system
CN107562774A (en) Generation method, system and the answering method and system of rare foreign languages word incorporation model
CN115409019A (en) A similar case retrieval method based on domain graph perception
CN106021423A (en) Group division-based meta-search engine personalized result recommendation method
TWI446191B (en) Word matching and information query method and device
Jain et al. Building query optimizers for information extraction: the sqout project
JP5292336B2 (en) Knowledge amount estimation device, knowledge amount estimation method, and knowledge amount estimation program for each field of search system users
CN106599147A (en) Method and device for browser browsing history management
TWI483129B (en) Retrieval method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100728

CF01 Termination of patent right due to non-payment of annual fee