CN115114393A

CN115114393A - A Lucene-based Peer Customer Retrieval Method

Info

Publication number: CN115114393A
Application number: CN202210823146.0A
Authority: CN
Inventors: 赵亮亮
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-09-27

Abstract

The invention discloses a method for retrieving peer customers based on Lucene, which mainly comprises the steps of constructing a feature index library of a first customer, identifying the features of the second customer, calculating the similarity between the features of the second customer and the features of the first customer, and recalling the results. Manual ranking. On the basis of the general Lucene-based full-text retrieval function, through feature engineering, the features that can represent the first customer and the features that can represent the second customer can be identified, and the similarity algorithm and manual ranking algorithm of the present invention can be extremely The effect of greatly improving the recall accuracy and retrieval efficiency of peer customers.

Description

A Lucene-based Peer Customer Retrieval Method

技术领域technical field

本发明涉及信息检索领域，特别是涉及一种基于Lucene的同行客户检索方法。The invention relates to the field of information retrieval, in particular to a Lucene-based peer customer retrieval method.

背景技术Background technique

在销售过程下，同行刺激是常用的销售谈判技巧。向客户推荐已经签约的同行名企，会有助于销售在谈判过程中打动客户，达到与其合作的目的。In the sales process, peer stimulation is a commonly used sales negotiation technique. Recommending well-known peer companies that have signed contracts to customers will help sales to impress customers in the negotiation process and achieve the purpose of cooperating with them.

现有的公开技术中，同行客户检索没有标准化的解决方案，常见的同行客户检索方法有利用公开渠道检索同行客户和自建同行客户检索平台两种，这些方案有以下缺点：In the existing public technology, there is no standardized solution for peer customer retrieval. Common peer customer retrieval methods include using public channels to retrieve peer customers and self-built peer customer retrieval platforms. These solutions have the following shortcomings:

1、通过公开渠道检索同行客户，信息层面依托外部渠道和客户主动行为，需要进行信息的二次加工处理，获取同行客户的效率及其低下。1. Retrieving peer customers through open channels, relying on external channels and customer active behavior at the information level, requires secondary processing of information, and the efficiency of acquiring peer customers is extremely low.

2、自建同行客户检索平台，通过构建已合作客户的信息库，通过传统关系型数据库，进行关键词匹配，存在检索效率低下的问题，无法应对规模场景下的同行客户检索。2. The self-built peer-to-peer customer retrieval platform, through the construction of the information base of the cooperative customers, through the traditional relational database, to carry out keyword matching, has the problem of low retrieval efficiency, and cannot cope with the peer-to-peer customer retrieval in large-scale scenarios.

3、基于上条很容易想到，对自建同行客户检索平台方案进行升级优化，通过采用全文检索的方案对分词匹配检索效率进行提升，但经过实践，特别是涉及全品类的行业和大量跨地域客户的平台中，单纯地依托分词匹配得到的同行客户准确性无法保证。3. Based on the previous article, it is easy to think of upgrading and optimizing the self-built peer-to-peer customer retrieval platform solution, and improving the efficiency of word segmentation matching retrieval by using the full-text retrieval solution. However, after practice, especially in industries involving all categories and a large number of cross-region In the customer's platform, the accuracy of peer customers obtained simply by word segmentation matching cannot be guaranteed.

因此，需要一种更高效便捷兼具高准确性的基于全文检索引擎的同行客户检索方法。Therefore, a more efficient, convenient and high-accuracy peer-to-peer customer retrieval method based on a full-text retrieval engine is required.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是克服现有技术的不足，提供一种高效便捷兼具高准确性的基于Lucene的同行客户检索方法。The technical problem to be solved by the present invention is to overcome the deficiencies of the prior art and provide a Lucene-based peer customer retrieval method that is efficient, convenient and highly accurate.

为解决上述技术问题，本发明提供一种基于Lucene的同行客户检索方法，主要是构建第一客户的特征索引库，识别第二客户的特征，对第二客户的特征与第一客户特征进行相似度计算，召回结果人工排名。其特征在于，包括以下步骤：In order to solve the above-mentioned technical problems, the present invention provides a method for retrieving peer customers based on Lucene, which mainly includes constructing a feature index library of the first customer, identifying the features of the second customer, and comparing the features of the second customer with the features of the first customer. Degree calculation, recall results are manually ranked. It is characterized in that, comprises the following steps:

步骤1：构建第一客户的特征索引库，具体为：利用第一客户的公司基本信息、行业信息和营销信息，提取特征词并构建基于Lucene的第一客户特征索引库；Step 1: constructing a feature index library of the first customer, specifically: using the company basic information, industry information and marketing information of the first customer to extract feature words and build a Lucene-based first customer feature index library;

步骤2：建立第二客户的行业特征词与关联目录的关联，包括：识别第二客户的行业特征词，获取第二客户的行业特征词的关联目录和对应的关联度；Step 2: establishing the association between the industry feature word of the second customer and the associated catalog, including: identifying the industry feature word of the second customer, and obtaining the associated catalog and the corresponding association degree of the industry feature word of the second customer;

步骤3：计算第二客户与每一个第一客户之间的相似度得分，所述相似度得分是对第二客户的行业特征词分别与第一客户的公司基本信息、行业信息和营销信息进行相似度计算所得的三个相似度分值的总和，包括：第二客户的行业特征词与第一客户的公司基本信息之间的相似度，第二客户的行业特征词与第一客户的行业信息之间的相似度，第二客户的行业特征词与第一客户的营销信息之间的相似度；Step 3: Calculate the similarity score between the second customer and each first customer. The similarity score is calculated by comparing the industry feature words of the second customer with the company basic information, industry information and marketing information of the first customer. The sum of the three similarity scores obtained by similarity calculation, including: the similarity between the industry feature word of the second customer and the basic company information of the first customer, the industry feature word of the second customer and the industry of the first customer The similarity between the information, the similarity between the industry feature words of the second customer and the marketing information of the first customer;

步骤4：召回结果重新排名，对于步骤3得出的相似度分值的基础上，基于预设的排名优化规则优化排名；所述排名优化方案包括根据第一客户的属性信息进行二次排序，属性信息包括客户等级和客户收到询盘数量，所述客户等级和客户收到询盘数量是数据库中的预设参数，在本步骤中作为词典数据使用。Step 4: Re-rank the recall results. On the basis of the similarity score obtained in Step 3, optimize the ranking based on a preset ranking optimization rule; the ranking optimization scheme includes performing secondary sorting according to the attribute information of the first customer, The attribute information includes the customer level and the number of inquiries received by the customer. The customer level and the number of inquiries received by the customer are preset parameters in the database and are used as dictionary data in this step.

所述步骤1具体包括：The step 1 specifically includes:

步骤11：收集第一客户的公司基本信息，所述公司基本信息包括公司关键词；Step 11: Collect basic company information of the first client, where the basic company information includes company keywords;

步骤12：对公司基本信息进行分词，整理成{词元,…}格式的文本，构建第一客户特征索引库的索引字段，其中，词元为文本分词结果的最小单元；Step 12: perform word segmentation on the basic information of the company, organize it into a text in the format of {word element,...}, and construct the index field of the first customer feature index library, wherein the word element is the smallest unit of the text word segmentation result;

步骤13：收集第一客户的行业信息，所述行业信息包括：主营产品名称、主营产品关键字、主营产品目录；Step 13: Collect industry information of the first customer, the industry information includes: main product name, main product keyword, main product catalog;

步骤14：按照主营产品目录作为统计维度，对步骤13中收集到的行业信息，对进行分词处理，整理成{主营产品目录:{词元:词频},…}格式的文本，构建第一客户的特征索引库的索引字段，其中，词元为文本分词结果的最小单元，词频为词元在对应主营产品目录统计维度中出现的次数，在一个主营产品中多次出现的词元的词频只记录1次，词频的最大值是对应主营产品目录下主营产品的数量总和；Step 14: According to the main product catalog as a statistical dimension, the industry information collected in step 13 is processed by word segmentation, and sorted into a text in the format of {main product catalog: {word element: word frequency},...} to construct the first An index field of a customer's feature index library, where the word element is the smallest unit of the text segmentation result, the word frequency is the number of times the word element appears in the statistical dimension of the corresponding main product catalog, and the word that appears multiple times in a main product The word frequency of the meta is only recorded once, and the maximum word frequency is the sum of the number of main products in the corresponding main product catalog;

步骤15：收集第一客户的营销信息，所述营销信息包括被购买的营销推广关键词；Step 15: Collect marketing information of the first customer, where the marketing information includes the purchased marketing promotion keywords;

步骤16；对步骤15收集到的营销信息，不做分词处理，整理成{营销词,…}格式的文本，构建第一客户的特征索引库的索引字段；Step 16: The marketing information collected in step 15 is processed into a {marketing word,...} format without word segmentation, and an index field of the feature index library of the first customer is constructed;

步骤17：基于索引字段，完成第一客户的特征索引库构建。Step 17: Complete the construction of the feature index library of the first customer based on the index field.

所述步骤2具体包括：The step 2 specifically includes:

步骤21：基于从预设的行业特征词词库，对第二客户的任意主营产品名称，通过逆向最大分词算法，识别其行业特征词，作为第二客户的行业特征词；Step 21: Based on the preset industry feature word lexicon, for any main product name of the second customer, through the reverse maximum word segmentation algorithm, identify its industry feature word as the industry feature word of the second customer;

步骤22：采用预训练语言模型Bert搭建第一客户的主营产品信息和目录的分类模型；Step 22: Use the pre-trained language model Bert to build a classification model of the first customer's main product information and catalog;

步骤23：向步骤23的模型输入步骤22识别出的第二客户的行业特征词，返回模型预测分类概率最高的三个目录码，将分类概率作为行业特征词与对应目录之间的关联度；Step 23: input the industry characteristic word of the second customer identified in step 22 to the model of step 23, return the three catalog codes with the highest predicted classification probability by the model, and use the classification probability as the degree of association between the industry characteristic word and the corresponding catalog;

步骤24：输出预测结果并整理成{第二客户的行业特征词:{关联目录:关联度,…},…}的文本格式。Step 24: Output the prediction result and organize it into a text format of {industry characteristic word of the second customer: {association directory: association degree,...},...}.

所述步骤3具体包括：The step 3 specifically includes:

步骤31：所述步骤3中第二客户的行业特征与第一客户的行业信息之间的相似度的计算，包括公式：

其中i∈[1,n]，n为第二客户的行业特征词的关联目录与匹配上的第一客户的主营产品目录匹配的个数，catboost为第二客户的行业特征词与主营产品目录之间的关联度，sum(fre)为第二客户的行业特征词相匹配的主营产品词元对应的词频的累加值，totalProdNum为匹配上的第一客户的主营产品总数；Step 31: Calculation of the similarity between the industry characteristics of the second customer and the industry information of the first customer in the step 3, including the formula:

where i∈[1,n], n is the number of matches between the associated catalog of the second customer's industry feature word and the matching main product catalog of the first customer, catboost is the second customer's industry feature word and the main product catalog The degree of correlation between product catalogs, sum(fre) is the cumulative value of the word frequency corresponding to the main product word that matches the second customer's industry feature word, and totalProdNum is the total number of main products of the first customer that matches;

步骤32：所述步骤3中第二客户的行业特征与第一客户的公司基本信息之间的相似度的打分公式为：score_ck＝n'*1000,其中n'为第二客户行业特征词与第一客户公司基本信息之间的匹配次数；Step 32: The scoring formula for the similarity between the industry characteristics of the second customer and the basic company information of the first customer in the step 3 is: score_ck=n'*1000, where n' is the industry characteristic word of the second customer and the The number of matches between the basic information of the first client company;

步骤33：步骤3中的第二客户的行业特征与第一客户的营销信息之间的相似度的打分公式为：score_tr＝n”*10000,其中n”为第二客户行业特征词与第一客户营销信息之间的匹配次数；Step 33: The scoring formula for the similarity between the industry characteristics of the second customer and the marketing information of the first customer in step 3 is: score_tr=n”*10000, where n” is the industry characteristic word of the second customer and the first customer’s marketing information. Number of matches between customer marketing messages;

步骤34：结合步骤31，步骤32和步骤33，第二客户特征与第一客户之间的总相似度分值为：score_final＝score_mpf+score_ck+score_tr。Step 34: Combining step 31, step 32 and step 33, the total similarity score between the second customer feature and the first customer is: score_final=score_mpf+score_ck+score_tr.

所述步骤4包括：The step 4 includes:

步骤41：所述排名优化规则为：根据步骤34的score_final分值，划分出分值区间[0～1]，[1～1000]，[1000～]，对于相似度分值落在[1～1000]，[1000～]两个区间的匹配结果，按照第一客户等级，第一客户收到询盘数量进行二次排序；Step 41: The ranking optimization rule is: according to the score_final score in step 34, divide the score interval [0～1], [1～1000], [1000～], for the similarity score falls in [1～ 1000], [1000～] The matching results of the two intervals, according to the first customer level, the number of inquiries received by the first customer is secondarily sorted;

步骤42：步骤41的排序结果为第二客户最终的同行检索结果。Step 42: The sorting result of Step 41 is the final peer search result of the second client.

本发明所达到的有益效果:在一般基于Lucene的全文检索的基础功能之上，通过特征工程，识别能够代表第一客户的特征，以及能够代表第二客户的特征，通过本发明的相似度算法和人工排名算法，能够极大提高同行客户的召回准确率和检索效率。Beneficial effects achieved by the present invention: On the basis of the general Lucene-based full-text retrieval function, through feature engineering, the features that can represent the first customer and the features that can represent the second customer are identified, and the similarity algorithm of the present invention is used. And artificial ranking algorithm, can greatly improve the recall accuracy and retrieval efficiency of peer customers.

附图说明Description of drawings

图1为本发明的一种示例性实施例的方法流程简图。FIG. 1 is a schematic flowchart of a method of an exemplary embodiment of the present invention.

具体实施方式Detailed ways

Lucene全文检索引擎是业界较为常用的文本检索方案，利用倒排索引技术，可以显著提升文档检索效率，同时，利用其自定义的打分插件，可以实现自定义打分和自定义人工排名，具有较好的灵活性。The Lucene full-text retrieval engine is a commonly used text retrieval solution in the industry. Using the inverted index technology can significantly improve the efficiency of document retrieval. At the same time, using its custom scoring plug-in, it can achieve custom scoring and custom manual ranking, with better performance. flexibility.

因此，本发明基于Lucene全文检索引擎的索引构建技术，构建客户特征索引，并通过自定义的打分算法和人工排名算法，实现高效、精确检索同行客户的效果。Therefore, the present invention builds a customer feature index based on the index construction technology of the Lucene full-text search engine, and achieves the effect of efficiently and accurately retrieving peer customers through a self-defined scoring algorithm and manual ranking algorithm.

以下结合附图和实施例，具体阐述本发明实施方案。所描述的实施例仅为示例，基于本发明技术实质所做的改变或等同变化，仍落入本发明保护范围。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples. The described embodiments are only examples, and changes or equivalent changes made based on the technical essence of the present invention still fall within the protection scope of the present invention.

参阅图1所示，本发明同行客户检索方法流程，具体步骤为：Referring to Fig. 1, the process flow of the peer customer retrieval method of the present invention, the specific steps are:

步骤1：构建第一客户的特征索引库，具体为：利用第一客户的公司基本信息、行业信息和营销信息，提取特征词并构建基于Lucene的第一客户特征索引库，记为collection；Step 1: Construct the feature index library of the first customer, specifically: using the company basic information, industry information and marketing information of the first customer to extract feature words and build a Lucene-based first customer feature index library, which is recorded as collection;

步骤11：收集第一客户的公司基本信息，所述公司基本信息包括公司关键词等。由于公司关键词是最能代表客户特征的信息，是客户自己对自己进行关键特征提取，将客户关键词作为客户特征具有较好的应用效果。Step 11: Collect basic company information of the first client, where the basic company information includes company keywords and the like. Since company keywords are the information that can best represent the characteristics of customers, and it is the customers themselves who extract key features for themselves, it has better application effects to use customer keywords as customer characteristics.

步骤12：对公司基本信息进行分词，整理成{词元,…}格式的文本，构建第一客户特征索引库的索引字段，其中，词元为文本分词结果的最小单元。Step 12: Perform word segmentation on the basic information of the company, organize it into a text in the format of {word element,...}, and construct an index field of the first customer feature index library, wherein the word element is the smallest unit of the text word segmentation result.

步骤13：收集第一客户的行业信息，所述行业信息包括：主营产品名称、主营产品关键字、主营产品目录；例如：{C11,C11,C22,C33}Step 13: Collect industry information of the first customer, the industry information includes: main product name, main product keyword, main product catalog; for example: {C11, C11, C22, C33}

例如，客户主营产品有两大类，分布在001，002两个目录中。For example, the customer's main products have two categories, which are distributed in the 001 and 002 catalogs.

001目录下发布了10个产品，经过分词处理后，得到C11,C12,C13三个词元，三个词元分别在10个产品中出现的次数分布如下：There are 10 products released in the 001 directory. After word segmentation, three tokens C11, C12, and C13 are obtained. The frequency distribution of the three tokens in the 10 products is as follows:

C11C11 C12C12 C13C13 产品1Product 1 22 55 1313 产品2Product 2 1010 1818 88 产品3Product 3 4040 33 66 产品4Product 4 00 22 1515 产品5Product 5 77 00 1212 产品6Product 6 00 3030 77 产品7Product 7 22 1818 33 产品8Product 8 00 99 44 产品9Product 9 1414 00 2525 产品10Product 10 00 55 1919 词频统计Word frequency statistics 66 88 1010

注：上述表格，横轴为词元，纵轴为目录产品，每一个单元格代表词元在产品中出现的次数，根据上述统计结果，可以得到如下信息汇总：{001:{C11:6,C12:8,C13:10}}。Note: In the above table, the horizontal axis is the word element, and the vertical axis is the catalog product. Each cell represents the number of times the word appears in the product. According to the above statistical results, the following information can be summarized: {001:{C11:6, C12:8, C13:10}}.

002目录下发布了15个产品，经过分词处理后，得到C11,C22,C23三个词元，三个词元分别在15个产品中出现的次数分布如下：There are 15 products released in the 002 directory. After the word segmentation process, three tokens C11, C22, and C23 are obtained. The frequency distribution of the three tokens in the 15 products is as follows:

注：上述表格，横轴为词元，纵轴为目录产品，每一个单元格代表词元在产品中出现的次数，根据上述统计结果，可以得到如下信息汇总：{002:{C11:15,C22:9,C23:11}}。Note: In the above table, the horizontal axis is the word element, and the vertical axis is the catalog product. Each cell represents the number of times the word appears in the product. According to the above statistical results, the following information can be summarized: {002:{C11:15, C22:9, C23:11}}.

经过本步骤处理后，最终整理成如下信息格式：After processing in this step, it is finally sorted into the following information format:

{001:{C11:6,C12:8,C13:10},002:{C11:15,C22:9,C23:11}}；{001:{C11:6,C12:8,C13:10},002:{C11:15,C22:9,C23:11}};

步骤15：收集第一客户的营销信息，所述营销信息包括被购买的营销推广关键词等。此信息是客户目前最推荐的特征，在所有信息中代表性最强；Step 15: Collect marketing information of the first customer, where the marketing information includes purchased marketing promotion keywords and the like. This information is currently the most recommended feature by customers and is the most representative of all information;

步骤16；对步骤15收集到的营销信息，不做分词处理，整理成{营销词,…}格式的文本，构建第一客户的特征索引库的索引字段；例如：{C11}；Step 16: The marketing information collected in Step 15 is not processed by word segmentation, and is organized into texts in the format of {marketing words,...}, and the index field of the feature index database of the first customer is constructed; for example: {C11};

步骤17：基于索引字段，完成第一客户的特征索引库构建；Step 17: Complete the construction of the feature index library of the first customer based on the index field;

步骤24：输出预测结果并整理成{第二客户的行业特征词:{关联目录:关联度,…},…}的文本格式，例如：第二客户的特征词为C11，输入上述模型后，输出{002:0.65,001:0.12}；Step 24: Output the prediction result and organize it into a text format of {industry characteristic word of the second customer: {association directory: relevance,...},...}, for example: the characteristic word of the second customer is C11, after entering the above model, output {002:0.65,001:0.12};

步骤3：计算第二客户与每一个第一客户之间的相似度得分，所述相似度得分是对第二客户的行业特征词分别与第一客户的公司基本信息、行业信息和营销信息进行相似度计算所得的三个相似度分值的总和，具体为：第二客户的行业特征词与B2B平台第一客户的公司基本信息之间的相似度，第二客户的行业特征词与B2B平台第一客户的行业信息之间的相似度，第二客户的行业特征词与B2B平台第一客户的营销信息之间的相似度；Step 3: Calculate the similarity score between the second customer and each first customer. The similarity score is calculated by comparing the industry feature words of the second customer with the company basic information, industry information and marketing information of the first customer. The sum of the three similarity scores obtained from the similarity calculation, specifically: the similarity between the industry feature word of the second customer and the basic company information of the first customer on the B2B platform, the industry feature word of the second customer and the B2B platform. The similarity between the industry information of the first customer, the similarity between the industry feature words of the second customer and the marketing information of the first customer of the B2B platform;

步骤31：所述步骤3中第二客户的行业特征与B2B平台第一客户的行业信息之间的相似度的计算，包括公式：

其中i∈[1,n]，n为第二客户的行业特征词的关联目录与匹配上的第一客户的主营产品目录匹配的个数，catboost为第二客户的行业特征词与B2B平台的主营产品目录之间的关联度，sum(fre)为第二客户的行业特征词相匹配的主营产品词元对应的词频的累加值，totalProdNum为匹配上的第一客户的主营产品总数，整体分值*10是为了控制分值区间，作为第一档；例如:对于第二客户的C1特征词和关联目录信息{001:0.65,002:0.12}，此时会通过C1命中第一客户，将特征信息统计数据带入公式，得到：score_mpf＝(0.65*(15/15)+0.12*(6/10))*10＝7.22；Step 31: Calculation of the similarity between the industry characteristics of the second customer and the industry information of the first customer of the B2B platform in the step 3, including the formula:

where i∈[1,n], n is the number of matches between the associated catalog of the second customer's industry feature word and the matching first customer's main product catalog, catboost is the second customer's industry feature word and the B2B platform The degree of association between the main product catalogs, sum(fre) is the cumulative value of the word frequency corresponding to the main product word that matches the industry feature word of the second customer, and totalProdNum is the matching main product of the first customer The total number, the overall score*10 is used to control the score range, as the first grade; for example: for the second customer's C1 feature word and associated directory information {001:0.65,002:0.12}, at this time, the C1 will hit the first A customer, put the feature information statistics into the formula, and get: score_mpf=(0.65*(15/15)+0.12*(6/10))*10=7.22;

步骤32：所述步骤3中第二客户的行业特征与B2B平台第一客户的公司基本信息之间的相似度的打分公式为：score_ck＝n'*1000,其中n'为第二客户行业特征词与第一客户公司基本信息之间的匹配次数；例如：对于第二客户的C1特征词，命中第一客户的公司基本信息，score_ck＝2*1000＝2000；Step 32: The scoring formula for the similarity between the industry characteristics of the second customer and the company basic information of the first customer on the B2B platform in the step 3 is: score_ck=n'*1000, where n' is the industry characteristic of the second customer The number of matches between the word and the company's basic information of the first customer; for example, for the C1 feature word of the second customer, if the basic company information of the first customer is hit, score_ck=2*1000=2000;

步骤33：步骤3中的第二客户的行业特征与B2B平台第一客户的营销信息之间的相似度的打分公式为：score_tr＝n”*10000,其中n”为第二客户行业特征词与第一客户营销信息之间的匹配次数；例如：对于第二客户的C1特征词，命中第一客户的营销信息，score_tr＝1*10000＝10000；Step 33: The scoring formula for the similarity between the industry characteristics of the second customer in step 3 and the marketing information of the first customer on the B2B platform is: score_tr=n”*10000, where n” is the industry characteristic word of the second customer and the The number of matches between the marketing information of the first customer; for example, for the C1 feature word of the second customer, if the marketing information of the first customer is hit, score_tr=1*10000=10000;

步骤34：结合步骤31，步骤32和步骤33，第二客户特征与第一客户之间的总相似度分值为：score_final＝score_mpf+score_ck+score_tr；例如：对于第二客户的C1特征词和关联目录信息{001:0.65,002:0.12}，命中第一客户，最后总的相似度得分为：Step 34: Combined with step 31, step 32 and step 33, the total similarity score between the second customer feature and the first customer is: score_final=score_mpf+score_ck+score_tr; for example: for the second customer’s C1 feature word and Associate directory information {001:0.65,002:0.12}, hit the first customer, and the final total similarity score is:

score_final＝7.22+2000+10000＝12007.22score_final=7.22+2000+10000=12007.22

步骤4：召回结果重新排名，对于步骤3得出的相似度分值的基础上，基于预设的排名优化规则优化排名；所述排名优化方案包括根据第一客户的属性信息进行二次排序，其中属性信息包括第一客户等级和第一客户收到询盘数量。所述客户等级和客户收到询盘数量是B2B平台数据库的客户属性信息，属于预设参数，在本步骤中作为词典数据使用。Step 4: Re-rank the recall results. On the basis of the similarity score obtained in Step 3, optimize the ranking based on a preset ranking optimization rule; the ranking optimization scheme includes performing secondary sorting according to the attribute information of the first customer, The attribute information includes the first customer level and the number of inquiries received by the first customer. The customer level and the number of inquiries received by the customer are the customer attribute information of the B2B platform database, which are preset parameters, and are used as dictionary data in this step.

步骤41：所述排名优化规则为：根据步骤34的score_final分值，划分出分值区间[0～1]，[1～1000]，[1000～]，对于相似度分值落在[1～1000]，[1000～]两个区间的匹配结果，按照第一客户等级，第一客户收到询盘数量进行二次排序；经实践对比，该区间值的设定相对于随意设定的其他值数，可以更精准高效的对检索结果进行排序，排序结果更符合预期。例如：对于第二客户的C1特征词和关联目录信息{001:0.65,002:0.12}，命中多个第一客户，分布如下：Step 41: The ranking optimization rule is: according to the score_final score in step 34, divide the score interval [0～1], [1～1000], [1000～], for the similarity score falls in [1～ 1000], [1000～] The matching results of the two intervals, according to the first customer level, the number of inquiries received by the first customer is sorted secondarily; after practical comparison, the setting of this interval value is compared with other arbitrarily set values. The number of values can sort the search results more accurately and efficiently, and the sorting results are more in line with expectations. For example, for the C1 feature word of the second customer and the associated directory information {001:0.65,002:0.12}, multiple first customers are hit, and the distribution is as follows:

score_finalscore_final 客户等级Customer level 收到询盘数量Number of inquiries received 第一客户1first customer 1 12007.2212007.22 22 100100 第一客户2first client 2 10010.5310010.53 11 6060 第一客户3first customer 3 5022.855022.85 11 8080 第一客户4first customer 4 2009.212009.21 33 9090 第一客户5first customer 5 1008.421008.42 22 4646 第一客户6first customer 6 55.3655.36 11 1414 第一客户7first customer 7 12.1512.15 22 23twenty three

经过重新排名之后，得到如下最终排名：After re-ranking, the final ranking is as follows:

score_finalscore_final 客户等级Customer level 收到询盘数量Number of inquiries received 第一客户1first customer 1 2009.212009.21 33 9090 第一客户2first client 2 12007.2212007.22 22 100100 第一客户3first customer 3 1008.421008.42 22 4646 第一客户4first customer 4 5022.855022.85 11 8080 第一客户5first customer 5 10010.5310010.53 11 6060 第一客户6first customer 6 12.1512.15 22 23twenty three 第一客户7first customer 7 55.3655.36 11 1414

本发明主要用于提供一种客户检索方法，在一般基于Lucene的全文检索的基础功能之上，通过特征工程，识别能够代表第一客户的特征，以及能够代表第二客户的特征，通过本发明的相似度算法和人工排名算法，能够极大提高同行客户的召回准确率和检索效率。The present invention is mainly used to provide a customer retrieval method. Based on the basic function of the general Lucene-based full-text retrieval, through feature engineering, the features that can represent the first customer and the features that can represent the second customer are identified. The similarity algorithm and manual ranking algorithm can greatly improve the recall accuracy and retrieval efficiency of peer customers.

以上实施例不以任何方式限定本发明，凡是对以上实施例以等效变换方式做出的其它改进与应用，都属于本发明的保护范围。The above embodiments do not limit the present invention in any way, and all other improvements and applications made in the form of equivalent transformations to the above embodiments belong to the protection scope of the present invention.

Claims

1. a kind of peer customer retrieval method based on Lucene, is characterized in that, comprises the following steps:

Step 1: constructing a feature index library of the first customer, specifically: using the company basic information, industry information and marketing information of the first customer to extract feature words and build a Lucene-based first customer feature index library;

Step 2: establishing the association between the industry feature word of the second customer and the associated catalog, including: identifying the industry feature word of the second customer, and obtaining the associated catalog and the corresponding association degree of the industry feature word of the second customer;

Step 3: Calculate the similarity score between the second customer and each first customer. The similarity score is calculated by comparing the industry feature words of the second customer with the company basic information, industry information and marketing information of the first customer. The sum of the three similarity scores obtained by similarity calculation, including: the similarity between the industry feature word of the second customer and the basic company information of the first customer, the industry feature word of the second customer and the industry of the first customer The similarity between the information, the similarity between the industry feature word of the second customer and the marketing information of the first customer;

Step 4: Re-rank the recall results. On the basis of the similarity score obtained in Step 3, optimize the ranking based on a preset ranking optimization rule; the ranking optimization scheme includes performing secondary sorting according to the attribute information of the first customer, The attribute information includes the customer level and the number of inquiries received by the customer. The customer level and the number of inquiries received by the customer are preset parameters in the database and are used as dictionary data in this step.

2. a kind of peer customer retrieval method based on Lucene as claimed in claim 1 is characterized in that: described step 1 specifically comprises:

Step 11: Collect basic company information of the first client, where the basic company information includes company keywords;

Step 12: perform word segmentation on the basic information of the company, organize it into a text in the format of {word element,...}, and construct the index field of the first customer feature index library, wherein the word element is the smallest unit of the text word segmentation result;

Step 13: Collect industry information of the first customer, the industry information includes: main product name, main product keyword, main product catalog;

Step 14: According to the main product catalog as a statistical dimension, the industry information collected in step 13 is processed by word segmentation, and sorted into a text in the format of {main product catalog: {word element: word frequency},...} to construct the first An index field of a customer's feature index library, where the word element is the smallest unit of the text segmentation result, the word frequency is the number of times the word element appears in the statistical dimension of the corresponding main product catalog, and the word that appears multiple times in a main product The word frequency of the meta is only recorded once, and the maximum word frequency is the sum of the number of main products in the corresponding main product catalog;

Step 15: Collect marketing information of the first customer, where the marketing information includes the purchased marketing promotion keywords;

Step 16: The marketing information collected in step 15 is processed into a {marketing word,...} format without word segmentation, and an index field of the feature index library of the first customer is constructed;

Step 17: Complete the construction of the feature index library of the first customer based on the index field.

3. a kind of peer customer retrieval method based on Lucene as claimed in claim 2 is characterized in that: described step 2 specifically comprises:

Step 21: Based on the preset industry feature word lexicon, for any main product name of the second customer, through the reverse maximum word segmentation algorithm, identify its industry feature word as the industry feature word of the second customer;

Step 22: Use the pre-trained language model Bert to build a classification model of the first customer's main product information and catalog;

Step 23: input the industry characteristic word of the second customer identified in step 22 to the model of step 23, return the three catalog codes with the highest predicted classification probability by the model, and use the classification probability as the degree of association between the industry characteristic word and the corresponding catalog;

Step 24: Output the prediction result and organize it into a text format of {industry characteristic word of the second customer: {association directory: association degree,...},...}.

4. a kind of peer customer retrieval method based on Lucene as claimed in claim 3 is characterized in that: described step 3 specifically comprises:

Step 31: Calculation of the similarity between the industry characteristics of the second customer and the industry information of the first customer in the step 3, including the formula:

Step 32: The scoring formula for the similarity between the industry characteristics of the second customer and the basic company information of the first customer in the step 3 is: score_ck=n'*1000, where n' is the industry characteristic word of the second customer and the The number of matches between the basic information of the first client company;

Step 33: The scoring formula for the similarity between the industry characteristics of the second customer and the marketing information of the first customer in step 3 is: score_tr=n”*10000, where n” is the industry characteristic word of the second customer and the first customer’s marketing information. Number of matches between customer marketing messages;

Step 34: Combining step 31, step 32 and step 33, the total similarity score between the second customer feature and the first customer is: score_final=score_mpf+score_ck+score_tr.

5. a kind of peer customer retrieval method based on Lucene as claimed in claim 4 is characterized in that: described step 4 comprises:

Step 41: The ranking optimization rule is: according to the score_final score in step 34, divide the score interval [0～1], [1～1000], [1000～], for the similarity score falls in [1～ 1000], [1000～] The matching results of the two intervals, according to the first customer level, the number of inquiries received by the first customer is secondarily sorted;

Step 42: The sorting result of Step 41 is the final peer search result of the second client.