CN115203502A

CN115203502A - Business data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115203502A
Application number: CN202210812190.1A
Authority: CN
Inventors: 王超; 宋双永
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-10-18

Abstract

The disclosure provides a service data processing method and device, electronic equipment and a storage medium, which can be applied to the technical field of intelligent customer service. The service data processing method comprises the following steps: acquiring service data to be processed; determining N recalling cluster centers from M cluster centers obtained after clustering, wherein the cluster centers represent at least one piece of service data in the same class of service data, and M is more than or equal to N; determining a target cluster center of which the first similarity meets a first preset condition from the N recall cluster centers according to the first similarity between the service data to be processed and the recall cluster centers of the N recall cluster centers; and classifying the service data to be processed into a class cluster corresponding to the target class cluster center.

Description

Business data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of intelligent customer service technologies, and in particular, to a method and an apparatus for processing service data, an electronic device, and a storage medium.

Background

In the customer service process, customer service personnel need to answer customer problems, with more and more customer service events, a great deal of time is needed for manually processing the customer service events, and how to realize efficient and accurate response is a problem to be solved urgently.

In the process of implementing the present disclosure, it is found that, with the increasing data volume of the customer service event, the current response processing method for the customer service time needs to consume a large amount of time and more computing resources, and when the computing resources are limited, it is difficult to cope with a large number of customer service events, and the timeliness of response of the customer service event cannot be guaranteed.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a business data processing method and apparatus, an electronic device, a storage medium, and a program product.

In one aspect of the present disclosure, a method for processing service data is provided, including:

acquiring service data to be processed;

determining N recalling cluster centers from M cluster centers obtained after clustering, wherein the cluster centers represent at least one piece of service data in the same class of service data, and M is more than or equal to N;

determining a target cluster center of which the first similarity meets a first preset condition from the N recall cluster centers according to the first similarity between the service data to be processed and the recall cluster centers of the N recall cluster centers;

and classifying the service data to be processed into a class cluster corresponding to the target class cluster center.

According to an embodiment of the present disclosure, the method further includes:

determining L recalling data from K service data associated with the cluster center, wherein K is larger than or equal to L;

performing correlation calculation on the K service data and the L recalling data respectively to obtain K correlation scores, wherein each correlation score is correlated with one service data in the K service data;

and taking the target data in the K service data associated with the target association score as a new cluster center associated with the K service data, wherein the target association score is as follows: and the relevance scores meeting a second preset condition in the K relevance scores.

According to the embodiment of the present disclosure, performing association calculation on the K service data and the L recall data, respectively, to obtain K association scores includes:

similarity calculation is carried out on data in the K service data and the L recalling data respectively to obtain K groups of similarity result sets, wherein each group of similarity result sets is associated with one service data in the K service data, and each group of similarity result sets comprises L second similarities;

and respectively calculating the average value of the L second similarities in each group of similarity result sets to obtain K associated scores.

According to an embodiment of the present disclosure, wherein: the relevance score meeting the second preset condition is as follows: the most significant of the K relevance scores is the relevance score.

According to the embodiment of the present disclosure, determining, according to a first similarity between service data to be processed and a recall cluster center of N recall cluster centers, a target cluster center of which the first similarity satisfies a first preset condition from the N recall cluster centers includes:

and determining a target cluster center with the first similarity larger than or equal to a preset similarity threshold from the N recall cluster centers according to the first similarity between the service data to be processed and the recall cluster center of the N recall cluster centers.

under the condition that the N recalling cluster centers do not comprise a target cluster center with the first similarity meeting a first preset condition, a new cluster is created;

and classifying the service data to be processed into the newly-built class cluster, wherein the service data to be processed is used as the cluster center of the newly-built class cluster.

According to the embodiment of the present disclosure, after the indexes are added to the M cluster centers and then the M cluster centers are stored in the index database, the method further includes the following steps:

creating an index aiming at the cluster center of the newly created cluster;

and adding the cluster center of the newly-built cluster and the newly-built index into an index database so as to determine a recall cluster center from the clustered cluster centers by using the index database.

Another aspect of the present disclosure provides a service data processing apparatus, which includes an obtaining module, a first determining module, a second determining module, and a first classifying module.

The acquisition module is used for acquiring the service data to be processed;

the first determining module is used for determining N recalling cluster centers from M cluster centers obtained after clustering, wherein the cluster centers represent at least one piece of service data in the same class of service data, and M is more than or equal to N;

the second determining module is used for determining a target cluster center of which the first similarity meets a first preset condition from the N recalling cluster centers according to the first similarity between the service data to be processed and the recalling cluster center of the N recalling cluster centers;

and the first classification module is used for classifying the service data to be processed into a class cluster corresponding to the target class cluster center.

According to the embodiment of the disclosure, the device further comprises a third determining module, a correlation module and an executing module.

The third determining module is used for determining L recalling data from K service data associated with the cluster center, wherein K is larger than or equal to L;

the correlation module is used for performing correlation calculation on the K service data and the L recalling data respectively to obtain K correlation scores, wherein each correlation score is correlated with one service data in the K service data;

the execution module is used for taking the target data in the K business data associated with the target association score as a new cluster center associated with the K business data, wherein the target association score is as follows: and the correlation scores meeting a second preset condition in the K correlation scores.

According to an embodiment of the present disclosure, wherein the association module includes a first computing unit and a second computing unit.

The first calculating unit is used for respectively carrying out similarity calculation on data in the K service data and the L recalling data to obtain K groups of similarity result sets, wherein each group of similarity result sets is associated with one service data in the K service data, and each group of similarity result sets comprises L second similarities;

and the second calculating unit is used for calculating the mean value of the L second similarities in each group of similarity result sets respectively to obtain K associated scores.

According to the embodiment of the disclosure, the relevance score meeting the second preset condition is as follows: the most significant of the K relevance scores is the relevance score.

According to an embodiment of the present disclosure, the second determining module includes a determining unit, configured to determine, according to a first similarity between the service data to be processed and a recall cluster center in the N recall cluster centers, a target cluster center from the N recall cluster centers, where the first similarity is greater than or equal to a preset similarity threshold.

According to the embodiment of the disclosure, the system further comprises a first new building module and a second classification module.

The first newly-built module is used for building a new cluster under the condition that the N recalling cluster centers do not contain a target cluster center with the first similarity meeting a first preset condition;

and the second classification module is used for classifying the service data to be processed into the newly-built class cluster, wherein the service data to be processed is used as the cluster center of the newly-built class cluster.

According to the embodiment of the disclosure, after the indexes are added to the M cluster centers, the M cluster centers are stored in the index library, and the device further comprises a second new building module and an adding module.

The second new building module is used for building an index aiming at the cluster center of the newly built class cluster after the service data to be processed is classified into the newly built class cluster;

and the adding module is used for adding the cluster center of the newly-built cluster and the newly-built index into the index database so as to determine the recall cluster center from the clustered cluster centers by using the index database.

Another aspect of the present disclosure provides an electronic device including: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to execute the business data processing method.

Another aspect of the present disclosure also provides a computer-readable storage medium having executable instructions stored thereon, which when executed by a processor, cause the processor to perform the above-mentioned service data processing method.

Another aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the business data processing method described above.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, taken in conjunction with the accompanying drawings of which:

fig. 1 schematically illustrates an application scenario diagram of a business data processing method, apparatus, device, medium, and program product according to embodiments of the present disclosure;

fig. 2 schematically shows a flow chart of a traffic data processing method according to an embodiment of the present disclosure;

fig. 3 schematically shows a flow chart of a traffic data processing method according to another embodiment of the present disclosure;

fig. 4 schematically shows a flow chart of a traffic data processing method according to another embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a method of adjusting cluster centers according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a service data processing apparatus according to an embodiment of the present disclosure; and

fig. 7 schematically shows a block diagram of an electronic device adapted to implement a business data processing method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B, and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.).

In the customer service process, customer service personnel need to answer customer problems, with more and more customer service events, a great deal of time is needed for manually processing the customer service events, and how to realize efficient and accurate response is a problem to be solved urgently. The intelligent customer service is carried along with the customer service, and in the intelligent customer service, the customer service data is processed by generally using a big data processing technology so as to complete intelligent response.

In the process of implementing the disclosure, it is found that a large amount of customer service event service data can be classified by applying a big data processing technology, and then the service data are respectively and uniformly responded in a targeted manner according to each category, so that the response workload can be reduced to a greater extent, and the working efficiency can be improved. For example, the classification of the service data can be realized by clustering service data texts (dialogs, forums, comments, and the like) by using clustering.

However, as the volume of data of the customer service event is larger and larger, the existing algorithm has some defects, for example, when a clustering algorithm such as k-means which needs to specify the number of clusters is adopted, the number of clusters is difficult to estimate, and a large number of isolated points exist and should not be forced to fall into certain categories; when a clustering algorithm such as DBScan, which does not require to specify the number of clusters, is used, but the calculation amount is large due to the calculation similarity between each piece of data and other pieces of data, a large amount of time and a large amount of calculation resources are consumed, when the calculation resources are limited, a large number of customer service events are difficult to deal with, and the timeliness of the response of the customer service events cannot be ensured.

In view of this, an embodiment of the present disclosure provides a method for processing service data, including:

acquiring service data to be processed;

determining a target cluster center with first similarity meeting a first preset condition from the N recalling cluster centers according to the first similarity between the service data to be processed and the recalling cluster center in the N recalling cluster centers;

Fig. 1 schematically shows an application scenario diagram of a service data processing method, apparatus, device, medium, and program product according to embodiments of the present disclosure.

As shown in fig. 1, the application scenario 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The backend management server may analyze and process the received data such as the user request, and feed back a processing result (for example, a web page, information, or data obtained or generated according to the user request) to the terminal device.

According to the embodiment of the present disclosure, in the application scenario of the intelligent customer service of the present disclosure, a client may interact with the server 105 through the

terminal devices

101, 102, 103, for example, the client may ask a question (such as an order question, a logistics question, etc.) for a certain business event, the server 105 may respond, or the client and the server 105 may perform a dialogue discussion for a certain event.

According to the embodiment of the present disclosure, in a case that the amount of the client problem data is large, the server 105 may process the service data used for representing a large amount of client dialogue content by executing the service data processing method of the embodiment of the present disclosure, for example, the service data may be clustered to form a plurality of categories, each category may be used for representing a class of service problem, and then a targeted response is uniformly performed for each category, and the response content is displayed to the client through the

terminal devices

101, 102, and 103.

It should be noted that the service data processing method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the service data processing apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The service data processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Correspondingly, the service data processing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The service data processing method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 7 based on the scenario described in fig. 1.

Fig. 2 schematically shows a flow chart of a traffic data processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the service data processing method of this embodiment includes operations S201 to S204.

In operation S201, obtaining service data to be processed;

in operation S202, N recall cluster centers are determined from the M cluster centers obtained after clustering, where a cluster center represents at least one piece of service data in the same class of service data, and M is greater than or equal to N;

in operation S203, according to a first similarity between the service data to be processed and a recall cluster center of the N recall cluster centers, determining a target cluster center of which the first similarity satisfies a first preset condition from the N recall cluster centers;

in operation S204, the service data to be processed is classified into a class cluster corresponding to the target class cluster center.

According to the embodiment of the disclosure, the service data processing method is applied to a scene of intelligent customer service, and is used for realizing data classification of service data of a customer service event so as to complete intelligent response according to the classified data. For example, in this scenario, the client may ask a question (e.g., an order question, a logistics question, etc.) for a certain business event, the server may respond, or the client and the server may conduct a conversational discussion for a certain business event.

According to the embodiment of the disclosure, the server may process the service data (for example, what time and delivery of logistics, how long goods can be received, and what compensation is damaged or damaged goods) representing the content of a large number of client dialogues and comments by executing the service data processing method of the embodiment of the disclosure, for example, the service data may be clustered to form a plurality of categories, each category may be used to represent a category of service problem, so that the subsequent server may respectively perform a unified response for each category, and display the response content to the client through the terminal device, which may reduce the response workload to a greater extent and improve the work efficiency.

According to the embodiment of the disclosure, in the process of clustering a large amount of dialogue text service data by applying the service data processing method, the text data (dialogue, forum, comment, and the like) can be combined to form an original text data set to be stored in the database, each data in the original text data set is traversed, and the service data processing method according to the embodiment of the disclosure is performed on each service data to be processed to classify the service data, wherein the service data can be classified into an existing class cluster or a newly-built data class cluster.

According to the embodiment of the disclosure, the clustering of the service data may be performed based on the similarity between the text data by using a clustering method without specifying the number of clusters. In the related art, by using the clustering methods such as DBScan, the calculation amount is large because each piece of data is respectively calculated with the similarity of other pieces of data, and a large amount of time and more calculation resources are consumed. The embodiment of the disclosure adopts the method of recalling part of the cluster center from the existing cluster centers, calculating the similarity between the service data to be processed and the recalled cluster center, and classifying the data according to the result of the similarity calculation.

Specifically, in the clustering method, in operation S201, to-be-processed service data is obtained, for example, current to-be-processed service data may be obtained from a database, one of data in an unprocessed original text data set obtained randomly, or one of data obtained according to a processing sequence.

According to the embodiment of the disclosure, since the clustering of the service data may be performed by using a clustering method that does not specify the number of the class clusters, when the service data to be processed is the first data, the first data is directly used as the first data class cluster, and the first data is used as the class cluster center of the first data class cluster; then clustering is carried out on a second piece of business data to be processed based on a similarity principle (similarity between the second piece of data and a first cluster center is calculated), according to a preset condition, if the second piece of business data to be processed can not belong to the first data cluster, a second data cluster is newly built, the data is used as the cluster center of the newly built cluster, otherwise, the second piece of business data to be processed can belong to the first data cluster \8230, the operation is repeated until all data are traversed, and the clustering process is finished.

According to the embodiment of the disclosure, each cluster obtained through clustering comprises a cluster center, the text data in each cluster belongs to the same service class, and each class can be used for representing a class of service problems, for example, the service problem represented by the first class of cluster data is related to a logistics delivery cycle, the service problem represented by the second class of cluster data is related to a logistics compensation amount, the service problem represented by the third class of cluster data is related to an order confirmation cycle, and so on. Each cluster class contains a cluster class center, and the cluster class center represents at least one piece of service data in the same class of service data, for example, the cluster class center may be one piece of service data in the cluster class or two pieces of data that are the same or very similar to each other.

According to an embodiment of the present disclosure, the operation S202 determines N recall cluster centers from the M cluster centers obtained after clustering, where the operation is performed during clustering of each piece of data (except for the first piece of data), or the recall operation is started after a preset number of pieces of data are executed, for example, an original text data set includes ten thousand pieces of data, or the recall operation described in operation S202 is started when the previous text data set is executed to the 300 th piece of data.

According to the embodiment of the disclosure, N recall cluster centers are determined from M cluster centers obtained after clustering, in the process of clustering, N recall cluster centers with the same fixed number are recalled from existing M cluster centers, for example, recall is started from 1000 th data, from 1001 st data, in the process of clustering each piece of data, 30 data are recalled from existing cluster centers as recall cluster centers, before 1001 st data, recall is not performed, and similarity calculation is directly performed by taking all existing cluster centers as a calculation basis.

According to the embodiment of the disclosure, N recall cluster centers are determined from M cluster centers obtained after clustering, or an unfixed number of N recall cluster centers are recalled from existing M cluster centers in the process of clustering execution. For example, a certain preset proportion of data (e.g., 0.05M and 0.03M) may be recalled from the existing M cluster centers each time as the recalled cluster centers.

According to the embodiment of the disclosure, N recalling cluster centers are determined from M cluster centers obtained after clustering, and partial data can be recalled from the existing cluster centers; or recalling all data when the number of the existing cluster centers is smaller, recalling part of data when the number of the existing cluster centers is larger, for example, recalling all data when the number of the existing cluster centers is smaller than a preset threshold, for example, recalling part of data when the number of the existing cluster centers is greater than or equal to a preset threshold.

According to the embodiment of the disclosure, the original data can be divided into a plurality of data categories by clustering the service data through the service data processing method, each category is used for representing a category of service problem, and then, uniform response can be respectively carried out on each category, so that the response workload can be reduced to a greater extent, and the working efficiency can be improved.

According to the embodiment of the disclosure, in the process of clustering the service data, the embodiment of the disclosure adopts a method of recalling part of the cluster centers from the existing cluster centers and calculating the similarity between the service data to be processed and the recalled cluster centers to realize data classification, compared with a method of adopting density clustering in the related art, the method only needs to compare the similarity between the data to be processed and part of the cluster centers, so that the clustering calculation amount is reduced to a greater extent, the calculation amount is reduced from (the number of data bars) to (the number of data bars) (the number of all the cluster centers-when the data amount is large, can reach hundreds of thousands and millions) to (the number of data bars) (the number of recalled data bars), the operation speed of a processor is improved, a large amount of time and more calculation resources are not consumed, the requirement on computer hardware equipment is reduced, when the calculation resources are limited, a large number of customer service events can be better coped with, the timeliness of customer service event responses can be ensured, and the customer experience is improved.

According to an embodiment of the present disclosure, in a process of clustering service data to be processed, determining, according to a first similarity between the service data to be processed and a recall cluster center of N recall cluster centers, a target cluster center of the N recall cluster centers, where the first similarity satisfies a first preset condition, may include: and determining a target cluster center with the first similarity greater than or equal to a preset similarity threshold from the N recalling cluster centers according to the first similarity between the service data to be processed and the recalling cluster center in the N recalling cluster centers.

According to an embodiment of the present disclosure, the method may be: respectively calculating first similarity between the service data to be processed and each of the N recalling cluster centers to obtain N first similarity values, selecting the largest one of the similarity values as a target similarity value, and classifying the service data to be processed into the cluster corresponding to the target cluster center when the target similarity value is greater than a preset similarity threshold value, wherein the condition is that the target cluster center is included in the N recalling cluster centers.

According to the embodiment of the disclosure, when the maximum similarity value (target similarity value) among the N calculated first similarity values is less than or equal to the preset similarity threshold, a new cluster is created, and the service data to be processed is classified into the new cluster, where the service data to be processed is used as a cluster center of the new cluster, and this situation is a situation where the target cluster center whose first similarity satisfies the first preset condition is not included in the N recall cluster centers.

According to an embodiment of the present disclosure, the method may also be: respectively calculating first similarity between the service data to be processed and each recall cluster center of the N recall cluster centers to obtain N first similarity values, directly taking the recall cluster center of which the first similarity value is greater than a preset similarity threshold value as a target cluster center, and classifying the service data to be processed into the cluster corresponding to the target cluster center under the condition that only one target cluster center is included. And under the condition that the number of the target cluster centers is more than one, randomly classifying the service data to be processed into the cluster corresponding to one of the target cluster centers, or classifying the service data to be processed into the cluster corresponding to one of the target cluster centers with the largest similarity value. And if the N first similarities are less than or equal to a preset similarity threshold (the situation is that the N recalling cluster centers do not comprise the target cluster center meeting the condition), newly building a cluster, and classifying the service data to be processed into the newly built cluster.

According to the embodiment of the disclosure, when the first similarity between the service data to be processed and the recall cluster center is calculated, a plurality of semantic-based text similarity algorithms can be adopted, for example, a cosine similarity algorithm based on W2V (word 2 vec) can be adopted, and a trained model, for example, a BERT model or other model-assisted calculation, can also be adopted. Compared with other algorithms based on non-semantic measurement, the text clustering accuracy can be improved by adopting the text similarity algorithm based on the semantic meaning.

According to the embodiment of the disclosure, N recall cluster centers are determined from M cluster centers obtained after clustering, the N recall cluster centers may be randomly recalled from the M cluster centers, or N data most similar to current to-be-processed service data are selected from the M cluster centers as the recall cluster centers, for example, a plurality of data with similarity between the current to-be-processed service data larger than a certain preset threshold are recalled.

N data (data with similarity larger than a preset threshold) most similar to the current service data to be processed are selected from the M cluster centers as recall cluster centers, and for example, a retrieval tool can be used as assistance to realize a function of recalling the data with the similarity larger than a certain preset threshold.

Further, in order to realize recall by using a retrieval tool, an index is added to the existing cluster center and then the cluster center is stored in an index base of the retrieval tool, and after each new cluster is created, an index is created for the cluster center of the new cluster, the newly created cluster center of the cluster and the newly created index are added to the index base, so that the recall cluster center is determined from the clustered cluster centers by using the retrieval tool and the index base.

According to embodiments of the present disclosure, the search tool may be a word dimension recall based search tool, such as lucene, es, solr. The retrieval tool may also be a vector dimension recall based retrieval tool, such as faiss.

According to the embodiment of the disclosure, the method for recalling part of the cluster centers from the existing cluster centers and calculating the similarity between the service data to be processed and the recalled cluster centers to realize data classification is adopted, the similarity comparison is only carried out on the data to be processed and part of the cluster centers, so that the clustering calculation amount is reduced to a greater extent, and further, the recalled data is the data which is more similar to the service data to be processed, so that the clustering accuracy loss is smaller, and the clustering calculation amount is reduced to a greater extent on the basis of ensuring the calculation accuracy.

Based on the above embodiments, fig. 3 schematically shows a flowchart of a service data processing method according to an embodiment of the present disclosure.

As shown in fig. 3, traversing the data in the text data set to be processed, the data processing method performed on each service data to be processed includes operations S301 to S304.

In operation S301, a partial cluster center is recalled from the existing cluster centers obtained after clustering according to a preset recall cluster center amount, and the partial cluster center is used as a recall cluster center. Specifically, in the process of executing clustering, a fixed number of N recall cluster centers are recalled from the existing M cluster centers according to a preset recall cluster center amount. For example, the recall may be started from the time of the 500 th item of data, and from the 501 th item of data, 30 pieces of data are recalled from the existing cluster centers as the recalled cluster centers in the process of clustering each piece of data. The recall data can be realized by utilizing a retrieval tool lucene, an lucene index database comprises indexes of all existing cluster centers, and N data most similar to the current business data to be processed are selected from M cluster centers as recall cluster centers by utilizing the lucene index database.

In operation S302, respectively calculating a similarity between the current service data to be processed and each of the N recall cluster centers to obtain N similarity values;

in operation S303, when the maximum similarity value of the N similarity values is greater than the preset threshold, classifying the current data into a corresponding cluster, that is, a cluster corresponding to a recall cluster center (target cluster center) associated with the maximum similarity value;

in operation S304, if the N first similarities are less than or equal to the preset similarity threshold (in this case, the N recalling cluster centers do not include the target cluster center that satisfies the condition), a cluster is created, the service data to be processed is categorized into the created cluster, and the service data to be processed is used as the cluster center of the created cluster.

In operation S305, an index is added to the newly-built cluster center and added to the lucene index library, so that in the process of clustering the next data, the lucene is used to select N data most similar to the next data from the existing cluster centers as a recall cluster center.

Fig. 4 schematically shows a flow chart of a traffic data processing method according to another embodiment of the present disclosure.

As shown in fig. 4, according to the service data processing method in the embodiment of the present disclosure, after data in the text data set to be processed is executed in each traversal, the cluster centers of a plurality of clusters obtained after clustering in each traversal may be adjusted and updated.

Specifically, according to an embodiment of the present disclosure, a specific process of the service data processing method may include:

before the clustering operation is performed, parameter setting is performed, and a similarity threshold (a first similarity threshold and a second similarity threshold), a recall cluster center amount c1 (default 30), a recall data amount c2 (default 30), an iteration number and the like can be preset. The first similarity threshold is associated with a first preset condition, and is used for determining a target cluster center, of which the first similarity meets a first preset condition (the first similarity is greater than or equal to the first similarity threshold), from existing recall cluster centers in the process of clustering each data according to the similarity principle. And the recall cluster center quantity c1 is used for recalling a fixed quantity of c1 recall cluster centers from the existing cluster centers in the process of clustering each data. The recall data volume c2 is used for recalling c2 data with a fixed quantity from each cluster in the process of adjusting and updating the cluster centers of a plurality of clusters obtained after each traversal clustering. The second similarity threshold is used in recalling data as follows: in the process of clustering each data, recalling c1 data which are most similar to the current data to be processed from the existing cluster center, wherein the similarity between the data to be processed and the current data to be processed is more than or equal to a second similarity threshold; the second similarity threshold is also used for recalling c2 data which are most similar to the current data from each cluster in the process of adjusting and updating the cluster centers of the plurality of clusters, wherein the similarity between the data and the current data is greater than or equal to the second similarity threshold.

And then traversing and executing the data in the text data set to be processed, carrying out data clustering, adjusting and updating the cluster centers of a plurality of clusters obtained after traversing the data, and then carrying out data clustering again based on the updated cluster centers until the preset iteration times are reached.

Further, adjusting and updating the cluster centers of the plurality of clusters obtained after each traversal clustering may include (the following method takes the cluster center of any one of the clusters as an example):

determining L recalling data from K service data associated with the cluster center, wherein K is more than or equal to L;

performing association calculation on the K service data and the L recalling data respectively to obtain K association scores, wherein each association score is associated with one service data in the K service data;

and taking the target data in the K service data associated with the target association score as a new cluster center associated with the K service data, wherein the target association score is as follows: and the correlation scores meeting a second preset condition in the K correlation scores.

According to an embodiment of the present disclosure, the determining of the L recalling data from the K service data associated with the cluster center may be randomly recalling the L recalling data from the K service data, or recalling L data most similar to the current data from the K service data as the recall data, for example, recalling a plurality of data with a similarity greater than a second similarity threshold with the current data.

According to the embodiment of the disclosure, L pieces of data most similar to the current data are recalled from the K pieces of business data as recall data, and for example, a retrieval tool can be used as assistance to realize a function of recalling data with a large similarity. Further, in order to implement recall by using a retrieval tool, each data in an existing cluster is stored in an index library of the retrieval tool after being indexed, and the method for implementing recall by using the retrieval tool can refer to the description of recalling a plurality of cluster centers from the existing cluster centers by using the retrieval tool in the process of performing clustering, which is not described herein again. The search tool may be lucene, es, solr, faiss, or the like.

According to an embodiment of the present disclosure, the determining L recall data from the K service data associated with the cluster center may be that, for each cluster, the same fixed number of L recall data are recalled from the K service data; for example, 30 pieces of data are recalled from each cluster as recall data.

According to an embodiment of the present disclosure, the L recall data are determined from the K service data associated with the cluster center, or for each cluster, L recall data with different quantities are recalled from the K service data respectively; for example, for each cluster, a certain preset proportion (e.g., 0.005K, 0.01K) of data is recalled from the K pieces of service data as recall data.

According to the embodiment of the disclosure, the cluster center obtained by clustering each time is adjusted and updated, so that the clustering accuracy is higher, the clustered data aggregation is stronger, the data of each cluster basically belong to the same service problem, and subsequently, each problem can be responded in a targeted manner, so that response errors are avoided, and the customer experience is improved.

According to the embodiment of the disclosure, in the process of adjusting and updating the cluster center obtained by each clustering, correlation calculation is performed through recalling part of data in the service data associated with the cluster center, the calculation amount of the correlation calculation can be reduced to a large extent, each piece of data only needs to be subjected to the correlation calculation with a small amount of data, correlation calculation with all data is not needed, the calculation amount is reduced from n (cluster data amount) n-1 to n c2 (recall data amount), therefore, the operation speed of the processor is improved, a large amount of time and more calculation resources are not needed to be consumed, the requirements on computer hardware equipment are reduced, when the calculation resources are limited, a large number of customer service events can be better dealt with, the timeliness of customer service event response can be ensured, and customer experience is improved.

Fig. 5 schematically illustrates a flowchart of a method for adjusting cluster center according to an embodiment of the present disclosure.

As shown in fig. 5, the method of adjusting the cluster center of any one of the clusters includes operations S501 to S504.

In operation S501, for each data in the cluster, a part of data (equal to a preset recall data amount) is recalled from the cluster as recall data according to the preset recall data amount, respectively. That is, for each cluster, the same fixed amount of L recall data is recalled from the K service data.

Then, performing correlation calculation on the K service data and the L recall data respectively to obtain K correlation scores, which specifically includes operation S502 and operation S503.

In operation S502, calculating a similarity between each data in the cluster and the recall data, and performing similarity calculation on each data in the K service data and L recall data to obtain K groups of similarity result sets, where each group of similarity result sets is associated with one service data in the K service data, and each group of similarity result sets includes L second similarities;

in operation S503, the mean of the L second similarities in each group of similarity result sets is calculated respectively to obtain K association scores, and a score that satisfies a second preset condition, that is, the score with the largest value among the K association scores is used as a target association score, and data associated with the target association score is used as a new cluster center.

In operation S504, after traversing the class cluster is completed, the class cluster center in the lucene index base is updated.

The above method for adjusting the cluster center of any one cluster is exemplified as follows:

for example, for class 1, which contains ten thousand pieces of data: data 1, data 2, data 3, data \8230, wherein the cluster center of cluster 1 is data 10.

For data 1, 30 pieces of data most similar to data 1 were recalled from this ten thousand pieces of data as recall data using lucene (note that data 1 itself was excluded); then, respectively carrying out similarity calculation on the data 1 and the 30 recalling data to obtain a 1 st group of similarity result sets, wherein the first group of similarity result sets comprise 30 similarity values; then calculating the mean value of the 30 similarity values to obtain the association value of the data 1;

for data 2, 30 pieces of data most similar to data 2 were recalled from the ten thousand pieces of data as recall data using lucene (note that data 2 itself was excluded); then, respectively carrying out similarity calculation on the data 2 and the 30 recalling data to obtain a 2 nd group of similarity result sets, wherein the 2 nd group of similarity result sets comprise 30 similarity values; then, calculating the mean value of the 30 similarity values to obtain the association value of the data 2;

traversing ten thousand pieces of data in the cluster 1 in this way to obtain the association scores of the ten thousand pieces of data, for example, data 1:0.8; data 2:0.7; data 3: 0.9, 8230and 8230, wherein the score with the maximum value in the association scores is the score 0.9 corresponding to the data 3, and the data 3 is taken as the new cluster center of the cluster 1.

According to the embodiment of the disclosure, in the process of adjusting and updating the cluster center obtained by each clustering, the correlation calculation is performed by recalling part of data in the service data associated with the cluster-like center, so that the calculation amount of the correlation calculation can be reduced to a large extent.

Based on the service data processing method, the disclosure also provides a service data processing device. The apparatus will be described in detail below with reference to fig. 6.

Fig. 6 schematically shows a block diagram of a service data processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the service data processing apparatus 600 of this embodiment includes: an acquisition module 601, a first determination module 602, a second determination module 603, and a first classification module 604.

The acquiring module 601 is configured to acquire service data to be processed;

a first determining module 602, configured to determine N recall cluster centers from M cluster centers obtained after clustering, where a cluster center represents at least one piece of service data in the same class of service data, and M is greater than or equal to N;

a second determining module 603, configured to determine, according to a first similarity between the service data to be processed and a recall cluster center of the N recall cluster centers, a target cluster center of which the first similarity meets a first preset condition from the N recall cluster centers;

the first classifying module 604 is configured to classify the service data to be processed into a class cluster corresponding to the target class cluster center.

According to the embodiment of the disclosure, the service data are clustered by the service data processing device, so that the original data can be divided into a plurality of data categories, each category is used for representing a category of service problem, and then, uniform response can be respectively carried out on each category, so that the response workload can be reduced to a greater extent, and the working efficiency can be improved.

According to the embodiment of the disclosure, further, in the clustering process, the purpose of data classification by recalling part of the cluster centers from the existing cluster centers and calculating the similarity between the service data to be processed and the recalled cluster centers is realized through the first determining module 602 and the second determining module 603, compared with the method of density clustering adopted in the related art, the clustering calculation amount is reduced to a large extent by only comparing the similarity between the data to be processed and part of the cluster centers, the calculation amount is reduced from M (number of data bars) × N (the number of all cluster centers, which can reach hundreds of thousands and millions when the data amount is large) to M (number of data bars) × c1 (the recall data amount), the operation speed of the processor is increased, a large amount of time and more calculation resources are not required to be consumed, the requirement on computer hardware equipment is reduced, when the calculation resources are limited, the number of service events can be better coped, the timeliness of customer service events can be ensured, and the customer service event response experience is improved.

the execution module is used for taking the target data in the K business data associated with the target association score as a new cluster center associated with the K business data, wherein the target association score is as follows: and the relevance scores meeting a second preset condition in the K relevance scores.

The first calculating unit is configured to perform similarity calculation on data in the K pieces of service data and the L pieces of recall data to obtain K groups of similarity result sets, where each group of similarity result sets is associated with one piece of service data in the K pieces of service data, and each group of similarity result sets includes L pieces of second similarity;

According to the embodiment of the disclosure, the relevance score meeting the second preset condition is as follows: the most significant of the K relevance scores is.

According to the embodiment of the disclosure, after the indexes are added to the M cluster centers, the M cluster centers are stored in the index base, and the apparatus further includes a second creating module and an adding module.

According to the embodiment of the present disclosure, any plurality of the obtaining module 601, the first determining module 602, the second determining module 603, and the first classifying module 604 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 601, the first determining module 602, the second determining module 603, and the first classifying module 604 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or implemented by a suitable combination of any of them. Alternatively, at least one of the obtaining module 601, the first determining module 602, the second determining module 603 and the first categorizing module 604 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.

As shown in fig. 7, an electronic device 700 according to an embodiment of the present disclosure includes a processor 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. The processor 701 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 702 and/or the RAM 703. It is noted that the programs may also be stored in one or more memories other than the ROM 702 and RAM 703. The processor 70J may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 700 may also include input/output (I/O) interface 705, which input/output (I/O) interface 705 is also connected to bus 704, according to an embodiment of the present disclosure. The electronic device 700 may also include one or more of the following components connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be embodied in the device/apparatus/system described in the above embodiments; or may exist alone without being assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the present disclosure, a computer-readable storage medium may include the above-described ROM 702 and/or RAM 703 and/or one or more memories other than the ROM 702 and RAM 703.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated by the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the business data processing method provided by the embodiment of the disclosure.

The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 701. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, and the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via the communication section 709, and/or installed from the removable medium 711. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by the processor 701, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A business data processing method, comprising:

Obtain pending business data;

Determine N recalled cluster centers from the M cluster centers obtained after clustering, where the cluster centers represent at least one piece of business data in the same type of business data, and M≥N;

According to the first similarity between the business data to be processed and the recalled cluster centers in the N recalled cluster centers, it is determined from the N recalled cluster centers that the first similarity satisfies the first degree of similarity. a preset conditional target cluster center; and

The to-be-processed business data is classified into a cluster corresponding to the target cluster center.

2. The method of claim 1, further comprising:

From the K pieces of business data associated with the cluster centers, determine L pieces of recall data, wherein the K≥L;

The K pieces of business data are respectively associated with the L pieces of recall data to obtain K correlation scores, wherein each of the correlation scores is associated with one business data in the K pieces of business data;

Taking the target data in the K pieces of business data associated with the target correlation score as the new cluster center associated with the K pieces of business data, wherein the target correlation score is: the K correlation scores , the associated score that satisfies the second preset condition.

3. The method according to claim 2, wherein, the K pieces of business data are respectively associated with the L pieces of recall data to obtain K pieces of associated scores comprising:

The data in the K pieces of business data are respectively subjected to similarity calculation with the L pieces of recall data to obtain K groups of similarity result sets, wherein each group of the similarity result sets is associated with the K pieces of business data. A business data of , each group of the similarity result set includes L second similarity;

Calculate the mean value of the L second similarities in each group of the similarity result sets respectively, so as to obtain the K correlation scores.

4. The method of claim 2, wherein:

The correlation score satisfying the second preset condition is: the correlation score with the largest score among the K correlation scores.

5. The method according to claim 1, wherein, according to the first similarity between the business data to be processed and the recalled cluster centers in the N recalled cluster centers, from N all the recalled cluster centers. Determining the target cluster centers whose first similarity satisfies the first preset condition in the recalled cluster centers includes:

According to the first similarity between the business data to be processed and the recalled cluster centers in the N recalled cluster centers, it is determined from the N recalled cluster centers that the first similarity is greater than or equal to The center of the target cluster with the preset similarity threshold.

6. The method of claim 1, further comprising:

In the case that the N recalled cluster centers do not include the target cluster centers whose first similarity satisfies the first preset condition, create a new cluster;

The to-be-processed business data is classified into a newly created cluster, wherein the to-be-processed business data is used as the cluster center of the newly created cluster.

7. The method according to claim 6, wherein, after the M cluster centers are added with an index, they are stored in an index library, and the method further comprises classifying the business data to be processed into a newly created cluster back:

Create a new index for the cluster center of the newly created cluster;

The newly created cluster center and the newly created index are added to an index library, so as to use the index library to determine the recalled cluster center from the cluster centers obtained after clustering.

8. A service data processing device, comprising:

The acquisition module is used to acquire the business data to be processed;

The first determination module is used to determine N recalled cluster centers from the M cluster centers obtained after clustering, where the cluster centers represent at least one piece of business data in the same type of business data, and the M≥N ;

The second determination module is configured to determine the number of recalled cluster centers from the N recalled cluster centers according to the first similarity between the business data to be processed and the recalled cluster centers of the N recalled cluster centers. the target cluster center whose first similarity satisfies the first preset condition;

The first classification module is configured to classify the to-be-processed business data into a cluster corresponding to the target cluster center.

9. An electronic device comprising:

one or more processors;

storage means for storing one or more programs,

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to execute the method according to any one of claims 1-7.

10. A computer-readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-7.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-7.