CN115083442B - Data processing method, device, electronic device, and computer-readable storage medium - Google Patents
Data processing method, device, electronic device, and computer-readable storage medium Download PDFInfo
- Publication number
- CN115083442B CN115083442B CN202210476041.2A CN202210476041A CN115083442B CN 115083442 B CN115083442 B CN 115083442B CN 202210476041 A CN202210476041 A CN 202210476041A CN 115083442 B CN115083442 B CN 115083442B
- Authority
- CN
- China
- Prior art keywords
- data
- inspected
- category
- pieces
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/22—Arrangements for supervision, monitoring or testing
- H04M3/2227—Quality of service monitoring
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本申请涉及数据处理技术领域,特别涉及一种数据处理方法、装置、电子设备以及计算机可读存储介质。The present application relates to the technical field of data processing, and in particular to a data processing method, device, electronic equipment, and computer-readable storage medium.
背景技术Background technique
越来越多数据需要在质检之后投入下一阶段的使用。因此,数据质检的准确性尤为重要。如,日常客服系统中产生大量的语音数据,对这些语音数据进行智能质检,检测出客服通话中不规范内容,就可以很好的提高客服服务的质量及用户满意度,减少人工作业,同时也可以对客服人员进行考评,完善客服人员工作考评体系。基于此,如何保证高准确率的数据质检是数据质检领域中重点研究的问题之一。More and more data needs to be used in the next stage after quality inspection. Therefore, the accuracy of data quality inspection is particularly important. For example, a large amount of voice data is generated in the daily customer service system. Intelligent quality inspection of these voice data can detect irregular content in customer service calls, which can improve the quality of customer service and user satisfaction, and reduce manual work. At the same time, customer service personnel can also be evaluated to improve the evaluation system for customer service personnel. Based on this, how to ensure high-accuracy data quality inspection is one of the key research issues in the field of data quality inspection.
发明内容Contents of the invention
本申请提供了数据处理方法、装置、电子设备以及计算机可读存储介质,能够提高待质检数据预测结果的准确性。The present application provides a data processing method, device, electronic equipment, and computer-readable storage medium, which can improve the accuracy of the prediction result of the data to be inspected.
一方面,本申请采用的一种数据处理方法,该方法包括:On the one hand, a data processing method adopted by the present application, the method includes:
获取N个待质检数据,N为正整数且大于或等于二;将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果;基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布;若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据,并对M个待质检数据的预测结果进行修正;其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。Obtain N quality inspection data, N is a positive integer and greater than or equal to two; input the N quality inspection data into the quality inspection model for category prediction, and obtain the prediction result corresponding to each quality inspection data; based on each The prediction results corresponding to the quality inspection data determine the predicted category distribution corresponding to the N quality inspection data; Determine M data to be quality inspected in the data, and correct the prediction results of M data to be quality inspected; wherein, the prior category distribution is determined based on the category label statistics corresponding to each sample data in the sample data set, and M is a positive integer , M is less than or equal to N.
一方面,本申请提供一种数据处理装置,该数据处理装置包括:In one aspect, the present application provides a data processing device, the data processing device comprising:
获取单元,用于获取N个待质检数据,N为正整数且大于或等于二;An acquisition unit, configured to acquire N pieces of data to be inspected, where N is a positive integer greater than or equal to two;
预测单元,用于将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果;A prediction unit, configured to input the N data to be inspected into the quality inspection model for category prediction, and obtain a prediction result corresponding to each data to be inspected;
确定单元,用于基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布;A determination unit, configured to determine the predicted category distributions corresponding to the N quality inspection data based on the prediction results corresponding to each quality inspection data;
确定单元,还用于若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据;修正单元,用于对M个待质检数据的预测结果进行修正;其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。The determining unit is also used to determine M data to be quality inspected from the N data to be quality inspected based on the prediction result corresponding to each data to be quality inspected if the predicted category distribution does not satisfy the prior category distribution; the correction unit uses It is used to correct the prediction results of M data to be inspected; wherein, the prior class distribution is determined based on the class label statistics corresponding to each sample data in the sample data set, M is a positive integer, and M is less than or equal to N.
一方面,本申请提供一种电子设备,该电子设备包括处理器以及与处理器耦接的计算机存储介质,计算机存储介质中存储有计算机程序,处理器用于执行计算机程序以实现如上述数据处理方法。In one aspect, the present application provides an electronic device, the electronic device includes a processor and a computer storage medium coupled to the processor, a computer program is stored in the computer storage medium, and the processor is used to execute the computer program to implement the above data processing method .
一方面,本申请提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,计算机程序在被处理器执行时,实现如上述技术方案提供的数据处理方法。In one aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements the data processing method provided by the above technical solution.
本申请实施例的有益效果是:区别于现有技术,本申请提供的数据处理方法、装置、电子设备以及计算机可读存储介质,该方法利用基于样本数据集中各个样本数据对应的类别标签统计确定的先验类别分布作为判断依据,确定出经过质检模型进行类别预测的预测结果不满足先验类别分布的待质检数据,并对不满足先验类别分布的待质检数据的预测结果进行修正,使得修正后的待质检数据的预测结果满足先验类别分布,能够提高待质检数据预测结果的准确性。The beneficial effects of the embodiments of the present application are: different from the prior art, the data processing method, device, electronic equipment and computer-readable storage medium provided by the present application, the method uses the statistical determination of the category labels corresponding to each sample data in the sample data set The prior category distribution of the quality inspection model is used as the judgment basis to determine the data to be inspected that the prediction results of the category prediction by the quality inspection model do not meet the prior category distribution, and the prediction results of the data to be quality inspected that do not meet the prior category distribution are analyzed. The correction makes the corrected prediction results of the data to be inspected satisfy the prior category distribution, which can improve the accuracy of the predicted results of the data to be inspected.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。其中:In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort. in:
图1是本申请提供的数据处理方法一应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of a data processing method provided by the present application;
图2是本申请提供的数据处理方法第一实施例的流程示意图;Fig. 2 is a schematic flow chart of the first embodiment of the data processing method provided by the present application;
图3是本申请提供的数据处理方法另一应用场景示意图;FIG. 3 is a schematic diagram of another application scenario of the data processing method provided by the present application;
图4是本申请提供的数据处理方法第二实施例的流程示意图;FIG. 4 is a schematic flow diagram of the second embodiment of the data processing method provided by the present application;
图5是本申请提供的数据处理方法另一应用场景示意图;FIG. 5 is a schematic diagram of another application scenario of the data processing method provided by the present application;
图6是本申请提供的数据处理装置一实施例的结构示意图;FIG. 6 is a schematic structural diagram of an embodiment of a data processing device provided by the present application;
图7是本申请提供的电子设备一实施例的流程示意图。Fig. 7 is a schematic flowchart of an embodiment of an electronic device provided by the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。可以理解的是,此处所描述的具体实施例仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. It should be understood that the specific embodiments described here are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, only some structures related to the present application are shown in the drawings but not all structures. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
随着互联网信息技术在金融领域的应用越来越深,各企业在创新力度上的不断加强,时长竞争也在变得越来越激烈,在这种激烈的时长竞争中,用户服务已经越来越成为体现竞争差异、提升公司形象、增加用户满意度的重要举措,因此对客服体系服务质量的管理和控制已经成为企业经营管理者日常的重要工作,而智能语音质检就是其中的主要组成部分。日常客服系统中产生大量的语音数据,如果能很好的利用好这些数据,依据规范要求,开展智能质检工作,检测出客服通话中不规范的点,就可以很好的提高客服服务的质量以及用户满意度,减少人工作业,同时也可以对客服人员进行考评,完善客服人员工作考评体系。With the deepening of the application of Internet information technology in the financial field and the continuous strengthening of innovation by various enterprises, the time competition is also becoming more and more fierce. In this fierce time competition, user services have become more and more It has become an important measure to reflect competitive differences, enhance the company's image, and increase user satisfaction. Therefore, the management and control of the service quality of the customer service system has become an important daily work for business managers, and intelligent voice quality inspection is the main component of it. . A large amount of voice data is generated in the daily customer service system. If you can make good use of these data, carry out intelligent quality inspection work according to the specification requirements, and detect irregular points in customer service calls, you can improve the quality of customer service. As well as user satisfaction, reducing manual work, and at the same time evaluating customer service personnel, improving the job evaluation system for customer service personnel.
在一个实施例中,本申请在对待质检数据进行质检时,尤其是语音数据进行质检时,发现目前人工智能在语音数据处理中使用比较广泛,所以本申请可以基于人工智能技术设计了一种数据处理方案,主要是通过训练质检模型,调用设计好的质检模型对待质检的数据进行质检。大致流程可以概括为如下几个阶段:线上数据拉取、数据预处理、数据标注、数据分析、模型训练、模型上线、线上badcase收集、模型迭代。可见,从模型上线到下一个模型版本迭代需要依赖线上badcase数据集手机然后重新训练模型再上线。这种数据质质检方式,虽然能够在一定程度上达到较高的质检准确性,但是耗时会较长,且依赖上线后的badcase数据收集和处理。In one embodiment, the present application finds that artificial intelligence is widely used in speech data processing when performing quality inspection on the data to be inspected, especially speech data, so the present application can be designed based on artificial intelligence technology. A data processing scheme, mainly through training the quality inspection model, calling the designed quality inspection model to perform quality inspection on the data to be inspected. The general process can be summarized into the following stages: online data pull, data preprocessing, data labeling, data analysis, model training, model online, online badcase collection, and model iteration. It can be seen that from the launch of the model to the next iteration of the model version, it is necessary to rely on the mobile phone of the online badcase dataset and then retrain the model before launching it. Although this data quality inspection method can achieve high quality inspection accuracy to a certain extent, it will take a long time and rely on the collection and processing of badcase data after going online.
在另一个实施例中,本申请提出了另一种数据处理方案,该种数据处理方案采用一种质检模型后处理的技巧来提高质检准确性,几乎不需要额外的成本增加,即可有一定的效果提升。该数据处理方案的思想非常朴素,就是利用数据分布的先验知识来对质检模型的预测结果进行优化。假设有一个二分类问题,二个类别分别可以用1和0表示,质检模型对于输入a给出的预测结果是p(a)=[0.01,0.99],假设该预测结果对应的预测类别为1;接下来,对于输入b,质检模型给出的预测结果是p(b)=[0.5,0.5],这时候处于最不确定的状态,质检模型也不确定输出哪个类别好。但是,提前让质检模型知悉两点先验知识:1、类别必然是0或1其中之一;2、两个类别的出现概率各为0.5。在这两点先验知识之下,由于前一个输入a样本预测结果为1,那么基于朴素的均匀思想,更倾向于将后一个输入b样本的类别预测为0,以得到一个满足第二点先验的预测结果。In another embodiment, this application proposes another data processing scheme, which adopts a post-processing technique of the quality inspection model to improve the accuracy of quality inspection, and almost no additional cost increase is required. There is a certain effect improvement. The idea of this data processing scheme is very simple, which is to use the prior knowledge of data distribution to optimize the prediction results of the quality inspection model. Suppose there is a binary classification problem, the two categories can be represented by 1 and 0 respectively, the prediction result given by the quality inspection model for the input a is p(a)=[0.01, 0.99], assuming that the prediction category corresponding to the prediction result is 1; Next, for the input b, the prediction result given by the quality inspection model is p(b)=[0.5, 0.5], which is the most uncertain state at this time, and the quality inspection model is not sure which category to output. However, let the quality inspection model know two prior knowledge in advance: 1. The category must be one of 0 or 1; 2. The probability of occurrence of the two categories is 0.5. Under these two points of prior knowledge, since the prediction result of the previous input a sample is 1, then based on the simple uniform idea, it is more inclined to predict the category of the next input b sample as 0, so as to obtain a satisfying the second point Prior prediction results.
进一步延伸,假设已经知道数据的类别比例(即先验知识),则质检模型在对数据进行预测得到的预测结果的类别比例也应该与先验知识的类别比例非常接近。如果相差很大的话,那可间接的说明质检模型预测的效果不好,会对一些数据的类别预测错误。则可以基于先验知识把这些预测错误的数据找出来进行修正。Further extension, assuming that the category ratio of the data is already known (that is, prior knowledge), the category ratio of the prediction result obtained by the quality inspection model when predicting the data should also be very close to the category ratio of the prior knowledge. If the difference is large, it can indirectly indicate that the prediction effect of the quality inspection model is not good, and it will make mistakes in the prediction of some data categories. Then, based on the prior knowledge, these incorrectly predicted data can be found out and corrected.
通过对本申请提出的上述两种数据处理方案对比可见,后一种数据处理方案相比于前一种数据处理方案,无需重复的进行模型迭代和模型上线,可以节省时间。因此,本申请下面的实施例中重点介绍后一种数据处理方案。By comparing the above two data processing schemes proposed in this application, it can be seen that the latter data processing scheme does not need to repeatedly perform model iteration and model online compared with the former data processing scheme, which can save time. Therefore, the latter data processing solution is mainly introduced in the following embodiments of the present application.
后一种数据处理方案具体概括为:The latter data processing scheme is specifically summarized as follows:
获取N个待质检数据,N为正整数且大于或等于二;将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果;基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布;若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据,并对M个待质检数据的预测结果进行修正;其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。Obtain N quality inspection data, N is a positive integer and greater than or equal to two; input the N quality inspection data into the quality inspection model for category prediction, and obtain the prediction result corresponding to each quality inspection data; based on each The prediction results corresponding to the quality inspection data determine the predicted category distribution corresponding to the N quality inspection data; Determine M data to be quality inspected in the data, and correct the prediction results of M data to be quality inspected; wherein, the prior category distribution is determined based on the category label statistics corresponding to each sample data in the sample data set, and M is a positive integer , M is less than or equal to N.
后一种数据处理方案利用基于样本数据集中各个样本数据对应的类别标签统计确定的先验类别分布作为判断依据,确定出经过质检模型进行类别预测的预测结果不满足先验类别分布的待质检数据,并对不满足先验类别分布的待质检数据的预测结果进行修正,使得修正后的待质检数据的预测结果满足先验类别分布,能够提高待质检数据预测结果的准确性。进一步,后一种数据处理方案无需重复的进行模型迭代和模型上线,可以节省时间。The latter data processing scheme uses the prior category distribution determined based on the category label statistics corresponding to each sample data in the sample data set as the basis for judgment, and determines that the prediction result of the category prediction through the quality inspection model does not meet the quality of the prior category distribution. The data to be inspected, and the prediction results of the data to be inspected that do not satisfy the prior category distribution are corrected, so that the predicted results of the revised data to be inspected meet the prior category distribution, which can improve the accuracy of the prediction results of the data to be inspected . Furthermore, the latter data processing scheme does not need to repeatedly perform model iteration and model online, which can save time.
本申请的数据处理方案可由电子设备执行,该电子设备可以是终端设备,比如智能手机、平板电脑、笔记本电脑、台式计算机、智能语音交互设备、智能家电、智能手表、车载终端、飞行器等;或者,电子设备还可以包括服务器,比如独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。The data processing scheme of the present application can be executed by an electronic device, which can be a terminal device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, a smart home appliance, a smart watch, a vehicle terminal, an aircraft, etc.; or , the electronic device may also include a server, such as an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services.
上述的数据处理方案可以主要应用对客服系统中客服与用户的对话数据进行语音质检。参见图1,为本申请实施例提供的一种应用场景图,在图1中,对话框中是客服与用户之间的对话数据。其中,对话数据可以是语音数据,也可以是文字数据。然后定期拉取对话数据;然后采用本申请的数据处理方案调用质检模型对对话数据进行质检,得到这些对话数据对应的预测质检结果;基于这些预测质检结果确定这些对话数据对应的预测类别分布;若预测类别分布不满足先验类别分布,则基于这些对话数据对应的预测质检结果,从中确定若干个对话数据,并对若干个对话数据的预测质检结果进行修正,得到最终的质检结果。然后给客服人员输出质检结果指示;如果是好的结果不用管,如果是不好的结果,客服需要调整话术。The above data processing solution can be mainly applied to voice quality inspection of the conversation data between the customer service and the user in the customer service system. Referring to FIG. 1 , it is a diagram of an application scenario provided by the embodiment of the present application. In FIG. 1 , the dialog box contains dialogue data between the customer service and the user. Wherein, the dialogue data may be voice data or text data. Then periodically pull the dialogue data; then use the data processing scheme of this application to call the quality inspection model to perform quality inspection on the dialogue data, and obtain the predicted quality inspection results corresponding to these dialogue data; determine the prediction corresponding to these dialogue data based on these predicted quality inspection results Category distribution; if the predicted category distribution does not satisfy the prior category distribution, based on the predicted quality inspection results corresponding to these dialogue data, several dialogue data are determined from them, and the predicted quality inspection results of several dialogue data are corrected to obtain the final Quality inspection results. Then output the quality inspection result instructions to the customer service personnel; if it is a good result, don’t worry about it, if it is a bad result, the customer service needs to adjust the speech skills.
基于上述的数据处理方案,本申请实施例提供了一种数据处理方法,参阅图2,是本申请提供的一种数据处理方法的流程示意图。图2所述的数据处理方法可由电子设备执行,具体可由电子设备的处理器执行,图2所示的数据处理方法包括如下步骤:Based on the above data processing solution, the embodiment of the present application provides a data processing method. Referring to FIG. 2 , it is a schematic flowchart of a data processing method provided in the present application. The data processing method described in FIG. 2 can be executed by an electronic device, specifically, it can be executed by a processor of the electronic device. The data processing method shown in FIG. 2 includes the following steps:
步骤21:获取N个待质检数据,N为正整数且大于或等于二。Step 21: Obtain N pieces of data to be inspected, where N is a positive integer greater than or equal to two.
在一些实施例中,待质检数据可以是语音数据、图像数据、文本数据等。In some embodiments, the data to be checked may be voice data, image data, text data, and the like.
其中,待质检数据可以基于不同的场景获得。如,图像数据可以来源于监控场景中监控设备。语音数据可以来源于客服与用户的对话。文本数据可以来源与客服与用户的文字交流。Among them, the data to be inspected can be obtained based on different scenarios. For example, image data may come from monitoring equipment in a monitoring scene. The voice data can come from the dialogue between the customer service and the user. The text data can come from text communication between customer service and users.
步骤22:将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果。Step 22: Input the N data to be inspected into the quality inspection model for category prediction, and obtain the prediction result corresponding to each data to be inspected.
其中,质检模型可以是基于卷积神经网络、反卷积神经网络、深度卷积逆向图网络、生成式对抗网络以及循环神经网络等构成的。Among them, the quality inspection model can be based on convolutional neural network, deconvolutional neural network, deep convolutional reverse graph network, generative confrontation network and recurrent neural network.
顾名思义,质检模型是对待质检数据进行质检检测的。对待质检数据进行质量检测可以是指预先为该种类型的待质检数据设置几种类别,然后调用质检模型预测待质检数据所属的类别。比如,如果待质检的数据是来自客服系统的客服与客户的对话数据,那么预先为该种类型的待质检数据设置了的类别为合格的对话数据、不合格的对话数据,或者;包括关键词的对话数据以及不包括关键词的对话数据,再或者,包括正向情况的对话数据,包括负向情绪的对话数据;再如,如果待质检的数据是图像数据,那么预先为该种类型的待质检数据设置的类别可以为:包含目标对象的图像数据、不包含目标对象的图像数据。As the name implies, the quality inspection model is for quality inspection of the data to be inspected. Performing quality inspection on the data to be inspected may refer to setting several categories for the data to be inspected in advance, and then calling the quality inspection model to predict the category to which the data to be inspected belongs. For example, if the data to be inspected is the dialogue data between the customer service and the customer from the customer service system, then the pre-set categories for this type of data to be inspected are qualified dialogue data, unqualified dialogue data, or; include Keyword dialogue data and dialogue data that do not include keywords, or, dialogue data that includes positive situations, dialogue data that includes negative emotions; for another example, if the data to be inspected is image data, then pre-set the The categories of the three types of data to be inspected may be: image data containing the target object, and image data not containing the target object.
可选的,如果预先为待质检数据设置了W个类别,那么通过质检模型预测到每个待质检数据的预测结果中包括了每个待质检数据属于W个类别中每个类别的概率值。举例来说,假设待质检数据为对话数据,预先设置的类别为包括正向情绪的对话数据,以及包括负向情绪的对话数据,假设包括正向情绪的对话数据这一类别采用1表示,包括负向情绪的对话数据这一类别采用0表示。将待质检数据输入至质检模型中,输出的预设结果为该待质检数据为1的概率值,以及该待质检数据为0的概率值。Optionally, if W categories are set for the data to be quality checked in advance, then the prediction result of each data to be quality checked through the quality control model predicts that each data to be quality checked belongs to each of the W categories probability value. For example, assuming that the data to be quality-checked is dialogue data, the preset categories are dialogue data including positive emotions and dialogue data including negative emotions, assuming that the category of dialogue data including positive emotions is represented by 1, The category of dialogue data including negative emotions is represented by 0. The data to be inspected is input into the quality inspection model, and the output preset result is a probability value of 1 for the data to be inspected and a probability value of 0 for the data to be inspected.
其中,质检模型是基于样本数据集中具有类别标签的各个样本数据训练得到的。如,在图像识别领域,样本数据是具有类别标签的图像数据。在语音质检领域,样本数据可以是具有类别标签的语音对话数据。在文本质检领域,样本数据可以是具有类别标签的文本数据。Wherein, the quality inspection model is trained based on each sample data with category labels in the sample data set. For example, in the field of image recognition, sample data is image data with class labels. In the field of voice quality inspection, the sample data can be voice dialogue data with category labels. In the field of text quality inspection, sample data can be text data with category labels.
在一些实施例中,多个待质检数据和样本数据可以是按照相同时间周期,不同时间,从同一服务器中获取的。发明人研究发现,相同时间周期内产生的数据是会遵从相应的规律的。如符合正态分布、正面数据和负面数据的比值基本相同。因此,两种数据是具备后续的可比性的。In some embodiments, multiple data to be quality-tested and sample data may be obtained from the same server at different times in the same time period. The inventor found through research that the data generated in the same time period will obey the corresponding law. If the normal distribution is met, the ratio of positive data and negative data is basically the same. Therefore, the two data are comparable in the future.
步骤23:基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布。Step 23: Determine the predicted category distribution corresponding to the N quality inspection data based on the prediction result corresponding to each quality inspection data.
经过步骤22的质检模型预测,得到每个待质检数据对应的预测结果,每个待质检数据的预测结果中包括了每个待质检数据属于W个类别中每个类别的概论值,基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布,首先需要基于每个待质检数据对应的预测结果确定每个待质检数据对应的预测类别。其中,W为正整数且大于或等于二After the quality inspection model prediction in step 22, the prediction result corresponding to each data to be quality inspection is obtained, and the prediction result of each data to be quality inspection includes the probable value that each data to be quality inspection belongs to each category in W categories , based on the prediction results corresponding to each quality inspection data to determine the prediction category distribution corresponding to the N quality inspection data. First, it is necessary to determine the prediction category corresponding to each quality inspection data based on the prediction results corresponding to each quality inspection data. Among them, W is a positive integer and greater than or equal to two
具体实现中,基于每个待质检数据对应的预测结果确定每个待质检数据对应的预测类别,包括:将每个待质检数据对应的预测结果中概率值最大的类别确定为每个待质检数据对应的预测类别。比如,W等于2,即待质检数据对应2个类别,将这两个类别定义为第一类别和第二类别。待质检数据a对应的第一类别的概率值为90%,对应的第二类别的概率值为10%,则第一类别被确定为待质检数据a对应的预测类别。In the specific implementation, the prediction category corresponding to each quality inspection data is determined based on the prediction result corresponding to each quality inspection data, including: determining the category with the largest probability value among the prediction results corresponding to each quality inspection data as each The prediction category corresponding to the data to be checked. For example, W is equal to 2, that is, the data to be inspected corresponds to two categories, and these two categories are defined as the first category and the second category. The probability value of the first class corresponding to the data to be inspected a is 90%, and the probability value of the corresponding second class is 10%, so the first class is determined as the predicted class corresponding to the data a to be inspected.
又比如,W等于3,即待质检数据对应3个类别,将这两个类别定义为第一类别、第二类别和第三类别。待质检数据b对应的第一类别的概率值为70%,对应的第二类别的概率值为10%,对应的第三类别的概率值为20%,则第一类别被确定为待质检数据b对应的预测类别。For another example, W is equal to 3, that is, the data to be inspected corresponds to 3 categories, and these two categories are defined as the first category, the second category, and the third category. The probability value of the first category corresponding to the quality inspection data b is 70%, the probability value of the corresponding second category is 10%, and the corresponding probability value of the third category is 20%, then the first category is determined to be quality inspection The prediction category corresponding to the detection data b.
在得到每个待质检数据对应的预测类别之后,可以基于每个待质检数据对应的预测类别统计N个待质检数据服从的预测类别分布。在一些实施例中,预测类别分布可以用类别比例表示。After the predicted category corresponding to each data to be quality checked is obtained, the predicted category distributions of the N pieces of data to be quality checked can be calculated based on the predicted category corresponding to each data to be quality checked. In some embodiments, predicted class distributions may be represented by class proportions.
具体地,统计属于W个类别中每个类别的待质检数据的数量,然后基于每个类别下待质检数据的数量统计N个待质检数据的预测类别分布。举例来说,假设W等于2,即待质检数据对应2个类别,将这两个类别定义为第一类别和第二类别。统计每一类别对应的待质检数据的数量。即第一类别对应第一数量,第二类别对应第二数量。其中,第一数量和第二数量之和等于N。由此,可以确定出每一类别的待质检数据对应的数量占比。Specifically, count the number of data to be inspected belonging to each of the W categories, and then count the distribution of predicted categories of the data to be inspected based on the number of data to be inspected under each category. For example, assuming that W is equal to 2, that is, the data to be inspected corresponds to two categories, and these two categories are defined as the first category and the second category. Count the number of data to be inspected corresponding to each category. That is, the first category corresponds to the first quantity, and the second category corresponds to the second quantity. Wherein, the sum of the first quantity and the second quantity is equal to N. Thus, the quantity proportion corresponding to each category of data to be quality inspected can be determined.
在W等于3时,即待质检数据对应3个类别,将这三个类别定义为第一类别、第二类别和第三类别。假设第一类别下待质检数据的数量为第一数量,第二类别下待质检数据的数量为第二数量、第三类别下待质检数据的数量为第三数量。其中,第一数量、第二数量和第三数量之和等于N。由此,可以确定出每一类别对应的待质检数据的数量占比。When W is equal to 3, that is, the data to be inspected corresponds to three categories, and these three categories are defined as the first category, the second category, and the third category. Assume that the quantity of data to be quality inspected under the first category is the first quantity, the quantity of data to be quality inspected under the second category is the second quantity, and the quantity of data to be quality inspected under the third category is the third quantity. Wherein, the sum of the first quantity, the second quantity and the third quantity is equal to N. Thus, the proportion of the quantity of data to be inspected corresponding to each category can be determined.
步骤24:若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据。Step 24: If the predicted class distribution does not satisfy the prior class distribution, then determine M data to be inspected from the N data to be inspected based on the prediction result corresponding to each data to be inspected.
其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。可选的,可以通过对样本数据集进行T次数据采样处理,得到T个样本子集;统计每个样本子集中每个类别下样本数据的数量;基于每个样本子集中每个类别下样本数据的数量,确定每次数据采样处理对应的类别比例;根据每次数据采样数据处理对应的类别比例,确定先验类别分布。Wherein, the prior class distribution is determined based on the class label statistics corresponding to each sample data in the sample data set, M is a positive integer, and M is less than or equal to N. Optionally, T sample subsets can be obtained by performing T data sampling processing on the sample data set; counting the number of sample data under each category in each sample subset; based on the samples under each category in each sample subset According to the amount of data, determine the category proportion corresponding to each data sampling process; determine the prior category distribution according to the category proportion corresponding to each data sampling data process.
可选的,每个样本子集包括训练子集和测试子集。对样本数据集进行T次数据采样处理,得到T个样本子集,包括:确定每次数据采样处理时训练子集测试子集之间样本数据的数量比例。基于数量比例,对样本数据集进行数据采样,分别得到T个训练子集和T个测试子集。通常训练子集中的样本数据的数量大于测试子集中的样本数据的数量。如,训练子集和测试子集之间样本数据的数量比例为9:1、8:1或者7:1。Optionally, each sample subset includes a training subset and a testing subset. Perform T data sampling processing on the sample data set to obtain T sample subsets, including: determining the ratio of sample data between the training subset and the test subset during each data sampling processing. Based on the number ratio, the sample data set is sampled to obtain T training subsets and T testing subsets respectively. Usually the number of sample data in the training subset is larger than the number of sample data in the testing subset. For example, the ratio of sample data between the training subset and the testing subset is 9:1, 8:1 or 7:1.
由此,按照每次确定的数量比例,对样本数据集进行数据采样,得到相应的训练子集和测试子集。在采样T次后,则会得到T个样本子集。Thus, according to the quantity ratio determined each time, data sampling is performed on the sample data set to obtain corresponding training subsets and test subsets. After sampling T times, T sample subsets will be obtained.
进一步,可以利用训练子集中的样本数据和测试子集中的样本数据对质检模型进行训练。具体地,利用训练子集中的样本数据对质检模型进行训练,然后利用测试子集中的样本数据对质检模型进行测试,以确定质检模型的精度。Further, the quality inspection model can be trained by using the sample data in the training subset and the sample data in the test subset. Specifically, the quality inspection model is trained by using the sample data in the training subset, and then the quality inspection model is tested by using the sample data in the test subset, so as to determine the accuracy of the quality inspection model.
在一些实施例中,样本数据的类别包括第一类别C1和第二类别C2。则第一类别C1的类别比例可以是c1/(c1+c2),第二类别C2的类别比例可以是c2/(c1+c2)。其中,c1表示第一类别C1的数量,c2表示第二类别C2的数量。In some embodiments, the categories of sample data include a first category C1 and a second category C2. Then the category ratio of the first category C1 may be c1/(c1+c2), and the category ratio of the second category C2 may be c2/(c1+c2). Wherein, c1 indicates the quantity of the first category C1, and c2 indicates the quantity of the second category C2.
在一些实施例中,样本数据的类别包括第一类别C1、第二类别C2和第三类别C3。第一类别C1的类别比例可以是c1/(c1+c2+c3),第二类别C2的类别比例可以是c2/(c1+c2+c3),第三类别C3的类别比例可以是c3/(c1+c2+c3)。其中,c3表示第三类别C3的数量。In some embodiments, the categories of sample data include a first category C1, a second category C2 and a third category C3. The category ratio of the first category C1 can be c1/(c1+c2+c3), the category ratio of the second category C2 can be c2/(c1+c2+c3), and the category ratio of the third category C3 can be c3/( c1+c2+c3). Wherein, c3 represents the quantity of the third category C3.
即每一类别对应一类别比例,这些类别比例之和为1。That is, each category corresponds to a category proportion, and the sum of these category proportions is 1.
在一些实施例中,先验类别分布可以由一概率区间表示。可选的,根据每次数据采样数据处理对应的类别比例,确定先验类别分布,包括:根据每次数据采样处理对应的类别比例,统计T次数据采样处理对应的均值和方差;对均值和方差进行加权求差运算,并将加权求差后的运算结果作为概率区间的最小值,以及对均值和方差进行加权求和运算,并将加权求和后的运算结果作为概率区间的最大值。In some embodiments, the prior class distribution may be represented by a probability interval. Optionally, according to the category proportion corresponding to each data sampling process, determine the prior category distribution, including: according to the category proportion corresponding to each data sampling process, count the mean value and variance corresponding to the T times of data sampling process; The weighted difference operation is performed on the variance, and the operation result after the weighted difference is taken as the minimum value of the probability interval, and the weighted sum operation is performed on the mean and variance, and the operation result after the weighted sum is taken as the maximum value of the probability interval.
其中,可以采用以下公式确定出均值:Among them, the mean value can be determined by the following formula:
其中,可以采用以下公式确定出方差:Among them, the variance can be determined by the following formula:
其中,T表示每一类别对应的类别比例的数量,Xi表示第i个类别比例,μ表示均值,σ表示方差。Among them, T indicates the number of category proportions corresponding to each category, Xi indicates the i-th category proportion, μ indicates the mean, and σ indicates the variance.
在待质检数据对应有第一类别和第二类别时,则经过统计T次数据采样,会得到T个第一类别对应的类别比例和T个第二类别对应的类别比例。由此,第一类别会对应一均值和方差,第二类别也会对应一均值和方差。When the data to be inspected corresponds to the first category and the second category, after counting T times of data sampling, category proportions corresponding to T first categories and category proportions corresponding to T second categories will be obtained. Thus, the first category will correspond to a mean and variance, and the second category will also correspond to a mean and variance.
在待质检数据对应有第一类别、第二类别和第三类别时,则经过统计T次数据采样,会得到T个第一类别对应的类别比例、T个第二类别对应的类别比例以及T个第三类别对应的类别比例。由此,第一类别会对应一均值和方差,第二类别也会对应一均值和方差,第三类别也会对应一均值和方差。When the data to be inspected corresponds to the first category, the second category, and the third category, after counting T times of data sampling, the category proportions corresponding to T first categories, the category proportions corresponding to T second categories, and The proportion of categories corresponding to T third categories. Thus, the first category will correspond to a mean and variance, the second category will also correspond to a mean and variance, and the third category will also correspond to a mean and variance.
然后对均值和方差进行加权求差运算,并将加权求差后的运算结果作为概率区间的最小值,以及对均值和方差进行加权求和运算,并将加权求和后的运算结果作为概率区间的最大值。Then perform a weighted difference operation on the mean and variance, and use the weighted difference operation result as the minimum value of the probability interval, and perform a weighted sum operation on the mean and variance, and use the weighted sum operation result as the probability interval the maximum value.
在一些实施例中,均值和方差的权重可以按照实际需求设置,如,均值的权重设置为1,方差的权重设置为3。对均值和方差进行加权求差运算,得到的运算结果为μ-3σ,对均值和方差进行加权求和运算,得到的运算结果为μ+3σ。即,先验类别分布可以表示为[μ-3σ,μ+3σ]。In some embodiments, the weights of the mean and variance can be set according to actual needs, for example, the weight of the mean is set to 1, and the weight of the variance is set to 3. The weighted difference operation is performed on the mean value and variance, and the obtained operation result is μ-3σ, and the weighted sum operation is performed on the mean value and variance, and the obtained operation result is μ+3σ. That is, the prior class distribution can be expressed as [μ-3σ,μ+3σ].
若预测类别分布在概率区间范围内,说明预测类别分布满足先验类别分布。If the predicted category distribution is within the range of the probability interval, it means that the predicted category distribution satisfies the prior category distribution.
若预测类别分布不在概率区间范围内,说明预测类别分布不满足先验类别分布。即可以确定出N个待质检数据的预测结果存在错误,则需要找出预测结果错误的待质检数据。If the predicted category distribution is not within the range of the probability interval, it means that the predicted category distribution does not satisfy the prior category distribution. That is, it can be determined that there are errors in the prediction results of the N pieces of data to be quality inspected, and it is necessary to find out the data to be inspected that have wrong prediction results.
在一些实施例中,可以利用预测类别分布和所述先验类别分布,确定出差异值;利用差异值和待质检数据的数量N,确定出预测结果异常的待质检数据的数量M;从N个待质检数据中确定M个待质检数据。In some embodiments, the difference value can be determined by using the predicted category distribution and the prior category distribution; by using the difference value and the number N of data to be quality checked, the number M of data to be quality checked with abnormal prediction results can be determined; M pieces of data to be quality checked are determined from the N pieces of data to be quality checked.
步骤25:对M个待质检数据的预测结果进行修正。Step 25: Correct the prediction results of the M data to be inspected.
在W等于2时,即待质检数据对应2个类别,将这两个类别定义为第一类别和第二类别。如果确定出M个待质检数据中预测结果属于第一类别的待质检数据,那么对M个待质检数据的预测结果进行修改可以是将M个待质检数据中,预测结果属于第一类别的待质检数据的预测结果修改为第二类别;以及将预测结果为第二类别的质检数据,预测结果修改为第一类别。When W is equal to 2, that is, the data to be inspected corresponds to two categories, and these two categories are defined as the first category and the second category. If it is determined that the prediction results of the M data to be quality inspection belong to the first category of data to be quality inspection, then modifying the prediction results of the M data to be quality inspection can be to make the prediction result of the M data to be quality inspection belong to the first category. modifying the prediction result of one category of data to be quality inspected to the second category; and modifying the prediction result to the first category of the second category of quality inspection data.
在W等于3时,即待质检数据对应3个类别,将这两个类别定义为第一类别、第二类别和第二类别。确定出M个待质检数据中预测结果属于第一类别的待质检数据,确定出M个待质检数据中预测结果属于第二类别的待质检数据,确定出M个待质检数据中预测结果属于第三类别的待质检数据。When W is equal to 3, that is, the data to be inspected corresponds to 3 categories, and these two categories are defined as the first category, the second category and the second category. Determine the data to be inspected whose prediction results belong to the first category among the M data to be inspected, determine the data to be inspected whose prediction result belongs to the second category among the data to be inspected, and determine the data to be inspected The predicted results belong to the third category of data to be inspected.
其中,在预测过程中,待质检数据对应每一类别均存在一概率值,会将最大概率值对应的类别作为预测结果。即待质检数据对应第一类别、第二类别和第二类别均存在一概率值。在确定出M个待质检数据中预测结果属于第一类别的待质检数据、属于第二类别的待质检数据、属于第三类别的待质检数据后,确定每一待质检数据对应的第二大概率值的类别。将M个待质检数据的预测结果修改为第二大概率值的类别。Among them, in the prediction process, there is a probability value corresponding to each category of the data to be inspected, and the category corresponding to the maximum probability value will be used as the prediction result. That is, there is a probability value corresponding to the first category, the second category, and the second category in the data to be inspected. After determining the data to be inspected that the prediction results belong to the first category, the data to be inspected to belong to the second category, and the data to be inspected to belong to the third category among the M data to be quality inspected, determine each data to be inspected The category corresponding to the second largest probability value. Modify the prediction results of the M data to be quality inspected to the category with the second largest probability value.
在本实施例中,利用基于样本数据集中各个样本数据对应的类别标签统计确定的先验类别分布作为判断依据,确定出经过质检模型进行类别预测的预测结果不满足先验类别分布的待质检数据,并对不满足先验类别分布的待质检数据的预测结果进行修正,一方面能够改善质检模型因训练精度问题造成的异常预测,不需要对质检模型重新进行训练,减少对质检模型的训练成本,另一方面,对不满足先验类别分布的待质检数据的预测结果进行修正,能够提高待质检数据预测结果的准确性。In this embodiment, the prior category distribution determined based on the category label statistics corresponding to each sample data in the sample data set is used as the judgment basis, and it is determined that the prediction result of the category prediction through the quality inspection model does not meet the quality of the prior category distribution. inspection data, and correct the prediction results of the quality inspection data that do not satisfy the prior category distribution. On the one hand, it can improve the abnormal prediction of the quality inspection model due to training accuracy problems, and does not need to retrain the quality inspection model, reducing the need for The training cost of the quality inspection model, on the other hand, correcting the prediction results of the data to be quality inspection that do not satisfy the prior category distribution can improve the accuracy of the prediction results of the data to be quality inspection.
在一应用场景中,结合图3进行说明:In an application scenario, it will be described with reference to Figure 3:
步骤301:线上数据拉取。Step 301: Online data fetching.
在步骤301中,随机从线上拉取数据,确保拉取的数据与线上真实的数据分布一致。在语音质检场景中,可以拉取客服和用户之间的语音对话数据。这里的线上指的是产生数据的系统。如,语音数据由客服系统产生。In step 301, data is randomly pulled from the line to ensure that the pulled data is consistent with the real data distribution on the line. In the voice quality inspection scenario, the voice conversation data between customer service and users can be pulled. The online here refers to the system that generates the data. For example, voice data is generated by a customer service system.
步骤302:数据标注。Step 302: Data labeling.
可以采用人工标注的方式对拉取的数据进行标注,以确定出每一数据对应的类别。如,按照业务逻辑及设定的标注格式对线上获取的数据进行标注。The extracted data can be marked manually to determine the category corresponding to each data. For example, mark the data obtained online according to the business logic and the set mark format.
步骤303:划分训练子集和测试子集。Step 303: Divide training subsets and test subsets.
把标注好的数据,随机打乱,然后按照一定比例进行训练子集、测试子集的划分。Randomly scramble the marked data, and then divide the training subset and test subset according to a certain ratio.
步骤304:统计类别比例。Step 304: Statistical category proportions.
分别统计训练子集和测试子集中每一类别的数据的数量,并计算出对应的类别比例。Count the number of data of each category in the training subset and test subset respectively, and calculate the corresponding category proportion.
步骤305:计算每一类别对应的均值和方差。Step 305: Calculate the mean and variance corresponding to each category.
重复T次步骤303和步骤304,则每一类别对应T个类别比例。可以基于T个类别比例计算出每一类别对应的均值和方差。Step 303 and step 304 are repeated T times, and each category corresponds to T category proportions. The mean and variance corresponding to each category can be calculated based on the proportions of T categories.
步骤306:质检模型训练。Step 306: Quality inspection model training.
利用标注好的数据对质检模型进行迭代训练。其中,可以利用划分的训练子集和测试子集对质检模型进行迭代训练。Use the labeled data to iteratively train the quality inspection model. Among them, the quality inspection model can be iteratively trained by using the divided training subset and test subset.
步骤307:模型测试。Step 307: Model testing.
利用训练好的质检模型对测试子集中的数据进行预测。Use the trained QC model to make predictions on the data in the test subset.
步骤308:预测结果的类别比例计算。Step 308: Calculation of category proportions of prediction results.
根据步骤307的预测结果确定出类别比例。The class proportion is determined according to the prediction result in step 307 .
步骤309:比对。Step 309: Compare.
在步骤305计算出每一类别对应的均值和方差后,基于均值和方差确定出每一类别的先验类别分布。After the mean value and variance corresponding to each category are calculated in step 305, the prior category distribution of each category is determined based on the mean value and variance.
若预测结果中的每一类别的类别比例不满足其对应的先验类别分布,则执行步骤310。若预测结果中的每一类别的类别比例满足其对应的先验类别分布,则确定预测正常。If the category proportion of each category in the prediction result does not satisfy its corresponding prior category distribution, step 310 is executed. If the category proportion of each category in the prediction result satisfies its corresponding prior category distribution, it is determined that the prediction is normal.
步骤310:确定预测结果异常的数据。Step 310: Determine the data with abnormal prediction results.
其中,可以根据预测结果中的每一类别对应的概率值确定出数据对应的熵,根据熵确定出预测结果异常的数据。具体地可以参阅上述任一实施例,这里不做赘述。Wherein, the entropy corresponding to the data can be determined according to the probability value corresponding to each category in the prediction result, and the data with abnormal prediction results can be determined according to the entropy. For details, reference may be made to any of the foregoing embodiments, and details are not described here.
在一些实施例中,因是根据类别比例确定出预测结果异常的数据。若在第一类别对应的数据确定出M个预测结果异常的数据,则第二类别对应的数据中也会存在M个预测结果异常的数据。则需要从第二类别中确定出预测结果异常的数据。即,当多个类别中的一个类别中的数据的预测结果异常,则其余类别中同样会出现预测结果异常的数据。In some embodiments, the data with abnormal prediction results are determined according to the proportion of categories. If M data with abnormal prediction results are determined in the data corresponding to the first category, there will also be M data with abnormal prediction results in the data corresponding to the second category. Then it is necessary to determine the data with abnormal prediction results from the second category. That is, when the prediction result of the data in one of the multiple categories is abnormal, the data with abnormal prediction results will also appear in the other categories.
步骤311:修正。Step 311: Correct.
将预测结果异常的数据的预测结果进行修正。The prediction result of the data whose prediction result is abnormal is corrected.
在一应用场景中,数据可以对应的类别为第一类别和第二类别。预测结果异常的数据对应第一类别,则将该数据的预测结果修改为第二类别。In an application scenario, the categories that data may correspond to are the first category and the second category. The data whose prediction result is abnormal corresponds to the first category, and the prediction result of the data is changed to the second category.
同理,若从第一类别中确定出预测结果异常的数据,则对应第二类别的数据中也存在预测结果异常的数据。这些数据也需要修正预测结果。Similarly, if data with an abnormal prediction result is determined from the first category, data with an abnormal prediction result also exists in the data corresponding to the second category. These data also require revisions to forecasts.
在另一应用场景中,数据可以对应的类别为第一类别、第二类别和第三类别。预测结果异常的数据对应第一类别,则确定出该数据对应第二类别的概率值和第三类别的概率值。确定出第二类别的概率值和第三类别的概率值之间的较大者。将预测结果异常的数据对应的第一类别修改为较大者对应的类别。如,第二类别的概率值较大,则将预测结果异常的数据的预测结果修改为第二类别。In another application scenario, the categories that data may correspond to are the first category, the second category, and the third category. The data whose prediction result is abnormal corresponds to the first category, and then it is determined that the data corresponds to the probability value of the second category and the probability value of the third category. A larger one between the probability value of the second category and the probability value of the third category is determined. Modify the first category corresponding to the data with abnormal prediction results to the category corresponding to the larger one. For example, if the probability value of the second category is relatively large, the prediction result of the data whose prediction result is abnormal is modified to the second category.
同理,若从第一类别中确定出预测结果异常的数据,则对应第二类别的数据以及第三类别的数据中也存在预测结果异常的数据。这些数据也需要修正预测结果。Similarly, if the data with abnormal prediction results is determined from the first category, there will also be data with abnormal prediction results in the corresponding data of the second category and the data of the third category. These data also require revisions to forecasts.
在其他实施例中,可以采用人工修正的方式,具体可以参阅其余任一实施例中的描述,这里不做赘述。In other embodiments, manual correction may be used. For details, please refer to the description in any other embodiment, and details are not repeated here.
参阅图4,图4是本申请提供的数据处理方法第二实施例的流程示意图。该方法包括:Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a second embodiment of the data processing method provided by the present application. The method includes:
步骤41:获取N个待质检数据,N为正整数且大于等于二。Step 41: Obtain N pieces of data to be inspected, where N is a positive integer greater than or equal to two.
步骤42:将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果。Step 42: Input the N data to be inspected into the quality inspection model for category prediction, and obtain a prediction result corresponding to each data to be inspected.
步骤41-步骤42与上述任一实施例具有相同或相似的技术方案,这里不做赘述。Steps 41 to 42 have the same or similar technical solutions as those in any of the above-mentioned embodiments, which will not be repeated here.
步骤43:将每个待质检数据对应的预测结果中概率值最大的类别确定为每个待质检数据对应的预测类别。Step 43: Determine the category with the highest probability value among the prediction results corresponding to each data to be inspected as the predicted category corresponding to each data to be inspected.
在本实施例中,每个待质检数据对应的预测结果包括每个待质检数据属于W个类别中每个类别的概率值。In this embodiment, the prediction result corresponding to each data subject to quality inspection includes a probability value that each data subject to quality inspection belongs to each of the W categories.
由此,可以将每个待质检数据对应的预测结果中概率值最大的类别确定为每个待质检数据对应的预测类别。Thus, the category with the highest probability value among the prediction results corresponding to each data to be quality checked can be determined as the predicted category corresponding to each data to be quality checked.
步骤44:根据每个待质检数据对应的预测类别,统计属于每个类别下的待质检数据的数量,得到N个待质检数据对应的预测类别分布。Step 44: According to the predicted category corresponding to each quality-inspected data, count the quantity of the quality-inspected data belonging to each category, and obtain the predicted category distribution corresponding to the N quality-inspected data.
在一些实施例中,在W等于2时,即待质检数据对应2个类别,将这两个类别定义为第一类别和第二类别。In some embodiments, when W is equal to 2, that is, the data to be inspected corresponds to two categories, and these two categories are defined as the first category and the second category.
经过步骤42-步骤43的质检模型预测以及预测类别的确定,每一待质检数据对应一个类别。由此,统计每一类别对应的待质检数据的数量。即第一类别对应第一数量,第二类别对应第二数量。其中,第一数量和第二数量之和等于N。由此,可以确定出每一类别的待质检数据对应的数量占比。After the prediction of the quality inspection model and the determination of the predicted category in steps 42-43, each data subject to quality inspection corresponds to a category. In this way, the quantity of data to be inspected corresponding to each category is counted. That is, the first category corresponds to the first quantity, and the second category corresponds to the second quantity. Wherein, the sum of the first quantity and the second quantity is equal to N. Thus, the quantity proportion corresponding to each category of data to be quality inspected can be determined.
在W等于3时,即待质检数据对应3个类别,将这两个类别定义为第一类别、第二类别和第三类别。When W is equal to 3, that is, the data to be inspected corresponds to 3 categories, and these two categories are defined as the first category, the second category and the third category.
经过步骤42-步骤43的质检模型预测以及预测类别的确定,每一待质检数据对应一个类别。由此,统计每一类别对应的待质检数据的数量。即第一类别对应第一数量,第二类别对应第二数量、第三类别对应第三数量。其中,第一数量、第二数量和第三数量之和等于N。由此,可以确定出每一类别对应的待质检数据的数量占比。After the prediction of the quality inspection model in steps 42-43 and the determination of the predicted category, each data to be quality-checked corresponds to a category. In this way, the quantity of data to be inspected corresponding to each category is counted. That is, the first category corresponds to the first quantity, the second category corresponds to the second quantity, and the third category corresponds to the third quantity. Wherein, the sum of the first quantity, the second quantity and the third quantity is equal to N. Thus, the proportion of the quantity of data to be inspected corresponding to each category can be determined.
步骤45:若预测类别分布不满足先验类别分布,则利用预测类别分布和先验类别分布,确定出差异值。Step 45: If the predicted category distribution does not satisfy the prior category distribution, then use the predicted category distribution and the prior category distribution to determine the difference value.
其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的。Wherein, the prior class distribution is determined based on the class label statistics corresponding to each sample data in the sample data set.
在一些实施例中,可以利用预测类别分布可以用每一类别比例表示,先验类别分布可以用每一类别比例的概率区间表示,进而可以利用每一类别比例和每一类别比例的概率区间确定出差异值。In some embodiments, the predicted category distribution can be expressed by the proportion of each category, the prior category distribution can be expressed by the probability interval of each category proportion, and then can be determined by using each category proportion and the probability interval of each category proportion out the difference value.
如果每一类别比例大于每一类别比例的概率区间中的最大值,则利用每一类别比例减去每一类别比例的概率区间中的最大值,得到相应的差异值。If the proportion of each category is greater than the maximum value in the probability interval of each category proportion, the corresponding difference value is obtained by subtracting the maximum value in the probability interval of each category proportion from each category proportion.
如果每一类别比例小于每一类别比例的概率区间中的最小值,则利用每一类别比例的概率区间中的最小值减去每一类别比例,得到相应的差异值。If the proportion of each category is less than the minimum value in the probability interval of each category proportion, the minimum value in the probability interval of each category proportion is used to subtract each category proportion to obtain the corresponding difference value.
具体地,可以采用以下公式表示:Specifically, the following formula can be used to express:
其中,Δ表示差异,Y表示每一类别比例,μ-3σ表示每一类别比例的概率区间中的最大值,(μ-3σ)表示每一类别比例的概率区间中的最小值。Among them, Δ indicates the difference, Y indicates the proportion of each category, μ-3σ indicates the maximum value in the probability interval of each category proportion, and (μ-3σ) indicates the minimum value in the probability interval of each category proportion.
步骤46:利用差异值和待质检数据的数量N,确定出预测结果异常的待质检数据的数量M。Step 46: Using the difference value and the number N of data to be inspected, determine the number M of data to be inspected with abnormal prediction results.
其中,M为正整数,M小于或等于N。Wherein, M is a positive integer, and M is less than or equal to N.
然后根据差异值确定预测结果异常的待质检数据的数量M。如,利用差异值乘以待质检数据的数量N,得到预测结果异常的待质检数据的数量M。Then, according to the difference value, the quantity M of data to be inspected with abnormal prediction results is determined. For example, the difference value is multiplied by the number N of data to be quality checked to obtain the number M of data to be quality checked with abnormal prediction results.
可以理解,因每一类别比例和每一类别比例的概率区间中的数值均是小于1的,则差异值也为小数,则预测结果异常的待质检数据的数量M会小于待质检数据的数量N。It can be understood that because the values in the probability interval of each category proportion and each category proportion are less than 1, the difference value is also a decimal, and the number M of data to be inspected with abnormal prediction results will be less than the data to be inspected The number of N.
步骤47:从N个待质检数据中确定M个待质检数据。Step 47: Determine M data to be inspected from the N data to be inspected.
在一些实施例中,步骤47包括:利用每个待质检数据对应的预测结果中每个类别对应的概率值,确定每个待质检数据对应的熵。根据每个待质检数据对应的熵,从N个待质检数据中确定出M个待质检数据。In some embodiments, step 47 includes: using the probability value corresponding to each category in the prediction result corresponding to each data to be quality checked to determine the entropy corresponding to each data to be quality checked. According to the entropy corresponding to each data to be quality inspection, M pieces of data to be quality checked are determined from the N pieces of data to be quality checked.
比如,在质检模型预测过程中,会预测出每一待质检数据对应的每一类别的概率值,然后根据概率值的大小,最终确定出对应的标签类别。具体地,可以采用以下公式确定每个待质检数据对应的熵:For example, in the quality inspection model prediction process, the probability value of each category corresponding to each data to be quality inspection is predicted, and then the corresponding label category is finally determined according to the magnitude of the probability value. Specifically, the following formula can be used to determine the entropy corresponding to each data to be quality checked:
其中,H(X)表示熵,p(xi)表示xi的概率值,这里指质检模型预测出的各个类别的概率值,m这里表示类别数,xi表示第i个待质检数据。Among them, H(X) represents entropy, p( xi ) represents the probability value of x i , here refers to the probability value of each category predicted by the quality inspection model, m here represents the number of categories, and x i represents the i-th one to be inspected data.
在确定出N个待质检数据对应的熵后,对这些熵进行排序,如从大到小或者从小到大。After determining the entropy corresponding to the N pieces of data to be quality inspected, sort these entropies, such as from large to small or from small to large.
在对这些熵按照从大到小进行排序时,则从前到后获取M个熵,进而确定出这些熵对应的待质检数据。这些待质检数据则为预测结果异常的待质检数据。When these entropies are sorted from large to small, M entropies are obtained from front to back, and then the data to be inspected corresponding to these entropies are determined. The data to be inspected is the data to be inspected with abnormal prediction results.
在对这些熵按照从小到大进行排序时,则从后到前获取M个熵,进而确定出这些熵对应的待质检数据。这些待标注样本则为预测结果异常的待质检数据。When these entropies are sorted from small to large, M entropies are obtained from back to front, and then the data to be inspected corresponding to these entropies is determined. These samples to be labeled are the data to be quality inspected with abnormal prediction results.
步骤48:对M个待质检数据的预测结果进行修正。Step 48: Correct the prediction results of the M data to be inspected.
在一些实施例中,步骤48可以通过人工修改的方式对M个待质检数据的预测结果进行修正。如,显示修改界面,其中,修改界面显示M个待质检数据以及M个待质检数据的预测结果;接收对M个待质检数据中任意待质检数据的修正信息,利用修正信息对任意待质检数据的预测结果进行修正。In some embodiments, step 48 may correct the prediction results of the M data to be quality-checked by manual modification. For example, a modification interface is displayed, wherein, the modification interface displays M data to be quality inspected and prediction results of M data to be quality inspected; receive correction information for any data to be quality inspected in the M data to be quality inspected, and use the correction information to The prediction results of any pending quality inspection data are corrected.
结合图5进行说明:Combined with Figure 5 for illustration:
如图5所示,修改界面显示了待质检数据a、待质检数据b和待质检数据c,以及对应的预测结果。每一待质检数据对应一修改按钮。As shown in FIG. 5 , the modification interface displays data to be inspected a, data to be inspected b, and data to be inspected c, as well as corresponding prediction results. Each data to be inspected corresponds to a modification button.
在选择修改按钮后,可以弹出输入栏或选择栏。若弹出输入栏,则接收用户输入的修正信息,利用修正信息对任意待质检数据的预测结果进行修正。如,将待质检数据a的第一类别修改为第二类别。After selecting the modify button, an input field or a selection field may pop up. If the input field pops up, the correction information input by the user is received, and the prediction result of any data to be inspected is corrected by using the correction information. For example, modify the first category of the data to be quality checked a to the second category.
若弹出选择栏,则可以在选择栏内显示每一可供选择的类别。接收用户选择的类别。利用被选择的类别对任意待质检数据的预测结果进行修正。如,将待质检数据b的第二类别修改为第一类别。If a selection bar pops up, each category available for selection can be displayed in the selection bar. Receives the category selected by the user. Use the selected category to correct the prediction results of any data to be inspected. For example, modify the second category of the data b to be quality checked to the first category.
在本实施例中,利用基于样本数据集中各个样本数据对应的类别标签统计确定的先验类别分布作为判断依据,确定出经过质检模型进行类别预测的预测结果不满足先验类别分布的待质检数据,并对不满足先验类别分布的待质检数据的预测结果进行修正,一方面能够改善质检模型因训练精度问题造成的异常预测,不需要对质检模型重新进行训练,减少对质检模型的训练成本,另一方面,对不满足先验类别分布的待质检数据的预测结果进行修正,能够提高待质检数据预测结果的准确性。In this embodiment, the prior category distribution determined based on the category label statistics corresponding to each sample data in the sample data set is used as the judgment basis, and it is determined that the prediction result of the category prediction through the quality inspection model does not meet the quality of the prior category distribution. inspection data, and correct the prediction results of the quality inspection data that do not satisfy the prior category distribution. On the one hand, it can improve the abnormal prediction of the quality inspection model due to training accuracy problems, and does not need to retrain the quality inspection model, reducing the need for The training cost of the quality inspection model, on the other hand, correcting the prediction results of the data to be quality inspection that do not satisfy the prior category distribution can improve the accuracy of the prediction results of the data to be quality inspection.
进一步,通过统计属于每个类别下的待质检数据的数量,得到N个待质检数据对应的预测类别分布,能够确定出待质检数据在每一类别下的数量,进而精确的得到N个待质检数据对应的预测类别分布,便于确定出后续的预测类别分布不满足先验类别分布的待质检数据。Further, by counting the number of data to be inspected under each category, the predicted category distribution corresponding to N data to be inspected can be obtained, and the number of data to be inspected under each category can be determined, and then N The predicted category distribution corresponding to each quality-inspection data is convenient for determining the subsequent quality-inspection data whose predicted category distribution does not satisfy the prior category distribution.
另外,本申请上述任一实施例的方法不仅能够用于对数据质检,还可以用于其他领域中,比如数据标注,在数据标注领域,待质检数据即为待标注数据,在质检模型预测出的待标注数据的预测类别后,确定出预测异常的待标注数据,并对其预测结果进行修正,进而将修正后的预测类别作为待标注数据的标签。以及未修正的预测类别作为未修正的待标注数据的标签。通过这种方式,一方面,利用网络模型的预测结果作为标注的标签,能够提高标注效率,另一方面,通过对异常标注的待标注样本的标签进行修改,能够改善网络模型因训练精度问题造成的异常预测,进而提高标注的准确性。In addition, the method of any of the above-mentioned embodiments of the present application can not only be used for data quality inspection, but also can be used in other fields, such as data labeling. In the field of data labeling, the data to be quality checked is the data to be labeled. After the prediction category of the data to be labeled is predicted by the model, the abnormally predicted data to be labeled is determined, and the prediction result is corrected, and then the corrected prediction category is used as the label of the data to be labeled. And the uncorrected predicted category is used as the label of the uncorrected data to be labeled. In this way, on the one hand, using the prediction results of the network model as labels can improve labeling efficiency; Anomaly prediction, thereby improving the accuracy of labeling.
在一应用场景中,随机从线上获取10000条数据,进行标注,假设正负比例为3:1,如此独立重复n次,计算出每组的正负样本比例[x1,x2,…,xn],这一数据服从正态分布,然后求出均值假设为3.01,方差为0.05,则可知类别比例主要分布在[2.86,3.16]范围内,然后用训练出的模型对待标注样本进行预测,则预测结果理论上类别比例应该在[2.86,3.16]范围内,若超过这一范围,这说明模型预测存在误差,则需要修正,假设预测后的类别比例为3.2,则可知预测结果中有1000*(3.2-3.16)=40条数据存在误差,然后对根据预测结果的概率值,采用计算熵的方式,筛选出top40的数据,然后进行修正。In an application scenario, 10,000 pieces of data are randomly obtained from the line and marked, assuming that the positive and negative ratio is 3:1, and this is repeated n times independently, and the positive and negative sample ratio of each group is calculated [x1, x2, ..., xn ], this data obeys the normal distribution, and then the mean value is assumed to be 3.01, and the variance is 0.05. It can be seen that the proportion of categories is mainly distributed in the range of [2.86, 3.16], and then the trained model is used to predict the samples to be labeled, then Theoretically, the category ratio of the predicted result should be within the range of [2.86, 3.16]. If it exceeds this range, it means that there is an error in the model prediction, and it needs to be corrected. Assuming that the predicted category ratio is 3.2, it can be seen that there are 1000* in the predicted result (3.2-3.16) = 40 pieces of data have errors, and then use the method of calculating entropy to filter out the top40 data based on the probability value of the predicted result, and then correct it.
参阅图6,图6是本申请提供的数据处理装置一实施例的结构示意图。该数据处理装置60包括:获取单元61、预测单元62、确定单元63和修正单元64。Referring to FIG. 6 , FIG. 6 is a schematic structural diagram of an embodiment of a data processing device provided in the present application. The data processing device 60 includes: an acquisition unit 61 , a prediction unit 62 , a determination unit 63 and a correction unit 64 .
其中,获取单元61用于获取N个待质检数据,N为正整数且大于等于二。Wherein, the acquiring unit 61 is used to acquire N pieces of data to be inspected, where N is a positive integer greater than or equal to two.
预测单元62用于将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果。The predicting unit 62 is configured to input the N data to be inspected into the quality inspection model to perform category prediction, and obtain a prediction result corresponding to each data to be inspected.
确定单元63用于基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布;以及若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据。The determining unit 63 is configured to determine the predicted category distribution corresponding to the N quality-checked data based on the prediction result corresponding to each quality-checked data; and if the predicted category distribution does not satisfy the prior category distribution, then based on the corresponding As a result of the prediction, M data to be inspected are determined from N data to be inspected.
修正单元64用于对M个待质检数据的预测结果进行修正;其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。The correction unit 64 is used to correct the prediction results of the M data to be quality inspected; wherein, the prior category distribution is determined based on the category label statistics corresponding to each sample data in the sample data set, M is a positive integer, and M is less than or equal to N .
在一个实施例中,所述每个待质检数据对应的预测结果包括所述每个待质检数据属于W个类别中每个类别的概率值;所述确定单元63在基于所述每个待质检数据对应的预测结果确定所述N个待质检数据对应的预测类别分布时,执行如下步骤:In one embodiment, the prediction result corresponding to each data subject to quality inspection includes a probability value that each data subject to quality inspection belongs to each category in W categories; the determination unit 63 is based on each When the prediction result corresponding to the data to be quality checked determines the predicted category distribution corresponding to the N pieces of data to be quality checked, the following steps are performed:
将所述每个待质检数据对应的预测结果中概率值最大的类别确定为所述每个待质检数据对应的预测类别;Determining the category with the largest probability value among the prediction results corresponding to each data to be quality checked as the predicted category corresponding to each data to be quality checked;
根据所述每个待质检数据对应的预测类别,统计属于每个类别下的待质检数据的数量,得到所述N个待质检数据对应的预测类别分布。According to the prediction category corresponding to each of the data to be quality inspection, the number of data to be quality inspection belonging to each category is counted, and the predicted category distribution corresponding to the N data to be quality inspection is obtained.
在一个实施例中,所述确定单元63在基于所述每个待质检数据对应的预测结果,从所述N个待质检数据中确定M个待质检数据时,执行如下步骤:In one embodiment, the determining unit 63 executes the following steps when determining M data to be inspected from the N data to be inspected based on the prediction result corresponding to each data to be inspected:
利用所述预测类别分布和所述先验类别分布,确定出差异值;determining a difference value using the predicted class distribution and the prior class distribution;
利用所述差异值和待质检数据的数量N,确定出预测结果异常的待质检数据的数量M;Using the difference value and the number N of data to be quality checked, determine the number M of data to be quality checked with abnormal prediction results;
从所述N个待质检数据中确定M个待质检数据。M pieces of data to be quality checked are determined from the N pieces of data to be quality checked.
在一个实施例中,所述确定单元63在从所述N个待质检数据中确定M个待质检数时,执行如下步骤:In one embodiment, the determining unit 63 performs the following steps when determining M numbers of data to be quality checked from the N pieces of data to be quality checked:
利用所述每个待质检数据对应的预测结果中每个类别对应的概率值,确定所述每个待质检数据对应的熵;Using the probability value corresponding to each category in the prediction result corresponding to each data to be quality inspection, determine the entropy corresponding to each data to be quality inspection;
根据所述每个待质检数据对应的熵,从所述N个待质检数据中确定出M个待质检数据。According to the entropy corresponding to each of the data to be quality checked, M pieces of data to be quality checked are determined from the N pieces of data to be quality checked.
在一个实施例中,数据处理装置还包括处理单元65,处理单元65用于:In one embodiment, the data processing device further includes a processing unit 65, the processing unit 65 is used for:
对所述样本数据集进行T次数据采样处理,得到T个样本子集;performing T times of data sampling processing on the sample data set to obtain T sample subsets;
统计每个样本子集中每个类别下样本数据的数量;Count the number of sample data under each category in each sample subset;
基于每个样本子集中每个类别下样本数据的数量,确定每次数据采样处理对应的类别比例;Determine the proportion of categories corresponding to each data sampling process based on the number of sample data under each category in each sample subset;
根据每次数据采样数据处理对应的类别比例,确定先验类别分布。According to the category proportion corresponding to each data sampling data processing, the prior category distribution is determined.
在一个实施例中,每个样本子集包括训练子集和测试子集,所述处理单元65在对所述样本数据集进行T次数据采样,得到T个样本子集时,执行如下步骤:In one embodiment, each sample subset includes a training subset and a test subset, and the processing unit 65 performs the following steps when performing T data sampling on the sample data set to obtain T sample subsets:
确定每次数据采样处理时训练子集测试子集之间样本数据的数量比例;Determine the proportion of sample data between the training subset and the test subset for each data sampling process;
基于所述数量比例,对所述样本数据集进行数据采样,分别得到T个训练子集和T个测试子集。Based on the number ratio, data sampling is performed on the sample data set to obtain T training subsets and T testing subsets respectively.
在一个实施例中,所述先验类别分布由概率区间表示,所述确定单元63在根据每次数据采样数据处理对应的类别比例,确定先验类别分布时,执行如下步骤:In one embodiment, the prior class distribution is represented by a probability interval, and the determining unit 63 performs the following steps when determining the prior class distribution according to the class ratio corresponding to each data sampling data processing:
根据所述每次数据采样处理对应的类别比例,统计T次数据采样处理对应的均值和方差;According to the category ratio corresponding to each data sampling process, count the mean value and variance corresponding to T times of data sampling process;
对所述均值和方差进行加权求差运算,并将加权求差后的运算结果作为概率区间的最小值,以及对所述均值和方差进行加权求和运算,并将加权求和后的运算结果作为概率区间的最大值。Perform weighted difference calculation on the mean value and variance, and use the operation result after weighted difference calculation as the minimum value of the probability interval, and perform weighted sum operation on the mean value and variance, and use the weighted sum operation result as the maximum value of the probability interval.
在一个实施例中,修改单元64在对所述M个待质检数据的预测结果进行修正时,执行如下步骤:In one embodiment, the modifying unit 64 performs the following steps when correcting the prediction results of the M data to be quality inspected:
显示修改界面,其中,所述修改界面显示所述M个待质检数据以及所述M个待质检数据的预测结果;Displaying a modification interface, wherein the modification interface displays the M data to be inspected and prediction results of the M data to be inspected;
接收对所述M个待质检数据中任意待质检数据的修正信息,利用所述修正信息对所述任意待质检数据的预测结果进行修正。Receive correction information for any data to be inspected among the M data to be inspected, and use the correction information to correct a prediction result of any data to be inspected.
本申请提供的数据处理装置基于样本数据集中各个样本数据对应的类别标签统计确定的先验类别分布作为判断依据,确定出经过质检模型进行类别预测的预测结果不满足先验类别分布的待质检数据,并对不满足先验类别分布的待质检数据的预测结果进行修正,使得修正后的待质检数据的预测结果满足先验类别分布,能够提高待质检数据预测结果的准确性。The data processing device provided in this application is based on the prior category distribution determined by the statistics of the category labels corresponding to each sample data in the sample data set as the judgment basis, and determines that the prediction result of the category prediction through the quality inspection model does not meet the quality of the prior category distribution. The data to be inspected, and the prediction results of the data to be inspected that do not satisfy the prior category distribution are corrected, so that the predicted results of the revised data to be inspected meet the prior category distribution, which can improve the accuracy of the prediction results of the data to be inspected .
参阅图7,图7是本申请提供的电子设备一实施例的结构示意图。该电子设备70包括处理器71、输入接口72、输出接口73以及计算机存储介质74。其中,处理器71分别与输入接口72、输出接口73以及计算机存储介质74耦接。输入接口72可以与外部设备相连,用于接收外部设备输入的数据。输出接口73可以与外部设备相连,用于向外部设备输出数据。Referring to FIG. 7 , FIG. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. The electronic device 70 includes a processor 71 , an input interface 72 , an output interface 73 and a computer storage medium 74 . Wherein, the processor 71 is respectively coupled to the input interface 72 , the output interface 73 and the computer storage medium 74 . The input interface 72 can be connected with an external device for receiving data input by the external device. The output interface 73 can be connected with an external device, and is used for outputting data to the external device.
其中,计算机存储介质74中存储有计算机程序,处理器71用于执行计算机程序以实现以下方法:Wherein, a computer program is stored in the computer storage medium 74, and the processor 71 is used to execute the computer program to realize the following method:
获取N个待质检数据,N为正整数且大于等于二;将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果;基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布;若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据,并对M个待质检数据的预测结果进行修正;其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。Obtain N quality inspection data, N is a positive integer and greater than or equal to two; input N quality inspection data into the quality inspection model for category prediction, and obtain the prediction result corresponding to each quality inspection data; based on each quality inspection data The prediction results corresponding to the data to be inspected determine the predicted category distribution corresponding to the N quality inspection data; if the predicted category distribution does not satisfy the prior category distribution, based on the prediction results corresponding to each quality inspection data, from the N quality inspection data Determine the M data to be quality-inspected, and correct the prediction results of the M data to be quality-inspected; wherein, the prior category distribution is determined based on the category label statistics corresponding to each sample data in the sample data set, M is a positive integer, M is less than or equal to N.
本申请还提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序在被处理器71执行时,实现以下方法:The present application also provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by the processor 71, the following methods are implemented:
获取N个待质检数据,N为正整数且大于等于二;将N个待质检数据输入至质检模型进行类别预测,得到每个待质检数据对应的预测结果;基于每个待质检数据对应的预测结果确定N个待质检数据对应的预测类别分布;若预测类别分布不满足先验类别分布,则基于每个待质检数据对应的预测结果,从N个待质检数据中确定M个待质检数据,并对M个待质检数据的预测结果进行修正;其中,先验类别分布是基于样本数据集中各个样本数据对应的类别标签统计确定的,M为正整数,M小于或等于N。Obtain N quality inspection data, N is a positive integer and greater than or equal to two; input N quality inspection data into the quality inspection model for category prediction, and obtain the prediction result corresponding to each quality inspection data; based on each quality inspection data The prediction results corresponding to the data to be inspected determine the predicted category distribution corresponding to the N quality inspection data; if the predicted category distribution does not satisfy the prior category distribution, based on the prediction results corresponding to each quality inspection data, from the N quality inspection data Determine the M data to be quality-inspected, and correct the prediction results of the M data to be quality-inspected; wherein, the prior category distribution is determined based on the category label statistics corresponding to each sample data in the sample data set, M is a positive integer, M is less than or equal to N.
在一个实施例中,所述每个待质检数据对应的预测结果包括所述每个待质检数据属于W个类别中每个类别的概率值;所述处理器71在基于所述每个待质检数据对应的预测结果确定所述N个待质检数据对应的预测类别分布时,执行如下步骤:In one embodiment, the prediction result corresponding to each data subject to quality inspection includes a probability value that each data subject to quality inspection belongs to each category in W categories; the processor 71 is based on each When the prediction result corresponding to the data to be quality checked determines the predicted category distribution corresponding to the N pieces of data to be quality checked, the following steps are performed:
将所述每个待质检数据对应的预测结果中概率值最大的类别确定为所述每个待质检数据对应的预测类别;Determining the category with the largest probability value among the prediction results corresponding to each data to be quality checked as the predicted category corresponding to each data to be quality checked;
根据所述每个待质检数据对应的预测类别,统计属于每个类别下的待质检数据的数量,得到所述N个待质检数据对应的预测类别分布。According to the prediction category corresponding to each of the data to be quality inspection, the number of data to be quality inspection belonging to each category is counted, and the predicted category distribution corresponding to the N data to be quality inspection is obtained.
在一个实施例中,所述处理器71在基于所述每个待质检数据对应的预测结果,从所述N个待质检数据中确定M个待质检数据时,执行如下步骤:In one embodiment, the processor 71 performs the following steps when determining M data to be inspected from the N data to be inspected based on the prediction result corresponding to each data to be inspected:
利用所述预测类别分布和所述先验类别分布,确定出差异值;determining a difference value using the predicted class distribution and the prior class distribution;
利用所述差异值和待质检数据的数量N,确定出预测结果异常的待质检数据的数量M;Using the difference value and the number N of data to be quality checked, determine the number M of data to be quality checked with abnormal prediction results;
从所述N个待质检数据中确定M个待质检数据。M pieces of data to be quality checked are determined from the N pieces of data to be quality checked.
在一个实施例中,所述处理器71在从所述N个待质检数据中确定M个待质检数时,执行如下步骤:In one embodiment, the processor 71 executes the following steps when determining M numbers of data to be quality-checked from the N pieces of data to be quality-checked:
利用所述每个待质检数据对应的预测结果中每个类别对应的概率值,确定所述每个待质检数据对应的熵;Using the probability value corresponding to each category in the prediction result corresponding to each data to be quality inspection, determine the entropy corresponding to each data to be quality inspection;
根据所述每个待质检数据对应的熵,从所述N个待质检数据中确定出M个待质检数据。According to the entropy corresponding to each of the data to be quality checked, M pieces of data to be quality checked are determined from the N pieces of data to be quality checked.
在一个实施例中,处理器71还用于:In one embodiment, the processor 71 is also used for:
对所述样本数据集进行T次数据采样处理,得到T个样本子集;performing T times of data sampling processing on the sample data set to obtain T sample subsets;
统计每个样本子集中每个类别下样本数据的数量;Count the number of sample data under each category in each sample subset;
基于每个样本子集中每个类别下样本数据的数量,确定每次数据采样处理对应的类别比例;Determine the proportion of categories corresponding to each data sampling process based on the number of sample data under each category in each sample subset;
根据每次数据采样数据处理对应的类别比例,确定先验类别分布。According to the category proportion corresponding to each data sampling data processing, the prior category distribution is determined.
在一个实施例中,每个样本子集包括训练子集和测试子集,所述处理器71在对所述样本数据集进行T次数据采样,得到T个样本子集时,执行如下步骤:In one embodiment, each sample subset includes a training subset and a test subset, and the processor 71 performs the following steps when performing T data sampling on the sample data set to obtain T sample subsets:
确定每次数据采样处理时训练子集测试子集之间样本数据的数量比例;Determine the proportion of sample data between the training subset and the test subset for each data sampling process;
基于所述数量比例,对所述样本数据集进行数据采样,分别得到T个训练子集和T个测试子集。Based on the number ratio, data sampling is performed on the sample data set to obtain T training subsets and T testing subsets respectively.
在一个实施例中,所述先验类别分布由概率区间表示,所述处理器71在根据每次数据采样数据处理对应的类别比例,确定先验类别分布时,执行如下步骤:In one embodiment, the prior class distribution is represented by a probability interval, and the processor 71 performs the following steps when determining the prior class distribution according to the class ratio corresponding to each data sampling data processing:
根据所述每次数据采样处理对应的类别比例,统计T次数据采样处理对应的均值和方差;According to the category ratio corresponding to each data sampling process, count the mean value and variance corresponding to T times of data sampling process;
对所述均值和方差进行加权求差运算,并将加权求差后的运算结果作为概率区间的最小值,以及对所述均值和方差进行加权求和运算,并将加权求和后的运算结果作为概率区间的最大值。Perform weighted difference calculation on the mean value and variance, and use the operation result after weighted difference calculation as the minimum value of the probability interval, and perform weighted sum operation on the mean value and variance, and use the weighted sum operation result as the maximum value of the probability interval.
在一个实施例中,处理器71在对所述M个待质检数据的预测结果进行修正时,执行如下步骤:In one embodiment, the processor 71 performs the following steps when correcting the prediction results of the M data to be quality checked:
显示修改界面,其中,所述修改界面显示所述M个待质检数据以及所述M个待质检数据的预测结果;Displaying a modification interface, wherein the modification interface displays the M data to be inspected and prediction results of the M data to be inspected;
接收对所述M个待质检数据中任意待质检数据的修正信息,利用所述修正信息对所述任意待质检数据的预测结果进行修正。Receive correction information for any data to be inspected among the M data to be inspected, and use the correction information to correct a prediction result of any data to be inspected.
综上所述,本申请提供的数据处理方法、装置、电子设备以及计算机可读存储介质,该方法利用基于样本数据集中各个样本数据对应的类别标签统计确定的先验类别分布作为判断依据,确定出经过质检模型进行类别预测的预测结果不满足先验类别分布的待质检数据,并对不满足先验类别分布的待质检数据的预测结果进行修正,使得修正后的待质检数据的预测结果满足先验类别分布,能够提高待质检数据预测结果的准确性。To sum up, the data processing method, device, electronic equipment, and computer-readable storage medium provided by this application use the prior category distribution determined based on the category label statistics corresponding to each sample data in the sample data set as the basis for judgment, and determine The prediction results of the category prediction by the quality inspection model do not meet the prior category distribution of the data to be quality inspection, and the prediction results of the data to be quality inspection that do not meet the prior category distribution are corrected, so that the corrected data to be quality inspection The prediction results satisfy the prior category distribution, which can improve the accuracy of the prediction results of the data to be inspected.
在本申请所提供的几个实施方式中,应该理解到,所揭露的方法以及设备,可以通过其它的方式实现。例如,以上所描述的设备实施方式仅仅是示意性的,例如,所述电路或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。In the several implementation manners provided in this application, it should be understood that the disclosed methods and devices may be implemented in other ways. For example, the device implementation described above is only illustrative. For example, the division of the circuit or unit is only a logical function division. In actual implementation, there may be another division method. For example, multiple units or components can be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施方式中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
以上所述仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是根据本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only the implementation of the application, and does not limit the patent scope of the application. All equivalent structures or equivalent process transformations made according to the contents of the specification and drawings of the application, or directly or indirectly used in other related technologies fields, are all included in the scope of patent protection of this application in the same way.
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210476041.2A CN115083442B (en) | 2022-04-29 | 2022-04-29 | Data processing method, device, electronic device, and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210476041.2A CN115083442B (en) | 2022-04-29 | 2022-04-29 | Data processing method, device, electronic device, and computer-readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115083442A CN115083442A (en) | 2022-09-20 |
CN115083442B true CN115083442B (en) | 2023-08-08 |
Family
ID=83247515
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210476041.2A Active CN115083442B (en) | 2022-04-29 | 2022-04-29 | Data processing method, device, electronic device, and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115083442B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016192204A (en) * | 2015-03-30 | 2016-11-10 | 日本電気株式会社 | Data model generation method and system for relational data |
CN108334575A (en) * | 2018-01-23 | 2018-07-27 | 北京三快在线科技有限公司 | A kind of recommendation results sequence modification method and device, electronic equipment |
CN109376786A (en) * | 2018-10-31 | 2019-02-22 | 中国科学院深圳先进技术研究院 | An image classification method, apparatus, terminal device and readable storage medium |
CN110222718A (en) * | 2019-05-09 | 2019-09-10 | 华为技术有限公司 | The method and device of image procossing |
CN111563721A (en) * | 2020-04-21 | 2020-08-21 | 上海爱数信息技术股份有限公司 | Mail classification method suitable for different label distribution occasions |
CN111598153A (en) * | 2020-05-13 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Data clustering processing method and device, computer equipment and storage medium |
CN112149758A (en) * | 2020-10-24 | 2020-12-29 | 中国人民解放军国防科技大学 | A hyperspectral open set classification method based on Euclidean distance and deep learning |
CN114298154A (en) * | 2021-12-01 | 2022-04-08 | 上海高德威智能交通系统有限公司 | Active learning method, device, electronic device and readable storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100426382B1 (en) * | 2000-08-23 | 2004-04-08 | 학교법인 김포대학 | Method for re-adjusting ranking document based cluster depending on entropy information and Bayesian SOM(Self Organizing feature Map) |
US8180717B2 (en) * | 2007-03-20 | 2012-05-15 | President And Fellows Of Harvard College | System for estimating a distribution of message content categories in source data |
-
2022
- 2022-04-29 CN CN202210476041.2A patent/CN115083442B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016192204A (en) * | 2015-03-30 | 2016-11-10 | 日本電気株式会社 | Data model generation method and system for relational data |
CN108334575A (en) * | 2018-01-23 | 2018-07-27 | 北京三快在线科技有限公司 | A kind of recommendation results sequence modification method and device, electronic equipment |
CN109376786A (en) * | 2018-10-31 | 2019-02-22 | 中国科学院深圳先进技术研究院 | An image classification method, apparatus, terminal device and readable storage medium |
CN110222718A (en) * | 2019-05-09 | 2019-09-10 | 华为技术有限公司 | The method and device of image procossing |
CN111563721A (en) * | 2020-04-21 | 2020-08-21 | 上海爱数信息技术股份有限公司 | Mail classification method suitable for different label distribution occasions |
CN111598153A (en) * | 2020-05-13 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Data clustering processing method and device, computer equipment and storage medium |
CN112149758A (en) * | 2020-10-24 | 2020-12-29 | 中国人民解放军国防科技大学 | A hyperspectral open set classification method based on Euclidean distance and deep learning |
CN114298154A (en) * | 2021-12-01 | 2022-04-08 | 上海高德威智能交通系统有限公司 | Active learning method, device, electronic device and readable storage medium |
Non-Patent Citations (1)
Title |
---|
薛薇 等.Python机器学习 数据建模与分析.机械工业出版社,2021,全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN115083442A (en) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111209977B (en) | Classification model training and using method, device, equipment and medium | |
WO2020082734A1 (en) | Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium | |
CN113240510B (en) | Abnormal user prediction method, device, equipment and storage medium | |
CN114118287B (en) | Sample generation method, device, electronic device and storage medium | |
CN113947336A (en) | Method, device, storage medium and computer equipment for evaluating risks of bidding enterprises | |
CN108228684B (en) | Method and device for training clustering model, electronic equipment and computer storage medium | |
CN112131322A (en) | Time series classification method and device | |
CN112949973A (en) | AI-combined robot process automation RPA process generation method | |
CN113643260A (en) | Method, apparatus, apparatus, medium and product for detecting image quality | |
CN110826327A (en) | Emotion analysis method and device, computer readable medium and electronic equipment | |
CN114925674A (en) | File compliance checking method and device, electronic equipment and storage medium | |
CN117611164A (en) | Risk statement identification method and device, electronic equipment and storage medium | |
CN116228301A (en) | Method, device, equipment and medium for determining target user | |
WO2023060954A1 (en) | Data processing method and apparatus, data quality inspection method and apparatus, and readable storage medium | |
CN119025877A (en) | Model performance evaluation method, device, electronic device and storage medium | |
CN119558862A (en) | Customer service evaluation method, device, medium and equipment based on improved NLP model | |
CN115083442B (en) | Data processing method, device, electronic device, and computer-readable storage medium | |
CN117574146B (en) | Text classification labeling method, device, electronic equipment and storage medium | |
CN113391989A (en) | Program evaluation method, device, equipment, medium and program product | |
CN116340831B (en) | Information classification method and device, electronic equipment and storage medium | |
CN117649115A (en) | Risk assessment method and device, electronic equipment and storage medium | |
CN117474669A (en) | Loan overdue prediction method, device, equipment and storage medium | |
CN117609467A (en) | Work order question and answer data processing method and device, electronic equipment and storage medium | |
CN116468479A (en) | Method for determining page quality evaluation dimension, and page quality evaluation method and device | |
CN115099934A (en) | High-latency customer identification method, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared | ||
OL01 | Intention to license declared |