CN116502003A

CN116502003A - Content audit model training method, device, equipment and storage medium

Info

Publication number: CN116502003A
Application number: CN202210057477.8A
Authority: CN
Inventors: 胡传锐
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2023-07-28

Abstract

The invention belongs to the technical field of computers, and discloses a content review model training method, device, equipment and storage medium. In the present invention, by obtaining the unlabeled training sample set and the artificially marked sample set, the number of samples in the unlabeled training sample set is greater than the number of samples in the artificially marked sample set; the initial self-supervising and auditing model is trained through the unlabeled training sample set to obtain content auditing Model; the second training of the content moderation model is carried out according to the manual labeling sample set. When training the content review model, a large amount of unlabeled data is used to train the initial self-supervised review model, so that the obtained content review model has strong generalization, and then the content review model is trained twice using the manually labeled sample set , to improve the review accuracy of the content review model, so that the final obtained model not only has the review accuracy but also has strong generalization, which can meet the needs of Internet content review scenarios.

Description

Content audit model training method, device, equipment and storage medium

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种内容审核模型训练方法、装置、设备及存储介质。The present invention relates to the field of computer technology, in particular to a method, device, equipment and storage medium for training a content review model.

背景技术Background technique

在互联网内容审核场景中，需要审核的文件多种多样，任务也多种多样，为了应对此场景，使用的内容审核模型需要具有足够的泛化性，而传统的增强模型泛化性的手段就是标注大量的数据对模型进行训练，但是，在互联网内容审核场景中，有害样本的获取往往十分困难，无法标注大量的数据，因此，在互联网内容审核场景如何使得最终获得的模型具备泛化性成为了一个亟需解决的难题。In Internet content review scenarios, there are various files and tasks that need to be reviewed. In order to cope with this scenario, the content review model used needs to have sufficient generalization, and the traditional means of enhancing the generalization of the model is Label a large amount of data to train the model. However, in the Internet content review scenario, it is often very difficult to obtain harmful samples, and it is impossible to label a large amount of data. Therefore, how to make the final obtained model generalizable in the Internet content review scenario has become A problem that urgently needs to be solved.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above content is only used to assist in understanding the technical solution of the present invention, and does not mean that the above content is admitted as prior art.

发明内容Contents of the invention

本发明的主要目的在于提供一种内容审核模型训练方法、装置、设备及存储介质，旨在解决现有技术互联网内容审核场景中难以获取大量的有害样本，导致模型泛化性差的技术问题。The main purpose of the present invention is to provide a content review model training method, device, equipment and storage medium, aiming to solve the technical problem that it is difficult to obtain a large number of harmful samples in the Internet content review scene in the prior art, resulting in poor model generalization.

为实现上述目的，本发明提供了一种内容审核模型训练方法，所述方法包括以下步骤:To achieve the above object, the present invention provides a method for training a content audit model, said method comprising the following steps:

获取无标签训练样本集及人工标记样本集，所述无标签训练样本集中的样本数量大于所述人工标记样本集中的样本数量；Obtaining an unlabeled training sample set and a manually labeled sample set, wherein the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set;

通过所述无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型；training the initial self-supervised review model through the unlabeled training sample set to obtain a content review model;

根据所述人工标记样本集对所述内容审核模型进行二次训练。Perform secondary training on the content review model according to the manually marked sample set.

可选的，所述根据所述人工标记样本集对所述内容审核模型进行二次训练的步骤之后，还包括：Optionally, after the step of performing secondary training on the content review model according to the artificially marked sample set, the method further includes:

将二次训练后的内容审核模型作为预设内容审核模型；Use the content review model after the second training as the default content review model;

在接收到待审核文件时，获取所述待审核文件对应的文件类型；When receiving the file to be reviewed, obtain the file type corresponding to the file to be reviewed;

获取所述文件类型对应的内容解析规则；Obtaining content parsing rules corresponding to the file type;

根据所述内容解析规则对所述待审核文件进行解析，获得待审核内容；Analyzing the file to be reviewed according to the content parsing rules to obtain the content to be reviewed;

通过所述预设内容审核模型对所述待审核内容进行内容审核。Perform content audit on the content to be audited through the preset content audit model.

可选的，所述将二次训练后的内容审核模型作为预设内容审核模型的步骤之前，还包括：Optionally, before the step of using the content audit model after the secondary training as the preset content audit model, it may further include:

获取模型验证样本集及所述模型验证样本集对应的标准判定结果集；Obtaining a model verification sample set and a standard judgment result set corresponding to the model verification sample set;

通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集；Analyzing the samples in the model verification sample set through the content audit model after secondary training to obtain an audit judgment result set;

根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率；determining the model audit accuracy rate according to the standard judgment result set and the audit judgment result set;

若所述模型审核准确率大于或等于预设准确率阈值，则执行所述将二次训练后的内容审核模型作为预设内容审核模型的步骤。If the model review accuracy rate is greater than or equal to the preset accuracy rate threshold, the step of using the content review model after the second training as the preset content review model is performed.

可选的，所述根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率的步骤，包括：Optionally, the step of determining the model audit accuracy rate according to the standard judgment result set and the audit judgment result set includes:

若所述模型审核准确率小于预设准确率阈值，则对所述人工标记样本集进行扩充，并返回所述根据所述人工标记样本集对所述内容审核模型进行二次训练的步骤。If the accuracy of the model review is less than the preset accuracy threshold, the manual marking sample set is expanded, and the step of performing secondary training on the content review model according to the manual marking sample set is returned.

可选的，所述在接收到待审核文件时，获取所述待审核文件对应的文件类型的步骤，包括：Optionally, the step of obtaining the file type corresponding to the file to be reviewed when receiving the file to be reviewed includes:

在接收到待审核文件时，读取所述待审核文件的文件后缀；When receiving the file to be reviewed, read the file suffix of the file to be reviewed;

将所述文件后缀与预设文件类型表中各文件类型对应的后缀集合进行匹配，确定所述待审核文件对应的文件类型。The file suffix is matched with the suffix set corresponding to each file type in the preset file type table to determine the file type corresponding to the file to be reviewed.

可选的，所述获取所述文件类型对应的内容解析规则的步骤，包括：Optionally, the step of obtaining the content parsing rules corresponding to the file type includes:

根据所述文件类型在预设规则类型映射表中查找对应的解析规则；Searching for corresponding parsing rules in the preset rule type mapping table according to the file type;

检测查找到的解析规则的规则数量；The number of rules to detect the found parsing rules;

若所述规则数量大于预设数量，则获取各解析规则对应的规则优先级；If the number of rules is greater than the preset number, then obtain the rule priority corresponding to each parsing rule;

基于所述规则优先级从大到小对所述解析规则进行排序，获得排序结果；sorting the parsing rules from large to small based on the priority of the rules to obtain a sorting result;

将所述排序结果中排序第一的解析规则作为所述文件类型对应的内容解析规则。The parsing rule ranked first in the sorting result is used as the content parsing rule corresponding to the file type.

可选的，所述通过所述预设内容审核模型对所述待审核内容进行内容审核的步骤，包括：Optionally, the step of performing content audit on the content to be audited through the preset content audit model includes:

通过所述预设内容审核模型对所述待审核内容进行内容分析，获取内容分析结果；performing a content analysis on the content to be audited through the preset content audit model, and obtaining a content analysis result;

从所述内容分析结果中提取审核结果标识；extracting an audit result identifier from the content analysis result;

根据所述审核结果标识确定所述待审核文件是否通过内容审核。Determine whether the file to be reviewed has passed content review according to the review result identifier.

可选的，所述根据所述审核结果标识确定所述待审核文件是否通过内容审核的步骤，包括：Optionally, the step of determining whether the file to be reviewed has passed the content review according to the review result identification includes:

根据所述审核结果标识在预设结果标识映射表中查找对应的违规判定结果；Searching for the corresponding violation determination result in the preset result identification mapping table according to the audit result identification;

若所述违规判定结果为不存在违规内容，则判定所述待审核文件通过内容审核。If the violation determination result is that there is no violation content, it is determined that the file to be reviewed has passed the content review.

可选的，所述根据所述审核结果标识在预设结果标识映射表中查找对应的违规判定结果的步骤之后，还包括：Optionally, after the step of searching the corresponding violation determination result in the preset result identification mapping table according to the audit result identification, it further includes:

若所述违规判定结果为存在违规内容，则从所述内容分析结果中提取违规内容信息；If the violation determination result is that there is violation content, extract the violation content information from the content analysis result;

根据所述违规内容信息及所述待审核文件构建违规确认报告；Constructing a violation confirmation report based on the violation content information and the pending documents;

将所述违规确认报告向内容审核人员进行展示，并接收所述内容审核人员给予所述违规确认报告反馈的违规确认结果；displaying the violation confirmation report to content reviewers, and receiving the violation confirmation result given by the content reviewer to the violation confirmation report;

若所述违规确认结果为违规认定，则判定所述待审核文件不通过内容审核。If the confirmation result of the violation is a determination of violation, it is determined that the file to be reviewed does not pass the content review.

可选的，所述将所述违规确认报告向内容审核人员进行展示，并接收所述内容审核人员给予所述违规确认报告反馈的违规确认结果的步骤之后，还包括：Optionally, after the step of presenting the violation confirmation report to the content reviewer and receiving the violation confirmation result fed back by the content reviewer to the violation confirmation report, the method further includes:

若所述违规确认结果为违规误判，则判定所述待审核文件通过内容审核；If the confirmation result of the violation is a misjudgment of violation, it is determined that the file to be reviewed has passed the content review;

根据所述待审核文件、所述违规内容信息及所述违规确认结果构建正例标注样本，将所述正例标注样本添加至所述人工标记样本集中。Construct a positive labeled sample according to the file to be reviewed, the violation content information, and the violation confirmation result, and add the positive labeled sample to the manual labeled sample set.

可选的，所述获取无标签训练样本集及人工标记样本集的步骤，包括：Optionally, the step of obtaining an unlabeled training sample set and a manually labeled sample set includes:

从预设样本文件库中提取样本审核文件；Extract sample audit files from the preset sample file library;

检测所述样本审核文件是否存在对应的人工标签；Detecting whether there is a corresponding manual label in the sample audit file;

根据存在对应的人工标签的样本审核文件构建人工标记样本集，根据不存在对应的人工标签的样本审核文件构建无标签训练样本集。A manually labeled sample set is constructed according to the sample review files with corresponding manual labels, and an unlabeled training sample set is constructed according to the sample review files without corresponding manual labels.

此外，为实现上述目的，本发明还提出一种内容审核模型训练装置，所述内容审核模型训练装置包括以下模块：In addition, in order to achieve the above object, the present invention also proposes a content audit model training device, the content audit model training device includes the following modules:

样本获取模块，用于获取无标签训练样本集及人工标记样本集，所述无标签训练样本集中的样本数量大于所述人工标记样本集中的样本数量；A sample acquisition module, configured to acquire an unlabeled training sample set and a manually marked sample set, wherein the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set;

一次训练模块，用于通过所述无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型；A training module, which is used to train the initial self-supervised review model through the unlabeled training sample set to obtain a content review model;

二次训练模块，用于根据所述人工标记样本集对所述内容审核模型进行二次训练。A secondary training module is configured to perform secondary training on the content review model according to the artificially marked sample set.

可选的，所述二次训练模块，还用于将二次训练后的内容审核模型作为预设内容审核模型；在接收到待审核文件时，获取所述待审核文件对应的文件类型；获取所述文件类型对应的内容解析规则；根据所述内容解析规则对所述待审核文件进行解析，获得待审核内容；通过所述预设内容审核模型对所述待审核内容进行内容审核。Optionally, the secondary training module is also used to use the content review model after the secondary training as the preset content review model; when receiving the file to be reviewed, obtain the file type corresponding to the file to be reviewed; obtain A content analysis rule corresponding to the file type; analyzing the file to be reviewed according to the content analysis rule to obtain the content to be reviewed; performing content review on the content to be reviewed through the preset content review model.

可选的，所述二次训练模块，还用于获取模型验证样本集及所述模型验证样本集对应的标准判定结果集；通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集；根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率；若所述模型审核准确率大于或等于预设准确率阈值，则执行所述将二次训练后的内容审核模型作为预设内容审核模型的步骤。Optionally, the secondary training module is also used to obtain a model verification sample set and a standard judgment result set corresponding to the model verification sample set; Analyze the sample to obtain an audit judgment result set; determine the model audit accuracy rate according to the standard judgment result set and the audit judgment result set; if the model audit accuracy rate is greater than or equal to the preset accuracy rate threshold, execute the A step of using the content review model after the secondary training as a preset content review model.

可选的，所述二次训练模块，还用于若所述模型审核准确率小于预设准确率阈值，则对所述人工标记样本集进行扩充，并根据所述人工标记样本集对所述内容审核模型进行二次训练。Optionally, the secondary training module is further configured to expand the artificially marked sample set if the accuracy of the model review is less than a preset accuracy threshold, and to perform the training according to the manually marked sample set. The content moderation model undergoes secondary training.

可选的，所述二次训练模块，还用于在接收到待审核文件时，读取所述待审核文件的文件后缀；将所述文件后缀与预设文件类型表中各文件类型对应的后缀集合进行匹配，确定所述待审核文件对应的文件类型。Optionally, the secondary training module is also used for reading the file suffix of the file to be reviewed when receiving the file to be reviewed; The suffix set is matched to determine the file type corresponding to the file to be reviewed.

可选的，所述二次训练模块，还用于根据所述文件类型在预设规则类型映射表中查找对应的解析规则；检测查找到的解析规则的规则数量；若所述规则数量大于预设数量，则获取各解析规则对应的规则优先级；基于所述规则优先级从大到小对所述解析规则进行排序，获得排序结果；将所述排序结果中排序第一的解析规则作为所述文件类型对应的内容解析规则。Optionally, the secondary training module is also used to search for corresponding parsing rules in the preset rule type mapping table according to the file type; detect the number of found parsing rules; if the number of rules is greater than the preset If the quantity is set, the rule priority corresponding to each analysis rule is obtained; based on the rule priority, the analysis rules are sorted from large to small to obtain the sorting result; Content parsing rules corresponding to the above file types.

可选的，所述二次训练模块，还用于通过所述预设内容审核模型对所述待审核内容进行内容分析，获取内容分析结果；从所述内容分析结果中提取审核结果标识；根据所述审核结果标识确定所述待审核文件是否通过内容审核。Optionally, the secondary training module is also used to perform content analysis on the content to be reviewed through the preset content review model to obtain a content analysis result; extract the review result identifier from the content analysis result; according to The audit result flag determines whether the file to be audited has passed the content audit.

此外，为实现上述目的，本发明还提出一种模型训练设备，所述模型训练设备包括：处理器、存储器及存储在所述存储器上并可在所述处理器上运行的内容审核模型训练程序，所述内容审核模型训练程序被处理器执行时实现如上所述的内容审核模型训练方法的步骤。In addition, in order to achieve the above object, the present invention also proposes a model training device, which includes: a processor, a memory, and a content review model training program stored on the memory and operable on the processor When the content audit model training program is executed by the processor, the steps of the above-mentioned content audit model training method are implemented.

此外，为实现上述目的，本发明还提出一种计算机可读存储介质，所述计算机可读存储介质上存储有内容审核模型训练程序，所述内容审核模型训练程序执行时实现如上所述的内容审核模型训练方法的步骤。In addition, in order to achieve the above object, the present invention also proposes a computer-readable storage medium, the computer-readable storage medium stores a content audit model training program, and when the content audit model training program is executed, the above-mentioned content is realized. Steps to audit the model training method.

本发明通过获取无标签训练样本集及人工标记样本集，无标签训练样本集中的样本数量大于人工标记样本集中的样本数量；通过无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型；根据人工标记样本集对内容审核模型进行二次训练。由于在训练内容审核模型时先采用大量无标记的数据对初始自监督审核模型进行训练，使得获得的内容审核模型具备强大的泛化性，之后采用人工标注样本集对内容审核模型进行二次训练，提高内容审核模型的审核准确率，从而使得最终获得的模型在具备审核准确率的同时也具备强大的泛化性，可满足互联网内容审核场景的需求。In the present invention, by obtaining the unlabeled training sample set and the artificially marked sample set, the number of samples in the unlabeled training sample set is greater than the number of samples in the artificially marked sample set; the initial self-supervision audit model is trained through the unlabeled training sample set, and the content audit is obtained. Model; the second training of the content moderation model is carried out according to the manual labeling sample set. When training the content review model, a large amount of unlabeled data is used to train the initial self-supervised review model, so that the obtained content review model has strong generalization, and then the content review model is trained twice using the manually labeled sample set , improve the review accuracy of the content review model, so that the final obtained model not only has the review accuracy but also has strong generalization, which can meet the needs of Internet content review scenarios.

附图说明Description of drawings

图1是本发明实施例方案涉及的硬件运行环境的电子设备的结构示意图；Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment involved in the solution of an embodiment of the present invention;

图2为本发明内容审核模型训练方法第一实施例的流程示意图；Fig. 2 is a schematic flow chart of the first embodiment of the content review model training method of the present invention;

图3为本发明内容审核模型训练方法第二实施例的流程示意图；Fig. 3 is a schematic flow chart of the second embodiment of the content review model training method of the present invention;

图4为本发明内容审核模型训练方法第三实施例的流程示意图；Fig. 4 is a schematic flow chart of the third embodiment of the content review model training method of the present invention;

图5为本发明内容审核模型训练装置第一实施例的结构框图。Fig. 5 is a structural block diagram of the first embodiment of the content review model training device of the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose of the present invention, functional characteristics and advantages will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

参照图1，图1为本发明实施例方案涉及的硬件运行环境的模型训练设备结构示意图。Referring to FIG. 1 , FIG. 1 is a schematic structural diagram of a model training device in a hardware operating environment involved in the solution of an embodiment of the present invention.

如图1所示，该电子设备可以包括：处理器1001，例如中央处理器(CentralProcessing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(WIreless-FIdelity，WI-FI)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory，RAM)，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1 , the electronic device may include: a processor 1001 , such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Wherein, the communication bus 1002 is used to realize connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM), or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图1中示出的结构并不构成对电子设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the electronic device, and may include more or less components than shown in the figure, or combine some components, or arrange different components.

如图1所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及内容审核模型训练程序。As shown in FIG. 1 , the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and a content review model training program.

在图1所示的电子设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本发明电子设备中的处理器1001、存储器1005可以设置在模型训练设备中，所述电子设备通过处理器1001调用存储器1005中存储的内容审核模型训练程序，并执行本发明实施例提供的内容审核模型训练方法。In the electronic device shown in Figure 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the electronic device of the present invention can be set in In the model training device, the electronic device calls the content review model training program stored in the memory 1005 through the processor 1001, and executes the content review model training method provided by the embodiment of the present invention.

本发明实施例提供了一种内容审核模型训练方法，参照图2，图2为本发明一种内容审核模型训练方法第一实施例的流程示意图。An embodiment of the present invention provides a method for training a content review model. Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of a method for training a content review model according to the present invention.

本实施例中，所述内容审核模型训练方法包括以下步骤：In this embodiment, the content audit model training method includes the following steps:

步骤S10：获取无标签训练样本集及人工标记样本集，所述无标签训练样本集中的样本数量大于所述人工标记样本集中的样本数量。Step S10: Obtain an unlabeled training sample set and a manually labeled sample set, where the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set.

需要说明的是，本实施例的执行主体可以是所述模型训练设备，所述模型训练设备可以是个人电脑、服务器等电子设备，还可以为其他可实现相同或相似功能的设备，本实施例对此不加以限制，在本实施例及下述各实施例中，以模型训练设备为例对本发明内容审核模型训练方法进行说明。It should be noted that the execution subject of this embodiment may be the model training device, and the model training device may be an electronic device such as a personal computer or a server, or other devices that can realize the same or similar functions. This is not limited. In this embodiment and the following embodiments, the model training device is taken as an example to describe the content audit model training method of the present invention.

需要说明的是，无标签训练样本集可以是由大量的无标签训练样本聚合而成的集合，无标签训练样本可以是并未经过人工标注的模型训练样本。人工标记样本集可以是由大量的人工标注样本聚合而成的集合，人工标注样本可以是经过人工对样本中违规内容进行过标注的模型训练样本。无标签训练样本集中的样本数量远大于人工标记样本集中的样本数量，例如：无标签训练样本集中的样本数量可以是人工标记样本集中的样本数量的100倍或更多。It should be noted that the unlabeled training sample set may be a set aggregated from a large number of unlabeled training samples, and the unlabeled training samples may be model training samples that have not been manually labeled. The manually labeled sample set may be a collection of a large number of manually labeled samples, and the manually labeled samples may be model training samples that have manually labeled violations in the samples. The number of samples in the unlabeled training sample set is much larger than the number of samples in the manually labeled sample set, for example: the number of samples in the unlabeled training sample set may be 100 times or more than the number of samples in the manually labeled sample set.

在具体实现中，为了便于构建样本集，便于对模型进行训练，可以将收集大量的样本文件，并将收集的大量样本文件存储在数据库中，则此时获取无标签训练样本集及人工标记样本集的步骤，可以包括：In the specific implementation, in order to facilitate the construction of the sample set and the training of the model, a large number of sample files can be collected and stored in the database, then the unlabeled training sample set and artificially labeled samples can be obtained at this time The set of steps can include:

需要说明的是，预设样本文件库可以是预先设置的用于存储样本文件的数据库。而由于人工对样本进行标记，需要消耗大量的人力成本，且有害样本获取的难度较大，为了对有害样本进行重复利用，可以将收集的样本审核文件均存储在预设样本文件库中，且在人工进行标注时将对应的人工标签也与样本审核文件关联存储，则在需要构建样本集时可以直接从预设样本文件库中提取样本审核文件，并根据是否存在对应的人工标签将样本审核文件进行区分，从而构建人工标记样本集与无标签训练样本集。It should be noted that the preset sample file library may be a preset database for storing sample files. However, manual labeling of samples requires a lot of labor costs, and it is difficult to obtain harmful samples. In order to reuse harmful samples, the collected sample audit files can be stored in the preset sample file library, and When manually labeling, the corresponding manual label is also stored in association with the sample review file. When it is necessary to build a sample set, the sample review file can be directly extracted from the preset sample file library, and the sample review is performed according to whether there is a corresponding manual label. Files are distinguished to construct a manually labeled sample set and an unlabeled training sample set.

步骤S20：通过所述无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型。Step S20: Using the unlabeled training sample set to train the initial self-supervised review model to obtain a content review model.

需要说明的是，初始自监督审核模型可以是基于自监督算法构建的用于进行内容审核的模型。通过大量的无标注的数据训练初始自监督审核模型，利用对比学习的方法，让模型在无标签数据之间学习相似特征，构造无标签数据的伪标签，可以使模型拉近同种数据的距离，拉远不同数据的距离，使得最终得到的内容审核模型具有十分强大的特征提取能力，即具备十分强大的泛化性。It should be noted that the initial self-supervision review model may be a model for content review constructed based on a self-supervision algorithm. Train the initial self-supervised audit model through a large amount of unlabeled data, use the method of comparative learning to let the model learn similar features between unlabeled data, and construct pseudo-labels of unlabeled data, which can make the model closer to the same type of data. , to shorten the distance between different data, so that the final content review model has a very powerful feature extraction ability, that is, has a very strong generalization.

步骤S30：根据所述人工标记样本集对所述内容审核模型进行二次训练。Step S30: performing secondary training on the content review model according to the manually marked sample set.

需要说明的是，在通过无标签样本集对初始自监督审核模型进行训练后，获得的内容审核模型已经具备了十分强大的泛化性，但是，此时该模型对内容审核的准确率会较低，则此时可以通过人工标记样本集对内容审核模型进行二次训练，从而提高内容审核模型对内容审核的准确性。It should be noted that after training the initial self-supervised review model through the unlabeled sample set, the obtained content review model already has very strong generalization, but at this time, the accuracy of the model for content review will be lower than If it is low, then the content review model can be retrained by manually marking the sample set at this time, so as to improve the accuracy of the content review model for content review.

本实施例通过获取无标签训练样本集及人工标记样本集，无标签训练样本集中的样本数量大于人工标记样本集中的样本数量；通过无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型；根据人工标记样本集对内容审核模型进行二次训练。由于在训练内容审核模型时先采用大量无标记的数据对初始自监督审核模型进行训练，使得获得的内容审核模型具备强大的泛化性，之后采用人工标注样本集对内容审核模型进行二次训练，提高内容审核模型的审核准确率，从而使得最终获得的模型在具备审核准确率的同时也具备强大的泛化性，可满足互联网内容审核场景的需求。In this embodiment, by obtaining an unlabeled training sample set and a manually labeled sample set, the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set; the initial self-supervised review model is trained through the unlabeled training sample set to obtain the content Review model: conduct secondary training on the content review model based on the manually marked sample set. When training the content review model, a large amount of unlabeled data is used to train the initial self-supervised review model, so that the obtained content review model has strong generalization, and then the content review model is trained twice using the manually labeled sample set , to improve the review accuracy of the content review model, so that the final obtained model not only has the review accuracy but also has strong generalization, which can meet the needs of Internet content review scenarios.

参考图3，图3为本发明一种内容审核模型训练方法第二实施例的流程示意图。Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a second embodiment of a method for training a content review model according to the present invention.

基于上述第一实施例，本实施例内容审核模型训练方法的所述步骤S30之后，还包括：Based on the above-mentioned first embodiment, after the step S30 of the content audit model training method of this embodiment, it also includes:

步骤S40：将二次训练后的内容审核模型作为预设内容审核模型。Step S40: Use the content audit model after the secondary training as the default content audit model.

需要说明的是，对内容审核模型进行二次训练，可以在保留模型的泛化性的基础上，提高模型内容审核的准确性，因此，可以将二次训练完毕之后的内容审核模型作为预设内容审核模型，用于对待审核文件进行审核。It should be noted that the second training of the content audit model can improve the accuracy of the model content audit on the basis of retaining the generalization of the model. Therefore, the content audit model after the second training can be used as the default The content audit model is used for auditing files to be audited.

步骤S50：在接收到待审核文件时，获取所述待审核文件对应的文件类型。Step S50: When the file to be reviewed is received, the file type corresponding to the file to be reviewed is obtained.

需要说明的是，在互联网内容审核领域，需要审核的文件是多种多样的，而不同文件类型的解析方式也是完全不同的，因此，在接收到待审核文件时，需要先获取待审核文件的文件类型。It should be noted that in the field of Internet content review, there are various files to be reviewed, and the analysis methods of different file types are completely different. Therefore, when receiving a file to be reviewed, it is necessary to first obtain the file file type.

在具体实现中，为了准确的确定文件类型，本实施例所述步骤S50，可以包括：In a specific implementation, in order to accurately determine the file type, step S50 in this embodiment may include:

需要说明的是，不同的文件后缀对应了不同的文件类型，例如：文件后缀为txt，则表示该文件为文本文件，若后缀为jpeg，则表示该文件为图片文件。It should be noted that different file suffixes correspond to different file types. For example, if the file suffix is txt, it means that the file is a text file; if the suffix is jpeg, it means that the file is a picture file.

在实际使用中，可以按承载内容将多种不同后缀的文件划分为不同的文件类型，一个文件类型可以对应多个文件后缀，例如：图片类型可以对应jpeg、png、bmp、tif等多种文件后缀。预设文件类型表中可以包含各文件类型与文件后缀的映射关系，该映射管理可以有模型训练设备的管理人员预先进行设置。文件类型对应的后缀集合可以是由文件类型对应的文件后缀聚合而成的集合。In actual use, files with different suffixes can be divided into different file types according to the carrying content. One file type can correspond to multiple file suffixes. For example, the image type can correspond to various files such as jpeg, png, bmp, tif, etc. suffix. The preset file type table may include a mapping relationship between each file type and a file suffix, and the mapping management may be pre-set by the manager of the model training device. The suffix set corresponding to the file type may be a set formed by aggregating file suffixes corresponding to the file type.

可以理解的是，获取待审核文件的文件后缀，将文件后缀与各文件类型对应的后缀集合中的文件后缀进行比对，即可快速确定待审核文件的文件类型。It can be understood that the file type of the file to be reviewed can be quickly determined by obtaining the file suffix of the file to be reviewed and comparing the file suffix with the file suffix in the suffix set corresponding to each file type.

步骤S60：获取所述文件类型对应的内容解析规则。Step S60: Obtain the content parsing rules corresponding to the file type.

需要说明的是，为了应对多种多样的需要审核的文件，可以对不同文件类型的文件设置不同的内容解析规则，获取文件类型对应的内容解析规则可以是查找预先为该文件类型设置的内容解析规则。It should be noted that, in order to deal with a variety of files that need to be reviewed, different content analysis rules can be set for files of different file types, and the content analysis rules corresponding to the file type can be obtained by searching for the content analysis rules set in advance for the file type. rule.

在实际使用中，随着技术不断发展，对各文件进行解析的方式可能也会不断发生变化，对同一种文件类型的文件可能存在多种不同的解析方式，为了尽可能获取到合适的解析规则对待审核文件进行解析，本实施例所述步骤S60，可以包括：In actual use, with the continuous development of technology, the way of parsing each file may also change continuously. There may be many different parsing methods for the same file type. In order to obtain the appropriate parsing rules as much as possible To analyze the file to be audited, step S60 in this embodiment may include:

需要说明的是，预设数量可以由模型训练设备的管理人员预先进行设置，预设数量可以是1。It should be noted that the preset number can be set in advance by a manager of the model training device, and the preset number can be 1.

在实际使用中，若规则数量大于预设数量，则表示该文件类型对应的解析规则存在多个，则此时可以获取解析规则对应的规则优先级，然后根据规则优先级从打到小对解析规则进行排序，然后从排序结果中选取排序第一的解析规则作为文件类型对应的内容解析规则。其中，预设规则类型映射表可以包含各解析规则与文件类型的映射关系，该映射关系可以由模型训练设备的管理人员预先进行设置。规则优先级越高，则表示该解析规则越应优先使用，规则优先级可以由模型训练设备的管理人员预先进行设置。In actual use, if the number of rules is greater than the preset number, it means that there are multiple parsing rules corresponding to the file type. At this time, you can obtain the rule priority corresponding to the parsing rule, and then parse from type to small pair according to the rule priority Rules are sorted, and then the parsing rule ranked first is selected from the sorting results as the content parsing rule corresponding to the file type. Wherein, the preset rule type mapping table may include a mapping relationship between each parsing rule and a file type, and the mapping relationship may be preset by a manager of the model training device. The higher the priority of the rule, the more preferentially the parsing rule should be used, and the priority of the rule can be set in advance by the manager of the model training device.

可以理解的是，以此选取文件类型对应的内容解析规则，可以保证选取得到的是模型训练设备的管理人员预先设置的优先级最高的解析规则，从而保证可以获取用户指定的最合适的解析规则对待审核文件进行解析。It is understandable that selecting the content parsing rules corresponding to the file type in this way can ensure that the selected parsing rules with the highest priority preset by the management personnel of the model training equipment can ensure that the most appropriate parsing rules specified by the user can be obtained. Parse the file to be reviewed.

步骤S70：根据所述内容解析规则对所述待审核文件进行解析，获得待审核内容。Step S70: Analyze the file to be reviewed according to the content analysis rules to obtain the content to be reviewed.

可以理解的是，在获取了内容解析规则之后，即可根据内容解析规则内记载的解析方式对待审核文件进行解析，读取待审核文件中记载的内容，从而获得待审核内容。It can be understood that after the content analysis rules are obtained, the files to be reviewed can be analyzed according to the analysis method recorded in the content analysis rules, and the content recorded in the files to be reviewed can be read to obtain the content to be reviewed.

步骤S80：通过所述预设内容审核模型对所述待审核内容进行内容审核。Step S80: Perform content audit on the content to be audited through the preset content audit model.

需要说明的是，通过预设内容审核模型对待审核内容进行内容审核可以是将待审核内容作为模型输入参数输入预设内容审核模型中进行分析，然后根据预设内容审核模型输出的内容审核结果确定待审核文件是否通过内容审核。It should be noted that the content auditing of the content to be audited through the preset content auditing model can be performed by inputting the content to be audited as a model input parameter into the preset content auditing model for analysis, and then determining according to the content auditing results output by the preset content auditing model Whether the file to be reviewed has passed content review.

在实际使用中，为了简化模型输入结果，可以利用固定的审核结果标识用于表示审核结果，则此时本实施例所述步骤S80，可以包括：In actual use, in order to simplify the model input results, a fixed audit result identifier can be used to indicate the audit result, then step S80 in this embodiment may include:

需要说明的是，通过预设内容审核模型对待审核内容进行内容分析，获取内容分析结果可以使将待审核内容作为模型输入数据输入预设内容审核模型中进行内容分析，并在分析结束时获取预设内容审核模型反馈的内容分析结果。内容分析结果可以有审核结果标识及违规内容信息构成，从内容分析结果中提取审核结果标识可以是读取内容分析结果中的审核结果标识。It should be noted that, through the content analysis of the content to be audited through the preset content audit model, and obtaining the content analysis results, the content to be audited can be input into the preset content audit model as model input data for content analysis, and the predicted content can be obtained at the end of the analysis. Set the content analysis results fed back by the content audit model. The content analysis result may be composed of an audit result identifier and violation content information, and extracting the audit result identifier from the content analysis result may be reading the audit result identifier in the content analysis result.

在实际使用中，本实施例所述根据所述审核结果标识确定所述待审核文件是否通过内容审核的步骤，可以包括：In actual use, the step of determining whether the file to be reviewed has passed the content review according to the review result identifier described in this embodiment may include:

需要说明的是，预设结果标识映射表可以包括审核结果标识与违规判定结果的映射关系，例如：审核结果标识为1，则对应的违规判定结果为存在违规内容；审核结果标识为0，则对应违规判定结果为不存在违规内容。It should be noted that the preset result identifier mapping table may include the mapping relationship between the audit result identifier and the violation judgment result. For example, if the audit result identifier is 1, the corresponding violation judgment result is that there is violation content; The corresponding violation judgment result is that there is no violation content.

可以理解的是，预设内容审核模型在二次训练时采用的是人工标注的存在违规内容的样本进行训练的，其可以识别各种存在违规内容的待审核文件，若违规判定结果为不存在违规内容，则表示预设内容审核模型判定该待审核文件中并不存在违规内容，因此，可以判定待审核文件通过内容审核。It is understandable that the pre-set content review model is trained with manually labeled samples with illegal content in the second training, which can identify various pending documents with illegal content. If the violation judgment result is non-existent Violation content means that the preset content audit model determines that there is no illegal content in the file to be reviewed, and therefore, it can be determined that the file to be reviewed has passed the content review.

进一步地，由于部分违规内容在部分实际场景中可能并不会构成违规行为，此种内容容易导致预设内容审核模型误判，为了尽可能避免此种误判对发起文件审核的用户的影响，本实施例所述根据所述审核结果标识在预设结果标识映射表中查找对应的违规判定结果的步骤之后，还可以包括：Furthermore, since some illegal content may not constitute a violation in some actual scenarios, such content may easily lead to misjudgment by the preset content review model. In order to avoid the impact of such misjudgment on the user who initiated the file review, After the step of searching the corresponding violation determination result in the preset result identification mapping table according to the audit result identification in this embodiment, it may further include:

需要说明的是，部分疑似违规内容需要结合语义进行分析，才可以确定该是否真正违规，此种内容可能导致误判，而为了避免此种影响，在预设内容审核模型确定存在违规内容时，还可以从内容分析结果中提取违规内容信息，并根据违规内容信息及待审核文件构成违规确认报告，由内容审核人员人工进行审核，并反馈最终的违规确认结果，然后根据最终的违规确认结果确定待审核文件是否通过内容审核。It should be noted that some suspected illegal content needs to be analyzed in combination with semantics to determine whether it is really illegal. Such content may lead to misjudgment. In order to avoid such impact, when the preset content review model determines that there is illegal content, It is also possible to extract the violation content information from the content analysis results, and form a violation confirmation report based on the violation content information and the documents to be reviewed. The content reviewer will manually review and feed back the final violation confirmation result, and then determine the violation according to the final violation confirmation result. Whether the file to be reviewed has passed content review.

可以理解的是，若违规确认结果为违规认定，则表示内容审核人员确定预设内容审核模型的输出结果无误，因此，可以直接判定待审核文件不通过内容审核。当然，根据实际需要，此时还可以根据违规确认结果为待审核文件生成对应的人工标签，然后将人工标签及待审核文件关联存储至预设样本文件库中。It is understandable that if the violation confirmation result is a violation determination, it means that the content reviewer confirms that the output result of the preset content review model is correct, therefore, it can be directly determined that the file to be reviewed does not pass the content review. Of course, according to actual needs, at this time, a corresponding manual label can be generated for the file to be reviewed according to the violation confirmation result, and then the manual label and the file to be reviewed can be associated and stored in the preset sample file library.

在具体实现中，若违规确认结果为违规误判，则表示预设内容审核模型出现了误判现象，此时可以进行额外处理，因此，本实施例所述将所述违规确认报告向内容审核人员进行展示，并接收所述内容审核人员给予所述违规确认报告反馈的违规确认结果的步骤之后，还可以包括：In the specific implementation, if the violation confirmation result is a violation misjudgment, it means that the preset content audit model has misjudged, and additional processing can be performed at this time. Therefore, the violation confirmation report is sent to the content audit After the step of displaying by personnel and receiving the violation confirmation result given by the content reviewer to the violation confirmation report, it may also include:

可以理解的是，若违规确认结果为违规误判，则表示内容审核人员认定预设内容审核模型出现了误判，预设内容审核模型认定的违规内容信息并不违规，因此，可以判定待审核文件通过内容审核，并根据待审核文件、违规内容信息及违规确认结果构建正例标注样本，然后将正例标注样本添加至人工标记样本集中，通过修改后的人工标记样本集对预设内容审核模型进行进一步训练，从而降低预设内容审核模型的误判几率。It is understandable that if the confirmation result of the violation is misjudgment of violation, it means that the content reviewer believes that the default content review model has made a misjudgment, and the violation content information identified by the default content review model does not violate the rules. Therefore, it can be determined that the pending review The document passes the content review, and constructs a positive labeling sample based on the pending file, violation content information, and violation confirmation results, then adds the positive labeling sample to the manual labeling sample set, and reviews the preset content through the modified manual labeling sample set The model is further trained to reduce the probability of misjudgment of the preset content review model.

本实施例通过将二次训练后的内容审核模型作为预设内容审核模型；在接收到待审核文件时，获取所述待审核文件对应的文件类型；获取所述文件类型对应的内容解析规则；根据所述内容解析规则对所述待审核文件进行解析，获得待审核内容；通过所述预设内容审核模型对所述待审核内容进行内容审核。由于在通过预设内容审核模型进行内容审核之前还会根据待审核文件的文件类型采用对应的内容解析规则提取待审核文件中的待审核内容，在通过预设内容审核模型对待审核内容进行内容审核，模型不必对待审核文件进行解析，降低了模型构建的难度。In this embodiment, the content audit model after secondary training is used as the preset content audit model; when the file to be audited is received, the file type corresponding to the file to be audited is obtained; the content analysis rule corresponding to the file type is obtained; Analyzing the file to be reviewed according to the content parsing rules to obtain the content to be reviewed; performing content review on the content to be reviewed through the preset content review model. Before performing content audit through the preset content audit model, the corresponding content analysis rules will be used to extract the content to be audited in the file to be audited according to the file type of the file to be audited. , the model does not need to parse the documents to be reviewed, which reduces the difficulty of model construction.

参考图4，图4为本发明一种内容审核模型训练方法第三实施例的流程示意图。Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a third embodiment of a method for training a content review model according to the present invention.

基于上述第二实施例，本实施例内容审核模型训练方法的所述步骤S40之前，还包括：Based on the above-mentioned second embodiment, before the step S40 of the content audit model training method of this embodiment, it also includes:

步骤S31：获取模型验证样本集及所述模型验证样本集对应的标准判定结果集。Step S31: Obtain a model verification sample set and a standard judgment result set corresponding to the model verification sample set.

需要说明的是，模型验证样本集可以是由大量的模型验证样本聚合而成的集合。模型验证样本可以是用于对二次训练后的内容审核模型进行验证的样本。模型验证样本集对应的标准判定结果集可以是由模型验证样本集中各模型验证样本对应的人工审核结果聚合而成的集合。It should be noted that the model verification sample set may be a set aggregated from a large number of model verification samples. The model verification sample may be a sample used to verify the content review model after secondary training. The standard determination result set corresponding to the model verification sample set may be a set formed by aggregating manual review results corresponding to each model verification sample in the model verification sample set.

步骤S32：通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集。Step S32: Analyzing the samples in the model verification sample set through the content review model trained twice to obtain a review judgment result set.

需要说明的是，通过二次训练后的内容审核模型对模型验证样本集中的样本进行分析，获得审核判定结果集可以是将模型验证样本集中的模型验证样本顺序依次输入二次训练后的内容审核模型进行分析，并将输出的审核判定结果进行聚合，从而获得审核判定结果集。It should be noted that by analyzing the samples in the model verification sample set through the content review model after the second training, the audit judgment result set can be obtained by inputting the order of the model verification samples in the model verification sample set into the content review after the second training The model is analyzed, and the output audit judgment results are aggregated to obtain the audit judgment result set.

步骤S33：根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率。Step S33: Determine the model audit accuracy rate according to the standard judgment result set and the audit judgment result set.

需要说明的是，根据标准判定结果集与审核判定结果集确定模型审核准确率可以是获取标注判定结果中的结果总数，然后将标准判定结果集与审核判定结果集中的结果进行比对，确定审核判定结果集与标准判定结果集中一致的判定结果的数量，获得正确数量，然后将正确数量与结果总数的比值作为模型审核准确率。It should be noted that the determination of the model audit accuracy rate based on the standard judgment result set and the audit judgment result set can be obtained by obtaining the total number of results in the label judgment results, and then comparing the standard judgment result set with the results in the audit judgment result set to determine the audit accuracy. Determine the number of judgment results that are consistent with the standard judgment result set, and obtain the correct number, and then use the ratio of the correct number to the total number of results as the accuracy of the model review.

步骤S34：若所述模型审核准确率大于或等于预设准确率阈值，则执行所述将二次训练后的内容审核模型作为预设内容审核模型的步骤。Step S34: If the model review accuracy rate is greater than or equal to the preset accuracy rate threshold, execute the step of using the content review model after the secondary training as the preset content review model.

需要说明的是，预设准确率阈值可以由模型审核设备的管理人员预先进行设置。若模型审核准确率大于或等于预设准确率阈值，则表示该模型的审核准确率已经达标，可以尝试投入实际使用，因此，可以将二次训练后的内容审核模型作为预设内容审核模型。It should be noted that the preset accuracy rate threshold may be preset by a manager of the model verification device. If the model review accuracy rate is greater than or equal to the preset accuracy rate threshold, it means that the model review accuracy rate has reached the standard and can be put into practical use. Therefore, the content review model after the second training can be used as the preset content review model.

在具体实现中，若所述模型审核准确率小于预设准确率阈值，则表示二次训练后得到的内容审核模型的审核准确率不达标，此时可能是人工标记样本集中的样本数量过少，因此，可以对所述人工标记样本集进行扩充，并返回所述根据所述人工标记样本集对所述内容审核模型进行二次训练的步骤对内容审核模型进一步训练。In a specific implementation, if the model review accuracy rate is less than the preset accuracy rate threshold, it means that the review accuracy rate of the content review model obtained after the second training is not up to the standard, and at this time, the number of samples in the manually marked sample set may be too small Therefore, the manual marking sample set can be expanded, and the step of performing secondary training on the content review model according to the manual marking sample set can be returned to further train the content review model.

本实施例通过获取模型验证样本集及所述模型验证样本集对应的标准判定结果集；通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集；通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集；若所述模型审核准确率大于或等于预设准确率阈值，则执行所述将二次训练后的内容审核模型作为预设内容审核模型的步骤。由于在将内容审核模型投入实际使用之前还会根据模型验证样本集及模型验证样本集对应的标准判定结果集对二次训练后的内容审核模型进行审核准确率验证，在二次训练后的内容审核模型的模型审核准确率达标时才会将训练得到的模型投入实际使用，保证了模型实际使用的使用效果。In this embodiment, by obtaining the model verification sample set and the standard judgment result set corresponding to the model verification sample set; and analyzing the samples in the model verification sample set through the content audit model after secondary training, the audit judgment result set is obtained; Analyze the samples in the model verification sample set through the content audit model after the second training to obtain the audit judgment result set; if the accuracy of the model audit is greater than or equal to the preset accuracy threshold, then execute the second The trained content moderation model is used as a step of the preset content moderation model. Before putting the content audit model into actual use, the audit accuracy of the content audit model after the second training will be verified according to the model verification sample set and the standard judgment result set corresponding to the model verification sample set. Only when the model audit accuracy of the audit model reaches the standard will the trained model be put into actual use, which ensures the actual use effect of the model.

此外，本发明实施例还提出一种存储介质，所述存储介质上存储有内容审核模型训练程序，所述内容审核模型训练程序被处理器执行时实现如上文所述的内容审核模型训练方法的步骤。In addition, an embodiment of the present invention also proposes a storage medium, on which a content audit model training program is stored, and when the content audit model training program is executed by a processor, the content audit model training method as described above is implemented. step.

参照图5，图5为本发明内容审核模型训练装置第一实施例的结构框图。Referring to FIG. 5 , FIG. 5 is a structural block diagram of the first embodiment of the content review model training device of the present invention.

如图5所示，本发明实施例提出的内容审核模型训练装置包括：As shown in Figure 5, the content audit model training device proposed by the embodiment of the present invention includes:

样本获取模块10，用于获取无标签训练样本集及人工标记样本集，所述无标签训练样本集中的样本数量大于所述人工标记样本集中的样本数量。The sample acquisition module 10 is configured to acquire an unlabeled training sample set and a manually labeled sample set, where the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set.

一次训练模块20，用于通过所述无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型；A training module 20, used to train the initial self-supervised review model through the unlabeled training sample set to obtain the content review model;

二次训练模块30，用于根据所述人工标记样本集对所述内容审核模型进行二次训练。The secondary training module 30 is configured to perform secondary training on the content review model according to the manually marked sample set.

本实施例通过获取无标签训练样本集及人工标记样本集，无标签训练样本集中的样本数量大于人工标记样本集中的样本数量；通过无标签训练样本集对初始自监督审核模型进行训练，获得内容审核模型；根据人工标记样本集对内容审核模型进行二次训练。由于在训练内容审核模型时先采用大量无标记的数据对初始自监督审核模型进行训练，使得获得的内容审核模型具备强大的泛化性，之后采用人工标注样本集对内容审核模型进行二次训练，提高内容审核模型的审核准确率，从而使得最终获得的模型在具备审核准确率的同时也具备强大的泛化性，可满足互联网内容审核场景的需求。In this embodiment, by obtaining an unlabeled training sample set and a manually labeled sample set, the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set; the initial self-supervised review model is trained through the unlabeled training sample set to obtain the content Review model: conduct secondary training on the content review model based on the manually marked sample set. When training the content review model, a large amount of unlabeled data is used to train the initial self-supervised review model, so that the obtained content review model has strong generalization, and then the content review model is trained twice using the manually labeled sample set , improve the review accuracy of the content review model, so that the final obtained model not only has the review accuracy but also has strong generalization, which can meet the needs of Internet content review scenarios.

进一步的，所述二次训练模块30，还用于将二次训练后的内容审核模型作为预设内容审核模型；在接收到待审核文件时，获取所述待审核文件对应的文件类型；获取所述文件类型对应的内容解析规则；根据所述内容解析规则对所述待审核文件进行解析，获得待审核内容；通过所述预设内容审核模型对所述待审核内容进行内容审核。Further, the secondary training module 30 is also used to use the content review model after the secondary training as the preset content review model; when receiving the file to be reviewed, obtain the file type corresponding to the file to be reviewed; obtain A content analysis rule corresponding to the file type; analyzing the file to be reviewed according to the content analysis rule to obtain the content to be reviewed; performing content review on the content to be reviewed through the preset content review model.

进一步的，所述二次训练模块30，还用于获取模型验证样本集及所述模型验证样本集对应的标准判定结果集；通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集；根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率；若所述模型审核准确率大于或等于预设准确率阈值，则执行所述将二次训练后的内容审核模型作为预设内容审核模型的步骤。Further, the secondary training module 30 is also used to obtain the model verification sample set and the standard judgment result set corresponding to the model verification sample set; Analyze the sample to obtain an audit judgment result set; determine the model audit accuracy rate according to the standard judgment result set and the audit judgment result set; if the model audit accuracy rate is greater than or equal to the preset accuracy rate threshold, execute the A step of using the content review model after the secondary training as a preset content review model.

进一步的，所述二次训练模块30，还用于若所述模型审核准确率小于预设准确率阈值，则对所述人工标记样本集进行扩充，并根据所述人工标记样本集对所述内容审核模型进行二次训练。Further, the secondary training module 30 is also used to expand the artificially marked sample set if the accuracy rate of the model review is less than the preset accuracy rate threshold, and to update the artificially marked sample set according to the artificially marked sample set. The content moderation model undergoes secondary training.

进一步的，所述二次训练模块30，还用于在接收到待审核文件时，读取所述待审核文件的文件后缀；将所述文件后缀与预设文件类型表中各文件类型对应的后缀集合进行匹配，确定所述待审核文件对应的文件类型。Further, the secondary training module 30 is also used to read the file suffix of the file to be reviewed when receiving the file to be reviewed; The suffix set is matched to determine the file type corresponding to the file to be reviewed.

进一步的，所述二次训练模块30，还用于根据所述文件类型在预设规则类型映射表中查找对应的解析规则；检测查找到的解析规则的规则数量；若所述规则数量大于预设数量，则获取各解析规则对应的规则优先级；基于所述规则优先级从大到小对所述解析规则进行排序，获得排序结果；将所述排序结果中排序第一的解析规则作为所述文件类型对应的内容解析规则。Further, the secondary training module 30 is also used to search for corresponding parsing rules in the preset rule type mapping table according to the file type; detect the rule quantity of the found parsing rules; if the rule quantity is greater than the preset If the quantity is set, the rule priority corresponding to each analysis rule is obtained; based on the rule priority, the analysis rules are sorted from large to small to obtain the sorting result; Content parsing rules corresponding to the above file types.

进一步的，所述二次训练模块30，还用于通过所述预设内容审核模型对所述待审核内容进行内容分析，获取内容分析结果；从所述内容分析结果中提取审核结果标识；根据所述审核结果标识确定所述待审核文件是否通过内容审核。Further, the secondary training module 30 is also used to perform content analysis on the content to be audited through the preset content audit model to obtain content analysis results; extract audit result identifiers from the content analysis results; according to The audit result flag determines whether the file to be audited has passed the content audit.

进一步的，所述二次训练模块30，还用于根据所述审核结果标识在预设结果标识映射表中查找对应的违规判定结果；若所述违规判定结果为不存在违规内容，则判定所述待审核文件通过内容审核。Further, the secondary training module 30 is also used to look up the corresponding violation determination result in the preset result identification mapping table according to the audit result identification; if the violation determination result is that there is no violation content, then determine the The document to be reviewed has passed the content review.

进一步的，所述二次训练模块30，还用于若所述违规判定结果为存在违规内容，则从所述内容分析结果中提取违规内容信息；根据所述违规内容信息及所述待审核文件构建违规确认报告；将所述违规确认报告向内容审核人员进行展示，并接收所述内容审核人员给予所述违规确认报告反馈的违规确认结果；若所述违规确认结果为违规认定，则判定所述待审核文件不通过内容审核。Further, the secondary training module 30 is also used to extract the violation content information from the content analysis result if the violation determination result is that there is violation content; according to the violation content information and the pending file Build a violation confirmation report; display the violation confirmation report to content reviewers, and receive the violation confirmation result given by the content reviewer to the violation confirmation report; if the violation confirmation result is a violation determination, determine the The file to be reviewed does not pass content review.

进一步的，所述二次训练模块30，还用于若所述违规确认结果为违规误判，则判定所述待审核文件通过内容审核；根据所述待审核文件、所述违规内容信息及所述违规确认结果构建正例标注样本，将所述正例标注样本添加至所述人工标记样本集中。Further, the secondary training module 30 is also used to determine that the file to be reviewed has passed the content review if the confirmation result of the violation is a misjudgment of violation; Construct a positive labeled sample based on the violation confirmation result, and add the positive labeled sample to the manual labeled sample set.

进一步的，所述样本获取模块10，还用于从预设样本文件库中提取样本审核文件；检测所述样本审核文件是否存在对应的人工标签；根据存在对应的人工标签的样本审核文件构建人工标记样本集，根据不存在对应的人工标签的样本审核文件构建无标签训练样本集。Further, the sample acquisition module 10 is also used to extract sample audit files from the preset sample file library; detect whether the sample audit files have corresponding manual tags; construct manual Label the sample set, and build an unlabeled training sample set based on the sample review files that do not have corresponding manual labels.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above is only an example, and does not constitute any limitation to the technical solution of the present invention. In specific applications, those skilled in the art can make settings according to needs, and the present invention is not limited thereto.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the protection scope of the present invention. In practical applications, those skilled in the art can select part or all of them to implement according to actual needs. The purpose of the scheme of this embodiment is not limited here.

另外，未在本实施例中详尽描述的技术细节，可参见本发明任意实施例所提供的内容审核模型训练方法，此处不再赘述。In addition, for technical details that are not exhaustively described in this embodiment, refer to the method for training a content review model provided by any embodiment of the present invention, which will not be repeated here.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that in this document, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or system comprising a set of elements includes not only those elements, but also other elements not expressly listed, or elements inherent in such a process, method, article, or system. Without further limitations, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system comprising that element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as a read-only memory (Read Only Memory) , ROM)/RAM, magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, computer, server, or network device, etc.) execute the methods described in various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process conversion made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technical fields , are all included in the scope of patent protection of the present invention in the same way.

本发明还公开了A1、一种内容审核模型训练方法，所述内容审核模型训练方法包括以下步骤：The present invention also discloses A1, a content audit model training method, the content audit model training method includes the following steps:

A2、如A1所述的内容审核模型训练方法，所述根据所述人工标记样本集对所述内容审核模型进行二次训练的步骤之后，还包括：A2. The content audit model training method as described in A1, after the step of performing secondary training on the content audit model according to the manually marked sample set, further comprising:

A3、如A2所述的内容审核模型训练方法，所述将二次训练后的内容审核模型作为预设内容审核模型的步骤之前，还包括：A3. The content audit model training method as described in A2, before the step of using the content audit model after the secondary training as the preset content audit model, it also includes:

A4、如A3所述的内容审核模型训练方法，所述根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率的步骤，包括：A4. The content review model training method as described in A3, the step of determining the model review accuracy rate according to the standard judgment result set and the review judgment result set includes:

A5、如A2所述的内容审核模型训练方法，所述在接收到待审核文件时，获取所述待审核文件对应的文件类型的步骤，包括：A5. The content audit model training method as described in A2, the step of obtaining the file type corresponding to the file to be audited when the file to be audited is received includes:

A6、如A2所述的内容审核模型训练方法，所述获取所述文件类型对应的内容解析规则的步骤，包括：A6. The method for training a content review model as described in A2, the step of obtaining the content parsing rules corresponding to the file type includes:

A7、如A2所述的内容审核模型训练方法，所述通过所述预设内容审核模型对所述待审核内容进行内容审核的步骤，包括：A7. The content audit model training method as described in A2, the step of performing content audit on the content to be audited through the preset content audit model, including:

A8、如A7所述的内容审核模型训练方法，所述根据所述审核结果标识确定所述待审核文件是否通过内容审核的步骤，包括：A8. The content audit model training method as described in A7, the step of determining whether the file to be audited has passed the content audit according to the audit result identification includes:

A9、如A8所述的内容审核模型训练方法，所述根据所述审核结果标识在预设结果标识映射表中查找对应的违规判定结果的步骤之后，还包括：A9. The content audit model training method as described in A8, after the step of searching the corresponding violation determination result in the preset result identification mapping table according to the audit result identification, it also includes:

A10、如A9所述的内容审核模型训练方法，所述将所述违规确认报告向内容审核人员进行展示，并接收所述内容审核人员给予所述违规确认报告反馈的违规确认结果的步骤之后，还包括：A10. The content review model training method described in A9, after the step of presenting the violation confirmation report to the content review personnel, and receiving the violation confirmation result fed back by the content review personnel to the violation confirmation report, Also includes:

A11、如A1-A10任一项所述的内容审核模型训练方法，所述获取无标签训练样本集及人工标记样本集的步骤，包括：A11. The content review model training method described in any one of A1-A10, the step of obtaining an unlabeled training sample set and a manually marked sample set includes:

本发明还公开了B12、一种内容审核模型训练装置，所述内容审核模型训练装置包括以下模块：The present invention also discloses B12, a content review model training device, the content review model training device includes the following modules:

B13、如B12所述的内容审核模型训练装置，所述二次训练模块，还用于将二次训练后的内容审核模型作为预设内容审核模型；在接收到待审核文件时，获取所述待审核文件对应的文件类型；获取所述文件类型对应的内容解析规则；根据所述内容解析规则对所述待审核文件进行解析，获得待审核内容；通过所述预设内容审核模型对所述待审核内容进行内容审核。B13, the content audit model training device as described in B12, the secondary training module is also used to use the content audit model after the secondary training as the preset content audit model; when receiving the file to be audited, obtain the The file type corresponding to the file to be reviewed; obtaining the content analysis rule corresponding to the file type; analyzing the file to be reviewed according to the content analysis rule to obtain the content to be reviewed; Content to be reviewed is subject to content review.

B14、如B13所述的内容审核模型训练装置，所述二次训练模块，还用于获取模型验证样本集及所述模型验证样本集对应的标准判定结果集；通过二次训练后的内容审核模型对所述模型验证样本集中的样本进行分析，获得审核判定结果集；根据所述标准判定结果集与所述审核判定结果集确定模型审核准确率；若所述模型审核准确率大于或等于预设准确率阈值，则执行所述将二次训练后的内容审核模型作为预设内容审核模型的步骤。B14, the content review model training device as described in B13, the secondary training module is also used to obtain the model verification sample set and the standard judgment result set corresponding to the model verification sample set; through the content review after the secondary training The model analyzes the samples in the model verification sample set to obtain an audit judgment result set; determines the model audit accuracy rate according to the standard judgment result set and the audit judgment result set; if the model audit accuracy rate is greater than or equal to the expected If the accuracy rate threshold is set, the step of using the content audit model after the secondary training as the preset content audit model is performed.

B15、如B14所述的内容审核模型训练装置，所述二次训练模块，还用于若所述模型审核准确率小于预设准确率阈值，则对所述人工标记样本集进行扩充，并根据所述人工标记样本集对所述内容审核模型进行二次训练。B15. The content audit model training device as described in B14, the secondary training module is also used to expand the manually marked sample set if the accuracy of the model audit is less than the preset accuracy threshold, and according to The manually marked sample set performs secondary training on the content review model.

B16、如B13所述的内容审核模型训练装置，所述二次训练模块，还用于在接收到待审核文件时，读取所述待审核文件的文件后缀；将所述文件后缀与预设文件类型表中各文件类型对应的后缀集合进行匹配，确定所述待审核文件对应的文件类型。B16, the content audit model training device as described in B13, the secondary training module is also used to read the file suffix of the file to be audited when receiving the file to be audited; The suffix set corresponding to each file type in the file type table is matched to determine the file type corresponding to the file to be reviewed.

B17、如B13所述的内容审核模型训练装置，所述二次训练模块，还用于根据所述文件类型在预设规则类型映射表中查找对应的解析规则；检测查找到的解析规则的规则数量；若所述规则数量大于预设数量，则获取各解析规则对应的规则优先级；基于所述规则优先级从大到小对所述解析规则进行排序，获得排序结果；将所述排序结果中排序第一的解析规则作为所述文件类型对应的内容解析规则。B17, the content audit model training device as described in B13, the secondary training module is also used to search the corresponding parsing rules in the preset rule type mapping table according to the file type; detect the rules of the found parsing rules number; if the number of rules is greater than the preset number, then obtain the rule priority corresponding to each analysis rule; sort the analysis rules based on the priority of the rules from large to small to obtain a sorting result; sort the result The parsing rule that ranks first in the file type is used as the content parsing rule corresponding to the file type.

B18、如B13所述的内容审核模型训练装置，所述二次训练模块，还用于通过所述预设内容审核模型对所述待审核内容进行内容分析，获取内容分析结果；从所述内容分析结果中提取审核结果标识；根据所述审核结果标识确定所述待审核文件是否通过内容审核。B18. The content audit model training device as described in B13, the secondary training module is also used to perform content analysis on the content to be audited through the preset content audit model, and obtain content analysis results; from the content Extracting an audit result identifier from the analysis result; determining whether the file to be audited has passed the content audit according to the audit result identifier.

本发明还公开了C19、一种模型训练设备，所述模型训练设备包括：处理器、存储器及存储在所述存储器上并可在所述处理器上运行的内容审核模型训练程序，所述内容审核模型训练程序被处理器执行时实现如上所述的内容审核模型训练方法的步骤。The present invention also discloses C19, a model training device. The model training device includes: a processor, a memory, and a content review model training program stored on the memory and operable on the processor. The content When the audit model training program is executed by the processor, the steps of the content audit model training method described above are implemented.

本发明还公开了D20、一种计算机可读存储介质，所述计算机可读存储介质上存储有内容审核模型训练程序，所述内容审核模型训练程序执行时实现如上所述的内容审核模型训练方法的步骤。The present invention also discloses D20, a computer-readable storage medium, wherein a content audit model training program is stored on the computer-readable storage medium, and when the content audit model training program is executed, the above-mentioned content audit model training method is realized A step of.

Claims

1. A content auditing model training method, is characterized in that, described content auditing model training method comprises the following steps:

Obtaining an unlabeled training sample set and a manually labeled sample set, wherein the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set;

training the initial self-supervised review model through the unlabeled training sample set to obtain a content review model;

Perform secondary training on the content review model according to the manually marked sample set.

2. The content auditing model training method as claimed in claim 1, characterized in that, after the step of performing secondary training on the content auditing model according to the artificially marked sample set, further comprising:

Use the content review model after the second training as the default content review model;

When receiving the file to be reviewed, obtain the file type corresponding to the file to be reviewed;

Obtaining content parsing rules corresponding to the file type;

Analyzing the file to be reviewed according to the content parsing rules to obtain the content to be reviewed;

Perform content audit on the content to be audited through the preset content audit model.

3. The content audit model training method as claimed in claim 2, characterized in that, before the step of using the content audit model after the secondary training as the preset content audit model, further comprising:

Obtaining a model verification sample set and a standard judgment result set corresponding to the model verification sample set;

Analyzing the samples in the model verification sample set through the content audit model after secondary training to obtain an audit judgment result set;

determining the model audit accuracy rate according to the standard judgment result set and the audit judgment result set;

If the model review accuracy rate is greater than or equal to the preset accuracy rate threshold, the step of using the content review model after the second training as the preset content review model is performed.

4. The content review model training method according to claim 3, wherein the step of determining the model review accuracy rate according to the standard judgment result set and the review judgment result set comprises:

If the accuracy of the model review is less than the preset accuracy threshold, the manual marking sample set is expanded, and the step of performing secondary training on the content review model according to the manual marking sample set is returned.

5. The content audit model training method according to claim 2, wherein the step of obtaining the file type corresponding to the file to be audited when receiving the file to be audited comprises:

When receiving the file to be reviewed, read the file suffix of the file to be reviewed;

The file suffix is matched with the suffix set corresponding to each file type in the preset file type table to determine the file type corresponding to the file to be reviewed.

6. The content audit model training method according to claim 2, wherein the step of obtaining the content analysis rules corresponding to the file type comprises:

Searching for corresponding parsing rules in the preset rule type mapping table according to the file type;

The number of rules to detect the found parsing rules;

If the number of rules is greater than the preset number, then obtain the rule priority corresponding to each parsing rule;

sorting the parsing rules from large to small based on the priority of the rules to obtain a sorting result;

The parsing rule ranked first in the sorting result is used as the content parsing rule corresponding to the file type.

7. The content audit model training method according to claim 2, wherein the step of performing content audit on the content to be audited through the preset content audit model comprises:

performing a content analysis on the content to be audited through the preset content audit model, and obtaining a content analysis result;

extracting an audit result identifier from the content analysis result;

Determine whether the file to be reviewed has passed content review according to the review result identifier.

8. A content audit model training device, characterized in that the content audit model training device comprises the following modules:

A sample acquisition module, configured to acquire an unlabeled training sample set and a manually marked sample set, wherein the number of samples in the unlabeled training sample set is greater than the number of samples in the manually labeled sample set;

A training module, which is used to train the initial self-supervised review model through the unlabeled training sample set to obtain a content review model;

A secondary training module is configured to perform secondary training on the content review model according to the artificially marked sample set.

9. A model training device, characterized in that the model training device comprises: a processor, a memory, and a content review model training program stored on the memory and operable on the processor, the content review When the model training program is executed by the processor, the steps of the content review model training method according to any one of claims 1-7 are realized.

10. A computer-readable storage medium, characterized in that, a content audit model training program is stored on the computer-readable storage medium, and when the content audit model training program is executed, any one of claims 1-7 is implemented. The steps of the content auditing model training method described above.