CN110019946A

CN110019946A - A kind of method and its system identifying harmful video

Info

Publication number: CN110019946A
Application number: CN201711499942.9A
Authority: CN
Inventors: 蔡昭权; 胡松; 胡辉; 蔡映雪; 陈伽; 黄翰; 梁椅辉; 罗伟; 黄思博
Original assignee: Huizhou University
Current assignee: Huizhou University
Priority date: 2017-12-30
Filing date: 2017-12-30
Publication date: 2019-07-16
Also published as: WO2019127651A1

Abstract

A method for identifying harmful videos and a system thereof, the method comprising: obtaining a URL path of a video, then obtaining a domain name and an IP address according to the URL path, and outputting a first weighting factor, a second weighting factor based on the related query of the IP address and the domain name weighting factor; and further obtain image files of multiple frames in the video, and extract DC coefficients in the compression domain of the image file, so as to identify the image file after partial decompression of the image file, and identify the image file according to the As a result, a third weighting factor is output; the first weighting factor, the second weighting factor and the third weighting factor are combined to identify whether the video is harmful video. The present disclosure can combine a database created by big data, use as few image processing means as possible, and provide a solution for identifying harmful videos by using multiple modes.

Description

A method and system for identifying harmful videos

技术领域technical field

本公开属于信息安全领域，例如涉及一种识别有害视频的方法及其系统。The present disclosure belongs to the field of information security, for example, relates to a method for identifying harmful videos and a system thereof.

背景技术Background technique

在信息社会，到处充斥信息流，包括但不限于文本、视频、音频、图片等。其中，视频文件往往包括听觉信息和视觉信息，表达能力更加全面。然而，随着移动互联网的普及，网络上充斥大量有害视频内容，由于视觉直观性、冲击性等特点，其危害性更加甚于有害文本、有害图片和有害音频等，因此对这些有害视频进行识别，进而进行过滤、删除、消除危害，是十分必要的。In the information society, information flows are everywhere, including but not limited to text, video, audio, and pictures. Among them, video files often include auditory information and visual information, and the expression ability is more comprehensive. However, with the popularization of the mobile Internet, a large amount of harmful video content is flooded on the Internet. Due to its visual intuitiveness and impact, it is more harmful than harmful text, harmful pictures and harmful audio. Therefore, these harmful videos are identified. It is necessary to filter, delete, and eliminate hazards.

对于网络有害视频的识别，现在的技术主要有可以分为两大类，一种是传统方法，其中又包括两类：(1)基于单模态特征的识别方法。这类方法主要是提取视频的视觉特征，根据这些特征来构造分类器。例如在暴力视频识别上，常见的特征有视频运动矢量、颜色、纹理以及形状等。(2)基于多模态特征融合的识别方法，这类方法主要是提取视频的多个模态的特征，将其融合以构造分类器。例如在暴力视频识别上，除了视频特征外，很多方法还提取音频特征，包括短时能量，突发声音等。有些方法还考虑了网络视频周围的文本，从这些文本中继续提取一些特征用于融合识别。另一种是深度学习的方法：(1)CNN 利用卷积神经网络对资料库中的敏感有害图像进行识别处理，得到有害敏感视频的内部特征，利用学习到的有害视频框架判断得到的视频帧中是否有有害信息。(2)RNN循环神经网络，直接将资料库中的视频序列输入循环神经网络中识别有害视频信息，学习到有害视频的框架，利用学习到的有害视频框架判断识别新的视频是否为有害视频。 (3)CNN+RNN，利用CNN学习视频中图像帧中的空间域信息，利用 RNN识别视频序列中的时间域信息，最后将两者结合进行识别判断，利用学习到的框架对视频进行识别。For the identification of harmful videos on the Internet, the current technologies can be mainly divided into two categories, one is the traditional method, which includes two categories: (1) The identification method based on unimodal features. This type of method mainly extracts the visual features of the video, and constructs a classifier according to these features. For example, in violent video recognition, common features include video motion vector, color, texture, and shape. (2) Identification methods based on multi-modal feature fusion. This type of method mainly extracts the features of multiple modalities of the video and fuses them to construct a classifier. For example, in violent video recognition, in addition to video features, many methods also extract audio features, including short-term energy, sudden sound, etc. Some methods also consider the text around the web video, and continue to extract some features from these texts for fusion recognition. The other is the deep learning method: (1) CNN uses convolutional neural network to identify and process sensitive and harmful images in the database, obtain the internal characteristics of harmful and sensitive videos, and use the learned harmful video frame to judge the obtained video frames. whether there is harmful information in it. (2) RNN recurrent neural network, directly input the video sequence in the database into the recurrent neural network to identify harmful video information, learn the frame of harmful video, and use the learned harmful video frame to judge and identify whether the new video is harmful video. (3) CNN+RNN, use CNN to learn the spatial domain information in the image frame in the video, use the RNN to identify the temporal domain information in the video sequence, and finally combine the two for identification and judgment, and use the learned framework to identify the video.

现有的图像处理手段主要有下面两种方法：传统方法和深度学习方法。其中传统方法中经典的方法词包模型，该模型由四个部分组成： (1)底层的特征提取阶段(2)特征编码(3)特征汇聚(4)使用合适的分类器进行分类。深度学习模型是另一种图像处理的模型，主要有自编码器，受限波尔兹曼机，深度信念网络，卷积神经网络，循环神经网络等。随着计算机硬件的不断进步，数据库的完善，使用传统的方法运算过程相比于深度学习来说较为简单，深度学习方法能够学习到更有意义的数据，并根据任务不断进行参数调整，所以对于图像处理方面，深度学习模型有更强大的特征表达能力。The existing image processing methods mainly include the following two methods: traditional methods and deep learning methods. Among them, the classical bag-of-words model in the traditional method consists of four parts: (1) Feature extraction stage at the bottom layer (2) Feature encoding (3) Feature aggregation (4) Use appropriate classifier for classification. Deep learning model is another image processing model, mainly including autoencoder, restricted Boltzmann machine, deep belief network, convolutional neural network, recurrent neural network, etc. With the continuous advancement of computer hardware and the improvement of databases, the operation process using traditional methods is simpler than deep learning. The deep learning method can learn more meaningful data and continuously adjust parameters according to tasks. Therefore, for In terms of image processing, deep learning models have more powerful feature expression capabilities.

现有的识别方法在在识别效率上都有所不足，在大数据和人工智能发展的情形下，如何高效的识别有害视频，就成为一个需要考虑的问题。Existing identification methods are insufficient in terms of identification efficiency. With the development of big data and artificial intelligence, how to efficiently identify harmful videos has become a problem that needs to be considered.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种识别有害视频的方法，包括：The present disclosure provides a method of identifying harmful videos, including:

步骤a),获取视频的URL路径，进而依据URL路径获取域名、IP 地址,并且基于所述IP地址，在第一数据库中查询是否存在所述IP 地址或同一网段IP地址，并根据IP地址的查询结果输出与IP相关的第一权重因子；Step a), obtain the URL path of the video, and then obtain domain name, IP address according to the URL path, and based on the IP address, inquire whether there is the IP address or the same network segment IP address in the first database, and according to the IP address The query result outputs the first weighting factor related to IP;

步骤b)，基于所述域名，在第二数据库中进行whois查询，并根据whois查询结果输出与域名相关的第二权重因子；Step b), based on the domain name, carry out whois query in the second database, and output the second weight factor related to the domain name according to the whois query result;

步骤c)，获取视频中的多个帧画面的图像文件，并在图像文件的压缩域中提取直流系数，以便对图像文件进行部分解压后识别所述图像文件，并根据识别图像文件的结果输出第三权重因子；Step c), obtain the image file of a plurality of frame pictures in the video, and extract the DC coefficient in the compression domain of the image file, so that the image file is partially decompressed to identify the image file, and output according to the result of identifying the image file the third weighting factor;

步骤d)，综合第一权重因子和第二权重因子以及第三权重因子，对所述视频是否属于有害视频进行识别。Step d), synthesizing the first weighting factor, the second weighting factor and the third weighting factor, to identify whether the video is harmful video.

此外，本公开还揭示了一种识别有害视频的系统，包括：In addition, the present disclosure also discloses a system for identifying harmful videos, including:

第一权重因子生成模块,用于：获取视频的URL路径，进而依据 URL路径获取域名、IP地址,并且基于所述IP地址，在第一数据库中查询是否存在所述IP地址或同一网段IP地址，并根据IP地址的查询结果输出与IP相关的第一权重因子；The first weight factor generation module is used to: obtain the URL path of the video, and then obtain the domain name and IP address according to the URL path, and based on the IP address, query whether the IP address or the same network segment IP exists in the first database. address, and output the first weighting factor related to the IP according to the query result of the IP address;

第二权重因子生成模块，用于：基于所述域名，在第二数据库中进行whois查询，并根据whois查询结果输出与域名相关的第二权重因子；The second weighting factor generating module is used for: performing a whois query in the second database based on the domain name, and outputting a second weighting factor related to the domain name according to the whois query result;

第三权重因子生成模块，用于：获取视频中的多个帧画面的图像文件，并在图像文件的压缩域中提取直流系数，以便对图像文件进行部分解压后识别所述图像文件，并根据识别图像文件的结果输出第三权重因子；The third weight factor generation module is used for: acquiring image files of multiple frames in the video, and extracting DC coefficients in the compression domain of the image files, so as to identify the image files after partial decompression of the image files, and identify the image files according to the The result of identifying the image file outputs a third weighting factor;

识别模块，用于综合第一权重因子和第二权重因子以及第三权重因子，对所述视频是否属于有害视频进行识别。The identification module is configured to integrate the first weighting factor, the second weighting factor and the third weighting factor to identify whether the video belongs to harmful video.

通过所述方法及其系统，本公开能够结合大数据所打造的数据库，用尽量少的图像处理手段，提供一种较为高效的识别有害视频的方案。Through the method and the system thereof, the present disclosure can provide a more efficient solution for identifying harmful videos with as few image processing means as possible in combination with the database created by big data.

附图说明Description of drawings

图1是本公开中一个实施例所述方法的示意图；1 is a schematic diagram of the method described in one embodiment of the present disclosure;

图2是本公开中一个实施例所述系统的示意图。Figure 2 is a schematic diagram of a system according to one embodiment of the present disclosure.

具体实施方式Detailed ways

为了使本领域技术人员理解本公开所披露的技术方案，下面将结合实施例及有关附图，对各个实施例的技术方案进行描述，所描述的实施例是本公开的一部分实施例，而不是全部的实施例。本公开所采用的术语“第一”、“第二”等是用于区别不同对象，而不是用于描述特定顺序。此外，“包括”和“具有”以及它们的任何变形，意图在于覆盖且不排他的包含。例如包含了一系列步骤或单元的过程、或方法、或系统、或产品或设备没有限定于已列出的步骤或单元，而是可选的还包括没有列出的步骤或单元，或可选的还包括对于这些过程、方法、系统、产品或设备固有的其他步骤或单元。In order to make those skilled in the art understand the technical solutions disclosed in the present disclosure, the technical solutions of the various embodiments will be described below with reference to the embodiments and the related drawings. The described embodiments are part of the present disclosure, not All examples. The terms "first", "second" and the like used in the present disclosure are used to distinguish different objects, rather than to describe a specific order. Furthermore, "including" and "having" and any variations thereof are intended to be inclusive and not exclusive. For example, a process, or method, or system, or product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optional Also includes other steps or units inherent to these processes, methods, systems, products or devices.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本公开的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其他实施例互斥的独立的或备选的实施例。本领域技术人员可以理解的是，本文所描述的实施例可以与其他实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present disclosure. The appearance of the phrase in various places in the specification is not necessarily all referring to the same embodiment, nor is it a separate or alternative embodiment that is mutually exclusive with other embodiments. Those skilled in the art will appreciate that the embodiments described herein may be combined with other embodiments.

参见图1，图1是本公开中一个实施例提供的一种识别有害视频的方法的流程示意图。如图所示，所述方法包括：Referring to FIG. 1 , FIG. 1 is a schematic flowchart of a method for identifying harmful videos provided by an embodiment of the present disclosure. As shown, the method includes:

步骤S100,获取视频的URL路径，进而依据URL路径获取域名、 IP地址,并且基于所述IP地址，在第一数据库中查询是否存在所述 IP地址或同一网段IP地址，并根据IP地址的查询结果输出与IP相关的第一权重因子；Step S100, obtain the URL path of the video, and then obtain the domain name and IP address according to the URL path, and based on the IP address, inquire whether the IP address or the same network segment IP address exists in the first database, and according to the IP address. The query result outputs the first weighting factor related to the IP;

能够理解，第一数据库维护已知的、发布过有害视频的IP地址清单。It can be understood that the first database maintains a list of known IP addresses that have published harmful videos.

例如，IP地址是192.168.10.3的情形下：For example, if the IP address is 192.168.10.3:

如果第一数据库中记载有该IP地址，那么第一权重因子可以示例性为1.0；If the IP address is recorded in the first database, the first weight factor may be exemplarily 1.0;

如果数据库中记载的IP地址只有192.168.10.4，那么 192.168.10.3则被中度怀疑为该视频所属网站的备用地址或者新近更换的地址，第一权重因子可以示例性为0.6；If the IP address recorded in the database is only 192.168.10.4, then 192.168.10.3 is moderately suspected to be the alternate address or the newly replaced address of the website to which the video belongs, and the first weight factor can be exemplarily 0.6;

如果数据库中记载的IP地址有192.168.10.4以及 192.168.10.5，甚至记载了192.168.10.X网段的所有IP地址，那么 192.168.10.3则被高度怀疑为该视频所属网站的备用地址或者新近更换的地址，第一权重因子可以示例性为0.9；If the IP addresses recorded in the database are 192.168.10.4 and 192.168.10.5, or even all IP addresses of the 192.168.10.X network segment are recorded, then 192.168.10.3 is highly suspected to be the alternate address of the website to which the video belongs or has been replaced recently. address, the first weight factor can be exemplarily 0.9;

如果数据库中记载的IP地址中包括多个192.168.X.X网段，而没有192.168.10.X网段，那么192.168.10.3则被谨慎怀疑为有害视频所属网站的地址，第一权重因子可以示例性为0.4。If the IP address recorded in the database includes multiple 192.168.X.X network segments, but there is no 192.168.10.X network segment, then 192.168.10.3 is carefully suspected to be the address of the website to which the harmful video belongs, and the first weight factor can be exemplified is 0.4.

步骤S200，基于所述域名，在第二数据库中进行whois查询，并根据whois查询结果输出与域名相关的第二权重因子；Step S200, based on the domain name, perform a whois query in the second database, and output a second weight factor related to the domain name according to the whois query result;

能够理解，第二数据库维护已知的、发布过有害视频的域名清单。It can be understood that the second database maintains a list of known domain names that have published harmful videos.

Whois查询是为了考察域名注册人与有害视频的关联情况。第二数据库可以维护如下信息：域名、互联网上大量发布有害视频的域名注册人的信息以及对应的有害视频的标识。Whois queries are made to examine the association of domain name registrants with harmful videos. The second database may maintain the following information: domain names, information of domain name registrants who publish a large number of harmful videos on the Internet, and the identification of the corresponding harmful videos.

例如，域名是www.a.com的情形下：For example, if the domain name is www.a.com:

如果第二数据库中记载有该域名地址、相应有害视频的标识及其 whois信息，那么第二权重因子可以示例性为1.0；If the domain name address, the identifier of the corresponding harmful video and its whois information are recorded in the second database, the second weight factor may be exemplarily 1.0;

如果第二数据库中没有记载上述域名www.a.com的任何有害视频的标识，但是能够查询到该域名的域名注册人，以及该域名的域名注册人注册的其他网站的域名，且第二数据库包括所述其他网站在互联网上大量发布有害视频的标识；那么即使第二数据库中没有记载上述域名www.a.com的任何有害视频的标识，www.a.com该域名对应的网站依然被高度怀疑为有害视频的来源，所述第二权重因子可以示例性为0.9；If the identification of any harmful video of the above-mentioned domain name www.a.com is not recorded in the second database, but the domain name registrant of the domain name and the domain names of other websites registered by the domain name registrant of the domain name can be queried, and the second database Including the identification of the other websites that publish a large number of harmful videos on the Internet; then even if the second database does not record the identification of any harmful videos of the above-mentioned domain name www.a.com, the website corresponding to the domain name www.a.com is still highly regarded. Suspected to be the source of the harmful video, the second weight factor can be exemplarily 0.9;

如果第二数据库中没有记载上述域名www.a.com的任何有害视频的标识，但是能够查询到该域名的域名注册人，以及该域名的域名注册人注册的其他网站的域名，然而第二数据库并不包括任何关于所述其他网站发布有害视频的标识，所述第二权重因子可以示例性为0；If the identification of any harmful video of the above-mentioned domain name www.a.com is not recorded in the second database, but the domain name registrant of the domain name and the domain names of other websites registered by the domain name registrant of the domain name can be queried, the second database Does not include any identification about the other website publishing harmful videos, and the second weight factor can be exemplarily 0;

容易理解，如果第二数据库中没有记载上述域名www.a.com的任何有害视频的标识，也查询不到该域名的域名注册人注册的其他网站的域名，那么所述第二权重因子也可以示例性为0。It is easy to understand that if the second database does not record the identification of any harmful video of the above-mentioned domain name www.a.com, and the domain names of other websites registered by the domain name registrant of the domain name cannot be queried, then the second weighting factor can also be Exemplarily 0.

步骤S300，获取视频中的多个帧画面的图像文件，并在图像文件的压缩域中提取直流系数，以便对图像文件进行部分解压后识别所述图像文件，并根据识别图像文件的结果输出第三权重因子；Step S300, acquiring image files of multiple frames in the video, and extracting DC coefficients in the compression domain of the image files, so as to identify the image files after partially decompressing the image files, and output the first image file according to the result of identifying the image files. Three weighting factors;

该步骤S300是基于视频来获取图像文件，并通过图像文件的识别结果来输出第三权重因子。如果检测到常规有害视频或其他不健康内容等，则第三权重因子会有所体现。能够理解，常规有害视频或其他不健康内容出现的次数满足相应的阈值条件时，第三权重因子可能是1.0，也可能是0.8或0.4，视具体阈值条件而定。In step S300, an image file is acquired based on the video, and a third weighting factor is output through the recognition result of the image file. The third weighting factor will reflect if conventional harmful videos or other unsafe content, etc. are detected. It can be understood that when the number of occurrences of conventional harmful videos or other unsafe content meets the corresponding threshold condition, the third weight factor may be 1.0, 0.8 or 0.4, depending on the specific threshold condition.

另外，需要强调的是，为了降低本实施例所需的计算资源和时间成本，对图像文件进行识别时，是先从图像文件的压缩域中提取直流系数，以便对图像文件进行部分解压即可用于图像识别。由于发明人利用：图像信息的大部分集中于直流系数及其附近的低频频谱这一特性，所以通过直流系数可以对图像文件进行部分解压，利用部分解压的图像信息来进行图像识别，而不利用完整的图像文件中的所有信息，从而降低了工作量。典型的，符合JPEG编码标准的图像文件均可以这样处理。In addition, it should be emphasized that, in order to reduce the computational resources and time costs required by this embodiment, when identifying an image file, the DC coefficients are first extracted from the compression domain of the image file, so that the image file can be used after partial decompression. for image recognition. Since the inventor utilizes the characteristic that most of the image information is concentrated in the DC coefficient and its nearby low-frequency spectrum, the image file can be partially decompressed by the DC coefficient, and the partially decompressed image information can be used for image recognition without using Complete image files with all information, thus reducing workload. Typically, image files conforming to the JPEG encoding standard can be processed in this way.

能够理解，本领域中对图像文件的有害信息识别的技术手段都能够用于本公开所述的视频文件中的图像文件。所述步骤S300，既可以结合传统的方法进行图像的处理，也可以使用结合深度学习模型进行图像的处理，进而对有害视频进行识别。It can be understood that all technical means in the art for identifying harmful information of image files can be used for the image files in the video files described in the present disclosure. In the step S300, images can be processed in combination with traditional methods, or images can be processed in combination with deep learning models, so as to identify harmful videos.

更特别的，在一种情形下，所述步骤S300中对图像文件进行部分解压后识别所述图像文件，具体包括：对图像文件进行部分解压后，将所述图像文件与第三方图像数据库中维护的、已知有害的图像文件进行特征比较，以便识别所述图像文件。当识别为有害时，将所述图像文件更新到所述第三方图像数据库。其中，所述第三方图像数据库中通过爬行已知有害网站的图像文件而预先建立。More particularly, in one case, identifying the image file after partially decompressing the image file in step S300 specifically includes: after partially decompressing the image file, storing the image file with the third-party image database. The maintained, known harmful image files are compared for characteristics in order to identify the image files. When identified as harmful, the image file is updated to the third-party image database. Wherein, the third-party image database is pre-established by crawling image files of known harmful websites.

步骤S400，综合第一权重因子和第二权重因子以及第三权重因子，对所述视频是否属于有害视频进行识别。Step S400, synthesizing the first weighting factor, the second weighting factor, and the third weighting factor, to identify whether the video belongs to a harmful video.

示例性的，设第一权重因子为x，第二权重因子为y，第三权重因子为z，其中0≤x≤1，0≤y≤1，0≤z≤1，可以根据如下公式综合上述权重因子计算视频的有害系数W：Exemplarily, set the first weighting factor to be x, the second weighting factor to be y, and the third weighting factor to be z, where 0≤x≤1, 0≤y≤1, 0≤z≤1, can be synthesized according to the following formula: The above weighting factors calculate the harmful coefficient W of the video:

W＝a×x+b×y+c×z，其中，a+b+c＝1，a、b、c则分别表示各个权重因子的权重。W=a×x+b×y+c×z, where a+b+c=1, and a, b, and c respectively represent the weights of each weighting factor.

例如，a＝b＝c＝1/3；For example, a=b=c=1/3;

更例如，a、b、c不相等，具体可以根据各个权重因子以及识别有害内容的实际情况而调整。More for example, a, b, and c are not equal, and can be adjusted according to each weighting factor and the actual situation of identifying harmful content.

能够理解，W越接近1，相关视频属于有害视频的几率越大。It can be understood that the closer W is to 1, the greater the probability that the related video is harmful.

以上计算W的公式属于线性公式，然而实际应用时，也可能采用非线性公式。The above formula for calculating W is a linear formula, but in practical application, a nonlinear formula may also be used.

进一步的，无论是线性公式还是非线性公式，均可以考虑通过训练或拟合来确定相关公式及其参数。Further, whether it is a linear formula or a nonlinear formula, it can be considered to determine the relevant formula and its parameters through training or fitting.

综上，对于上述实施例，仅仅步骤S300进行了图像处理，而其余步骤则是另辟蹊径，利用了相关查询、获得相关的权重因子。步骤S400则综合(也可称为融合)多个权重因子进行有害视频的识别。本领域技术人员均知晓，针对视频的每一帧图像进行处理、识别是非常消耗时间成本的，而查询则相对而言更加节省时间成本。显而易见，上述实施例提出了一种富有效率的识别有害视频的方法。另外，上述实施例显然能够进一步结合大数据和/或人工智能来建立、更新所述第一数据库、第二数据库以及其他数据库。To sum up, for the above-mentioned embodiment, only step S300 performs image processing, and the remaining steps are a new way to obtain relevant weight factors by using relevant queries. In step S400, multiple weighting factors are integrated (also referred to as fusion) to identify harmful videos. Those skilled in the art know that processing and identifying each frame of a video is very time-consuming, and querying is relatively time-saving. Obviously, the above embodiments provide an efficient method for identifying harmful videos. In addition, the above-mentioned embodiments can obviously further combine big data and/or artificial intelligence to establish and update the first database, the second database and other databases.

在另一个实施例中，所述第二数据库为第三方数据库。In another embodiment, the second database is a third-party database.

例如，进行whois查询的众多网站、以及第三方维护的有害视频的网站列表方面的数据库。For example, databases of numerous websites that conduct whois queries, and third-party-maintained lists of websites with harmful videos.

在另一个实施例中，对于识别后确定为有害视频的，针对其来源的网址(例如论坛或网页)，收集所述网址上记载的所述有害视频的发表者的IP地址信息并更新第一数据库。这是因为，有害视频一般会形成一些粘性用户，这些用户有一部分会参与传播有害视频且大部分的IP地址会相对固定，如果相关网址自身记载了所述有害视频的发表者的IP地址信息，本公开则通过收集其IP地址信息来更新前述第一数据库。In another embodiment, for a video determined to be harmful after identification, the IP address information of the publisher of the harmful video recorded on the website is collected for the website (for example, a forum or a web page) of its source, and the first update is made. database. This is because harmful videos generally form some sticky users, some of these users will participate in the dissemination of harmful videos, and most of their IP addresses will be relatively fixed. If the relevant website itself records the IP address information of the publisher of the harmful video, The present disclosure updates the aforementioned first database by collecting its IP address information.

在另一个实施例中，步骤S200还包括：In another embodiment, step S200 further includes:

进一步的，在第三方域名安全列表中查询所述域名的安全性以便输出安全因子，并通过所述安全因子对所述与域名相关的第二权重因子进行修正。Further, the security of the domain name is queried in a third-party domain name security list so as to output a security factor, and the second weighting factor related to the domain name is modified by the security factor.

例如virustotal.com这一第三方域名安全筛查网站。能够理解，如果第三方信息中认为相关域名包含病毒或木马，则应当提高第二权重因子，根源在于相关网站更加不安全。For example, virustotal.com, a third-party domain name security screening website. It can be understood that if the third-party information believes that the relevant domain name contains viruses or Trojan horses, the second weight factor should be increased, because the relevant website is more insecure.

能够理解，所述实施例是侧重于从网络安全角度修正第二权重因子，防止用户遭受其他损失。这是因为，网络安全事关用户的隐私和财产权，如果有害视频的相关网站存在网络安全隐患，那么除了有害视频的危害之外还对用户带来隐私泄露或财产损失的危害。It can be understood that the embodiment focuses on correcting the second weighting factor from the perspective of network security, so as to prevent the user from suffering other losses. This is because network security concerns the privacy and property rights of users. If there are hidden network security risks on websites related to harmful videos, in addition to the harm of harmful videos, it will also bring harm to users' privacy leakage or property loss.

在另一个实施例中，步骤S300中的获取视频中的多个帧画面的图像文件，是通过随机方式获取的。In another embodiment, the acquisition of image files of multiple frames in the video in step S300 is acquired in a random manner.

对该实施例而言，其意味着随机选取视频中的画面，例如从视频的前面1/3播放时间段选取一帧或多帧画面的图像文件，从中间1/3 播放时间段以及末尾1/3播放时间段也分别选取一帧或多帧画面的图像文件。通常情况下，识别视频都是基于关键帧提取来做的，关键帧提取相对随机方式耗时一些，因此所述实施例通过随机方式选取一帧或多帧，特别是多帧画面，能够显著节省时间。随机方式获取多帧画面的图像文件，不仅显著节省时间，而且一定程度保证了处理的结果相对可信。For this embodiment, it means randomly selecting a picture in the video, for example, selecting an image file of one or more frames of pictures from the first 1/3 of the playback time period of the video, and from the middle 1/3 of the playback time period and the end 1. /3 The playback time period also selects image files of one or more frames respectively. Usually, video recognition is done based on key frame extraction, and key frame extraction is relatively time-consuming. Therefore, the embodiment selects one or more frames, especially multiple frames, in a random manner, which can significantly save money. time. Obtaining image files of multiple frames in a random manner not only saves time significantly, but also ensures that the processing results are relatively credible to a certain extent.

在另一个实施例中，步骤S300中的获取视频中的多个帧画面的图像文件，还包括如下：In another embodiment, acquiring image files of multiple frames in the video in step S300 further includes the following:

步骤c1)：提取视频中的音频；Step c1): extract the audio in the video;

步骤c2)：识别音频中是否包括有害内容，如果有，则根据音频的起止时间获取所述起止时间内的多个帧画面的图像文件。Step c2): Identify whether the audio contains harmful content, and if so, acquire image files of multiple frames within the start and end times of the audio according to the start and end times of the audio.

对于该实施例而言，如果识别到音频中包括有害内容，则定位其时间，从音频的起止时间为依据，获取起止时间内多个帧画面的图像文件。这样能够更加针对性的找到相关有害的画面。For this embodiment, if it is identified that the audio contains harmful content, its time is located, and based on the start and end times of the audio, image files of multiple frames within the start and end time are acquired. This can be more targeted to find relevant harmful pictures.

如前文所述，如果结合大数据技术，本公开能够富有成效的结合多个维度、多种模式，结合IP信息、域名信息、图像信息、音频信息来快速的识别有害视频。As mentioned above, if combined with big data technology, the present disclosure can effectively combine multiple dimensions, multiple modes, and combine IP information, domain name information, image information, and audio information to quickly identify harmful videos.

更进一步的，上述实施例可以在路由器一侧、或者网络提供商一侧实施，提前过滤相关视频。Further, the above embodiment may be implemented on the router side or the network provider side to filter relevant videos in advance.

与方法相对应的，参见图2，本公开在另一个实施例中揭示了一种识别有害视频的系统，包括：Corresponding to the method, referring to FIG. 2 , the present disclosure discloses, in another embodiment, a system for identifying harmful videos, including:

与前文各个方法的实施例所类似的，Similar to the foregoing embodiments of the respective methods,

优选的，所述第二数据库为第三方数据库。Preferably, the second database is a third-party database.

更优选的，第二权重因子生成模块还包括：More preferably, the second weight factor generation module further includes:

修正单元，用于：进一步的，在第三方域名安全列表中查询所述域名的安全性以便输出安全因子，并通过所述安全因子对所述与域名相关的第二权重因子进行修正。The correction unit is configured to: further query the security of the domain name in the third-party domain name security list to output a security factor, and correct the second weighting factor related to the domain name by the security factor.

更优选的，所述第三权重因子生成模块中所述的获取视频中的多个帧画面的图像文件，是通过随机方式获取的。More preferably, the acquisition of the image files of multiple frames in the video described in the third weight factor generation module is acquired in a random manner.

更优选的，所述第三权重因子生成模块中还通过如下单元实现获取视频中的多个帧画面的图像文件：More preferably, the third weight factor generation module also realizes the acquisition of image files of multiple frames in the video through the following units:

音频提取单元，用于提取视频中的音频；Audio extraction unit for extracting audio in video;

音频识别单元，用于识别音频中是否包括有害内容，如果有，则根据音频的起止时间获取所述起止时间内的多个帧画面的图像文件。The audio identification unit is configured to identify whether the audio contains harmful content, and if so, acquire image files of multiple frames within the start and end times according to the start and end times of the audio.

本公开在另一个实施例中揭示了一种识别有害视频的系统，包括：The present disclosure discloses, in another embodiment, a system for identifying harmful videos, comprising:

处理器及存储器，所述存储器中存储有可执行指令，所述处理器执行这些指令以执行以下操作：A processor and a memory having executable instructions stored in the memory that the processor executes to perform the following operations:

本公开在另一个实施例中还揭示了一种计算机存储介质，存储有可执行指令，所述指令用于执行如下识别有害视频的方法：In another embodiment, the present disclosure also discloses a computer storage medium storing executable instructions for performing the following method of identifying harmful videos:

对于上述系统，其可以包括：至少一个处理器(例如CPU)，至少一个传感器(例如加速度计、陀螺仪、GPS模块或其他定位模块)，至少一个存储器，至少一个通信总线，其中，通信总线用于实现各个组件之间的连接通信。所述设备还可以包括至少一个接收器，至少一个发送器，其中，接收器和发送器可以是有线发送端口，也可以是无线设备(例如包括天线装置)，用于与其他节点设备进行信令或数据的传输。所述存储器可以是高速RAM存储器，也可以是非不稳定的存储器(Non-volatile memory)，例如至少一个磁盘存储器。存储器可选的可以是至少一个位于远离前述处理器的存储装置。存储器中存储一组程序代码，且所述处理器可通过通信总线，调用存储器中存储的代码以执行相关的功能。For the above system, it may include: at least one processor (eg CPU), at least one sensor (eg accelerometer, gyroscope, GPS module or other positioning module), at least one memory, at least one communication bus, wherein the communication bus uses It is used to realize the connection communication between various components. The device may also include at least one receiver and at least one transmitter, wherein the receiver and the transmitter may be wired transmission ports or wireless devices (eg, including antenna devices) for signaling with other node devices or data transmission. The memory may be a high-speed RAM memory, or a non-volatile memory (Non-volatile memory), such as at least one disk memory. The memory may optionally be at least one storage device located remotely from the aforementioned processor. A set of program codes is stored in the memory, and the processor can call the code stored in the memory through the communication bus to perform the related functions.

本公开的实施例还提供一种计算机存储介质，其中，该计算机存储介质可存储程序，该程序执行时包括上述方法实施例中记载的任何一种识别有害视频的方法的部分或全部步骤。Embodiments of the present disclosure further provide a computer storage medium, wherein the computer storage medium can store a program, and when the program is executed, the program includes part or all of the steps of any of the methods for identifying harmful videos described in the above method embodiments.

本公开的实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。The steps in the method of the embodiment of the present disclosure may be adjusted, combined and deleted in sequence according to actual needs.

本公开的实施例系统中的模块和单元可以根据实际需要进行合并、划分和删减。需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作、模块、单元并不一定是本发明所必须的。The modules and units in the system of the embodiments of the present disclosure can be combined, divided and deleted according to actual needs. It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. As in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions, modules and units involved are not necessarily required by the present invention.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本公开所提供的几个实施例中，应该理解到，所揭露的系统，可通过其它的方式实现。例如，以上所描述的实施例仅是示意性的，例如所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，各单元或组件相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed system may be implemented in other manners. For example, the above-described embodiments are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between units or components may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，既可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and may be located in one place or distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，本公开的各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist independently, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可为智能手机、个人数字助理、可穿戴设备、笔记本电脑、平板电脑)执行本公开的各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the part that contributes to the prior art, or all or part of the technical solutions, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which can be a smartphone, a personal digital assistant, a wearable device, a laptop, a tablet) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

以上所述，以上实施例仅用以说明本公开的技术方案，而非对其限制；尽管参照前述实施例对本公开进行了详细的说明，本领域技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本公开的各实施例技术方案的范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present disclosure, but not to limit them; although the present disclosure has been described in detail with reference to the above-mentioned embodiments, those skilled in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present disclosure.

Claims

1. a kind of method for identifying harmful video, comprising:

Step a) obtains the path URL of video, and then obtains domain name, IP address according to the path URL, and based on the IP Location, inquiry whether there is the IP address or same network segment IP address in first database, and according to the inquiry knot of IP address Fruit exports the first weight factor relevant to IP；

Step b), be based on domain name, in the second database carry out whois inquiry, and according to whois query result output with Relevant second weight factor of domain name；

Step c) obtains the image file of multiple frame pictures in video, and direct current system is extracted in the compression domain of image file Number, so as to image file carry out part decompression after identify described image file, and according to identification image file result export Third weight factor；

Whether step d), comprehensive first weight factor and the second weight factor and third weight factor, belong to the video Harmful video is identified.

2. according to the method described in claim 1, wherein, it is preferred that second database is third party database.

3. according to the method described in claim 1, wherein, step b) further include:

Further, the safety of domain name is inquired in third party's domain name safe list so as to the output safety factor, and led to The factor of safety is crossed to be modified second weight factor relevant to domain name.

4. according to the method described in claim 1, wherein, the image for obtaining multiple frame pictures in video in step c) is literary Part is obtained by random fashion.

5. according to the method described in claim 1, wherein, the image for obtaining multiple frame pictures in video in step c) is literary Part further includes as follows:

Step c1): extract the audio in video；

Step c2): it whether include harmful content in identification audio, if so, then obtaining described rise according to the beginning and ending time of audio The only image file of multiple frame pictures in the time.

6. a kind of system for identifying harmful video, comprising:

First weight factor generation module, is used for: obtaining the path URL of video, and then with obtaining domain name, IP according to the path URL Location, and it is based on the IP address, inquiry whether there is the IP address or same network segment IP address in first database, and The first weight factor relevant to IP is exported according to the query result of IP address；

Second weight factor generation module, is used for: it is based on domain name, the progress whois inquiry in the second database, and according to Whois query result exports the second weight factor relevant to domain name；

Third weight factor generation module, is used for: obtaining the image file of multiple frame pictures in video, and in image file DC coefficient is extracted in compression domain, to identify described image file after carrying out part decompression to image file, and according to identification The result of image file exports third weight factor；

Identification module is to the video for integrating the first weight factor and the second weight factor and third weight factor It is no to belong to harmful video and identified.

7. system according to claim 6, wherein preferred, second database is third party database.

8. system according to claim 6, wherein the second weight factor generation module further include:

Amending unit is used for: it is further, the safety of domain name is inquired in third party's domain name safe list to export Factor of safety, and second weight factor relevant to domain name is modified by the factor of safety.

9. system according to claim 6, wherein in acquisition video described in the third weight factor generation module Multiple frame pictures image file, be to be obtained by random fashion.

10. system according to claim 6, wherein also by such as lower unit in the third weight factor generation module Realize the image file for obtaining multiple frame pictures in video:

Audio extraction unit, for extracting the audio in video；

Whether audio identification unit includes for identification harmful content in audio, if so, then being obtained according to the beginning and ending time of audio Take the image file of multiple frame pictures in the beginning and ending time.