WO2019109529A1

WO2019109529A1 - Webpage identification method, device, computer apparatus, and computer storage medium

Info

Publication number: WO2019109529A1
Application number: PCT/CN2018/077064
Authority: WO
Inventors: 王元铭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2017-12-08
Filing date: 2018-02-23
Publication date: 2019-06-13
Anticipated expiration: 2020-06-08
Also published as: CN108092963A; CN108092963B

Abstract

A webpage identification method, a device, a computer apparatus, and a storage medium. The method comprises: acquiring a webpage having an identified risk level greater than a pre-determined level, and extracting a domain name of the website corresponding to the webpage; acquiring a network address corresponding to the website and according to the domain name of the website; searching for a domain name associated with the network address, and when the domain name associated with the network address is found, using the associated domain name as a domain name to be identified; acquiring webpage data in a website corresponding to the domain name to be identified; and obtaining, according to the acquired webpage data, a webpage having a risk level greater than the pre-determined level and corresponding to the domain name to be identified.

Description

Web page identification method, device, computer device and computer storage medium

本申请要求于2017年12月08日提交中国专利局、申请号为201711297266.7、发明名称为“网页识别方法、装置、计算机设备及计算机存储介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application filed on Dec. 08, 2017, the Chinese Patent Application No. 201711297266.7, entitled "Web Page Identification Method, Apparatus, Computer Equipment, and Computer Storage Media", the entire contents of which are incorporated by reference. Combined in this application.

Technical field

本申请涉及网络安全领域，特别是涉及一种网页识别方法、装置、计算机设备及存储介质。The present application relates to the field of network security, and in particular, to a webpage identification method, apparatus, computer device, and storage medium.

Background technique

随着互联网科技的发展，人们越来越多的活动在网络上进行，例如在网络上进行交易，在网络上办理相应的银行业务等，由此会出现一些伪装成银行的网站，在用户访问时会窃取用户在使用该类网站时提交的银行账号、密码等私密信息，若不及时发现该类具有威胁性的网站，会威胁用户的财产安全，危害用户的利益。With the development of Internet technology, more and more activities are being carried out on the Internet, such as conducting transactions on the Internet, handling corresponding banking services on the Internet, etc., and thus some websites pretending to be banks will be accessed. It will steal private information such as bank account numbers and passwords submitted by users when using such websites. If such threatening websites are not discovered in time, the user's property will be threatened and the interests of users will be harmed.

传统地，由于每天会产生大量的网页，则需要从互联网上产生的大量的网页中选取可能具有威胁性的目标网页，进而对选取到的目标网页进行繁琐的分析，使得识别目标网页为风险等级是否大于预设等级的效率不高。Traditionally, since a large number of web pages are generated every day, it is necessary to select a potentially threatening target webpage from a large number of webpages generated on the Internet, thereby performing cumbersome analysis on the selected target webpage, so that the target webpage is identified as a risk level. Whether it is greater than the preset level is not efficient.

发明内容Summary of the invention

根据本申请的各种实施例，提供一种网页识别方法、装置、计算机设备及计算机存储介质，解决了背景技术中所涉及的一个或多个问题。According to various embodiments of the present application, a web page identification method, apparatus, computer apparatus, and computer storage medium are provided, which solve one or more problems involved in the background art.

一种网站识别方法，包括：A website identification method includes:

获取已识别的风险等级大于预设等级的网页，提取所述网页对应的网站域名；Obtaining a webpage whose identified risk level is greater than a preset level, and extracting a website domain name corresponding to the webpage;

根据所述网站域名获取所述网站对应的网络地址；Obtaining a network address corresponding to the website according to the website domain name;

查找与所述网络地址关联的域名，当查找到与所述网络地址关联的域名时，则将所述关联的域名作为待识别域名；Searching for a domain name associated with the network address, and when the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified;

获取所述待识别域名对应的网站中的网页数据；Obtaining webpage data in a website corresponding to the domain name to be identified;

根据所获取的网页数据得到与所述待识别域名对应的风险等级大于预设等级的网页。And obtaining, according to the obtained webpage data, a webpage corresponding to the domain name to be identified with a risk level greater than a preset level.

一种网页识别装置，所述装置包括：A webpage identification device, the device comprising:

第一获取模块，用于获取已识别的风险等级大于预设等级的网页，提取所述网页对应的网站域名；a first acquiring module, configured to acquire a webpage whose identified risk level is greater than a preset level, and extract a website domain name corresponding to the webpage;

第二获取模块，用于根据所述网站域名获取所述网站对应的网络地址；a second obtaining module, configured to obtain a network address corresponding to the website according to the website domain name;

查找模块，用于查找与所述网络地址关联的域名，当查找到与所述网络地址关联的域名时，则将所述关联的域名作为待识别域名；a search module, configured to search for a domain name associated with the network address, and when the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified;

第三获取模块，用于获取所述待识别域名对应的网站中的网页数据；a third obtaining module, configured to acquire webpage data in a website corresponding to the domain name to be identified;

识别模块，用于根据所获取的网页数据得到与所述待识别域名对应的风险等级大于预设等级的网页。The identification module is configured to obtain, according to the acquired webpage data, a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level.

一种计算机设备，包括存储器和处理器，所述存储器存储有计算机可读指令，其特征在于，所述处理器执行所述计算机可读指令时实现以下步骤：计算机可读指令A computer apparatus comprising a memory and a processor, the memory storing computer readable instructions, wherein the processor, when executing the computer readable instructions, implements the step of: computer readable instructions

一个或多个存储有计算机可读指令的非易失性计算机可读存储介质，所述计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行以下步骤：获取已识别的风险等级大于预设等级的网页，提取所述网页对应的网站域名；One or more non-transitory computer readable storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: obtaining Identifying a webpage whose risk level is greater than a preset level, and extracting a website domain name corresponding to the webpage;

本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.

DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present application, and other drawings can be obtained according to the drawings without any creative work for those skilled in the art.

图1为一实施例中网页识别方法的应用场景图；1 is an application scenario diagram of a webpage identification method in an embodiment;

图2为一实施例中网页识别方法流程图；2 is a flow chart of a webpage identification method in an embodiment;

图3为一实施例中网页识装置的结构示意图；3 is a schematic structural diagram of a webpage identification device in an embodiment;

图4为一实施例中计算机设备结构示意图。4 is a schematic structural diagram of a computer device in an embodiment.

Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用于解释本申请，并不用于限定本申请。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

在详细说明根据本申请的实施例前，应该注意到的是，所述的实施例主要在于与网页识别方法、装置、计算机设备及存储介质相关的步骤和装置组件的组合。因此，所述装置组件和方法步骤已经在附图中通过常规符号在适当的位置表示出来了，并且只示出了与理解本申请的实施例有关的细节，以免因对于得益于本申请的本领域普通技术人员而言显而易见的那些细节模糊了本申请的公开内容。Before the embodiments in accordance with the present application are described in detail, it should be noted that the described embodiments are primarily in combination with steps and apparatus components associated with web page identification methods, apparatus, computer devices, and storage media. Accordingly, the device components and method steps have been shown in the drawings by the conventional symbols in the appropriate positions, and only the details related to the understanding of the embodiments of the present application are shown to avoid the benefit of the present application. Those details apparent to those of ordinary skill in the art obscure the disclosure of the present application.

在本文中，诸如左和右，上和下，前和后，第一和第二之类的关系术语仅仅用来区分一个实体或动作与另一个实体或动作，而不一定要求或暗示这种实体或动作之间的任何实际的这种关系或顺序。术语“包括”、“包含”或任何其他变体旨在涵盖非排他性的包含，由此使得包括一系列要素的过程、方法、物品或者设备不仅包含这些要素，而且还包含没有明确列出的其他要素，或者为这种过程、方法、物品或者设备所固有的要素。In this context, relational terms such as left and right, up and down, before and after, first and second are only used to distinguish one entity or action from another entity or action, without necessarily requiring or implying such Any actual relationship or order between entities or actions. The terms "comprising," "comprising," or "include" or "includes" or "includes" or "includes" or "includes" or "includes" An element, or an element inherent to such a process, method, item, or device.

请参照图1，图1为一实施例中网页识别方法的应用场景图，其中包括网页识别平台和服务器，网页识别平台从服务器获取存储的已识别出的风险等级大于预设等级的网页，从获取到的风险等级大于预设等级的网页上获取网页地址，进而从网页地址中提取该网页对应的网站域名，网页识别平台根据网站域名获取网站对应的网络地址，网页识别平台根据网络地址，从存储在网页识别平台的地址关联库中查找与该网络地址关联的域名，当查找到与网络地址关联的域名时，则将该关联的域名作为待识别域名，网页识别平台获取待识别域名对应的网站中包含的网页上的网页数据，根据获取到的网页数据得到与待识别域名对应的风险等级大于预设等级的网页。Referring to FIG. 1 , FIG. 1 is an application scenario diagram of a webpage identification method according to an embodiment, which includes a webpage recognition platform and a server. The webpage identification platform acquires, from a server, a stored webpage whose detected risk level is greater than a preset level. Obtaining a webpage address on a webpage with a risk level greater than a preset level, and extracting a webpage domain name corresponding to the webpage from the webpage address, the webpage identification platform obtains a webpage address corresponding to the webpage according to the website domain name, and the webpage identification platform according to the network address The domain name associated with the network address is searched in the address association database of the webpage identification platform. When the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified, and the webpage identification platform obtains the domain name corresponding to the domain to be identified. The webpage data on the webpage included in the website obtains a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level according to the obtained webpage data.

请参见图2，在其中一个实施例中，提供一种网页识别方法的流程图，本实施例中以该方法应用到上述图1中的网页识别平台中来举例说明，该平台上运行有网页识别程序，通过该网页识别程序来实施网页识别处理。该方法包括如下步骤：Referring to FIG. 2, in one embodiment, a flowchart of a webpage identification method is provided. In this embodiment, the method is applied to the webpage identification platform in FIG. 1 to illustrate that the webpage runs on the platform. The recognition program implements the web page recognition processing by the web page recognition program. The method comprises the following steps:

S202：获取已识别的风险等级大于预设等级的网页，提取网页对应的网站域名。S202: Acquire a webpage whose identified risk level is greater than a preset level, and extract a website domain name corresponding to the webpage.

具体地，风险等级是指用于评价网页是否安全的安全指标，风险等级可以是预设的评价网页是否安全的不同级别，例如，风险等级可以按照级别从低到高设置，风险等级越高，则表示对应的网页存在的风险越高，如，风险等级设置为1级到5级，表示网页对应的风险越来越高。网站域名是指相关网站的标识，同一网站域名下可以有多个网页，例如，网站“百度”的网站域名为“baidu.com”，该网站域名下有多个网页，如“百度百科”网页等。其中，服务器中设置有风险数据库，风险数据库中存储有风险等级大于预设等级的网页，风险等级大于预设等级的网页则表示具有高风险的网页，网页识别平台从服务器上获取已识别出的风险等级大于预设等级的网页，当获取到已识别的风险等级大于预设等级的网页时，则根据获取到的网页获取该网页对应的网页地址，进而网页识别平台根据该网页地址，提取网页地址中的网站域名。需要说明的是，网页的网页地址是指在网络中，每个网页都有的对应的唯一标识，网页地址可以是URL(Uniform Resoure Locator，统一资源定位器)地址。风险数据库是指存储有风险等级大于预设值的网页的数据库。Specifically, the risk level refers to a security indicator for evaluating whether the webpage is secure, and the risk level may be a different level of whether the predetermined evaluation webpage is secure. For example, the risk level may be set from low to high according to the level, and the risk level is higher. It means that the corresponding webpage has a higher risk. For example, the risk level is set to level 1 to level 5, indicating that the risk corresponding to the webpage is getting higher and higher. The website domain name refers to the logo of the relevant website. There may be multiple web pages under the same website domain name. For example, the website name of the website "Baidu" is "baidu.com", and there are multiple web pages under the domain name of the website, such as the "Baidu Encyclopedia" webpage. Wait. Wherein, the server is provided with a risk database, the risk database stores a webpage with a risk level greater than a preset level, and the webpage with a risk level greater than a preset level indicates a webpage with high risk, and the webpage recognition platform obtains the identified webpage from the server. If the webpage whose risk level is greater than the preset level is obtained, the webpage address corresponding to the webpage is obtained according to the obtained webpage, and the webpage identification platform extracts the webpage according to the webpage address. The domain name of the website in the address. It should be noted that the webpage address of the webpage refers to a corresponding unique identifier of each webpage in the network, and the webpage address may be a URL (Uniform Resoure Locator) address. A risk database is a database that stores web pages with a risk level greater than a preset value.

S204：根据网站域名获取网站对应的网络地址。S204: Obtain a network address corresponding to the website according to the website domain name.

具体地，网络地址是指计算机网络相互连接或进行通信时的一种可通信标识，可以是处于某网络中的计算机的网络地址，该网络地址可以唯一地标识网络中的该计算机设备，该计算机与其他计算机进行通信时可以采用网络地址作为通信标识，例如，网络地址可以是IP(Internet Protocol，互联网协议)地址等，不同的网站域名对应有相应的网络地址。进一步地，网页识别平台根据网站域名查询到该网站对应的网络地址，可以是，网页识别平台根据获取到的网站域名向该网站对应的网站服务器发送相应的测试数据，当对应的网站服务器返回响应数据时，则网页识别平台从接收到网站服务器发送的响应数据中提取对应的网络地址。Specifically, the network address refers to a communicable identifier when the computer network is connected to each other or communicates, and may be a network address of a computer in a network, the network address may uniquely identify the computer device in the network, the computer When communicating with other computers, the network address can be used as the communication identifier. For example, the network address can be an IP (Internet Protocol) address, and different website domain names have corresponding network addresses. Further, the webpage identification platform queries the web address corresponding to the website according to the website domain name, and the webpage identification platform sends corresponding test data to the webpage server corresponding to the website according to the obtained website domain name, and the corresponding web server returns a response. When the data is used, the webpage identification platform extracts the corresponding network address from the response data sent by the web server.

S206：查找与网络地址关联的域名，当查找到与网络地址关联的域名时，则将关联的域名作为待识别域名。S206: Search for a domain name associated with the network address. When the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified.

具体地，关联的域名是指可以共用同一个网络地址的域名，当不同的网站域名对应的网站存储在相同的网站服务器中时可以共用同一个网络地址，不同的网站域名对应的网站在网站服务器中对应有不同的访问端口，根据不同的访问端口区分不同的网站域名对应的网站。进一步地，网页识别平台中预存储有不同的网络地址以及对应的网站域名，网页识别平台根据获取到的网络地址，查询与网络地址关联的域名，该关联的域名与已识别的风险等级大于预设等级的网站域名不同，当查找到与风险等级大于预设等级的网站对应的网络地址关联的域名时，则将该关联的域名作为待识别域名。Specifically, the associated domain name refers to a domain name that can share the same network address. When sites corresponding to different website domain names are stored in the same website server, the same network address can be shared, and the website corresponding to different website domain names is in the website server. There are different access ports in the corresponding ones, and different websites corresponding to different website domain names are distinguished according to different access ports. Further, the webpage identification platform pre-stores different network addresses and corresponding website domain names, and the webpage identification platform queries the domain name associated with the network address according to the obtained network address, and the associated domain name and the identified risk level are greater than the pre- If the domain name of the website is different, if the domain name associated with the network address corresponding to the website with the risk level greater than the preset level is found, the associated domain name is used as the domain name to be identified.

S208：获取待识别域名对应的网站中的网页数据。S208: Acquire webpage data in a website corresponding to the domain name to be identified.

具体地，网页数据是指网页页面上显示的内容，网页数据可以是文字数据、图片数据、数字数据等。具体地，网站中可以包含有不同的网页，网页识别平台根据获取到的已识别的风险等级大于预设等级的网站对应的网络地址查找到的关联的域名作为待识别域名时，网页识别平台根据获取到的待识别域名查找到待识别域名对应的网站，从而获取待识别域名对应的网站中包含的不同网页的网页数据，如获取不同网页上显示的文字数据等。Specifically, the webpage data refers to content displayed on a webpage page, and the webpage data may be text data, image data, digital data, and the like. Specifically, the website may include different web pages, and the webpage identification platform is based on the obtained domain name that is found by the webpage corresponding to the network address corresponding to the preset level of the website. The obtained domain name to be identified finds the website corresponding to the domain name to be identified, thereby obtaining webpage data of different webpages included in the website corresponding to the domain name to be identified, such as obtaining text data displayed on different webpages.

S210：根据所获取的网页数据得到与待识别域名对应的风险等级大于预设等级的网页。S210: Obtain a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level according to the acquired webpage data.

具体地，网页识别平台根据获取到的网页数据，对网页数据进行识别，当得到的网页数据中存在可疑数据时，进而将包含该网页数据的网页作为风险等级大于预设等级的网页。可以是，网页识别平台根据获取到的网页数据的文字数据，对文字数据中对字符逐个进行识别，当识别到存在可疑文字数据时，则包含该文字数据的网页为与待识别域名对应的风险等级大于预设等级的网页。需要说明的是，可疑数据可以是预设的数据，当网页中包含该预设的数据时，则网页为风险等级大于预设等级的网页，可疑数据可以是文字数据、图片数据、数字数据等，例如，可以数据可以为设置为文字“银行”、“积分”或“奖品”等。Specifically, the webpage identification platform identifies the webpage data according to the obtained webpage data, and when there is suspicious data in the obtained webpage data, the webpage including the webpage data is further used as a webpage whose risk level is greater than a preset level. The webpage recognition platform may identify the characters in the text data one by one according to the text data of the obtained webpage data. When the suspicious character data is recognized, the webpage including the text data is a risk corresponding to the domain name to be identified. A web page with a rating greater than the preset level. It should be noted that the suspicious data may be preset data. When the webpage includes the preset data, the webpage is a webpage whose risk level is greater than a preset level, and the suspicious data may be text data, image data, digital data, etc. For example, the data can be set to the words "bank", "point" or "prize" and the like.

本实施例中，网页识别平台通过一个已识别出来的风险等级大于预设等级的网页查询到其他关联的域名，根据关联的域名对应的网站中的网页数据查询得到其他风险等级大于预设等级的网页，由一个风险等级大于预设等级的网页可以关联查询到不同的风险等级大于预设等级的网页，提高查询效率。In this embodiment, the webpage identification platform queries the other associated domain names through a recognized webpage whose risk level is greater than the preset level, and obtains other risk levels greater than the preset level according to the webpage data in the website corresponding to the associated domain name. For a webpage, a webpage with a risk level greater than a preset level can be associated with a webpage with a different risk level than a preset level to improve query efficiency.

在其中一个实施例中，步骤S206可以包括如下的流程，步骤S206，即查找与网络地址关联的域名的步骤，包括：In one embodiment, step S206 may include the following process, step S206, that is, the step of searching for a domain name associated with the network address, including:

将网络地址与地址关联库中预存储的网络地址进行匹配。具体地，地址关联库是指存储有不同的网络地址以及与不同的网络地址对应的域名的数据库。网页识别平台将获取到的风险等级大于预设等级的网页，并获取风险等级大于预设等级的网页的网页地址，根据网页地址提取该网页对应的网站域名，根据网站域名获取该风险等级大于预设等级的网站对应的网络地址，进而，将获取到的已识别的风险等级大于预设等级的网站对应的网络地址与地址关联库中预存储的所有网络地址逐个进行匹配，并且遍历匹配完地址存储库中存储的所有网络地址。The network address is matched to the pre-stored network address in the address association library. Specifically, the address association library refers to a database in which different network addresses and domain names corresponding to different network addresses are stored. The webpage recognition platform obtains a webpage with a risk level greater than a preset level, and obtains a webpage address of a webpage whose risk level is greater than a preset level, extracts a webpage domain name corresponding to the webpage according to the webpage address, and obtains the risk level greater than the pre-requisite according to the webpage domain name. The network address corresponding to the graded website is further matched, and the network address corresponding to the website with the identified risk level greater than the preset level is matched with all the network addresses pre-stored in the address association library, and the matching address is traversed one by one. All network addresses stored in the repository.

当网络地址与地址关联库中预存储的网络地址匹配成功时，获取与预存储的网络地址关联的待匹配关联域名。具体地，待匹配关联域名是指与预存储在地址关联库中的网络地址关联的域名，该域名可以是相关的网站的标识，当在地址关联存储库中获取到网络地址即可关联获取到与网络地址对应的待匹配关联域名。网页识别平台将已识别的风险等级大于预设等级的网络地址与地址关联库中存储的所有的网络地址逐一进行匹配，进而网页识别平台选取已识别的风险等级大于预设等级的网络地址在地址关联库中匹配成功的网络地址，从地址关联库中获取与匹配成功的网络地址关联的待匹配关联域名。When the network address and the network address pre-stored in the address association library are successfully matched, the domain name to be matched associated with the pre-stored network address is obtained. Specifically, the domain name to be matched refers to a domain name associated with a network address pre-stored in the address association library, and the domain name may be an identifier of a related website, and may be associated when the network address is obtained in the address association repository. The associated domain name to be matched corresponding to the network address. The webpage identification platform matches the network address with the identified risk level greater than the preset level and all the network addresses stored in the address association library one by one, and the webpage identification platform selects the identified network address with the risk level greater than the preset level. The network address that is successfully matched in the associated library is obtained, and the associated domain name to be matched associated with the successfully matched network address is obtained from the address association library.

获取地址关联库中获取待匹配关联域名的有效截止时间。具体地，有效截止时间是指待匹配关联域名携带的最终的有效时间，有效截止时间可以是年份时间，有效截止时间可以年份中的具体月份，有效截止时间还可以是具体详细日期等，例如，有效截止时间可以是年份时间为2017年，有效截止时间可以是年份中的具体月份为2017年12月，有效截止时间还可以是具体详细日期为2017年12月31日等。在网页识别平台将已识别的风险等级大于预设等级的网页对应的网络地址与地址关联库中存储的网络地址匹配成功，进而网页识别平台获取匹配成功的网络地址关联的待匹配关联域名时，网页识别平台根据地址关联库中的待匹配关联域名获取该待匹配关联域名对应的有效截止时间，即根据地址关联库中的待匹配关联域名获取该待匹配关联域名对应的最终的有效时间。Obtain the valid deadline for obtaining the domain name to be matched in the address association library. Specifically, the effective expiration time refers to the last valid time to be carried in the associated domain name to be matched, the effective expiration time may be the year time, the effective expiration time may be the specific month in the year, and the effective deadline may be a specific detailed date, etc., for example, The effective deadline can be 2017, the effective deadline can be the specific month in the year is December 2017, and the effective deadline can be the specific detailed date is December 31, 2017. The webpage identification platform successfully matches the network address corresponding to the webpage with the identified risk level greater than the preset level to the network address stored in the address association library, and the webpage identification platform obtains the to-be-matched associated domain name associated with the successfully matched network address. The webpage identifying platform obtains the valid expiration time corresponding to the to-be-matched domain name according to the to-be-matched domain name in the address-associated database, that is, obtains the final effective time corresponding to the to-be-matched domain name according to the to-be-matched domain name in the address-associated database.

若当前时间小于等于有效截止时间时，则提取待匹配关联域名作为待识别域名。具体地，当前时间是指获取到待匹配关联域名的时间，当前时间可以是系统时间，例如，当前时间可以是年份时间，当前时间可以是年份中的具体月份，当前时间还可以是具体的日期等。网页识别平台获取到待匹配关联域名，并获取当前时间，该当前时间可以是系统时间，网页识别平台根据获取到的当前时间，将获取到的当前时间与待匹配关联域名对应的有效截止时间进行比对，若获取到待匹配关联域名的当前时间小于有效截止时间时，则获取到的待匹配关联域名未超过有效截止时间，即获取到的待匹配关联域名有效，则网页识别平台将获取到的待匹配关联域名作为关联的域名，进而将关联的域名作为待识别域名。If the current time is less than or equal to the valid deadline, the associated domain name to be matched is extracted as the domain name to be identified. Specifically, the current time refers to the time when the associated domain name is to be matched, and the current time may be the system time. For example, the current time may be the year time, the current time may be the specific month in the year, and the current time may also be a specific date. Wait. The webpage identification platform obtains the associated domain name to be matched, and obtains the current time. The current time may be the system time. The webpage identification platform performs the current deadline and the valid deadline corresponding to the domain name to be matched according to the obtained current time. If the current time of the domain name to be matched is less than the effective expiration time, the acquired domain name to be matched does not exceed the effective expiration time, that is, the acquired domain name to be matched is valid, then the webpage identification platform will obtain The associated domain name to be matched is used as the associated domain name, and the associated domain name is used as the domain name to be identified.

需要说明的是，本实施例中，地址关联库可以是passive DNS(passive Domain Name System，被动域名系统)数据库，网页识别平台根据获取到的已识别的风险等级大于预设等级的网站的网络地址与passive DNS数据库中存储的网络地址进行匹配，当匹配成功时，则获取passive DNS数据库中匹配成功的网络地址对应的待匹配关联域名，当获取的待匹配关联域名的当前时间小于等于该待匹配关联域名的有效截止时间时，则将该待匹配关联域名作为关联的域名。It should be noted that, in this embodiment, the address association library may be a passive DNS (passive domain name system) database, and the webpage identification platform obtains the network address of the website according to the obtained identified risk level that is greater than the preset level. Matches the network address stored in the passive DNS database. If the matching is successful, the domain name of the domain name to be matched in the passive DNS database is obtained. The current time of the domain name to be matched is less than or equal to the matching domain name. When the valid deadline of the associated domain name is used, the associated domain name to be matched is used as the associated domain name.

需要说明的是，风险等级大于预设等级的网页可以是伪装成正常网页的高风险网页，当用户访问时，窃取用户的相关银行卡信息等，进而威胁用户的财产安全，例如是钓鱼网页；也可以是当需要进行风险管控时，限制访问的其他网页，例如，风险等级大于预设等级的网页是一些企业内部会有对应的网页的访问权限，则限制访问的网页则可以认为是风险等级大于预设等级的网页。本实施例中，网页识别平台根据从地址关联库中匹配成功的预存储的网络地址获取待匹配关联域名，并将当前时间与待匹配关联域名对应的有效截止时间进行比较，当当前时间小于等于有效截止时间时，则该待匹配关联域名有效，即可以作为关联的域名进而作为待识别域名，直接根据当前时间与有效截止时间的过滤无效的待匹配关联域名，操作简单提高效率，且对无效的待匹配关联域名直接进行过滤，提高选取关联的域名的准确性。It should be noted that a webpage with a risk level greater than a preset level may be a high-risk webpage disguised as a normal webpage. When the user accesses, the user's relevant bank card information is stolen, thereby threatening the user's property security, such as a phishing webpage; It may also be other webpages that restrict access when risk management is required. For example, a webpage with a risk level greater than a preset level is an access right of a corresponding webpage in some enterprises, and a webpage that restricts access may be regarded as a risk level. A web page that is larger than the preset level. In this embodiment, the webpage identification platform obtains the domain name to be matched according to the pre-stored network address that is successfully matched from the address association library, and compares the current time with the valid deadline of the domain name to be matched, when the current time is less than or equal to When the valid deadline is valid, the domain name to be matched is valid, that is, the associated domain name can be used as the domain name to be identified, and the domain name to be matched that is invalid according to the filtering of the current time and the valid deadline is directly operated, the efficiency is improved, and the pair is invalid. The domain name to be matched is directly filtered to improve the accuracy of selecting the associated domain name.

在其中一个实施例中，网页识别方法还可以包括如下步骤，该步骤可以在步骤S206之后执行，步骤S206，即查找与网络地址关联的域名之后执行，该步骤可以包括：In one embodiment, the webpage identification method may further include the following steps. The step may be performed after the step S206, that is, after searching for the domain name associated with the network address, the step may include:

当未查找到与网络地址关联的域名时，则获取网站的域名对应的注册数据，根据注册数据查询对应的域名作为待识别域名。具体地，注册数据是指表明注册网站的域名的用户的详细信息的数据，注册数据可以是文字数据、图片数据或数字数据等，例如，注册数据可以是个人姓名，注册数据可以是个人邮箱，注册数据可以是个人电话，注册数据还可以是个人照片等。网页识别平台在地址关联库中未与预存储的网络地址匹配成功时，则未获取到与预存储的网络地址关联的待匹配关联域名，则网页识别平台获取已识别的风险等级大于预设等级的网站的域名对应的注册数据，进而网页识别平台根据查询到的注册数据查询与注册数据对应的域名，查询到的与注册数据对应的域名即与风险等级大于预设等级的网站的域名不同，进而将查询到的与风险等级大于预设等级的网站的域名不同的域名作为待识别域名。When the domain name associated with the network address is not found, the registration data corresponding to the domain name of the website is obtained, and the corresponding domain name is queried according to the registration data as the domain name to be identified. Specifically, the registration data refers to data indicating the detailed information of the user who registered the domain name of the website. The registration data may be text data, picture data, or digital data. For example, the registration data may be a personal name, and the registration data may be a personal mailbox. The registration data can be a personal phone, and the registration data can also be a personal photo or the like. If the webpage identification platform does not match the pre-stored network address in the address association library, the domain identification platform acquires the identified risk level greater than the preset level. The registration data corresponding to the domain name of the website, and the webpage identification platform queries the domain name corresponding to the registration data according to the registered registration data, and the domain name corresponding to the registration data is different from the domain name of the website whose risk level is greater than the preset level. Further, the queried domain name different from the domain name of the website whose risk level is greater than the preset level is used as the domain name to be identified.

本实施例中，当在地址关联库中未查找到与已识别的风险等级大于预设等级的网站对应的网络地址关联的域名时，则根据该已识别的风险等级大于预设等级的网站对应注册数据查询到不同的域名作为待识别域名，即可以经过注册信息再次查询关联的域名，将查询到的关联的域名作为待识别域名，提高查询到风险等级大于预设等级的网站的准确性。In this embodiment, when the domain name associated with the network address corresponding to the website whose risk level is greater than the preset level is not found in the address association library, the website corresponding to the determined level is greater than the preset level. The registration data is queried to a different domain name as the domain name to be identified, that is, the associated domain name can be queried again through the registration information, and the associated domain name is used as the domain name to be identified, thereby improving the accuracy of the website with the risk level greater than the preset level.

在其中一个实施例中，上述获取网站的域名对应的注册数据，根据注册数据查询对应的域名作为待识别域名的步骤，可以包括如下流程：In one of the embodiments, the step of obtaining the registration data corresponding to the domain name of the website and querying the corresponding domain name as the domain name to be identified according to the registration data may include the following processes:

获取网站的域名对应的注册数据，从转换逻辑库中选取注册数据对应的转换逻辑。具体地，转换逻辑库是指存储有将注册数据转换为固定格式的注册数据的转换逻辑的数据库。转换逻辑是指将注册数据进行转换的规则，转换逻辑可以是将注册数据中的字符进行替换为预设的字符，转换逻辑可以是删除无效的字符等。进一步地，网页识别平台获取到已识别的风险等级大于预设等级的网页时，根据网页的网页地址提取到该已识别的风险等级大于预设等级的网页对应的网站域名，当网页识别平台提取到该网站域名时，则根据该网站域名获取该已识别的风险等级大于预设等级的网页对应的注册数据，而获取到的注册数据不是按照规定格式进行显示，则按照注册数据的类型从转换逻辑库中选取到该注册数据对应的转换逻辑，进而将获取的注册数据根据规定的显示格式。例如，网页识别平台根据提取的已识别的风险等级大于预设等级的网站的域名，根据网站的域名提取到域名对应的注册数据，如注册姓名、注册邮箱、注册电话等，注册姓名中间含有空格，注册电话中含有连接符，则根据注册数据类型，即网页识别根据注册姓名从逻辑转换库中选取注册姓名按照显示规则显示的转换逻辑，即将注册姓名中的空格删除，进而根据注册电话从转换逻辑库中选取注册电话按照显示规则显示的转换逻辑，即将注册电话中的连接符删除。Obtain the registration data corresponding to the domain name of the website, and select the conversion logic corresponding to the registration data from the conversion logic library. Specifically, the conversion logic library is a database that stores conversion logic for converting registration data into a fixed format registration data. The conversion logic refers to a rule for converting registration data. The conversion logic may replace characters in the registration data with preset characters, and the conversion logic may delete invalid characters or the like. Further, when the webpage recognition platform obtains the recognized webpage with a risk level greater than the preset level, the website domain name corresponding to the webpage whose detected risk level is greater than the preset level is extracted according to the webpage address of the webpage, and is extracted by the webpage recognition platform. When the domain name of the website is obtained, the registration data corresponding to the webpage whose detected risk level is greater than the preset level is obtained according to the domain name of the website, and the obtained registration data is not displayed according to the specified format, and the conversion is performed according to the type of the registration data. The conversion logic corresponding to the registration data is selected in the logic library, and the acquired registration data is further according to a prescribed display format. For example, the webpage identification platform extracts the registration data corresponding to the domain name according to the domain name of the website, such as the registered name, the registered email address, the registered telephone number, etc. according to the extracted domain name of the website whose authorized risk level is greater than the preset level, and the registered name contains a space in the middle. If the registration phone contains a connector, according to the registration data type, that is, the webpage identification selects the registration name from the logical conversion library according to the registered name, and the conversion logic displayed according to the display rule is deleted, and the space in the registered name is deleted, and then converted according to the registered phone. In the logic library, the conversion logic displayed by the registration phone according to the display rule is selected, and the connector in the registered phone is deleted.

根据转换逻辑将注册数据进行转换得到转换后的注册数据。具体地，当网页识别平台选取到转换逻辑时，即网页识别平添选取到将注册数据进行转换的规则，如将注册数据中的字符进行替换为预设的字符，删除无效的字符等，则网页识别平台根据转换逻辑，将注册数据进行转换的到转换后的注册数据，转换后的注册数据则可以是按照规定的显示格式进行显示。例如，注册数据有注册姓名、注册邮箱、注册电话等，网页识别平台选取到注册姓名与注册电话的转换逻辑，则将注册姓名中按照转换逻辑删除无效的空格字符，也可以将注册电话中按照注册电话中的转换逻辑删除连接符。The registration data is converted according to the conversion logic to obtain the converted registration data. Specifically, when the webpage recognition platform selects the conversion logic, that is, the webpage identification adds the rules to convert the registration data, such as replacing the characters in the registration data with the preset characters, deleting invalid characters, etc., the webpage The recognition platform converts the registration data to the converted registration data according to the conversion logic, and the converted registration data may be displayed according to a prescribed display format. For example, the registration data has a registered name, a registered email address, a registered telephone number, etc., and the webpage identification platform selects the conversion logic of the registered name and the registered telephone, and then deletes the invalid space character according to the conversion logic in the registered name, and can also follow the registration telephone. The conversion logic in the registration phone deletes the connector.

将转换后的注册数据与信息存储库中存储的信息数据进行匹配。具体地，信息存储库是指存储有不同的注册信息以及注册信息关联的域名的数据库，信息存储库可以存储有注册姓名、注册邮箱以及注册电话等，信息数据库中存储的注册姓名、注册邮箱以及注册电话可以是相互对应的，且信息存储库可以存储有注册信息关联的网站域名。信息数据是指显示相关的域名的注册人的详细信息的数据，信息数据可以是文字数据，信息数据可以是数字数据也可以是图片数据等，例如，信息数据可以是姓名、电话、邮箱或照片等。具体地，网页识别平台将获取到的注册数据与信息存储库中存储的信息数据逐一进行匹配，可以是，网页识别平台获取到的注册数据为注册姓名、注册邮箱和注册电话，网页识别平台根据转换规则将注册姓名、注册邮箱与注册电话进行转换得到转换后的注册姓名、转换后的注册邮箱与转换后的注册电话，网页识别平台将转换后的注册姓名与信息存储库中存储的姓名进行匹配，网页识别平台再将转换后的注册电话与信息存储库中存储的电话进行匹配，进而网页识别平台将转换后的注册邮箱与信息存储库中存储的邮箱进行匹配。The converted registration data is matched with the information data stored in the information repository. Specifically, the information repository refers to a database storing different registration information and a domain name associated with the registration information, and the information repository may store a registered name, a registered email address, a registered telephone, etc., a registered name, a registered email address, and the stored in the information database. The registration phones may correspond to each other, and the information repository may store the website domain name associated with the registration information. The information data refers to data showing the detailed information of the registrant of the related domain name, and the information data may be text data, and the information data may be digital data or image data, for example, the information data may be a name, a phone, a mailbox, or a photo. Wait. Specifically, the webpage identification platform matches the acquired registration data with the information data stored in the information repository one by one, and the registration data acquired by the webpage identification platform is a registered name, a registered email address, and a registration phone, and the webpage identification platform is based on The conversion rule converts the registered name, registered email address and registered telephone to obtain the converted registered name, the converted registered email address and the converted registration telephone, and the webpage identification platform performs the converted registered name and the name stored in the information repository. The matching, the webpage identification platform then matches the converted registration phone with the phone stored in the information repository, and the webpage identification platform matches the converted registration mailbox with the mailbox stored in the information repository.

当转换后的注册数据与信息存储库中存储的信息数据匹配成功时，则获取匹配成功的信息数据关联的域名作为待识别域名。具体地，当网页识别平台将转换后的注册数据与信息存储库汇总存储的信息数据逐个匹配，当在信息存储库中匹配到相应的信息数据时，则获取匹配成功的信息数据关联的域名，将该关联的域名作为待识别域名。可以是，网页识别平台将注册数据中的每中数据分别与信息数据中存储的信息数据一一匹配，当注册数据中的每个数据都与信息数据库中存储的信息数据匹配成功时，则获取信息数据关联的域名。网页识别平台将转换后的注册姓名与信息数据库中存储的姓名进行匹配，当匹配成功时，再将注册邮箱与信息数据库中存储的该姓名对应的邮箱进行匹配，当注册邮箱匹配成功时则再将注册电话与信息数据库中存储的与姓名和邮箱对应的电话进行匹配，当注册电话也匹配成功时则将信息存储库中存储的匹配成功的姓名、电话以及邮箱关联的域名进行提取，从而将提取出来的域名作为待识别域名。需要说明的是，也可以是网页识别平台仅用注册数据中任意的注册数据与信息数据中存储的数据进行匹配的，当匹配成功时，则将匹配成功信息数据关联的域名作为待识别域名。将转换后的注册姓名与信息数据库中存储的姓名进行匹配，则直接提取匹配成功的姓名关联的域名作为待识别域名。When the converted registration data is successfully matched with the information data stored in the information repository, the domain name associated with the successfully matched information data is obtained as the domain name to be identified. Specifically, when the webpage identification platform matches the converted registration data with the information data stored by the information repository one by one, when the corresponding information data is matched in the information repository, the domain name associated with the successfully matched information data is obtained. The associated domain name is used as the domain name to be identified. The webpage identification platform may match each of the data in the registration data with the information data stored in the information data, and when each data in the registration data is successfully matched with the information data stored in the information database, the webpage identification platform acquires The domain name associated with the information data. The webpage identification platform matches the converted registered name with the name stored in the information database. When the matching is successful, the registered mailbox is matched with the mailbox corresponding to the name stored in the information database, and when the registered mailbox matches successfully, then Matching the registration phone with the phone number corresponding to the name and the mailbox stored in the information database, and when the registration phone is also successfully matched, the matching name, the phone number, and the domain name associated with the mailbox stored in the information repository are extracted, thereby The extracted domain name is used as the domain name to be identified. It should be noted that the webpage identification platform only uses any registration data in the registration data to match the data stored in the information data. When the matching is successful, the domain name associated with the matching success information data is used as the domain name to be identified. If the converted registered name is matched with the name stored in the information database, the domain name associated with the successfully matched name is directly extracted as the domain name to be identified.

需要说明的是，本实施例中，信息存储库可以是whois数据库，网页识别平台获取到已识别的风险等级大于预设等级的网站的域名，并根据该域名获取到该网站对应的注册数据时，可以将该注册数据与whois数据库中存储的信息数据进行匹配，当匹配成功时，则获取信息数据关联的域名作为待识别域名。It should be noted that, in this embodiment, the information storage database may be a whois database, and the webpage identification platform obtains the domain name of the website whose identified risk level is greater than the preset level, and obtains the registration data corresponding to the website according to the domain name. The registration data may be matched with the information data stored in the whois database. When the matching is successful, the domain name associated with the information data is obtained as the domain name to be identified.

本实施例中，网页识别平台先将获取到的注册数据按照转换逻辑进行转换，得到可以按照显示规则显示的转换后的注册数据，提高识别出关联的待识别域名的准确性，进而根据转换后的注册数据与信息存储库中存储的信息数据进行匹配，当匹配成功时，则获取匹配成功的信息数据关联的域名作为待识别域名，根据注册信息即可得到不同的待识别域名，提高识别效率。In this embodiment, the webpage identification platform first converts the acquired registration data according to the conversion logic, and obtains the converted registration data that can be displayed according to the display rule, thereby improving the accuracy of identifying the associated domain name to be identified, and further, according to the conversion. The registration data is matched with the information data stored in the information repository. When the matching is successful, the domain name associated with the successfully matched information data is obtained as the domain name to be identified, and different domain names to be identified can be obtained according to the registration information, thereby improving the recognition efficiency. .

在其中一个实例中，根据所获取的网页数据得到与待识别域名对应的风险等级大于预设等级的网页的步骤，可以包括：In one of the examples, the step of obtaining a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level according to the obtained webpage data may include:

将网页数据与预设的黑名单中存储的第一过滤数据进行匹配，当网站数据与第一过滤数据匹配成功时，则对待识别域名添加可疑标签。具体地，黑名单是指存储有具有风险等级大于预设等级的数据，风险等级大于预设等级的数据可以是文字数据、图片数据、数字数据等，例如，可以存储有字符如“银行”、“积分”等。第一过滤数据是指风险等级大于预设等级的数据，当网页中包含有第一过滤数据则该网站可能是风险等级大于预设等级的网页，第一过滤数据可以是文字数据、图片数据、数字数据等。可疑标签是指待识别域名可能是风险等级大于预设等级的标记。具体地，网页识别平台将从待识别域名对应的网站中包含的所有网页都提取到网页数据时，则将提取到所有网页数据逐一与预设的黑名单中存储的第一过滤数据进行匹配，当所有的网页数据与任意的存储在黑名单中的第一过滤数据匹配成功时，则网页识别平台将该网页数据的来源的网页关联的网站对应的待识别域名添加可以标签。需要说明的是，也可以是设置有匹配数量阈值，即网页识别平台将获取到的所有网页数据与存储在黑名单中的第一过滤数据逐个进行匹配，当与预设的数量的存储在黑名单中的第一过滤数据匹配成功时，则对该网页数据来源的网页关联的网站对应的待识别域名添加可疑标签，匹配数量阈值可以是预设为1，预设为3，预设为4等。还可以是，当有预设数量的获取到的待识别域名对应的网站中包含的网页的网页数据与黑名单中的第一过滤数据匹配成功时，则对该待识别域名添加可疑标签。The webpage data is matched with the first filtering data stored in the preset blacklist. When the website data and the first filtering data are successfully matched, the suspicious label is added to the domain name to be identified. Specifically, the blacklist refers to storing data having a risk level greater than a preset level, and the data with a risk level greater than a preset level may be text data, picture data, digital data, etc., for example, characters such as “bank” may be stored. "Points" and so on. The first filtered data refers to data whose risk level is greater than a preset level. When the webpage contains the first filtered data, the website may be a webpage whose risk level is greater than a preset level, and the first filtered data may be text data, image data, Digital data, etc. A suspicious tag means that the domain name to be identified may be a tag whose risk level is greater than a preset level. Specifically, when the webpage identification platform extracts all the webpages included in the website corresponding to the domain name to be identified to the webpage data, the webpage data extracted is matched with the first filtering data stored in the preset blacklist one by one. When all the webpage data matches any of the first filtered data stored in the blacklist, the webpage identifying platform adds the to-be-identified domain name corresponding to the webpage associated with the webpage of the source of the webpage data to the label. It should be noted that the matching quantity threshold may be set, that is, the webpage identification platform matches all the webpage data acquired by the webpage identification platform with the first filtering data stored in the blacklist one by one, when the storage with the preset quantity is black. If the first filtering data in the list is successfully matched, the suspicious label is added to the domain name to be identified corresponding to the website associated with the webpage to which the webpage data originates. The threshold of the matching quantity may be preset to 1, preset to 3, and preset to 4 Wait. The suspicious tag may be added to the to-be-identified domain name when the webpage data of the webpage included in the website corresponding to the obtained domain name to be identified has been successfully matched with the first filtering data in the blacklist.

将添加可疑标签的待识别域名对应的网站中的网页数据与预设的白名单中存储的第二过滤数据进行匹配。具体地，白名单是指存储有可信数据的数据库，可信数据是指风险等级小于等于预设等级的数据，可信数据可以是文字数据、图片数据、数字数据等，例如，可以存储有字符如“博彩”等。第二过滤数据是指风险等级小于等于预设等级的数据，也即是可信数据，当网页中包含有第第二过滤数据则该网站可能是可信网站，第二过滤数据可以是文字数据、图片数据、数字数据等。具体地，网页识别平台提取添加了可疑标签的待识别域名，并将添加了可疑标签的待识别域名的网站中包含的所有网页上的网页数据与预设的白名单中存储的第二过滤数据逐个进行匹配，当添加了可疑标签的待识别域名对应的网站中包含的所有网页上的网页数据均与白名单中预存储的第二过滤数据匹配成功时，则将待识别域名上携带的可疑标签删除。需要说明的是，也可以是，当预设数量的添加了可疑标签的待识别域名的网站中包含的网页上的网页数据与预设的白名单中存储的第二过滤数据匹配成功时，则将待识别域名上携带的可疑标签删除。Matching the webpage data in the website corresponding to the domain name to be identified with the suspicious tag to the second filtering data stored in the preset whitelist. Specifically, the whitelist refers to a database in which trusted data is stored, and the trusted data refers to data whose risk level is less than or equal to a preset level. The trusted data may be text data, image data, digital data, etc., for example, may be stored. Characters such as "gaming" and so on. The second filtered data refers to data whose risk level is less than or equal to the preset level, that is, trusted data. When the webpage contains the second filtered data, the website may be a trusted website, and the second filtered data may be text data. , picture data, digital data, etc. Specifically, the webpage identification platform extracts the to-be-identified domain name to which the suspicious tag is added, and the webpage data on all the webpages included in the website of the to-be-identified domain name to which the suspicious tag is added and the second filtering data stored in the preset whitelist. Matching one by one, when the webpage data on all the webpages included in the website corresponding to the domain name to be identified with the suspicious label is matched with the second filtering data pre-stored in the whitelist, the suspiciously carried on the domain to be identified is carried. The label is deleted. It should be noted that, when the webpage data on the webpage included in the website of the preset number of the domain name to be identified with the suspicious label is matched with the second filtering data stored in the preset whitelist, Delete the suspicious tag carried on the domain to be identified.

当网页数据与第二过滤数据未匹配成功时，则提取携带有可疑标签的待识别域名，获取待识别域名对应的网站中的网页作为风险等级大于预设等级的网页。具体地，当网页识别平台将添加可疑标签的待识别域名对应的网站中包含的网页数据与第二过滤数据未匹配成功时，则待识别域名上仍然携带有可疑标签，则网页识别平台提取出仍然携带有可疑标签的待识别域名，进而获取待识别域名对应的网站，提取对应的网站中包含的网页作为风险等级大于预设等级的网页。When the webpage data and the second filtering data are not successfully matched, the domain name to be identified carrying the suspicious label is extracted, and the webpage in the website corresponding to the domain name to be identified is obtained as a webpage whose risk level is greater than the preset level. Specifically, when the webpage data included in the website corresponding to the domain name to be identified that adds the suspicious tag does not match the second filtering data, the domain name still carries the suspicious tag, and the webpage identification platform extracts The domain name to be identified with the suspicious tag is still carried, and the website corresponding to the domain name to be identified is obtained, and the webpage included in the corresponding website is extracted as a webpage with a risk level greater than a preset level.

本实施例中，通过黑名单中存储的第一过滤数据与存储在白名单中的第二过滤数据对网页数据进行过滤，从而得到所需的风险等级大于预设等级的网页，防止出现虽然携带有风险等级大于预设等级的网页数据但实际是可信网页，经过两级过滤，提高识别风险等级大于预设等级的网页的准确性。In this embodiment, the webpage data is filtered by the first filtering data stored in the blacklist and the second filtering data stored in the whitelist, thereby obtaining a webpage with a required risk level greater than a preset level, thereby preventing occurrence of carrying The webpage data with a risk level greater than the preset level is actually a trusted webpage, and after two levels of filtering, the accuracy of identifying a webpage with a risk level greater than a preset level is improved.

在其中一个实施例中，网页识别方法还可以包括：In one embodiment, the webpage identification method may further include:

当经过预设的黑名单与预设的白名单进行数据识别后未存在携带有可疑标签的待识别域名时，则获取待识别域名对应的标识符。具体地，标识符是指表示待识别域名对应的网站特有的标志，标识符可以是企业标识，例如，标识符可以是企业logo等。具体地，当网页识别平台根据所有获取到的待识别域名对应的网站中包含的网页上的网页数据经过预设的黑名单与预设的白名单进行数据识别后，待识别域名都未携带有可疑标签时，则经过网页数据识别未识别到风险等级大于预设等级的网页，则网页识别平台获取待识别域名对应的标识符。If the domain name to be identified carrying the suspicious tag does not exist after the data is identified by the preset blacklist and the preset whitelist, the identifier corresponding to the domain to be identified is obtained. Specifically, the identifier refers to a website-specific identifier corresponding to the domain name to be identified, and the identifier may be an enterprise identifier. For example, the identifier may be a corporate logo or the like. Specifically, when the webpage identification platform performs data identification according to the preset blacklist and the preset whitelist according to the webpage data on the webpage included in the website corresponding to the domain name to be identified, the domain name to be identified is not carried. In the case of a suspicious tag, the webpage identification platform obtains an identifier corresponding to the domain name to be identified after the webpage data identifies that the webpage whose risk level is greater than the preset level is not recognized.

将标识符与预先存储在安全标识存储库中的安全标识符进行匹配。具体地，安全标识存储库是指存储有可信任的网站的标识符以及标识符对应的网站域名的数据库。安全标识符是指可信任网站的标志，安全标识符可以是安全的网页的企业的标志，例如，安全标识符为工商银行网页的logo，为平安集团网页的logo等。具体地，网页识别平台将获取到的标识符与预先存储在安全标识存储库中存储的安全标识符逐一进行匹配，可以是，网页识别平台获取到的待识别域名对应的标识符为平安集团logo，进而将获取到的待识别域名对应的标识符即平安集团logo与存储在安全标识存储库中的安全标识符进行匹配。The identifier is matched to a security identifier pre-stored in the security identity store. Specifically, the security identity repository refers to a database that stores an identifier of a trusted website and a website domain name corresponding to the identifier. The security identifier refers to the logo of the trusted website. The security identifier can be the logo of the enterprise of the secure webpage. For example, the security identifier is the logo of the ICBC webpage, and the logo of the Ping An Group webpage. Specifically, the webpage identification platform matches the obtained identifier with the security identifier stored in the security identifier repository one by one, and the identifier corresponding to the domain name to be identified obtained by the webpage identification platform is the Pingguo Group logo. And, the acquired identifier corresponding to the domain name to be identified, that is, the Ping An Group logo, is matched with the security identifier stored in the security identifier storage.

当安全标识符与待识别域名对应的标识符匹配成功时，则获取匹配成功的存储在安全标识存储库中的安全标识符关联的安全域名，将安全域名与待识别域名匹配。具体地，网页识别平台将待识别域名对应的标识符与安全存储库中存储的安全标识符匹配成功时，则待识别域名对应的安全标识符对应的待识别域名可能为安全域名，进而需要进行进一步匹配与识别，则网页识别平台获取匹配成功的存储在前安全标识存储库中的安全标识符关联的安全域名，将匹配成功的存储在安全标识存储库中的安全识别符关联的安全域名，并将安全域名与待识别域名进行匹配。例如，网页识别获取到的待识别域名对应的标识符平安集团logo与安全存储库中存储的平安集团logo匹配成功时，则获取安全标识存储库中存储的平安集团logo关联的域名“pingan.com”,并将待识别域名与该关联的域名“pingan.com”进行匹配。When the identifier corresponding to the domain name to be identified is successfully matched, the secure domain name associated with the security identifier stored in the security identifier repository is obtained, and the secure domain name is matched with the domain name to be identified. Specifically, the webpage identification platform matches the identifier corresponding to the domain name to be identified with the security identifier stored in the security repository, and the domain name to be identified corresponding to the security identifier corresponding to the domain name to be identified may be a secure domain name, and thus needs to be performed. Further matching and identifying, the webpage identification platform obtains the secure domain name associated with the security identifier stored in the former security identifier repository, and matches the secure domain name associated with the security identifier stored in the security identifier repository. Match the secure domain name with the domain name to be identified. For example, if the identifier of the to-be-identified domain name corresponding to the domain name obtained by the webpage matches the Ping An Group logo stored in the secure repository, the domain name associated with the Ping An Group logo stored in the security identity repository is obtained. “pingan.com” ", and match the domain name to be identified with the associated domain name "pingan.com".

当安全域名与待识别域名匹配不成功时，则待识别域名对应的网站中的网页作为风险等级大于预设等级的网页。具体地，当网页识别平台将待识别域名与安全域名匹配未成功使，则待识别域名对应的标识符是伪造的安全标识符，则将待识别域名对应的网站中包含的网页作为风险等级大于预设等级的网页。例如，网页识别平台将获取到的待识别域名的标识符为平安集团logo，当平安集团logo与安全标识存储库中存储的安全标识匹配成功则获取安全标识存储库中关联的域名“pingan.com”，当待识别域名不为“pingan.com”时，则待识别域名伪造了平安集团logo，则将该待识别域名对应的网站中的网页作为风险等级大于预设等级的网页。When the matching of the secure domain name and the domain name to be identified is unsuccessful, the webpage in the website corresponding to the domain name to be identified is used as a webpage with a risk level greater than a preset level. Specifically, when the webpage identification platform matches the domain name to be identified with the security domain name, the identifier corresponding to the domain name to be identified is a forged security identifier, and the webpage included in the website corresponding to the domain name to be identified is regarded as a risk level greater than A web page with a preset level. For example, the identifier of the domain name to be identified obtained by the webpage recognition platform is the Pingan Group logo. When the security logo stored in the security group store and the security identity store is successfully matched, the domain name associated with the security identity store is obtained. Pingan.com When the domain name to be identified is not "pingan.com", the domain name to be identified forged the Pingguo Group logo, and the webpage in the website corresponding to the domain name to be identified is used as a webpage whose risk level is greater than a preset level.

本实施例中，当对网页数据进行识别未得到可疑待识别域名时，则根据待识别域名携带的标识符进一步识别从而得到待识别域名对应的网站中包含的网页为风险等级大于预设值的网页，采用多重识别方法，提高识别风险等级大于预设等级的网页的准确性。In this embodiment, when the webpage data is identified and the suspicious domain name is not obtained, the identifier is further identified according to the identifier carried by the domain name to be identified, so that the webpage included in the website corresponding to the domain name to be identified is greater than the preset value. The webpage adopts a multi-identification method to improve the accuracy of identifying a webpage whose risk level is greater than a preset level.

在其中一个实施例中，步骤S210之后，还可以包括如下步骤，步骤S210，即根据所获取的网站数据得到与待识别域名对应的风险等级大于预设等级的网页的步骤之后，还包括：In one embodiment, after the step S210, the method further includes the following steps: Step S210, after the step of obtaining the webpage with the risk level corresponding to the domain name to be identified that is greater than the preset level according to the acquired website data, the method further includes:

提取风险等级大于预设等级网页的网页数据的关键字，根据关键字对风险等级大于预设等级的待识别域名添加对应的类别标签。具体地，类别标签是指网页数据的类型的标识，类别标签可以是不同的风险类别的标签，例如，类别标签可以是银行类别标签，可以是购物类别标签等。具体地，网页识别平台识别出风险等级大于预设等级的网页，进而，网页识别平台提取网页数据的关键字，网页识别平台根据提取出的网页数据的关键字，根据提取出的网页数据的关键字，对包含网页数据的网页对应的网站关联的待识别域名添加对应的类别标签。例如，网页识别平台根据识别出风险等级大于预设等级的网页，进而从网页识别平台从不同的网页上提取关键字分别为“积分”与“银行”，网页识别平台根据提取出的网页数据的关键字“积分”与“银行”，对包含网页数据的网页对应的网站关联的待识别域名添加对应的类别标签即添加“银行标签”或“积分标签”。The keywords of the webpage data whose risk level is greater than the preset level webpage are extracted, and the corresponding category label is added according to the keyword to the domain name to be identified whose risk level is greater than the preset level. Specifically, the category label refers to an identifier of a type of webpage data, and the category label may be a label of a different risk category. For example, the category label may be a bank category label, may be a shopping category label, or the like. Specifically, the webpage recognition platform identifies a webpage whose risk level is greater than a preset level, and further, the webpage recognition platform extracts keywords of the webpage data, and the webpage recognition platform selects the keyword of the webpage data according to the key of the extracted webpage data. The word, the corresponding category label is added to the domain name to be identified associated with the website corresponding to the webpage containing the webpage data. For example, the webpage identification platform identifies the webpages whose risk level is greater than the preset level, and then extracts keywords from the webpage recognition platform from different webpages as “points” and “banks” respectively, and the webpage recognition platform is based on the extracted webpage data. The keyword "point" and "bank" add a "bank tag" or "point tag" when adding a corresponding category tag to the domain name to be identified associated with the website corresponding to the webpage containing the webpage data.

将风险等级大于预设等级的待识别域名的类别标签与已存储的类别标签进行匹配。具体地，网页识别平台根据对待添加域名的类别标签，将已存储网页识别平台的类别标签进行逐个匹配，直至遍历完所有的已存储的类别标签。例如，对待识别域名添加的标签为“银行”与“积分”，将待识别域名添加的标签“银行”与已存储的类别标签逐个进行匹配，再将对待识别域名添加的类别标签“积分”与已存储的类别标签逐个进行匹配。The category label of the domain name to be identified whose risk level is greater than the preset level is matched with the stored category label. Specifically, the webpage identification platform matches the category labels of the stored webpage recognition platform one by one according to the category label of the domain name to be added until all the stored category labels are traversed. For example, the tags to be added to the domain name to be identified are "bank" and "points", and the tag "bank" added to the domain to be identified is matched one by one with the stored category tags, and then the category tag "integration" added to the domain name to be identified is The stored category labels are matched one by one.

当未匹配成功时，则添加风险等级大于预设等级的待识别域名的类别标签，并将风险等级大于预设等级的网页存储至类别标签下。具体地，当添加的类别标签与已存储的类别标签未匹配成功时，则添加的类别标签为新的类别标签，则将未匹配成功的风险等级大于预设等级的待识别域名的类别标签添加到已存储的类别标签中，并将添加的类别标签的待识别域名对应的网站中包含的风险等级大于预设等级的网页添加到该类别标签中。例如，待识别域名添加的类别标签分别为“银行”和“积分”，将类别标签“银行”与已存储的类别标签逐个进行匹配，在将待识别域名添加的类别标签“积分”与已存储的类别标签逐个进行匹配，当类别标签“银行”未匹配成功时，则将类别标签“银行”添加到已存储的类别标签中，并将添加了“银行”类别标签的待识别域名对应的网站中包含的风险等级大于预设等级的网页添加到该类别标签中。When the unmatching is successful, the category label of the domain name to be identified whose risk level is greater than the preset level is added, and the webpage whose risk level is greater than the preset level is stored under the category label. Specifically, when the added category label does not match the stored category label successfully, the added category label is a new category label, and the category label of the domain name to be identified whose risk level is not matched is greater than the preset level. Add to the category label of the stored category category label, and add the webpage with the risk level greater than the preset level included in the website corresponding to the domain name to be identified of the added category label to the category label. For example, the category labels added to the domain name to be identified are “bank” and “point” respectively, and the category label “bank” is matched with the stored category labels one by one, and the category label “integration” added to the domain name to be identified is stored. The category labels are matched one by one. When the category label "Bank" does not match successfully, the category label "Bank" is added to the stored category label, and the website corresponding to the domain to be identified with the "Bank" category label added A web page containing a risk level greater than a preset level is added to the category label.

需要说明的是，网页识别平台可以预设时间，将已更新的类别标签以及类别标签对应的风险等级大于预设等级的网页发送至服务器进行存储。例如，预设间隔一个小时将已更新的类别标签以及类别标签对应的风险等级大于预设等级的网页发送至服务器进行存储。It should be noted that the webpage recognition platform may preset the time, and send the updated category label and the webpage with the risk level corresponding to the category label greater than the preset level to the server for storage. For example, the preset category label and the webpage corresponding to the risk level corresponding to the preset level of the category label are sent to the server for storage by one hour at a preset interval.

本实施例中，将风险等级大于预设等级的网页中的网页数据的关键字进行提取，根据关键字对风险等级大于预设等级的待识别域名添加对应的类别标签，进而如果添加的类别标签未与已存储的类别标签匹配成功时，则将添加的类别标签添加至已存储的类别标签，并将风险等级大于预设等级的网页存储在该添加的类别标签中，逐步扩充已存储的类别标签，增强适用性。In this embodiment, the keywords of the webpage data in the webpage with the risk level greater than the preset level are extracted, and the corresponding category label is added according to the keyword to the domain name to be identified whose risk level is greater than the preset level, and then the category label is added. If the matching category label does not match successfully, the added category label is added to the stored category label, and the webpage with the risk level greater than the preset level is stored in the added category label, and the stored category is gradually expanded. Labels for enhanced applicability.

在其中一个实施例中，当风险等级大于预设等级的网页为钓鱼网页时，举例示意，网页识别平台获取到已识别的钓鱼网页时，则提取该钓鱼网页对应的网页域名，进而根据网页域名获取该钓鱼网页对应的网站的网络地址，网页识别平台根据查询到的网络地址，查找网络地址关联的域名，查找网络地址关联的域名可以是，网页识别平台将查询到的钓鱼网页对应的网站的网络地址与地址关联库中与存储的网络地址进行匹配，当该钓鱼网页对应的网站的网络地址与地址关联库中预存储的网络地址匹配成功时，获取到与预存储的网络地址关联的待匹配关联域名，进而根据待匹配关联域名的有效时间，判断该待匹配关联域名是否有效，也即当当前时间小于等于有效截止时间时，则提取待匹配关联域名作为待识别域名，进而当网页识别平台查找到与网络地址关联的域名时，则将该关联的域名作为待识别域名。进而用上述方法未查询到与网络地址关联的域名时，则获取网站的域名对应的注册数据，根据注册数据查询对应的域名作为待识别域名，可以是，根据注册数据查询对应的域名作为待识别域名可以是，网页识别平台获取到钓鱼网站对应的网站的域名对应的注册数据，进而从转换逻辑库中选取注册数据对应的转换逻辑，进而将注册数据按照转换逻辑进行转换得到转换后的注册数据，将转换后的注册数据与信息存储库中存储的信息数据进行匹配，当转换后的注册数据与信息存储库中存储的信息数据匹配成功时，则获取匹配成功的信息数据关联的域名作为待识别域名。先采用已识别的钓鱼网页对应的网站的网络地址关联的域名进行查询待识别域名，当未查询到时，再采用已识别的钓鱼网页对应的网站的网络地址对应的注册数据查询待识别域名，通过两次查询的方式进行查询，保证查询不会出现遗漏。In one embodiment, when the webpage whose risk level is greater than the preset level is a phishing webpage, for example, when the webpage identifying platform obtains the identified phishing webpage, the webpage domain name corresponding to the phishing webpage is extracted, and then the webpage domain name is further Obtaining a network address of the website corresponding to the phishing webpage, the webpage identifying platform searches for the domain name associated with the network address according to the queried network address, and searching for the domain name associated with the network address may be the website corresponding to the phishing webpage that the webpage identification platform will query The network address and the address association library are matched with the stored network address. When the network address of the website corresponding to the phishing webpage and the pre-stored network address in the address association library are successfully matched, the network address associated with the pre-stored network address is obtained. Matching the associated domain name, and determining whether the associated domain name to be matched is valid according to the effective time of the domain name to be matched, that is, when the current time is less than or equal to the effective deadline, the domain name to be matched is extracted as the domain name to be identified, and then the webpage is identified. The platform finds the network address When the domain name is linked, the associated domain name is used as the domain name to be identified. When the domain name associated with the network address is not queried by the above method, the registration data corresponding to the domain name of the website is obtained, and the corresponding domain name is used as the domain name to be identified according to the registration data, and the domain name corresponding to the registration data is used as the to-be-identified domain name. The domain name may be: the webpage identification platform obtains the registration data corresponding to the domain name of the website corresponding to the phishing website, and further selects the conversion logic corresponding to the registration data from the conversion logic library, and then converts the registration data according to the conversion logic to obtain the converted registration data. And matching the converted registration data with the information data stored in the information repository. When the converted registration data and the information data stored in the information repository are successfully matched, the domain name associated with the successfully matched information data is obtained as Identify the domain name. First, the domain name associated with the network address of the website corresponding to the identified phishing website is used to query the domain name to be identified, and when not found, the registration data corresponding to the network address of the website corresponding to the identified phishing website is used to query the domain name to be identified. Query by means of two queries to ensure that the query will not be missed.

网页识别平台得到待识别域名时，则获取待识别域名对应的网站中包含的网页的网页数据，进而将网页数据与预设的黑名单中存储的第一数据进行匹配，当匹配成功时，则该网页数据对应的网页所来源的网站对应的待识别域名添加可疑标签，进而再将添加了可疑标签的待识别域名对应的网站中的网页数据与预设的白名单中存储的第二过滤数据进行匹配，当未与第二过滤数据未匹配成功时，则提取携带有可疑标签的待识别域名，从而该携带有可疑标签的待识别域名对应的网站中的网页作为钓鱼网页。进一步地，当经过预设的黑名单与预设的名单都进行数据匹配进而识别都未存在带有可疑标签的待识别域名时，则获取待识别域名对应的标识符，如企业logo，进而将获取的logo与预先存储在安全标识存储中的安全标识符进行匹配，当匹配成功时，则获取匹配成功的存储在安全标识库中的安全标识关联的安全域名，进而将安全域名与待识别域名进行匹配，当匹配不成功时，则该待识别域名伪装成安全域名，进而该待识别域名对应的网站中的网页作为钓鱼网页，通过对待识别域名对应的网站中包含的网页中的网页数据以及网页标识进行查询，确定待识别域名对应的网站中的包含的网页是否为钓鱼网页，且采用网页数据与网页标识进行二次检测，提高检测为钓鱼网页的准确性。When the webpage identification platform obtains the domain name to be identified, the webpage data of the webpage included in the website corresponding to the domain name to be identified is obtained, and the webpage data is matched with the first data stored in the preset blacklist, and when the matching is successful, Adding a suspicious tag to the domain to be identified corresponding to the website from which the webpage corresponding to the webpage data is added, and further adding the webpage data in the website corresponding to the domain to be identified with the suspicious tag and the second filtering data stored in the preset whitelist. If the matching does not match the second filtering data, the domain name to be identified carrying the suspicious tag is extracted, so that the webpage in the website corresponding to the domain name to be identified carrying the suspicious tag is used as the phishing webpage. Further, when the data is matched by the preset blacklist and the preset list to identify that the domain name to be identified with the suspicious tag does not exist, the identifier corresponding to the domain name to be identified, such as the enterprise logo, is further obtained. The obtained logo is matched with the security identifier pre-stored in the security identifier store. When the match is successful, the secure domain name associated with the secure identifier stored in the security identifier database is obtained, and the secure domain name and the domain name to be identified are obtained. If the matching is unsuccessful, the domain name to be identified is masqueraded as a secure domain name, and the webpage in the website corresponding to the domain name to be identified is used as a phishing webpage, and the webpage data in the webpage included in the website corresponding to the domain name to be recognized is The webpage identifier is queried to determine whether the webpage included in the website corresponding to the domain name to be identified is a phishing webpage, and the webpage data and the webpage identifier are used for secondary detection, thereby improving the accuracy of detecting the phishing webpage.

进而，当识别出钓鱼网页时，则提取钓鱼网页上的网页数据的关键则，根据关键字将该钓鱼网页对应的待识别域名添加类别标签，且该类别标签如果与已存储的类别标签未匹配成功时，则添加钓鱼网页对应的待识别域名的类别标签，进而将钓鱼网页添加到类别标签下。Further, when the phishing webpage is identified, the key of the webpage data on the phishing webpage is extracted, and the category label is added to the domain name to be identified corresponding to the phishing webpage according to the keyword, and the category label does not match the stored category label. When successful, the category label of the domain name to be identified corresponding to the phishing webpage is added, and the phishing webpage is added to the category label.

本实施例中，通过一个钓鱼网页即可关联查询到多个待识别域名，提高产讯效率，增强适用性，且对待识别域名对应内的网站中的网页的网页数据进行查询，以及对网页标识进行查询判断待识别域名中对应的网页是否为钓鱼网页，查询准确，且将查询到的钓鱼网页按照类别进行分类，便于后续的查询与推送。In this embodiment, a phishing webpage can be associated with multiple to-be-identified domain names to improve the efficiency of the communication, enhance the applicability, and query the webpage data of the webpage in the website corresponding to the domain name, and the webpage identifier. The query determines whether the corresponding webpage in the domain name to be identified is a phishing webpage, and the query is accurate, and the phishing webpages that are queried are classified according to categories, so as to facilitate subsequent query and push.

在其中一个实施例中，请参见图3，提供一网页识别装置的结构示意图，网页识别装置300可以包括：In one embodiment, please refer to FIG. 3, which is a schematic structural diagram of a webpage identification apparatus. The webpage identification apparatus 300 may include:

第一获取模块310，用于获取已识别的风险等级大于预设等级的网页，提取网页对应的网站域名。The first obtaining module 310 is configured to obtain a webpage whose identified risk level is greater than a preset level, and extract a website domain name corresponding to the webpage.

第二获取模块320，用于根据网站域名获取网站对应的网络地址。The second obtaining module 320 is configured to obtain a network address corresponding to the website according to the website domain name.

查找模块330，用于查找与网络地址关联的域名，当查找到与网络地址关联的域名时，则将关联的域名作为待识别域名。The searching module 330 is configured to search for a domain name associated with the network address. When the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified.

第三获取模块340，用于获取待识别域名对应的网站中的网页数据。The third obtaining module 340 is configured to obtain webpage data in a website corresponding to the domain name to be identified.

识别模块350，用于根据所获取的网页数据得到与待识别域名对应的风险等级大于预设等级的网页。The identification module 350 is configured to obtain, according to the acquired webpage data, a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level.

在其中一个实施例中，查找模块330可以包括：In one embodiment, the lookup module 330 can include:

第一匹配单元，用于将网络地址与地址关联库中预存储的网络地址进行匹配。The first matching unit is configured to match the network address with a pre-stored network address in the address association library.

域名获取单元，用于当网络地址与地址关联库中预存储的网络地址匹配成功时，获取与预存储的网络地址关联的待匹配关联域名。The domain name obtaining unit is configured to acquire the to-be-matched associated domain name associated with the pre-stored network address when the network address is successfully matched with the network address pre-stored in the address association library.

时间获取单元，用于获取待匹配关联域名的有效截止时间。The time obtaining unit is configured to obtain an effective deadline for the associated domain name to be matched.

提取单元，用于若当前时间小于等于有效截止时间时，则提取待匹配关联域名作为待识别域名。The extracting unit is configured to extract the domain name to be matched as the domain name to be identified if the current time is less than or equal to the effective deadline.

在其中一个实施例中，网页识别装置还可以包括:In one embodiment, the webpage identification device may further include:

查询模块，用于当未查找到与网络地址关联的域名时，则获取网站的域名对应的注册数据，根据注册数据查询对应的域名作为待识别域名。The query module is configured to obtain the registration data corresponding to the domain name of the website when the domain name associated with the network address is not found, and query the corresponding domain name as the domain name to be identified according to the registration data.

在其中一个实施例中，查询模块可以包括：In one of the embodiments, the query module can include:

选取单元，用于获取网站的域名对应的注册数据，从转换逻辑库中选取注册数据对应的转换逻辑。The selecting unit is configured to obtain registration data corresponding to the domain name of the website, and select a conversion logic corresponding to the registration data from the conversion logic library.

转换单元，用于根据转换逻辑将注册数据进行转换得到转换后的注册数据。a conversion unit, configured to convert the registration data according to the conversion logic to obtain the converted registration data.

第二匹配单元，用于将转换后的注册数据与信息存储库中存储的信息数据进行匹配。And a second matching unit, configured to match the converted registration data with the information data stored in the information repository.

待识别域名获取单元，用于当转换后的注册数据与信息存储库中存储的信息数据匹配成功时，则获取匹配成功的信息数据关联的域名作为待识别域名。The domain name obtaining unit is configured to obtain the domain name associated with the successfully matched information data as the domain name to be identified when the converted registration data is successfully matched with the information data stored in the information repository.

在其中一个实施例中，识别模块350还可以包括：In one embodiment, the identification module 350 can further include:

第一过滤单元，用于将网页数据与预设的黑名单中存储的第一过滤数据进行匹配，当网站数据与第一过滤数据匹配成功时，则对待识别域名添加可疑标签。The first filtering unit is configured to match the webpage data with the first filtering data stored in the preset blacklist. When the website data and the first filtering data are successfully matched, the suspicious label is added to the domain name to be identified.

第二过滤单元，用于将添加可疑标签的待识别域名对应的网站中的网页数据与预设的白名单中存储的第二过滤数据进行匹配。The second filtering unit is configured to match the webpage data in the website corresponding to the to-be-identified domain name to which the suspicious tag is added, and the second filtering data stored in the preset whitelist.

标签域名获取单元，用于当网页数据与第二过滤数据未匹配成功时，则提取携带有可疑标签的待识别域名，获取待识别域名对应的网站中的网页作为风险等级大于预设等级的网页。The tag domain name obtaining unit is configured to: when the webpage data and the second filtering data are not successfully matched, extract the domain name to be identified carrying the suspicious tag, and obtain the webpage in the website corresponding to the domain name to be identified as the webpage whose risk level is greater than the preset level. .

在其中一个实例种，网页识别装置300还可以包括：In one example, the webpage identification device 300 may further include:

标识符获取模块，用于当经过预设的黑名单与预设的白名单进行数据识别后未存在携带有可疑标签的待识别域名时，则获取待识别域名对应的标识符。The identifier obtaining module is configured to obtain an identifier corresponding to the domain name to be identified when the domain name to be identified carrying the suspicious tag does not exist after the data is identified by the preset blacklist and the preset whitelist.

标识符匹配模块，用于将标识符与预先存储在安全标识存储库中的安全标识符进行匹配。An identifier matching module for matching the identifier with a security identifier pre-stored in the security identity store.

安全域名匹配模块，用于当安全标识符与待识别域名对应的标识符匹配成功时，则获取匹配成功的存储在安全标识存储库中的安全标识符关联的安全域名，将安全域名与待识别域名匹配。The secure domain name matching module is configured to: when the identifier corresponding to the domain name to be identified is successfully matched, obtain the secure domain name associated with the security identifier stored in the security identifier repository, and the secure domain name and the to-be-identified Domain name matching.

可疑域名提取模块，用于当安全域名与待识别域名匹配不成功时，则待识别域名对应的网站中的网页作为风险等级大于预设等级的网页。The suspicious domain name extraction module is configured to: when the matching of the secure domain name and the domain name to be identified is unsuccessful, the webpage in the website corresponding to the domain name to be identified is a webpage with a risk level greater than a preset level.

在其中一个实施例中，网页识别装置300还可以包括：In one embodiment, the webpage identification device 300 may further include:

关键字提取模块，用于提取风险等级大于预设等级的网页的网页数据的关键字，根据关键字对风险等级大于预设等级的网页对应的待识别域名添加对应的类别标签。The keyword extraction module is configured to extract a keyword of the webpage data of the webpage whose risk level is greater than the preset level, and add a corresponding category label to the to-be-identified domain name corresponding to the webpage whose risk level is greater than the preset level according to the keyword.

标签匹配模块，用于将风险等级大于预设等级的待识别域名的类别标签与已存储的类别标签进行匹配。The label matching module is configured to match the category label of the domain name to be identified whose risk level is greater than the preset level with the stored category label.

添加模块，用于当未匹配成功时，则添加风险等级大于预设等级的待识别域名的类别标签，并将风险等级大于预设等级的网页存储至类别标签下。The adding module is configured to add a category label of the domain name to be identified with a risk level greater than a preset level when the unmatched success, and store the webpage with a risk level greater than the preset level under the category label.

上述关于网页识别装置的具体限定可以参见上文中关于网页识别方法的限定，在此不再赘述。上述网页识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。该处理器可以为中央处理单元(CPU)、微处理器、单片机等。上述网页识别装置可以实现为一种计算机可读指令的形式，计算机可读指令可在如图1所示的网页数据处理平台设备上运行。For the specific definition of the webpage identification device, reference may be made to the above description of the webpage identification method, and details are not described herein again. Each of the above-described web page identification devices may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules. The processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like. The web page identification device described above can be implemented in the form of a computer readable instruction that can be executed on a web page data processing platform device as shown in FIG.

本申请实施例提出了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储网页识别数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种网页识别方法。The embodiment of the present application provides a computer device, which may be a server, and an internal structure diagram thereof may be as shown in FIG. 4 . The computer device includes a processor, memory, network interface, and database connected by a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for operation of an operating system and computer readable instructions in a non-volatile storage medium. The database of the computer device is used to store web page identification data. The network interface of the computer device is used to communicate with an external terminal via a network connection. The computer readable instructions are executed by a processor to implement a web page identification method.

本领域技术人员可以理解，图4中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。其中，处理器执行该计算机可读指令时实现以下步骤：获取已识别的风险等级大于预设等级的网页，提取网页对应的网站域名。根据网站域名获取网站对应的网络地址。查找与网络地址关联的域名，当查找到与网络地址关联的域名时，则将关联的域名作为待识别域名。获取待识别域名对应的网站中的网页数据。根据所获取的网页数据得到与待识别域名对应的风险等级大于预设等级的网页。It will be understood by those skilled in the art that the structure shown in FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements. The processor executes the following steps: obtaining a webpage whose identified risk level is greater than a preset level, and extracting a website domain name corresponding to the webpage. Get the web address corresponding to the website according to the website domain name. Find the domain name associated with the network address. When the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified. Get the webpage data in the website corresponding to the domain name to be identified. And obtaining, according to the obtained webpage data, a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level.

上述关于计算机设备的具体限定可以参见上文中关于网页识别方法的限定，在此不再赘述。For the specific definition of the computer device, reference may be made to the definition of the webpage identification method in the above, and details are not described herein again.

在其中一个实施例中，请继续参见图4，提供一种存储有计算机可读指令的非易失性计算机可读存储介质，所述计算机可读指令被一个或多个处理器执行，使得所述一个或多个处理器执行以下步骤：获取已识别的风险等级大于预设等级的网页，提取网页对应的网站域名。根据网站域名获取网站对应的网络地址。查找与网络地址关联的域名，当查找到与网络地址关联的域名时，则将关联的域名作为待识别域名。获取待识别域名对应的网站中的网页数据。根据所获取的网页数据得到与待识别域名对应的风险等级大于预设等级的网页。In one embodiment, with continued reference to FIG. 4, a non-transitory computer readable storage medium storing computer readable instructions executed by one or more processors is provided, such that The one or more processors perform the following steps: acquiring a webpage whose identified risk level is greater than a preset level, and extracting a website domain name corresponding to the webpage. Get the web address corresponding to the website according to the website domain name. Find the domain name associated with the network address. When the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified. Get the webpage data in the website corresponding to the domain name to be identified. And obtaining, according to the obtained webpage data, a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level.

上述关于存储介质的具体限定可以参见上文中关于网页识别方法的限定，在此不再赘述。For the specific definition of the storage medium, reference may be made to the definition of the webpage identification method in the above, and details are not described herein again.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机可读指令来指令相关的硬件来完成，所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中，该计算机可读指令在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by computer readable instructions, which can be stored in a non-volatile computer. The readable storage medium, which when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be considered as the scope of this manual.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A webpage identification method, comprising:

Obtaining a webpage whose identified risk level is greater than a preset level, and extracting a website domain name corresponding to the webpage;

Obtaining a network address corresponding to the website according to the website domain name;

Searching for a domain name associated with the network address, and when the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified;

Obtaining webpage data in a website corresponding to the domain name to be identified; and

And obtaining, according to the obtained webpage data, a webpage corresponding to the domain name to be identified with a risk level greater than a preset level.

The method of claim 1, wherein the step of searching for a domain name associated with the network address comprises:

Matching the network address with a pre-stored network address in the address association library;

Obtaining a domain name to be matched associated with the pre-stored network address when the network address is successfully matched with the network address pre-stored in the address association library;

Obtaining an effective deadline for the associated domain name to be matched; and

If the current time is less than or equal to the valid deadline, the associated domain name to be matched is extracted as the domain name to be identified.

The method of claim 1 further comprising:

When the domain name associated with the network address is not found, the registration data corresponding to the domain name of the website is obtained, and the corresponding domain name is used as the domain name to be identified according to the registration data.

The method according to claim 3, wherein the step of obtaining the registration data corresponding to the domain name of the website, and querying the corresponding domain name as the domain name to be identified according to the registration data comprises:

Obtaining registration data corresponding to the domain name of the website, and selecting, from the conversion logic library, conversion logic corresponding to the registration data;

Converting the registration data according to the conversion logic to obtain converted registration data;

Matching the converted registration data with the information data stored in the information repository; and

When the converted registration data is successfully matched with the information data stored in the information repository, the domain name associated with the information data that is successfully matched is obtained as the domain name to be identified.

The method according to claim 1, wherein the step of obtaining a webpage having a risk level corresponding to the to-be-identified domain name that is greater than a preset level according to the obtained webpage data comprises:

Matching the webpage data with the first filtering data stored in the preset blacklist, and when the website data is successfully matched with the first filtering data, adding a suspicious tag to the domain name to be identified;

Matching the webpage data in the website corresponding to the to-be-identified domain name to which the suspicious tag is added, and the second filtering data stored in the preset whitelist; and

When the webpage data and the second filtering data are not successfully matched, the domain name to be identified carrying the suspicious tag is extracted, and the webpage in the website corresponding to the domain name to be identified is obtained as a webpage whose risk level is greater than a preset level.

The method of claim 5, wherein the method further comprises:

Obtaining an identifier corresponding to the to-be-identified domain name when the domain name to be identified carrying the suspicious tag does not exist after the data is identified by the preset blacklist and the preset whitelist;

Matching the identifier with a security identifier pre-stored in a secure identity store;

When the identifier corresponding to the to-be-identified domain name is successfully matched, the secure domain name associated with the security identifier stored in the security identifier repository is obtained, and the secure domain name is obtained. Matching the domain name to be identified; and

When the matching of the secure domain name and the domain name to be identified is unsuccessful, the webpage in the website corresponding to the domain name to be identified is used as a webpage whose risk level is greater than a preset level.

The method according to claim 1, wherein the step of obtaining a webpage having a risk level corresponding to the to-be-identified domain name that is greater than a preset level according to the acquired webpage data further comprises:

Extracting a keyword of the webpage data of the webpage whose risk level is greater than the preset level, and adding a corresponding category label to the to-be-identified domain name corresponding to the webpage whose risk level is greater than the preset level according to the keyword;

Matching the category label of the domain name to be identified with the risk level greater than the preset level to the stored category label; and

When the unmatching is successful, the category label of the domain name to be identified whose risk level is greater than the preset level is added, and the webpage whose risk level is greater than the preset level is stored under the category label.

A webpage identification device, characterized in that the device comprises:

a first acquiring module, configured to acquire a webpage whose identified risk level is greater than a preset level, and extract a website domain name corresponding to the webpage;

a second obtaining module, configured to obtain a network address corresponding to the website according to the website domain name;

a search module, configured to search for a domain name associated with the network address, and when the domain name associated with the network address is found, the associated domain name is used as the domain name to be identified;

a third obtaining module, configured to acquire webpage data in a website corresponding to the domain name to be identified;

The identification module is configured to obtain, according to the acquired webpage data, a webpage whose risk level corresponding to the domain name to be identified is greater than a preset level.

The device according to claim 1, wherein the searching module comprises:

a first matching unit, configured to match the network address with a pre-stored network address in the address association library;

a domain name obtaining unit, configured to acquire a to-be-matched associated domain name associated with the pre-stored network address when the network address is successfully matched with the network address pre-stored in the address association library;

a time obtaining unit, configured to obtain an effective deadline for the associated domain name to be matched; and

The extracting unit is configured to extract the to-be-matched associated domain name as the to-be-identified domain name if the current time is less than or equal to the valid deadline.

A computer apparatus comprising a memory and a processor, the memory storing computer readable instructions, wherein the processor, when executing the computer readable instructions, implements the following steps:

The computer device according to claim 10, wherein the step of searching for the domain name associated with the network address implemented by the processor when the computer readable instructions are executed further comprises the step of:

The computer apparatus according to claim 10, wherein said processor further implements the following steps when said computer readable instructions are executed:

The computer device according to claim 12, wherein the processor executes the computer readable instructions to obtain the registration data corresponding to the domain name of the website, and queries the corresponding data according to the registration data. The domain name as a step of identifying the domain name also includes the following steps:

The computer device according to claim 10, wherein the processor, when the processor executes the computer readable instructions, obtains a risk level corresponding to the domain name to be identified that is greater than a pre-determined according to the acquired webpage data. The step of leveling the webpage further includes performing the following steps:

The computer apparatus according to claim 14, wherein said processor further implements the following steps when said computer readable instructions are executed:

The computer device according to claim 10, wherein the processor, when the processor executes the computer readable instructions, obtains a risk level corresponding to the domain name to be identified that is greater than a pre-determined according to the acquired webpage data. After the step of leveling the webpage, it also includes:

One or more non-transitory computer readable storage mediums storing computer readable instructions, wherein when the computer readable instructions are executed by one or more processors, cause one or more processors to perform the following step:

The storage medium of claim 17, wherein the computer readable instructions are executed by one or more processors such that one or more processors perform the lookup of the domain name associated with the network address Steps, including:

The storage medium of claim 17, wherein the computer readable instructions are executed by one or more processors such that the one or more processors can further perform the following steps:

The storage medium of claim 19, wherein the computer readable instructions are executed by one or more processors such that the one or more processors execute the registration data corresponding to the domain name of the website The step of querying the corresponding domain name as the domain name to be identified according to the registration data includes: acquiring registration data corresponding to the domain name of the website, and selecting a conversion logic corresponding to the registration data from the conversion logic library;