CN108304395B

CN108304395B - Webpage cheating detection

Info

Publication number: CN108304395B
Application number: CN201610083780.XA
Authority: CN
Inventors: 王飞; 蒋汉平; 常智山
Original assignee: BEIJING XUNAO TECHNOLOGY CO LTD
Current assignee: Zhongkeji Big Data Technology Nanjing Co ltd
Priority date: 2016-02-05
Filing date: 2016-02-05
Publication date: 2022-09-06
Anticipated expiration: 2036-02-05
Also published as: CN108304395A

Abstract

The web page cheating detection is that on the basis of the existing method, the detection of cheating web pages is improved, on one hand, the accuracy and recall rate of the detection of the cheating web pages are improved on the basis of the existing method, so that the coverage range and the reliability of a seed set in a trustrank algorithm are improved, on the other hand, a new PageRank value is calculated through the established new link relation, the correlation of the web pages is improved, the quality of a search engine is increased, and a user is better served.

Description

Webpage cheating detection

Technical Field

The method has the advantages that the cheating website is prevented from improving the PageRank value in a link cheating mode, the quality of a search engine is seriously influenced, the query time of a user is prolonged, and the trust degree of the user on the used search engine is reduced. This patent is solved: how to identify the cheating web pages and quantify the degree of cheating. The patent is one of the core technical field problems of the search engine.

Background

At present, the main methods for detecting cheating webpages mainly include trustrank, badrank and the like. These techniques have some effect but have significant limitations. For example, these algorithms rely on the existence of a trustworthy seed set (premium or spamming) that is spread to measure the individual pages. The method has the advantages that the influence of a cheating farm on the webpage is eliminated, but a plurality of defects are exposed, on one hand, the method depends excessively on the coverage range and the reliability problem existing in the selection process of the seed set, on the other hand, the webpage in the seed set can often obtain high ranking due to the driving of the seed set, and meanwhile, due to the incomplete link relation and the existence of data noise when the link is grabbed, the algorithm cannot achieve good effect in practical application.

Disclosure of Invention

The method is essentially based on a statistical model, and is used for constructing a webpage cheating detection model through the distribution condition of statistical linked webpages. The method calculates the PageRank value of each webpage on the basis of the original link, then carries out statistics on the webpage of each webpage, analyzes whether the PageRank distribution in the webpage set follows the tenths distribution rule (in the implementation, a simplified two-eight distribution law can be adopted based on efficiency), and finally gives a measure for the reliability or cheating of each webpage by combining the original PageRank value. The specific steps are as follows (here for our domain based):

1. and capturing a webpage, extracting a url-based network link relation, and establishing a url-based chaining-out relation library.

2. And establishing a domain-based chain-out relational database by using the chain-out relational database.

3. And calculating the domain-based PageRank value of each webpage by using the domain-based linked-out relational database.

4. And establishing a domain-based link-in relation library by using the link-out relation library.

5. And (4) statistically analyzing the PageRank value of each domain by using the link-in relational library, and statistically analyzing the distribution condition of the linked-in set.

6. And calculating the cheating size of each webpage by using the statistical information.

7. Setting a threshold value, and selecting the cheated web pages.

FIG. 1: is the description of the processing flow chart of the webpage cheating detection:

a. when building a domain-based linked-out relational library, we sometimes also consider the number of urls contained in this domain (since this time domain is equivalent to a combination of many urls).

b. When the domain-based PageRank value is calculated, a link relation file is divided into 1024 files according to a module in the calculation process of operation on distributed machines, so that the link relation file can be conveniently stored in a memory of each machine, and the calculation speed is increased. In dealing with the problem of dangling nodes, we distribute the score of a dangling node to each web page.

C. When the statistical information is utilized to calculate the cheating size of each webpage, three conditions of meeting or not meeting the twenty-eight law, meeting or unsatisfying degrees and the PageRank value of the webpage are comprehensively considered, and the measurement is given through the following formula.

a) Given a domain name a, the PageRank value of the domain name is PA, the sorted set of the PageRank values of the domain names linked to a is SA, the sum of the top 20% of the web page domains in SA is S1, the sum of the top 80% of the web page domains is S2, and the ratio of the sum of the top 20% of the web page domains S1 to the sum of all web page domains is recorded as: ratio is S1/(S1+ S2).

b) The cheat size for domain name a is defined as: spam (a) (ratio-1) × PA, which is positive in physical meaning, the larger the value is, the more likely it is to be normal; spam (a) is negative, the smaller the value, the greater the likelihood of cheating.

Claims

1. A webpage cheating detection method is characterized in that a model is built by counting the distribution condition of linked-in webpages, and the method is characterized in that the PageRank value of each webpage is calculated on the basis of original links, the PageRank value of each webpage is obtained by grabbing the webpage, extracting the network link relation based on url, establishing a linked-out relation library based on domain by using the linked-out relation library, and calculating by using the linked-out relation library based on domain;

establishing a domain-based link-in relation library by using a link-out relation library, counting the link-in web pages of each web page, and analyzing whether pagerank distribution in a link-in set follows a power distribution rule or a twenty-eight distribution law or not;

combining the original PageRank value to give a measure to the reliability or cheating performance of each webpage;

specifically, the method comprises the following steps: given a domain name a, the PageRank value of the domain name is PA, the set of the PageRank values of the domain names in the chain direction a after sorting is SA, the sum of the first 20% of the web page domains in the SA is S1, the sum of the last 80% of the web page domains is S2, and the ratio of the sum of the first 20% of the web page domains S1 to the sum of all the web page domains is recorded as: ratio S1/(S1+ S2);

the cheating size of the domain name A is defined as follows: spam (a) ═ PA, in the physical sense that spam (a) is positive, the greater the value the greater the likelihood of normality; spam (a) is negative, the smaller the value, the greater the likelihood of cheating;

and finally, setting a threshold value and selecting the cheated webpages.