[go: up one dir, main page]

CN108304395B - Webpage cheating detection - Google Patents

Webpage cheating detection Download PDF

Info

Publication number
CN108304395B
CN108304395B CN201610083780.XA CN201610083780A CN108304395B CN 108304395 B CN108304395 B CN 108304395B CN 201610083780 A CN201610083780 A CN 201610083780A CN 108304395 B CN108304395 B CN 108304395B
Authority
CN
China
Prior art keywords
webpage
cheating
domain
value
pagerank
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610083780.XA
Other languages
Chinese (zh)
Other versions
CN108304395A (en
Inventor
王飞
蒋汉平
常智山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongkeji Big Data Technology Nanjing Co ltd
Original Assignee
BEIJING XUNAO TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING XUNAO TECHNOLOGY CO LTD filed Critical BEIJING XUNAO TECHNOLOGY CO LTD
Priority to CN201610083780.XA priority Critical patent/CN108304395B/en
Publication of CN108304395A publication Critical patent/CN108304395A/en
Application granted granted Critical
Publication of CN108304395B publication Critical patent/CN108304395B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The web page cheating detection is that on the basis of the existing method, the detection of cheating web pages is improved, on one hand, the accuracy and recall rate of the detection of the cheating web pages are improved on the basis of the existing method, so that the coverage range and the reliability of a seed set in a trustrank algorithm are improved, on the other hand, a new PageRank value is calculated through the established new link relation, the correlation of the web pages is improved, the quality of a search engine is increased, and a user is better served.

Description

Webpage cheating detection
Technical Field
The method has the advantages that the cheating website is prevented from improving the PageRank value in a link cheating mode, the quality of a search engine is seriously influenced, the query time of a user is prolonged, and the trust degree of the user on the used search engine is reduced. This patent is solved: how to identify the cheating web pages and quantify the degree of cheating. The patent is one of the core technical field problems of the search engine.
Background
At present, the main methods for detecting cheating webpages mainly include trustrank, badrank and the like. These techniques have some effect but have significant limitations. For example, these algorithms rely on the existence of a trustworthy seed set (premium or spamming) that is spread to measure the individual pages. The method has the advantages that the influence of a cheating farm on the webpage is eliminated, but a plurality of defects are exposed, on one hand, the method depends excessively on the coverage range and the reliability problem existing in the selection process of the seed set, on the other hand, the webpage in the seed set can often obtain high ranking due to the driving of the seed set, and meanwhile, due to the incomplete link relation and the existence of data noise when the link is grabbed, the algorithm cannot achieve good effect in practical application.
Disclosure of Invention
The method is essentially based on a statistical model, and is used for constructing a webpage cheating detection model through the distribution condition of statistical linked webpages. The method calculates the PageRank value of each webpage on the basis of the original link, then carries out statistics on the webpage of each webpage, analyzes whether the PageRank distribution in the webpage set follows the tenths distribution rule (in the implementation, a simplified two-eight distribution law can be adopted based on efficiency), and finally gives a measure for the reliability or cheating of each webpage by combining the original PageRank value. The specific steps are as follows (here for our domain based):
1. and capturing a webpage, extracting a url-based network link relation, and establishing a url-based chaining-out relation library.
2. And establishing a domain-based chain-out relational database by using the chain-out relational database.
3. And calculating the domain-based PageRank value of each webpage by using the domain-based linked-out relational database.
4. And establishing a domain-based link-in relation library by using the link-out relation library.
5. And (4) statistically analyzing the PageRank value of each domain by using the link-in relational library, and statistically analyzing the distribution condition of the linked-in set.
6. And calculating the cheating size of each webpage by using the statistical information.
7. Setting a threshold value, and selecting the cheated web pages.
FIG. 1: is the description of the processing flow chart of the webpage cheating detection:
a. when building a domain-based linked-out relational library, we sometimes also consider the number of urls contained in this domain (since this time domain is equivalent to a combination of many urls).
b. When the domain-based PageRank value is calculated, a link relation file is divided into 1024 files according to a module in the calculation process of operation on distributed machines, so that the link relation file can be conveniently stored in a memory of each machine, and the calculation speed is increased. In dealing with the problem of dangling nodes, we distribute the score of a dangling node to each web page.
C. When the statistical information is utilized to calculate the cheating size of each webpage, three conditions of meeting or not meeting the twenty-eight law, meeting or unsatisfying degrees and the PageRank value of the webpage are comprehensively considered, and the measurement is given through the following formula.
a) Given a domain name a, the PageRank value of the domain name is PA, the sorted set of the PageRank values of the domain names linked to a is SA, the sum of the top 20% of the web page domains in SA is S1, the sum of the top 80% of the web page domains is S2, and the ratio of the sum of the top 20% of the web page domains S1 to the sum of all web page domains is recorded as: ratio is S1/(S1+ S2).
b) The cheat size for domain name a is defined as: spam (a) (ratio-1) × PA, which is positive in physical meaning, the larger the value is, the more likely it is to be normal; spam (a) is negative, the smaller the value, the greater the likelihood of cheating.

Claims (1)

1. A webpage cheating detection method is characterized in that a model is built by counting the distribution condition of linked-in webpages, and the method is characterized in that the PageRank value of each webpage is calculated on the basis of original links, the PageRank value of each webpage is obtained by grabbing the webpage, extracting the network link relation based on url, establishing a linked-out relation library based on domain by using the linked-out relation library, and calculating by using the linked-out relation library based on domain;
establishing a domain-based link-in relation library by using a link-out relation library, counting the link-in web pages of each web page, and analyzing whether pagerank distribution in a link-in set follows a power distribution rule or a twenty-eight distribution law or not;
combining the original PageRank value to give a measure to the reliability or cheating performance of each webpage;
specifically, the method comprises the following steps: given a domain name a, the PageRank value of the domain name is PA, the set of the PageRank values of the domain names in the chain direction a after sorting is SA, the sum of the first 20% of the web page domains in the SA is S1, the sum of the last 80% of the web page domains is S2, and the ratio of the sum of the first 20% of the web page domains S1 to the sum of all the web page domains is recorded as: ratio S1/(S1+ S2);
the cheating size of the domain name A is defined as follows: spam (a) ═ PA, in the physical sense that spam (a) is positive, the greater the value the greater the likelihood of normality; spam (a) is negative, the smaller the value, the greater the likelihood of cheating;
and finally, setting a threshold value and selecting the cheated webpages.
CN201610083780.XA 2016-02-05 2016-02-05 Webpage cheating detection Active CN108304395B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610083780.XA CN108304395B (en) 2016-02-05 2016-02-05 Webpage cheating detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610083780.XA CN108304395B (en) 2016-02-05 2016-02-05 Webpage cheating detection

Publications (2)

Publication Number Publication Date
CN108304395A CN108304395A (en) 2018-07-20
CN108304395B true CN108304395B (en) 2022-09-06

Family

ID=62870988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610083780.XA Active CN108304395B (en) 2016-02-05 2016-02-05 Webpage cheating detection

Country Status (1)

Country Link
CN (1) CN108304395B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101493819A (en) * 2008-01-24 2009-07-29 中国科学院自动化研究所 Method for optimizing detection of search engine cheat
CN102663101A (en) * 2012-04-13 2012-09-12 北京交通大学 Sina microblog-based user grade sequencing algorithm

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7533092B2 (en) * 2004-10-28 2009-05-12 Yahoo! Inc. Link-based spam detection
US20060112089A1 (en) * 2004-11-22 2006-05-25 International Business Machines Corporation Methods and apparatus for assessing web page decay
US7624104B2 (en) * 2006-06-22 2009-11-24 Yahoo! Inc. User-sensitive pagerank
US7974970B2 (en) * 2008-10-09 2011-07-05 Yahoo! Inc. Detection of undesirable web pages
US9092516B2 (en) * 2011-06-20 2015-07-28 Primal Fusion Inc. Identifying information of interest based on user preferences
CN102456064B (en) * 2011-04-25 2013-04-24 中国人民解放军国防科学技术大学 Method for realizing community discovery in social networking
CN102508859B (en) * 2011-09-29 2014-10-29 北京亿赞普网络技术有限公司 Advertisement classification method and device based on webpage characteristic
CN102693308B (en) * 2012-05-24 2014-02-12 北京迅奥科技有限公司 Cache method for real time search
CN104092567B (en) * 2014-06-26 2017-10-27 华为技术有限公司 Determine the method and apparatus of the influence power sequence of user
CN105183784B (en) * 2015-08-14 2020-04-28 天津大学 Content-based spam webpage detection method and detection device thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101493819A (en) * 2008-01-24 2009-07-29 中国科学院自动化研究所 Method for optimizing detection of search engine cheat
CN102663101A (en) * 2012-04-13 2012-09-12 北京交通大学 Sina microblog-based user grade sequencing algorithm

Also Published As

Publication number Publication date
CN108304395A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN103458042B (en) A kind of microblog advertisement user detection method
CN104601556B (en) A kind of attack detection method and system towards WEB
CN102567407B (en) Method and system for collecting forum reply increment
CN103886105B (en) User influence analysis method based on social network user behaviors
CN103336766A (en) Short text garbage identification and modeling method and device
CN105005594A (en) Abnormal Weibo user identification method
CN103559330B (en) Method and system for detecting data consistency
CN107870957A (en) A Popular Microblog Prediction Method Based on Information Gain and BP Neural Network
CN103136331A (en) Micro blog network opinion leader identification method
CN103051637A (en) User identification method and device
CN101477552A (en) Website user rank division method
CN104268230B (en) A kind of Chinese micro-blog viewpoint detection method based on heterogeneous figure random walk
CN108055227B (en) WAF unknown attack defense method based on site self-learning
CN103870541B (en) Social network user interest digging method and system
CN107766234A (en) A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device
CN103425650A (en) Recommendation searching method and recommendation searching system
CN102156746A (en) Method for evaluating performance of search engine
CN109885656B (en) Microblog forwarding prediction method and device based on quantitative popularity
CN114238360A (en) A user behavior analysis system
CN107341508A (en) A kind of quick cuisines image identification method and system
CN104008213B (en) A kind of more new discovery of info web and the method and apparatus of statistics
CN106777070A (en) A kind of system and method for the Web record links based on piecemeal
CN102929977B (en) Event tracing method aiming at news website
CN107423319A (en) A kind of spam page detection method
CN108304395B (en) Webpage cheating detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice

Addressee: Beijing Xunao Technology Co.,Ltd.

Document name: Notification of Publication and of Entering the Substantive Examination Stage of the Application for Invention

DD01 Delivery of document by public notice
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230113

Address after: 210000 201, Floor 2, Building 34, Chaoyang Xiyuan Business Building, Banqiao Street, Yuhuatai District, Nanjing, Jiangsu Province

Patentee after: Zhongkeji big data technology (Nanjing) Co.,Ltd.

Address before: Room 506, East Building, No. 15, North West Fourth Ring Road, Haidian District, Beijing 100097

Patentee before: BEIJING XUNAO TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right