[go: up one dir, main page]

WO2019148712A1 - Phishing website detection method, device, computer equipment and storage medium - Google Patents

Phishing website detection method, device, computer equipment and storage medium Download PDF

Info

Publication number
WO2019148712A1
WO2019148712A1 PCT/CN2018/088935 CN2018088935W WO2019148712A1 WO 2019148712 A1 WO2019148712 A1 WO 2019148712A1 CN 2018088935 W CN2018088935 W CN 2018088935W WO 2019148712 A1 WO2019148712 A1 WO 2019148712A1
Authority
WO
WIPO (PCT)
Prior art keywords
website
phishing
phishing website
keyword
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/088935
Other languages
French (fr)
Chinese (zh)
Inventor
王元铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2019148712A1 publication Critical patent/WO2019148712A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/16Implementing security features at a particular protocol layer
    • H04L63/168Implementing security features at a particular protocol layer above the transport layer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Definitions

  • the application relates to a phishing website detecting method, device, computer device and storage medium.
  • the inventor realized that the traditional phishing website detection is detected by keywords in the webpage content in the website, and the phishing website is detected by the webpage content keyword, and the detection by the keyword is not comprehensive enough to cause the detection of the website to be insufficient.
  • the phishing website has a low detection accuracy.
  • a phishing website detecting method According to various embodiments disclosed herein, a phishing website detecting method, apparatus, computer apparatus, and storage medium are provided.
  • a phishing website detection method includes:
  • the phishing website warning information is returned to the terminal.
  • a phishing website detecting device includes:
  • An access request receiving module configured to receive a website access request sent by the terminal
  • a website data query module configured to query website data according to the website access request
  • a model tree building module configured to build a document object model tree according to the queried website data
  • a model tree comparison module configured to compare the constructed document object model tree with a pre-stored phishing website model tree to obtain a comparison result
  • a phishing website determining module configured to determine, according to the comparison result, whether the website corresponding to the website access request is a phishing website
  • the warning information returning module is configured to return a phishing website warning message to the terminal when determining that the website corresponding to the website access request is a phishing website.
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps:
  • the phishing website warning information is returned to the terminal.
  • One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:
  • the phishing website warning information is returned to the terminal.
  • FIG. 1 is an application scenario diagram of a phishing website detection method according to one or more embodiments.
  • FIG. 2 is a flow chart of a method for detecting a phishing website according to one or more embodiments.
  • FIG. 3 is a flow diagram of the steps of extracting a phishing website keyword in accordance with one or more embodiments.
  • FIG. 4 is a flow diagram of the steps of detecting a suspected phishing website in accordance with one or more embodiments.
  • FIG. 5 is a flow diagram showing the steps of determining weight values for each phishing website keyword in accordance with one or more embodiments.
  • FIG. 6 is a flow diagram showing the steps of calculating the similarity between website text information and phishing website text information in accordance with one or more embodiments.
  • FIG. 7 is a block diagram of a phishing website detecting apparatus in accordance with one or more embodiments.
  • Figure 8 is a block diagram of a phishing website detecting apparatus in another embodiment.
  • FIG. 9 is a block diagram of a computer device in accordance with one or more embodiments.
  • Terminal 102 communicates with server 104 over a network over a network.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.
  • a phishing website detection method is provided.
  • the method is applied to the server in FIG. 1 as an example, and includes the following steps:
  • the user inputs the website address in the website access page displayed by the terminal, the terminal obtains the input website address, and generates a website access request according to the website address.
  • the terminal sends a website access request to the server.
  • the server receives a website access request sent by the terminal.
  • the user inputs a website identifier on the website access page displayed by the terminal, and the terminal obtains the website address according to the input website identifier, and generates a website access request according to the obtained website address.
  • the server parses the website access request, and parses and extracts the website address in the website access request, and queries the website data corresponding to the extracted website address.
  • Website data includes website frame data and website text information.
  • the website framework data is used to construct a website document object model tree, and the website text information is content related data displayed in the website webpage.
  • the Document Object Model (DOM) is a tree structure in which each element in a web page is regarded as an object, and a node structure is constructed according to the dependency relationship of each element, and only the element label is included in the node of the tree structure.
  • the server identifies the website frame data and the website text information in the website data according to the element tag, and uses the element label in the identified website frame data as a tree node, according to elements in the website frame data. Between the dependencies and the tree nodes build the document object model tree.
  • a plurality of phishing website model trees are pre-stored in the server.
  • the server compares the constructed document object model tree with the pre-stored phishing website model tree, and compares the results.
  • the comparison result may be that the same phishing website model tree exists in the pre-stored phishing website model tree or the same phishing website model tree does not exist in the pre-stored phishing website model tree.
  • the server determines that the website corresponding to the website access request is a phishing website; if the comparison result is pre-stored The phishing website model tree does not exist in the phishing website model tree in which the constructed document object model tree is consistent.
  • the comparison result may be 1 or 0.
  • "1" indicates that there is a phishing website model tree consistent with the constructed document object model tree in the pre-stored phishing website model tree; "0" indicates in the pre-stored phishing website model tree.
  • the server when determining that the website corresponding to the website access request is a phishing website, the server extracts the terminal address and the website address in the website access request, generates a phishing website warning information according to the website address, and sends the phishing website warning information to the terminal corresponding to the terminal address.
  • the server extracts the website address in the website access request and marks the website address as the phishing website address when determining that the website corresponding to the website access request is a phishing website.
  • the server extracts the terminal address and the website address in the website access request, detects whether the website address is marked as the phishing website address, and if it is marked as the phishing website address, generates a phishing website warning information according to the website address, and will phish
  • the website warning message is sent to the terminal corresponding to the terminal address.
  • the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content.
  • the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.
  • the step S202 specifically includes the step of extracting a phishing website keyword, and the step specifically includes the following content:
  • the server receives the phishing website data uploaded by the terminal, and stores the uploaded phishing website data to obtain pre-stored phishing website data.
  • the phishing website data pre-stored by the server the phishing website text information and the phishing website model tree data are identified, the identified phishing website text information is extracted, and the phishing website model tree is obtained according to the identified phishing website model tree data.
  • the server receives the phishing website address sent by the terminal, obtains the phishing website data of each phishing website according to each phishing website address, and stores the obtained phishing website data, and obtains corresponding pre-existing phishing websites respectively. Phishing website data.
  • the server extracts the phishing website text information and the phishing website model tree for the pre-stored phishing website data corresponding to each phishing website, and stores the extracted phishing website text information and the phishing website model tree corresponding to the website address.
  • the server matches the extracted phishing website text information according to the split vocabulary, extracts the matched words, and obtains a phishing website split word.
  • the server splits the phishing website text information of each phishing website according to the split vocabulary, and obtains the phishing website split word by splitting.
  • the word frequency is the frequency of occurrence of a word in the phishing website text information.
  • the server counts the total number of split words of the phishing website obtained by the split, the server reads the split words of the phishing website, and counts the number of occurrences of the phishing split words in the phishing website text information, and obtains the phishing websites.
  • the number of occurrences of the word segmentation, the number of occurrences of the split words of each phishing website is divided by the total number of split words of the phishing website, and the word frequency of the split words of each phishing website is obtained.
  • a word frequency threshold is preset in the server.
  • the server compares the word frequency of each phishing website split word with the preset word frequency threshold, determines the phishing website split word whose word frequency exceeds the preset word frequency threshold, and extracts the determined phishing website split word as the phishing website keyword.
  • S308 specifically includes the following steps: identifying a stop word in the phishing website split word according to the preset stop word table; deleting the stop word identified in the phishing website split word, and obtaining the remaining fishing Website split words; according to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words.
  • Stop words means that in the information retrieval, in order to save storage space and improve search efficiency, certain words or words are automatically filtered before or after processing natural language data or text. These words or words are called stop words. . Words that appear more frequently, but do not have substantial meaning, such as: “this”, “that”, “of” and so on.
  • a preset stop word table is stored in the server.
  • the server calls the preset stop word table to identify the stop word in the phishing website split word, and if the stop word in the phishing website split word is recognized, the identified stop word is deleted from the phishing website split word.
  • the remaining phishing website split words are obtained, and the phishing website split words whose word frequency exceeds the preset word frequency threshold are extracted from the remaining phishing website split words, and the extracted phishing website split words are extracted as the phishing website keywords.
  • S310 Store the extracted phishing website keywords to obtain pre-stored phishing website keywords.
  • the server stores the extracted phishing website keywords into the phishing website keyword library, and uses the phishing website keywords in the phishing website keyword library as the pre-stored phishing website keywords.
  • the server extracts the pre-stored phishing website keyword, queries the pre-stored phishing website keyword in the website data, and counts the number of pre-stored phishing website keywords queried in the website data, and compares the counted number with the preset threshold. If the ratio is greater than the preset threshold, the website corresponding to the website access request is detected as a suspected phishing website.
  • the server extracts the pre-stored phishing website keywords, counts the total number of pre-stored phishing website keywords, queries the pre-stored phishing website keywords in the website data, and counts the number of pre-stored phishing website keywords queried in the website data. .
  • the server divides the number of statistics by the total number of keywords in the pre-existing phishing website, and obtains the proportion of keywords in the phishing website.
  • the ratio of the keyword occurrence of the phishing website is compared with the threshold of the preset appearance ratio. When the proportional threshold is set, the website corresponding to the website access request is detected as a suspected phishing website.
  • the phishing website text information is split to obtain the phishing website split word, and the phishing website keyword is extracted from the phishing website split word according to the word frequency, and the phishing website keyword is extracted to query the queried website data.
  • S316 specifically includes the step of detecting a suspected phishing website, and the step specifically includes the following contents:
  • the server identifies the webpage label in the website data, and identifies whether the text information included in the webpage label is included. If the webpage label includes text information, the recognized text information is extracted, and all the webpage tags in the website data are to be extracted. After the text information is included, the website text information is obtained.
  • the server performs word splitting on the extracted website text information, obtains a text split word of the website text information through word splitting, identifies a stop word in the text split word, and identifies the identified stop word from the text.
  • the split word is deleted to get the remaining text split words.
  • the server queries the pre-stored phishing website keywords in the remaining text split words, and searches for the pre-stored phishing website keywords as the phishing website keywords included in the website text information.
  • the method further includes: calculating the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; When the similarity is greater than the preset similarity threshold, S406 is performed.
  • S406 Statistics the detected word frequency of the phishing website keyword in the extracted website text information.
  • the server counts the number of occurrences of each phishing website keyword in the remaining text split words, and simultaneously counts the total number of split words of the remaining text split words, and divides the number of occurrences of each phishing website keyword by the total number of split words.
  • the word frequency of each phishing website keyword is the word frequency of each phishing website keyword.
  • the frequency appears in the middle, that is, for a certain keyword, in the website text information including M keywords, the keyword appears N times.
  • the server compares the counted word frequency of each phishing website keyword with a preset word frequency threshold. If the word frequency of each phishing website keyword is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as suspected fishing.
  • the website performs the step of constructing a document object model tree based on the queried website data.
  • the phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword, and the word frequency of the phishing website keyword in the website text information is counted, and according to the key word frequency of each phishing website, whether the website is Suspected phishing websites, when detected as suspected phishing websites, once tested the document object model tree and improved the detection accuracy of phishing websites through various phishing websites.
  • the method further includes the step of determining a weight value of each phishing website keyword, and the step specifically includes the following content:
  • S502 Count the total number of phishing websites corresponding to the pre-stored phishing website data and the number of phishing websites corresponding to each phishing website keyword.
  • the server extracts the phishing website identifier corresponding to the pre-stored phishing website data, and collects the total number of identifiers corresponding to the extracted phishing website identifier, and uses the total number of statistics as the total number of phishing websites.
  • the phishing website keyword is read, and the phishing website keyword is searched for in the phishing website data corresponding to each phishing website identifier, and the number of phishing website identification corresponding to the phishing website data of the phishing website keyword is queried.
  • the number of statistics is stored in association with the phishing website keywords read, and the number of statistics is used as the number of phishing websites corresponding to the keywords of the phishing website.
  • S504 Determine a weight value corresponding to each phishing website keyword according to the total number of phishing websites and the number of phishing websites corresponding to each phishing website keyword.
  • IDF is used to identify the weight value corresponding to the phishing website keyword
  • D is used to identify the total number of phishing websites
  • D W is the number of phishing websites corresponding to the phishing website keywords
  • the server stores the weight values corresponding to the keywords of each phishing website.
  • the weight value corresponding to each phishing website keyword is calculated, that is, the more phishing websites corresponding to a phishing website keyword, the corresponding phishing website keyword The higher the weight value, the weight value is determined according to the importance level, and the accuracy of the weight value is improved.
  • the similarity between the extracted website text information and the phishing website text information is calculated according to the detected phishing website keyword and the pre-stored phishing website keyword, specifically including calculating the website text information.
  • the step of similarity with the phishing website text information the step specifically includes the following contents:
  • the server stores the pre-stored phishing website keywords and the weight values corresponding to the pre-existing phishing website suspense words.
  • the server queries each of the pre-stored phishing website keywords and the weight value corresponding to each pre-stored phishing website keyword.
  • the server sorts the pre-stored phishing website keywords, sorts the weight values corresponding to the pre-stored phishing website keywords in the order of the pre-stored phishing website keywords, and constructs a phishing website feature vector corresponding to the pre-stored phishing website keyword.
  • the server adds the weight value corresponding to the detected phishing website keyword to the corresponding position of the vector according to the phishing website feature vector, and sets the weight value corresponding to the undetected phishing website keyword to zero, and obtains the detected phishing website keyword corresponding Access to website feature vectors.
  • the corresponding weight values are 50, 40, 30, 20, and 10
  • the phishing website keywords corresponding to the phishing website keywords are pre-stored.
  • the similarity between the visited website feature vector and the phishing website feature vector is calculated, and the calculated similarity is used as the similarity between the extracted website text information and the pre-stored phishing website text information.
  • the similarity between the phishing website feature vector A and the visiting website feature vector B is represented by Sim(A, B)
  • Sim(A, B) is calculated according to the following formula, where 1 ⁇ k ⁇ n:
  • FIGS. 2-6 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in FIGS. 2-6 may include a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be executed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.
  • a phishing website detecting apparatus 700 including: an access request receiving module 702, a website data query module 704, a model tree building module 706, a model tree comparing module 708, and fishing.
  • Website determination module 710 and warning information return module 712 wherein:
  • the access request receiving module 702 is configured to receive a website access request sent by the terminal.
  • the website data query module 704 is configured to query the website data according to the website access request.
  • the model tree building module 706 is configured to construct a document object model tree according to the queried website data.
  • the model tree comparison module 708 is configured to compare the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result.
  • the phishing website determining module 710 is configured to determine, according to the comparison result, whether the website corresponding to the website access request is a phishing website.
  • the warning information returning module 712 is configured to return the phishing website warning information to the terminal when determining that the website corresponding to the website access request is a phishing website.
  • the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content.
  • the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.
  • the phishing website detecting apparatus 700 further includes: a model tree extracting module 714, a text information splitting module 716, a word frequency statistic module 718, a keyword extracting module 720, and a keyword storage module. 722 and suspected website detection module 724, wherein:
  • the model tree extraction module 714 is configured to extract and store the phishing website text information and the phishing website model tree from the pre-stored phishing website data.
  • the text information splitting module 716 is configured to split the phishing website text information to obtain a phishing website split word.
  • the word frequency statistics module 718 is configured to count the word frequency of the split words of each phishing website in the phishing website text information.
  • the keyword extraction module 720 is configured to extract a phishing website keyword from the phishing website split word according to the statistical word frequency.
  • the keyword storage module 722 is configured to store the extracted phishing website keywords to obtain pre-stored phishing website keywords.
  • the suspected website detecting module 724 is configured to detect, according to the pre-stored phishing website keyword and the website data, whether the website corresponding to the website access request is a suspected phishing website;
  • the model tree construction module 706 is further configured to construct a document object model tree according to the queried website data when detecting that the website corresponding to the website access request is a suspected phishing website.
  • the phishing website text information is split to obtain the phishing website split word, and the phishing website keyword is extracted from the phishing website split word according to the word frequency, and the phishing website keyword is extracted to query the queried website data.
  • the keyword extraction module 720 is further configured to identify a stop word in the phishing website split word according to the preset stop word table; delete the stop word identified in the phishing website split word, and obtain the remaining The phishing website splits the words; according to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words.
  • the suspected website detection module 724 is further configured to extract website text information in the website data; detect the phishing website keywords included in the extracted website text information according to the pre-stored phishing website keywords; and statistically detect the phishing The word frequency of the website keyword in the extracted website text information; when the word frequency of each phishing website keyword is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as a suspected phishing website.
  • the suspected website detection module 724 calculates the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; when the calculated similarity is greater than When the similarity threshold is preset, the detected word frequency of the phishing website keyword in the extracted website text information is counted.
  • the phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword, and the word frequency of the phishing website keyword in the website text information is counted, and according to the key word frequency of each phishing website, whether the website is Suspected phishing websites, when detected as suspected phishing websites, once tested the document object model tree and improved the detection accuracy of phishing websites through various phishing websites.
  • the phishing website detecting apparatus 700 further includes: a weight value determining module.
  • the weight value determining module is configured to count the total number of phishing websites corresponding to the pre-stored phishing website data and the number of phishing websites corresponding to the keywords of each phishing website; and determine the phishing websites according to the total number of phishing websites and the number of phishing websites corresponding to the keywords of each phishing website. The weight value corresponding to the keyword.
  • the suspected website detection module 724 is further configured to query a weight value corresponding to the pre-stored phishing website keyword; and construct the visited website feature vector corresponding to the detected phishing website keyword according to the queried weight value, and Pre-existing the phishing website feature vector corresponding to the phishing website keyword; determining the similarity between the extracted website text information and the pre-stored phishing website text information according to the visiting website feature vector and the phishing website feature vector.
  • the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content.
  • the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.
  • each of the above-described phishing website detecting devices 700 may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor can invoke the operations corresponding to the above modules.
  • a computer device which may be a server, and its internal structure diagram may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for operation of an operating system and computer readable instructions in a non-volatile storage medium.
  • the database of the computer device is used to store phishing website data.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by the processor to implement a phishing website detection method.
  • FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer apparatus comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, cause one or more processors.
  • the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content.
  • the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.
  • non-volatile storage media having computer readable instructions that, when executed by one or more processors, cause one or more processors to implement the present The steps of applying the phishing website detection method provided in any of the embodiments.
  • the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content.
  • the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A phishing website detection method, comprising: receiving a website access request sent by a terminal; querying website data according to the website access request; constructing a document object model tree according to the queried website data; comparing the constructed document object model tree to a pre-stored phishing website model tree, and obtaining a comparison result; determining whether a website corresponding to the website access request is a phishing website according to the comparison result; and returning phishing website warning information to the terminal when determining that the website corresponding to the website access request is a phishing website.

Description

钓鱼网站检测方法、装置、计算机设备和存储介质Phishing website detection method, device, computer device and storage medium

相关申请的交叉引用Cross-reference to related applications

本申请要求于2018年01月30日提交中国专利局,申请号为2018100912528,申请名称为“钓鱼网站检测方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims to be filed on January 30, 2018, the Chinese Patent Office, the application number is 2018100912528, and the priority of the Chinese patent application entitled "Phishing Website Detection Method, Apparatus, Computer Equipment and Storage Medium" is hereby incorporated by reference. Combined in this application.

技术领域Technical field

本申请涉及一种钓鱼网站检测方法、装置、计算机设备和存储介质。The application relates to a phishing website detecting method, device, computer device and storage medium.

背景技术Background technique

随着网络技术的发展,不同的公司都会搭建自己的网站。用户通过网站来获取公司信息或者公司提供的服务。同时,也出现了一些恶意的网站,例如,伪装成银行网站或电子商务网站的钓鱼网站,通过钓鱼网站来窃取用户的私密信息。With the development of network technology, different companies will build their own websites. Users use the website to obtain company information or services provided by the company. At the same time, there have been some malicious websites, such as phishing websites pretending to be bank websites or e-commerce websites, stealing users' private information through phishing websites.

为了保护用户的私密信息,需要对用户访问的网站进行检测。然而,发明人意识到,传统的钓鱼网站检测是通过网站中的网页内容中的关键字检测,通过网页内容关键词来检测钓鱼网站,只通过关键词来检测使得对网站的检测不够全面,导致钓鱼网站的检测准确率较低。In order to protect the user's private information, it is necessary to detect the website visited by the user. However, the inventor realized that the traditional phishing website detection is detected by keywords in the webpage content in the website, and the phishing website is detected by the webpage content keyword, and the detection by the keyword is not comprehensive enough to cause the detection of the website to be insufficient. The phishing website has a low detection accuracy.

发明内容Summary of the invention

根据本申请公开的各种实施例,提供一种钓鱼网站检测方法、装置、计算机设备和存储介质。According to various embodiments disclosed herein, a phishing website detecting method, apparatus, computer apparatus, and storage medium are provided.

一种钓鱼网站检测方法包括:A phishing website detection method includes:

接收终端发送的网站访问请求;Receiving a website access request sent by the terminal;

根据所述网站访问请求查询网站数据;Querying website data according to the website access request;

根据查询到的网站数据构建文档对象模型树;Constructing a document object model tree based on the queried website data;

将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;Comparing the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result;

根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及Determining, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and

当确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。When it is determined that the website corresponding to the website access request is a phishing website, the phishing website warning information is returned to the terminal.

一种钓鱼网站检测装置包括:A phishing website detecting device includes:

访问请求接收模块,用于接收终端发送的网站访问请求;An access request receiving module, configured to receive a website access request sent by the terminal;

网站数据查询模块,用于根据所述网站访问请求查询网站数据;a website data query module, configured to query website data according to the website access request;

模型树构建模块,用于根据查询到的网站数据构建文档对象模型树;a model tree building module, configured to build a document object model tree according to the queried website data;

模型树比较模块,用于将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;a model tree comparison module, configured to compare the constructed document object model tree with a pre-stored phishing website model tree to obtain a comparison result;

钓鱼网站确定模块,用于根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及a phishing website determining module, configured to determine, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and

警告信息返回模块,用于在确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。The warning information returning module is configured to return a phishing website warning message to the terminal when determining that the website corresponding to the website access request is a phishing website.

一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps:

接收终端发送的网站访问请求;Receiving a website access request sent by the terminal;

根据所述网站访问请求查询网站数据;Querying website data according to the website access request;

根据查询到的网站数据构建文档对象模型树;Constructing a document object model tree based on the queried website data;

将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;Comparing the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result;

根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及Determining, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and

在确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。When it is determined that the website corresponding to the website access request is a phishing website, the phishing website warning information is returned to the terminal.

一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:

接收终端发送的网站访问请求;Receiving a website access request sent by the terminal;

根据所述网站访问请求查询网站数据;Querying website data according to the website access request;

根据查询到的网站数据构建文档对象模型树;Constructing a document object model tree based on the queried website data;

将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;Comparing the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result;

根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及Determining, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and

在确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。When it is determined that the website corresponding to the website access request is a phishing website, the phishing website warning information is returned to the terminal.

本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.

附图说明DRAWINGS

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings to be used in the embodiments will be briefly described below. Obviously, the drawings in the following description are only some embodiments of the present application, Those skilled in the art can also obtain other drawings based on these drawings without any creative work.

图1为根据一个或多个实施例中钓鱼网站检测方法的应用场景图。FIG. 1 is an application scenario diagram of a phishing website detection method according to one or more embodiments.

图2为根据一个或多个实施例中钓鱼网站检测方法的流程示意图。2 is a flow chart of a method for detecting a phishing website according to one or more embodiments.

图3为根据一个或多个实施例中提取钓鱼网站关键词的步骤的流程示意图。3 is a flow diagram of the steps of extracting a phishing website keyword in accordance with one or more embodiments.

图4为根据一个或多个实施例中检测疑似钓鱼网站的步骤的流程示意图。4 is a flow diagram of the steps of detecting a suspected phishing website in accordance with one or more embodiments.

图5为根据一个或多个实施例中确定各钓鱼网站关键词的权重值步骤的流程示意图。FIG. 5 is a flow diagram showing the steps of determining weight values for each phishing website keyword in accordance with one or more embodiments.

图6为根据一个或多个实施例中计算网站文本信息与钓鱼网站文本信息相似度的步骤的流程示意图。6 is a flow diagram showing the steps of calculating the similarity between website text information and phishing website text information in accordance with one or more embodiments.

图7为根据一个或多个实施例中钓鱼网站检测装置的框图。FIG. 7 is a block diagram of a phishing website detecting apparatus in accordance with one or more embodiments.

图8为另一个实施例中钓鱼网站检测装置的框图。Figure 8 is a block diagram of a phishing website detecting apparatus in another embodiment.

图9为根据一个或多个实施例中计算机设备的框图。9 is a block diagram of a computer device in accordance with one or more embodiments.

具体实施方式Detailed ways

为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

本申请提供的钓鱼网站检测方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104通过网络进行通信。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The phishing website detection method provided by the present application can be applied to an application environment as shown in FIG. 1. Terminal 102 communicates with server 104 over a network over a network. The terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.

在其中一个实施例中,如图2所示,提供了一种钓鱼网站检测方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, a phishing website detection method is provided. The method is applied to the server in FIG. 1 as an example, and includes the following steps:

S202,接收终端发送的网站访问请求。S202. Receive a website access request sent by the terminal.

具体地,用户在终端展示的网站访问页面中输入网站地址,终端获取输入的网站地址,根据网站地址生成网站访问请求。终端将网站访问请求发送至服务器。服务器接收终端的发送的网站访问请求。Specifically, the user inputs the website address in the website access page displayed by the terminal, the terminal obtains the input website address, and generates a website access request according to the website address. The terminal sends a website access request to the server. The server receives a website access request sent by the terminal.

在其中一个实施例中,用户在终端展示的网站访问页面输入网站标识,终端根据输入的网站标识获取网站地址,根据获取到的网站地址生成网站访问请求。In one embodiment, the user inputs a website identifier on the website access page displayed by the terminal, and the terminal obtains the website address according to the input website identifier, and generates a website access request according to the obtained website address.

S204,根据网站访问请求查询网站数据。S204. Query website data according to a website access request.

具体地,服务器在接收到网站访问请求后,对网站访问请求进行解析,通过解析提取网站访问请求中的网站地址,查询与提取到的网站地址对应的网站数据。Specifically, after receiving the website access request, the server parses the website access request, and parses and extracts the website address in the website access request, and queries the website data corresponding to the extracted website address.

S206,根据查询到的网站数据构建文档对象模型树。S206. Construct a document object model tree according to the queried website data.

网站数据中包括网站框架数据和网站文本信息。网站框架数据为用于构建网站文档对象模型树,网站文本信息为网站的网页中显示的内容相关的数据。文档对象模型树(DOM,Document Object Model),即将网页中的各个元素看作一个个对象,根据各个元素的依赖关系构建的树形结构,该树形结构的节点中只包括元素标签。Website data includes website frame data and website text information. The website framework data is used to construct a website document object model tree, and the website text information is content related data displayed in the website webpage. The Document Object Model (DOM) is a tree structure in which each element in a web page is regarded as an object, and a node structure is constructed according to the dependency relationship of each element, and only the element label is included in the node of the tree structure.

具体地,服务器在查询到网站数据后,根据元素标签识别网站数据中的网站框架数据和网站文本信息,根据识别到的网站框架数据中的元素标签作为树节点,根据网站框架数据中的元素之间的依赖关系和树节点构建文档对象模型树。Specifically, after querying the website data, the server identifies the website frame data and the website text information in the website data according to the element tag, and uses the element label in the identified website frame data as a tree node, according to elements in the website frame data. Between the dependencies and the tree nodes build the document object model tree.

S208,将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果。S208. Compare the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result.

具体地,服务器中预存有多个钓鱼网站模型树。服务器将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,经过比较得到比较结果。比较结果可以为在预存的钓鱼网站模型树中存在相同的钓鱼网站模型树或者在预存的钓鱼网站模型树中不存在一致的相同的钓鱼网站模型树。Specifically, a plurality of phishing website model trees are pre-stored in the server. The server compares the constructed document object model tree with the pre-stored phishing website model tree, and compares the results. The comparison result may be that the same phishing website model tree exists in the pre-stored phishing website model tree or the same phishing website model tree does not exist in the pre-stored phishing website model tree.

S210,根据比较结果确定网站访问请求对应的网站是否为钓鱼网站。S210. Determine, according to the comparison result, whether the website corresponding to the website access request is a phishing website.

具体地,若比较结果为在预存的钓鱼网站模型树中存在与构建的文档对象模型树一致的钓鱼网站模型树,则服务器确定网站访问请求对应的网站为钓鱼网站;若比较结果为在预存的钓鱼网站模型树中不存在于构建的文档对象模型树一致的钓鱼网站模型树。Specifically, if the comparison result is that the phishing website model tree is consistent with the constructed document object model tree in the pre-stored phishing website model tree, the server determines that the website corresponding to the website access request is a phishing website; if the comparison result is pre-stored The phishing website model tree does not exist in the phishing website model tree in which the constructed document object model tree is consistent.

举例说明,比较结果可以是1或0,“1”表示在预存的钓鱼网站模型树中存在与构建的文档对象模型树一致的钓鱼网站模型树;“0”表示在预存的钓鱼网站模型树中不存在于构建的文档对象模型树一致的钓鱼网站模型树。服务器识别比较结果中是“1”还是“0”,若服务器识别到比较结果为“1”,则确定网站访问请求对应的网站为钓鱼网站;若服务器识别到比较结果为“0”,则确定网站访问请求对应的网站不是钓鱼网站。For example, the comparison result may be 1 or 0. "1" indicates that there is a phishing website model tree consistent with the constructed document object model tree in the pre-stored phishing website model tree; "0" indicates in the pre-stored phishing website model tree. A phishing website model tree that does not exist in the constructed document object model tree. If the server identifies that the comparison result is "1", the server determines that the website corresponding to the website access request is a phishing website; if the server recognizes that the comparison result is "0", it determines The website corresponding to the website access request is not a phishing website.

S212,在确定网站访问请求对应的网站为钓鱼网站时,向终端返回钓鱼网站警告信息。S212. When determining that the website corresponding to the website access request is a phishing website, returning the phishing website warning information to the terminal.

具体地,服务器在确定网站访问请求对应的网站为钓鱼网站时,提取网站访问请求中终端地址和网站地址,根据网站地址生成钓鱼网站警告信息,将钓鱼网站警告信息发送至终端地址对应的终端。Specifically, when determining that the website corresponding to the website access request is a phishing website, the server extracts the terminal address and the website address in the website access request, generates a phishing website warning information according to the website address, and sends the phishing website warning information to the terminal corresponding to the terminal address.

在其中一个实施例中,服务器在确定网站访问请求对应的网站为钓鱼网站时,提取网站访问请求中的网站地址,将网站地址标记为钓鱼网站地址。服务器再次接收到网站访问请求后,提取网站访问请求中终端地址和网站地址,检测网站地址是否被标记为钓鱼网站地址,若被标记为钓鱼网站地址,根据网站地址生成钓鱼网站警告信息,将钓鱼网站警告信息发送至终端地址对应的终端。In one embodiment, the server extracts the website address in the website access request and marks the website address as the phishing website address when determining that the website corresponding to the website access request is a phishing website. After receiving the website access request again, the server extracts the terminal address and the website address in the website access request, detects whether the website address is marked as the phishing website address, and if it is marked as the phishing website address, generates a phishing website warning information according to the website address, and will phish The website warning message is sent to the terminal corresponding to the terminal address.

本实施例中,根据网站访问请求查询网站数据,根据查询到的网站数据构建文档对象模型树。而由于大多数钓鱼网站的文档对象模型树都是相同的,都是将网页内容进行更换以伪装成不同网页内容的钓鱼网站。通过构建网站对应的文档对象模型树,将构建的文档对象模型与预存的钓鱼网站模型树进行比较,以确定网站访问请求对应网站是否为钓鱼网站,比较全面的根据网站框架中的文档对象模型树来检测钓鱼网站,提高了钓鱼网站的准确性。In this embodiment, the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content. By constructing the corresponding document object model tree of the website, the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.

在其中一个实施例中,如图3所示,S202之前具体还包括提取钓鱼网站关键词的步骤,该步骤具体包括以下内容:In one embodiment, as shown in FIG. 3, the step S202 specifically includes the step of extracting a phishing website keyword, and the step specifically includes the following content:

S302,从预存的钓鱼网站数据中,提取钓鱼网站文本信息和钓鱼网站模型树并储存。S302. Extract the phishing website text information and the phishing website model tree from the pre-stored phishing website data and store the phishing website.

具体地,服务器接收终端上传的钓鱼网站数据,将上传的钓鱼网站数据进行存储,得到预存的钓鱼网站数据。服务器预存的钓鱼网站数据中,识别钓鱼网站文本信息和钓鱼网 站模型树数据,提取识别到的钓鱼网站文本信息,根据识别到的钓鱼网站模型树数据得到钓鱼网站模型树。Specifically, the server receives the phishing website data uploaded by the terminal, and stores the uploaded phishing website data to obtain pre-stored phishing website data. In the phishing website data pre-stored by the server, the phishing website text information and the phishing website model tree data are identified, the identified phishing website text information is extracted, and the phishing website model tree is obtained according to the identified phishing website model tree data.

在其中一个实施例中,服务器接收终端发送的钓鱼网站地址,按照每个钓鱼网站地址获取各钓鱼网站的钓鱼网站数据,将获取到的钓鱼网站数据进行存储,得到预存的各钓鱼网站分别对应的钓鱼网站数据。服务器对于每个钓鱼网站对应的预存的钓鱼网站数据提取钓鱼网站文本信息和钓鱼网站模型树,将提取到的钓鱼网站文本信息和钓鱼网站模型树与网站地址对应存储。In one embodiment, the server receives the phishing website address sent by the terminal, obtains the phishing website data of each phishing website according to each phishing website address, and stores the obtained phishing website data, and obtains corresponding pre-existing phishing websites respectively. Phishing website data. The server extracts the phishing website text information and the phishing website model tree for the pre-stored phishing website data corresponding to each phishing website, and stores the extracted phishing website text information and the phishing website model tree corresponding to the website address.

S304,对钓鱼网站文本信息进行拆分得到钓鱼网站拆分词。S304, splitting the phishing website text information to obtain a phishing website split word.

具体地,服务器对根据拆分词库对提取到的钓鱼网站文本信息进行匹配,将匹配到的词提取出来,得到钓鱼网站拆分词。Specifically, the server matches the extracted phishing website text information according to the split vocabulary, extracts the matched words, and obtains a phishing website split word.

在其中一个实施例中,服务器对每个钓鱼网站的钓鱼网站文本信息,根据拆分词库对钓鱼网站文本信息进行拆分,通过拆分得到钓鱼网站拆分词。In one embodiment, the server splits the phishing website text information of each phishing website according to the split vocabulary, and obtains the phishing website split word by splitting.

S306,统计各钓鱼网站拆分词在钓鱼网站文本信息中的词频。S306. Count the word frequency of the phishing website split words in the phishing website text information.

词频为某个词在钓鱼网站文本信息中的出现频率。The word frequency is the frequency of occurrence of a word in the phishing website text information.

具体地,服务器统计拆分得到的钓鱼网站拆分词的总数,服务器读取钓鱼网站拆分词,统计读取的钓鱼网站拆分词在钓鱼网站文本信息中的出现次数,得到各钓鱼网站拆分词的出现次数,将各钓鱼网站拆分词的出现次数除以钓鱼网站拆分词的总数,得到各钓鱼网站拆分词的词频。Specifically, the server counts the total number of split words of the phishing website obtained by the split, the server reads the split words of the phishing website, and counts the number of occurrences of the phishing split words in the phishing website text information, and obtains the phishing websites. The number of occurrences of the word segmentation, the number of occurrences of the split words of each phishing website is divided by the total number of split words of the phishing website, and the word frequency of the split words of each phishing website is obtained.

S308,根据统计到的词频从钓鱼网站拆分词中提取钓鱼网站关键词。S308, extracting a phishing website keyword from the phishing website split word according to the statistical word frequency.

具体地,服务器中预设有词频阈值。服务器将每个钓鱼网站拆分词的词频与预设词频阈值进行比较,确定词频超过预设词频阈值的钓鱼网站拆分词,提取确定的钓鱼网站拆分词作为钓鱼网站关键词。Specifically, a word frequency threshold is preset in the server. The server compares the word frequency of each phishing website split word with the preset word frequency threshold, determines the phishing website split word whose word frequency exceeds the preset word frequency threshold, and extracts the determined phishing website split word as the phishing website keyword.

在其中一个实施例中,S308具体还包括以下步骤:根据预设停用词表识别钓鱼网站拆分词中的停用词;删除钓鱼网站拆分词中识别到的停用词,得到剩余钓鱼网站拆分词;根据统计到的词频从剩余钓鱼网站拆分词中提取钓鱼网站关键词。In one embodiment, S308 specifically includes the following steps: identifying a stop word in the phishing website split word according to the preset stop word table; deleting the stop word identified in the phishing website split word, and obtaining the remaining fishing Website split words; according to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words.

停用词是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言数据或文本之前或之后会自动过滤掉某些字或词,这些字或词即被称为停用词。出现次数较多,但是并没有实质含义的词,例如:“这个”、“那个”、“的”等。Stop words means that in the information retrieval, in order to save storage space and improve search efficiency, certain words or words are automatically filtered before or after processing natural language data or text. These words or words are called stop words. . Words that appear more frequently, but do not have substantial meaning, such as: "this", "that", "of" and so on.

具体地,服务器中存储着预设停用词表。服务器调用预设停用词表识别钓鱼网站拆分词中的停用词,若识别到钓鱼网站拆分词中的停用词,则将识别到的停用词从钓鱼网站拆分词中删除,得到剩余的钓鱼网站拆分词,在剩余的钓鱼网站拆分词中提取词频超过预设词频阈值的钓鱼网站拆分词,以提取到的钓鱼网站拆分词作为钓鱼网站关键词。Specifically, a preset stop word table is stored in the server. The server calls the preset stop word table to identify the stop word in the phishing website split word, and if the stop word in the phishing website split word is recognized, the identified stop word is deleted from the phishing website split word The remaining phishing website split words are obtained, and the phishing website split words whose word frequency exceeds the preset word frequency threshold are extracted from the remaining phishing website split words, and the extracted phishing website split words are extracted as the phishing website keywords.

S310,将提取到的钓鱼网站关键词存储得到预存钓鱼网站关键词。S310: Store the extracted phishing website keywords to obtain pre-stored phishing website keywords.

具体地,服务器将提取到的钓鱼网站关键词存储到钓鱼网站关键词库中,以钓鱼网站关键词库中的钓鱼网站关键词作为预存钓鱼网站关键词。Specifically, the server stores the extracted phishing website keywords into the phishing website keyword library, and uses the phishing website keywords in the phishing website keyword library as the pre-stored phishing website keywords.

S312,接收终端发送的网站访问请求。S312. Receive a website access request sent by the terminal.

S314,根据所网站访问请求查询网站数据。S314, querying website data according to the website access request.

S316,根据所述预存钓鱼网站关键词和网站数据,检测网站访问请求对应的网站是否为疑似钓鱼网站。S316. Detect, according to the pre-stored phishing website keyword and the website data, whether the website corresponding to the website access request is a suspected phishing website.

具体地,服务器提取预存钓鱼网站关键词,在网站数据中查询预存钓鱼网站关键词,并统计在网站数据中查询到的预存钓鱼网站关键词个数,将统计到的个数与预设阈值相比,若大于预设阈值,则检测到网站访问请求对应的网站为疑似钓鱼网站。Specifically, the server extracts the pre-stored phishing website keyword, queries the pre-stored phishing website keyword in the website data, and counts the number of pre-stored phishing website keywords queried in the website data, and compares the counted number with the preset threshold. If the ratio is greater than the preset threshold, the website corresponding to the website access request is detected as a suspected phishing website.

在其中一个实施例中,服务器提取预存钓鱼网站关键词,统计预存钓鱼网站关键词总数,在网站数据中查询预存钓鱼网站关键词,并统计在网站数据中查询到的预存钓鱼网站关键词个数。服务器将统计到的个数除以预存钓鱼网站关键词总数,得到钓鱼网站关键词出现比例,将钓鱼网站关键词出现比例与预设出现比例阈值进行比较,当钓鱼网站关键词出现比例高于预设出现比例阈值时,则检测到网站访问请求对应的网站为疑似钓鱼网站。In one embodiment, the server extracts the pre-stored phishing website keywords, counts the total number of pre-stored phishing website keywords, queries the pre-stored phishing website keywords in the website data, and counts the number of pre-stored phishing website keywords queried in the website data. . The server divides the number of statistics by the total number of keywords in the pre-existing phishing website, and obtains the proportion of keywords in the phishing website. The ratio of the keyword occurrence of the phishing website is compared with the threshold of the preset appearance ratio. When the proportional threshold is set, the website corresponding to the website access request is detected as a suspected phishing website.

本实施例中,对钓鱼网站文本信息进行拆分得到钓鱼网站拆分词,根据词频从钓鱼网站拆分词中提取钓鱼网站关键词,以提取到的钓鱼网站关键词来对查询到的网站数据进行检测,在检测到网站数据中的出现的预存钓鱼网站关键词较多时,则检测到网站访问请求对应的网站为疑似钓鱼网站,则需要继续根据文档对象模型树对该网站进行检测,从而提高了钓鱼网站的检测准确率。In this embodiment, the phishing website text information is split to obtain the phishing website split word, and the phishing website keyword is extracted from the phishing website split word according to the word frequency, and the phishing website keyword is extracted to query the queried website data. When detecting that there are more pre-existing phishing websites in the website data, if it detects that the website corresponding to the website access request is a suspected phishing website, it needs to continue to detect the website according to the document object model tree, thereby improving The detection accuracy of the phishing website.

在其中一个实施例中,如图4所示,S316具体还包括检测疑似钓鱼网站的步骤,该步骤具体包括以下内容:In one embodiment, as shown in FIG. 4, S316 specifically includes the step of detecting a suspected phishing website, and the step specifically includes the following contents:

S402,提取网站数据中的网站文本信息。S402. Extract website text information in the website data.

具体地,服务器识别网站数据中的网页标签,识别网页标签中是否包括的文本信息,若识别到网页标签中包括文本信息,在提取识别到的文本信息,待提取完网站数据中所有网页标签中包括的文本信息后,得到网站文本信息。Specifically, the server identifies the webpage label in the website data, and identifies whether the text information included in the webpage label is included. If the webpage label includes text information, the recognized text information is extracted, and all the webpage tags in the website data are to be extracted. After the text information is included, the website text information is obtained.

S404,根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词。S404. Detect a phishing website keyword included in the extracted website text information according to the pre-stored phishing website keyword.

具体地,服务器对提取到的网站文本信息进行词语拆分,通过词语拆分得到网站文本信息的文本拆分词,识别文本拆分词中的停用词,将识别到的停用词从文本拆分词中删除得到剩余文本拆分词。服务器在剩余文本拆分词中查询预存钓鱼网站关键词,以查询到的预存钓鱼网站关键词作为网站文本信息中包括的钓鱼网站关键词。Specifically, the server performs word splitting on the extracted website text information, obtains a text split word of the website text information through word splitting, identifies a stop word in the text split word, and identifies the identified stop word from the text. The split word is deleted to get the remaining text split words. The server queries the pre-stored phishing website keywords in the remaining text split words, and searches for the pre-stored phishing website keywords as the phishing website keywords included in the website text information.

在其中一个实施例中,S404之后具体还包括以下内容:根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与钓鱼网站文本信息的相似度;当计算得到的相似度大于预设相似度阈值时,执行S406。In one embodiment, after S404, the method further includes: calculating the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; When the similarity is greater than the preset similarity threshold, S406 is performed.

S406,统计检测到的钓鱼网站关键词在提取到的网站文本信息中的词频。S406: Statistics the detected word frequency of the phishing website keyword in the extracted website text information.

具体地,服务器统计各钓鱼网站关键词在剩余文本拆分词中的出现次数,同时统计剩余文本拆分词的拆分词总数,将各钓鱼网站关键词的出现次数除以拆分词总数得到各钓鱼 网站关键词的词频。Specifically, the server counts the number of occurrences of each phishing website keyword in the remaining text split words, and simultaneously counts the total number of split words of the remaining text split words, and divides the number of occurrences of each phishing website keyword by the total number of split words. The word frequency of each phishing website keyword.

以TF表示词频,N表示关键词在网站文本信息中的出现次数,M表示网站文本信息中的关键词的总词数,根据公式TF=N/M计算词频;词频是指关键词网站文本信息中出现频率,即对于某个关键词,为在包括M个关键词的网站文本信息中,出现了N次该关键词。The word frequency is represented by TF, N is the number of occurrences of the keyword in the text information of the website, M is the total number of words of the keyword in the text information of the website, and the word frequency is calculated according to the formula TF=N/M; the word frequency refers to the text information of the keyword website. The frequency appears in the middle, that is, for a certain keyword, in the website text information including M keywords, the keyword appears N times.

S408,当统计到的各钓鱼网站关键词的词频均大于预设词频阈值时,则检测到网站访问请求对应的网站为疑似钓鱼网站。S408: When the word frequency of each of the phishing website keywords is greater than a preset word frequency threshold, detecting that the website corresponding to the website access request is a suspected phishing website.

具体地,服务器将统计到的各钓鱼网站关键词的词频与预设词频阈值相比较,若各钓鱼网站关键词的词频均大于预设词频阈值,则检测到网站访问请求对应的网站为疑似钓鱼网站,则执行根据查询到的网站数据构建文档对象模型树的步骤。Specifically, the server compares the counted word frequency of each phishing website keyword with a preset word frequency threshold. If the word frequency of each phishing website keyword is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as suspected fishing. The website performs the step of constructing a document object model tree based on the queried website data.

本实施例中,根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词,统计钓鱼网站关键词在网站文本信息中的词频,根据各钓鱼网站关键的词频检测网站是否为疑似钓鱼网站,在检测到为疑似钓鱼网站,进一度对文档对象模型树进行检测,通过多种钓鱼网站的检测方式,提高了钓鱼网站的检测准确率。In this embodiment, the phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword, and the word frequency of the phishing website keyword in the website text information is counted, and according to the key word frequency of each phishing website, whether the website is Suspected phishing websites, when detected as suspected phishing websites, once tested the document object model tree and improved the detection accuracy of phishing websites through various phishing websites.

在其中一个实施例中,如图5所示,S308之后具体还包括确定各钓鱼网站关键词的权重值的步骤,该步骤具体包括以下内容:In one embodiment, as shown in FIG. 5, after S308, the method further includes the step of determining a weight value of each phishing website keyword, and the step specifically includes the following content:

S502,统计预存的钓鱼网站数据对应的钓鱼网站总数和各钓鱼网站关键词对应的钓鱼网站数量。S502: Count the total number of phishing websites corresponding to the pre-stored phishing website data and the number of phishing websites corresponding to each phishing website keyword.

具体地,服务器提取预存的钓鱼网站数据对应的钓鱼网站标识,统计提取到的钓鱼网站标识对应的标识总数,以统计到的标识总数作为钓鱼网站总数。读取钓鱼网站关键词,在各钓鱼网站标识对应的钓鱼网站数据中查询读取的钓鱼网站关键词,统计查询到读取的钓鱼网站关键词的钓鱼网站数据所对应的钓鱼网站标识的标识数量,将统计到的数量与读取的钓鱼网站关键词对应存储,以统计到的数量作为该钓鱼网站关键词对应的钓鱼网站数量。Specifically, the server extracts the phishing website identifier corresponding to the pre-stored phishing website data, and collects the total number of identifiers corresponding to the extracted phishing website identifier, and uses the total number of statistics as the total number of phishing websites. The phishing website keyword is read, and the phishing website keyword is searched for in the phishing website data corresponding to each phishing website identifier, and the number of phishing website identification corresponding to the phishing website data of the phishing website keyword is queried. The number of statistics is stored in association with the phishing website keywords read, and the number of statistics is used as the number of phishing websites corresponding to the keywords of the phishing website.

S504,根据钓鱼网站总数和各钓鱼网站关键词对应的钓鱼网站数量,确定各钓鱼网站关键词对应的权重值。S504: Determine a weight value corresponding to each phishing website keyword according to the total number of phishing websites and the number of phishing websites corresponding to each phishing website keyword.

具体地,以IDF标识钓鱼网站关键词对应的权重值,以D标识钓鱼网站总数,以D W表示钓鱼网站关键词对应的钓鱼网站数量,根据公式IDF=log(D/D W)计算各钓鱼网站关键词的IDF。服务器将各钓鱼网站关键词对应的权重值对应存储。 Specifically, IDF is used to identify the weight value corresponding to the phishing website keyword, D is used to identify the total number of phishing websites, D W is the number of phishing websites corresponding to the phishing website keywords, and each fishing is calculated according to the formula IDF=log(D/D W ). The IDF of the website keyword. The server stores the weight values corresponding to the keywords of each phishing website.

本实施例中,根据钓鱼网站关键词在钓鱼网站中的重要程度,计算各钓鱼网站关键词对应的权重值,即某钓鱼网站关键词对应的钓鱼网站越多,则该钓鱼网站关键词对应的权重值越高,根据重要程度确定权重值,提高了权重值的准确性。In this embodiment, according to the importance degree of the phishing website keyword in the phishing website, the weight value corresponding to each phishing website keyword is calculated, that is, the more phishing websites corresponding to a phishing website keyword, the corresponding phishing website keyword The higher the weight value, the weight value is determined according to the importance level, and the accuracy of the weight value is improved.

在其中一个实施例中,如图6所示,根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与钓鱼网站文本信息的相似度,具体包括计算网站文本信息与钓鱼网站文本信息相似度的步骤,该步骤具体包括以下内容:In one embodiment, as shown in FIG. 6, the similarity between the extracted website text information and the phishing website text information is calculated according to the detected phishing website keyword and the pre-stored phishing website keyword, specifically including calculating the website text information. The step of similarity with the phishing website text information, the step specifically includes the following contents:

S602,查询与预存钓鱼网站关键词对应的权重值。S602. Query a weight value corresponding to the pre-stored phishing website keyword.

具体地,服务器中存储着预存钓鱼网站关键词,以及与预存钓鱼网站挂念词对应的权重值。服务器查询各预存钓鱼网站关键词以及与各预存钓鱼网站关键词对应的权重值。Specifically, the server stores the pre-stored phishing website keywords and the weight values corresponding to the pre-existing phishing website suspense words. The server queries each of the pre-stored phishing website keywords and the weight value corresponding to each pre-stored phishing website keyword.

S604,根据查询到的权重值分别构建检测到的钓鱼网站关键词对应的访问网站特征向量,以及预存钓鱼网站关键词对应的钓鱼网站特征向量。S604. Build the visited website feature vector corresponding to the detected phishing website keyword and the phishing website feature vector corresponding to the pre-stored phishing website keyword according to the queried weight value.

具体地,服务器对预存钓鱼网站关键词进行排序,按照预存钓鱼网站关键词的顺序对各预存钓鱼网站关键词对应的权重值排序,构建预存钓鱼网站关键词对应的钓鱼网站特征向量。服务器按照钓鱼网站特征向量将检测到的钓鱼网站关键词对应的权重值添加到向量的相应位置,将未检测到的钓鱼网站关键词对应的权重值置零,得到检测到的钓鱼网站关键词对应的访问网站特征向量。Specifically, the server sorts the pre-stored phishing website keywords, sorts the weight values corresponding to the pre-stored phishing website keywords in the order of the pre-stored phishing website keywords, and constructs a phishing website feature vector corresponding to the pre-stored phishing website keyword. The server adds the weight value corresponding to the detected phishing website keyword to the corresponding position of the vector according to the phishing website feature vector, and sets the weight value corresponding to the undetected phishing website keyword to zero, and obtains the detected phishing website keyword corresponding Access to website feature vectors.

举例说明,假设有5个预存钓鱼网站关键词a、b、c、d和e,则对应的权重值为50、40、30、20和10,则预存钓鱼网站关键词对应的钓鱼网站关键词向量为A=(50,40,30,20,10),则在访问网站的网站数据中检测到的钓鱼网站关键词为a、b、c和d,则访问网站特征向量B=(50,40,30,20,0)。For example, if there are 5 pre-existing phishing website keywords a, b, c, d, and e, the corresponding weight values are 50, 40, 30, 20, and 10, and the phishing website keywords corresponding to the phishing website keywords are pre-stored. The vector is A=(50,40,30,20,10), then the phishing website keywords detected in the website data of the visiting website are a, b, c and d, then the website feature vector B=(50, 40, 30, 20, 0).

S606,根据访问网站特征向量和钓鱼网站特征向量,确定提取到的网站文本信息与预存钓鱼网站文本信息的相似度。S606. Determine, according to the visited website feature vector and the phishing website feature vector, the similarity between the extracted website text information and the pre-stored phishing website text information.

具体地,计算访问网站特征向量和钓鱼网站特征向量之间的相似度,以计算得到的相似度作为提取到的网站文本信息与预存钓鱼网站文本信息的相似度。Specifically, the similarity between the visited website feature vector and the phishing website feature vector is calculated, and the calculated similarity is used as the similarity between the extracted website text information and the pre-stored phishing website text information.

举例说明,假设预设钓鱼网站关键词有n个,钓鱼网站特征向量为A=(A 1,A 2,A 3,…A n),访问网站特征向量为B=(B 1,B 2,B 3,…B n),则以Sim(A,B)表示钓鱼网站特征向量A和访问网站特征向量B的相似度,根据以下公式计算Sim(A,B),其中1<k≤n: For example, suppose there are n keywords in the default phishing website, the feature vector of the phishing website is A=(A 1 , A 2 , A 3 , ... A n ), and the feature vector of the visiting website is B=(B 1 , B 2 , B 3 ,...B n ), the similarity between the phishing website feature vector A and the visiting website feature vector B is represented by Sim(A, B), and Sim(A, B) is calculated according to the following formula, where 1<k≤n:

Figure PCTCN2018088935-appb-000001
Figure PCTCN2018088935-appb-000001

本实施例中,通过准确计算访问网站特征向量和钓鱼网站特征向量之间的相似度,根据相似度确定是否为疑似钓鱼网站,通过利用相似度计算与文档对象模型树检测访问网站是否我钓鱼网站,提高了钓鱼网站的检测效率。In this embodiment, by accurately calculating the similarity between the visited website feature vector and the phishing website feature vector, determining whether it is a suspected phishing website according to the similarity degree, detecting whether the visiting website is a phishing website by using the similarity calculation and the document object model tree. Improve the detection efficiency of phishing websites.

应该理解的是,虽然图2-6的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-6中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowcharts of FIGS. 2-6 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in FIGS. 2-6 may include a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be executed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.

在其中一个实施例中,如图7所示,提供了一种钓鱼网站检测装置700,包括:访问请求接收模块702、网站数据查询模块704、模型树构建模块706、模型树比较模块708、钓鱼网站确定模块710和警告信息返回模块712,其中:In one embodiment, as shown in FIG. 7, a phishing website detecting apparatus 700 is provided, including: an access request receiving module 702, a website data query module 704, a model tree building module 706, a model tree comparing module 708, and fishing. Website determination module 710 and warning information return module 712, wherein:

访问请求接收模块702,用于接收终端发送的网站访问请求。The access request receiving module 702 is configured to receive a website access request sent by the terminal.

网站数据查询模块704,用于根据网站访问请求查询网站数据。The website data query module 704 is configured to query the website data according to the website access request.

模型树构建模块706,用于根据查询到的网站数据构建文档对象模型树。The model tree building module 706 is configured to construct a document object model tree according to the queried website data.

模型树比较模块708,用于将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果。The model tree comparison module 708 is configured to compare the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result.

钓鱼网站确定模块710,用于根据比较结果确定网站访问请求对应的网站是否为钓鱼网站。The phishing website determining module 710 is configured to determine, according to the comparison result, whether the website corresponding to the website access request is a phishing website.

警告信息返回模块712,用于当确定网站访问请求对应的网站为钓鱼网站时,向终端返回钓鱼网站警告信息。The warning information returning module 712 is configured to return the phishing website warning information to the terminal when determining that the website corresponding to the website access request is a phishing website.

本实施例中,根据网站访问请求查询网站数据,根据查询到的网站数据构建文档对象模型树。而由于大多数钓鱼网站的文档对象模型树都是相同的,都是将网页内容进行更换以伪装成不同网页内容的钓鱼网站。通过构建网站对应的文档对象模型树,将构建的文档对象模型与预存的钓鱼网站模型树进行比较,以确定网站访问请求对应网站是否为钓鱼网站,比较全面的根据网站框架中的文档对象模型树来检测钓鱼网站,提高了钓鱼网站的准确性。In this embodiment, the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content. By constructing the corresponding document object model tree of the website, the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.

在其中一个实施例中,如图8所示,钓鱼网站检测装置700,还包括:模型树提取模块714、文本信息拆分模块716、词频统计模块718、关键词提取模块720、关键词存储模块722和疑似网站检测模块724,其中:In one embodiment, as shown in FIG. 8, the phishing website detecting apparatus 700 further includes: a model tree extracting module 714, a text information splitting module 716, a word frequency statistic module 718, a keyword extracting module 720, and a keyword storage module. 722 and suspected website detection module 724, wherein:

模型树提取模块714,用于从预存的钓鱼网站数据中,提取钓鱼网站文本信息和钓鱼网站模型树并储存。The model tree extraction module 714 is configured to extract and store the phishing website text information and the phishing website model tree from the pre-stored phishing website data.

文本信息拆分模块716,用于对钓鱼网站文本信息进行拆分得到钓鱼网站拆分词。The text information splitting module 716 is configured to split the phishing website text information to obtain a phishing website split word.

词频统计模块718,用于统计各钓鱼网站拆分词在钓鱼网站文本信息中的词频。The word frequency statistics module 718 is configured to count the word frequency of the split words of each phishing website in the phishing website text information.

关键词提取模块720,用于根据统计到的词频从钓鱼网站拆分词中提取钓鱼网站关键词。The keyword extraction module 720 is configured to extract a phishing website keyword from the phishing website split word according to the statistical word frequency.

关键词存储模块722,用于将提取到的钓鱼网站关键词存储得到预存钓鱼网站关键词。The keyword storage module 722 is configured to store the extracted phishing website keywords to obtain pre-stored phishing website keywords.

疑似网站检测模块724,用于根据预存钓鱼网站关键词和网站数据,检测网站访问请求对应的网站是否为疑似钓鱼网站;The suspected website detecting module 724 is configured to detect, according to the pre-stored phishing website keyword and the website data, whether the website corresponding to the website access request is a suspected phishing website;

模型树构建模块706还用于在检测到网站访问请求对应的网站为疑似钓鱼网站是,则根据查询到的网站数据构建文档对象模型树。The model tree construction module 706 is further configured to construct a document object model tree according to the queried website data when detecting that the website corresponding to the website access request is a suspected phishing website.

本实施例中,对钓鱼网站文本信息进行拆分得到钓鱼网站拆分词,根据词频从钓鱼网站拆分词中提取钓鱼网站关键词,以提取到的钓鱼网站关键词来对查询到的网站数据进行检测,在检测到网站数据中的出现的预存钓鱼网站关键词较多时,则检测到网站访问请求 对应的网站为疑似钓鱼网站,则需要继续根据文档对象模型树对该网站进行检测,从而提高了钓鱼网站的检测准确率。In this embodiment, the phishing website text information is split to obtain the phishing website split word, and the phishing website keyword is extracted from the phishing website split word according to the word frequency, and the phishing website keyword is extracted to query the queried website data. When detecting that there are more pre-existing phishing websites in the website data, if it detects that the website corresponding to the website access request is a suspected phishing website, it needs to continue to detect the website according to the document object model tree, thereby improving The detection accuracy of the phishing website.

在其中一个实施例中,关键词提取模块720还用于根据预设停用词表识别钓鱼网站拆分词中的停用词;删除钓鱼网站拆分词中识别到的停用词,得到剩余钓鱼网站拆分词;根据统计到的词频从剩余钓鱼网站拆分词中提取钓鱼网站关键词。In one embodiment, the keyword extraction module 720 is further configured to identify a stop word in the phishing website split word according to the preset stop word table; delete the stop word identified in the phishing website split word, and obtain the remaining The phishing website splits the words; according to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words.

在其中一个实施例中,疑似网站检测模块724还用于提取网站数据中的网站文本信息;根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词;统计检测到的钓鱼网站关键词在提取到的网站文本信息中的词频;当统计到的各钓鱼网站关键词的词频均大于预设词频阈值时,则检测到网站访问请求对应的网站为疑似钓鱼网站。In one embodiment, the suspected website detection module 724 is further configured to extract website text information in the website data; detect the phishing website keywords included in the extracted website text information according to the pre-stored phishing website keywords; and statistically detect the phishing The word frequency of the website keyword in the extracted website text information; when the word frequency of each phishing website keyword is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as a suspected phishing website.

在其中一个实施例中,疑似网站检测模块724根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与钓鱼网站文本信息的相似度;当计算得到的相似度大于预设相似度阈值时,统计检测到的钓鱼网站关键词在提取到的网站文本信息中的词频。In one embodiment, the suspected website detection module 724 calculates the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; when the calculated similarity is greater than When the similarity threshold is preset, the detected word frequency of the phishing website keyword in the extracted website text information is counted.

本实施例中,根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词,统计钓鱼网站关键词在网站文本信息中的词频,根据各钓鱼网站关键的词频检测网站是否为疑似钓鱼网站,在检测到为疑似钓鱼网站,进一度对文档对象模型树进行检测,通过多种钓鱼网站的检测方式,提高了钓鱼网站的检测准确率。In this embodiment, the phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword, and the word frequency of the phishing website keyword in the website text information is counted, and according to the key word frequency of each phishing website, whether the website is Suspected phishing websites, when detected as suspected phishing websites, once tested the document object model tree and improved the detection accuracy of phishing websites through various phishing websites.

在其中一个实施例中,钓鱼网站检测装置700还包括:权重值确定模块。In one embodiment, the phishing website detecting apparatus 700 further includes: a weight value determining module.

权重值确定模块,用于统计预存的钓鱼网站数据对应的钓鱼网站总数和各钓鱼网站关键词对应的钓鱼网站数量;根据钓鱼网站总数和各钓鱼网站关键词对应的钓鱼网站数量,确定各钓鱼网站关键词对应的权重值。The weight value determining module is configured to count the total number of phishing websites corresponding to the pre-stored phishing website data and the number of phishing websites corresponding to the keywords of each phishing website; and determine the phishing websites according to the total number of phishing websites and the number of phishing websites corresponding to the keywords of each phishing website. The weight value corresponding to the keyword.

在其中一个实施例中,疑似网站检测模块724还用于查询与预存钓鱼网站关键词对应的权重值;根据查询到的权重值分别构建检测到的钓鱼网站关键词对应的访问网站特征向量,以及预存钓鱼网站关键词对应的钓鱼网站特征向量;根据访问网站特征向量和钓鱼网站特征向量,确定提取到的网站文本信息与预存钓鱼网站文本信息的相似度。In one embodiment, the suspected website detection module 724 is further configured to query a weight value corresponding to the pre-stored phishing website keyword; and construct the visited website feature vector corresponding to the detected phishing website keyword according to the queried weight value, and Pre-existing the phishing website feature vector corresponding to the phishing website keyword; determining the similarity between the extracted website text information and the pre-stored phishing website text information according to the visiting website feature vector and the phishing website feature vector.

本实施例中,根据网站访问请求查询网站数据,根据查询到的网站数据构建文档对象模型树。而由于大多数钓鱼网站的文档对象模型树都是相同的,都是将网页内容进行更换以伪装成不同网页内容的钓鱼网站。通过构建网站对应的文档对象模型树,将构建的文档对象模型与预存的钓鱼网站模型树进行比较,以确定网站访问请求对应网站是否为钓鱼网站,比较全面的根据网站框架中的文档对象模型树来检测钓鱼网站,提高了钓鱼网站的准确性。In this embodiment, the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content. By constructing the corresponding document object model tree of the website, the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.

关于钓鱼网站检测装置700装置的具体限定可以参见上文中对于钓鱼网站检测方法的限定,在此不再赘述。上述钓鱼网站检测装置700中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器 中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the phishing website detecting device 700, reference may be made to the above definition of the phishing website detecting method, and details are not described herein again. Each of the above-described phishing website detecting devices 700 may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor can invoke the operations corresponding to the above modules.

在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储钓鱼网站数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种钓鱼网站检测方法。In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in FIG. The computer device includes a processor, memory, network interface, and database connected by a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for operation of an operating system and computer readable instructions in a non-volatile storage medium. The database of the computer device is used to store phishing website data. The network interface of the computer device is used to communicate with an external terminal via a network connection. The computer readable instructions are executed by the processor to implement a phishing website detection method.

本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。It will be understood by those skilled in the art that the structure shown in FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.

在其中一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的钓鱼网站检测方法的步骤。In one embodiment, a computer apparatus is provided comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, cause one or more processors The steps of the phishing website detecting method provided in any one of the embodiments of the present application are implemented.

本实施例中,根据网站访问请求查询网站数据,根据查询到的网站数据构建文档对象模型树。而由于大多数钓鱼网站的文档对象模型树都是相同的,都是将网页内容进行更换以伪装成不同网页内容的钓鱼网站。通过构建网站对应的文档对象模型树,将构建的文档对象模型与预存的钓鱼网站模型树进行比较,以确定网站访问请求对应网站是否为钓鱼网站,比较全面的根据网站框架中的文档对象模型树来检测钓鱼网站,提高了钓鱼网站的准确性。In this embodiment, the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content. By constructing the corresponding document object model tree of the website, the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.

在其中一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的钓鱼网站检测方法的步骤。In one embodiment, there is provided one or more non-volatile storage media having computer readable instructions that, when executed by one or more processors, cause one or more processors to implement the present The steps of applying the phishing website detection method provided in any of the embodiments.

本实施例中,根据网站访问请求查询网站数据,根据查询到的网站数据构建文档对象模型树。而由于大多数钓鱼网站的文档对象模型树都是相同的,都是将网页内容进行更换以伪装成不同网页内容的钓鱼网站。通过构建网站对应的文档对象模型树,将构建的文档对象模型与预存的钓鱼网站模型树进行比较,以确定网站访问请求对应网站是否为钓鱼网站,比较全面的根据网站框架中的文档对象模型树来检测钓鱼网站,提高了钓鱼网站的准确性。In this embodiment, the website data is queried according to the website access request, and the document object model tree is constructed according to the queried website data. Since most of the phishing websites have the same document object model tree, they are phishing websites that replace the content of the webpage to disguise as different webpage content. By constructing the corresponding document object model tree of the website, the constructed document object model is compared with the pre-stored phishing website model tree to determine whether the website access request corresponding website is a phishing website, and the document object model tree according to the website framework is more comprehensive. To detect phishing websites and improve the accuracy of phishing websites.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的 流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by computer readable instructions, which can be stored in a non-volatile computer. The readable storage medium, which when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database, or other medium used in the various embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims (20)

一种钓鱼网站检测方法,包括:A method for detecting a phishing website, comprising: 接收终端发送的网站访问请求;Receiving a website access request sent by the terminal; 根据所述网站访问请求查询网站数据;Querying website data according to the website access request; 根据查询到的网站数据构建文档对象模型树;Constructing a document object model tree based on the queried website data; 将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;Comparing the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result; 根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及Determining, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and 当确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。When it is determined that the website corresponding to the website access request is a phishing website, the phishing website warning information is returned to the terminal. 根据权利要求1所述的方法,其特征在于,在所述接收终端发送的网站访问请求之前,所述方法还包括:The method according to claim 1, wherein before the receiving the website access request sent by the terminal, the method further comprises: 从预存的钓鱼网站数据中,提取钓鱼网站文本信息和钓鱼网站模型树并储存;Extracting phishing website text information and phishing website model tree from pre-stored phishing website data and storing it; 对所述钓鱼网站文本信息进行拆分得到钓鱼网站拆分词;Splitting the text information of the phishing website to obtain a phishing website split word; 统计各钓鱼网站拆分词在所述钓鱼网站文本信息中的词频;Counting the word frequency of each phishing website split word in the text information of the phishing website; 根据统计到的词频从所述钓鱼网站拆分词中提取钓鱼网站关键词;及Extracting phishing website keywords from the phishing website split words according to the statistical word frequency; and 将提取到的钓鱼网站关键词存储得到预存钓鱼网站关键词;The extracted phishing website keywords are stored to obtain pre-stored phishing website keywords; 在所述根据所述网站访问请求查询网站数据之后,所述方法还包括:After the querying the website data according to the website access request, the method further includes: 根据所述预存钓鱼网站关键词和所述网站数据,检测所述网站访问请求对应的网站是否为疑似钓鱼网站;及Determining, according to the pre-stored phishing website keyword and the website data, whether the website corresponding to the website access request is a suspected phishing website; and 若是,则执行所述根据查询到的网站数据构建文档对象模型树的步骤。If yes, the step of constructing the document object model tree based on the queried website data is performed. 根据权利要求2所述的方法,其特征在于,所述根据统计到的词频从所述钓鱼网站拆分词中提取钓鱼网站关键词,包括:The method according to claim 2, wherein the extracting the phishing website keywords from the phishing website split words according to the statistical word frequency comprises: 根据预设停用词表识别钓鱼网站拆分词中的停用词;Identify stop words in the phishing website split word according to the preset stop word table; 删除所述钓鱼网站拆分词中识别到的停用词,得到剩余钓鱼网站拆分词;及Deleting the stop words identified in the phishing website split words, and obtaining the remaining phishing website split words; 根据统计到的词频从所述剩余钓鱼网站拆分词中提取钓鱼网站关键词。According to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words. 根据权利要求3所述的方法,其特征在于,所述根据所述预存钓鱼网站关键词和所述网站数据,检测所述网站访问请求对应的网站是否为疑似钓鱼网站,包括:The method according to claim 3, wherein the detecting whether the website corresponding to the website access request is a suspected phishing website according to the pre-stored phishing website keyword and the website data comprises: 提取所述网站数据中的网站文本信息;Extracting website text information in the website data; 根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词;The phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword; 统计检测到的钓鱼网站关键词在所述提取到的网站文本信息中的词频;及Statistically detecting the word frequency of the phishing website keyword in the extracted website text information; and 当统计到的各钓鱼网站关键词的词频均大于预设词频阈值时,则检测到所述网站访问请求对应的网站为疑似钓鱼网站。When the word frequency of each of the phishing website keywords is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as a suspected phishing website. 根据权利要求4所述的方法,其特征在于,在所述根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词之后,所述方法还包括:The method according to claim 4, wherein after the detecting the phishing website keyword included in the extracted website text information according to the pre-stored phishing website keyword, the method further comprises: 根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与 所述钓鱼网站文本信息的相似度;及Calculating the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; 当计算得到的相似度大于预设相似度阈值时,执行所述统计检测到的钓鱼网站关键词在所述提取到的网站文本信息中的词频的步骤。And when the calculated similarity is greater than the preset similarity threshold, the step of performing the statistically detected keyword of the phishing website keyword in the extracted website text information is performed. 根据权利要求5所述的方法,其特征在于,在所述根据统计到的词频从所述钓鱼网站拆分词中提取钓鱼网站关键词之后,所述方法还包括:The method according to claim 5, wherein after the extracting the phishing website keyword from the phishing website split word according to the statistical word frequency, the method further comprises: 统计预存的钓鱼网站数据对应的钓鱼网站总数和各钓鱼网站关键词对应的钓鱼网站数量;及Counting the total number of phishing websites corresponding to the pre-stored phishing website data and the number of phishing websites corresponding to each phishing website keyword; 根据所述钓鱼网站总数和各钓鱼网站关键词对应的钓鱼网站数量,确定各钓鱼网站关键词对应的权重值。The weight value corresponding to each phishing website keyword is determined according to the total number of phishing websites and the number of phishing websites corresponding to each phishing website keyword. 根据权利要求6所述的方法,其特征在于,所述根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与所述钓鱼网站文本信息的相似度,包括:The method according to claim 6, wherein the calculating the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword comprises: 查询与预存钓鱼网站关键词对应的权重值;Query the weight value corresponding to the pre-stored phishing website keyword; 根据查询到的权重值分别构建检测到的钓鱼网站关键词对应的访问网站特征向量,以及预存钓鱼网站关键词对应的钓鱼网站特征向量;及Constructing the visited website feature vector corresponding to the detected phishing website keyword and pre-existing the phishing website feature vector corresponding to the phishing website keyword according to the queried weight value; 根据所述访问网站特征向量和所述钓鱼网站特征向量,确定提取到的网站文本信息与预存钓鱼网站文本信息的相似度。And determining, according to the visited website feature vector and the phishing website feature vector, the similarity between the extracted website text information and the pre-stored phishing website text information. 一种钓鱼网站检测装置,包括:A phishing website detecting device, comprising: 访问请求接收模块,用于接收终端发送的网站访问请求;An access request receiving module, configured to receive a website access request sent by the terminal; 网站数据查询模块,用于根据所述网站访问请求查询网站数据;a website data query module, configured to query website data according to the website access request; 模型树构建模块,用于根据查询到的网站数据构建文档对象模型树;a model tree building module, configured to build a document object model tree according to the queried website data; 模型树比较模块,用于将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;a model tree comparison module, configured to compare the constructed document object model tree with a pre-stored phishing website model tree to obtain a comparison result; 钓鱼网站确定模块,用于根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及a phishing website determining module, configured to determine, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and 警告信息返回模块,用于在确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。The warning information returning module is configured to return a phishing website warning message to the terminal when determining that the website corresponding to the website access request is a phishing website. 根据权利要求8所述的装置,其特征在于,还包括:The device according to claim 8, further comprising: 模型树提取模块,用于从预存的钓鱼网站数据中,提取钓鱼网站文本信息和钓鱼网站模型树并储存;a model tree extraction module, configured to extract and store phishing website text information and a phishing website model tree from pre-stored phishing website data; 文本信息拆分模块,用于对钓鱼网站文本信息进行拆分得到钓鱼网站拆分词;a text information splitting module for splitting the phishing website text information to obtain a phishing website split word; 词频统计模块,用于统计各钓鱼网站拆分词在钓鱼网站文本信息中的词频;The word frequency statistics module is used to count the word frequency of the phishing website split words in the phishing website text information; 关键词提取模块,用于根据统计到的词频从钓鱼网站拆分词中提取钓鱼网站关键词;a keyword extraction module, configured to extract a phishing website keyword from a phishing website split word according to a statistical word frequency; 关键词存储模块,用于将提取到的钓鱼网站关键词存储得到预存钓鱼网站关键词;a keyword storage module, configured to store the extracted phishing website keywords to obtain pre-stored phishing website keywords; 疑似网站检测模块,用于根据预存钓鱼网站关键词和网站数据,检测网站访问请求对应的网站是否为疑似钓鱼网站;及The suspected website detection module is configured to detect whether the website corresponding to the website access request is a suspected phishing website according to the pre-stored phishing website keyword and the website data; 所述模型树构建模块,还用于在检测到网站访问请求对应的网站为疑似钓鱼网站是,则根据查询到的网站数据构建文档对象模型树。The model tree building module is further configured to: when detecting that the website corresponding to the website access request is a suspected phishing website, construct a document object model tree according to the queried website data. 根据权利要求9所述的装置,其特征在于,所述关键词提取模块,还用于根据预设停用词表识别钓鱼网站拆分词中的停用词;The device according to claim 9, wherein the keyword extraction module is further configured to identify a stop word in the phishing website split word according to the preset stop word table; 所述关键词提取模块,还用于删除钓鱼网站拆分词中识别到的停用词,得到剩余钓鱼网站拆分词;及The keyword extraction module is further configured to delete the stop words identified in the phishing website split words, and obtain the remaining phishing website split words; 所述关键词提取模块,还用于根据统计到的词频从剩余钓鱼网站拆分词中提取钓鱼网站关键词。The keyword extraction module is further configured to extract a phishing website keyword from the remaining phishing website split words according to the statistical word frequency. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors to cause the one or more The processors perform the following steps: 接收终端发送的网站访问请求;Receiving a website access request sent by the terminal; 根据所述网站访问请求查询网站数据;Querying website data according to the website access request; 根据查询到的网站数据构建文档对象模型树;Constructing a document object model tree based on the queried website data; 将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;Comparing the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result; 根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及Determining, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and 当确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。When it is determined that the website corresponding to the website access request is a phishing website, the phishing website warning information is returned to the terminal. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer apparatus according to claim 11, wherein said processor further performs the following steps when said computer readable instructions are executed: 从预存的钓鱼网站数据中,提取钓鱼网站文本信息和钓鱼网站模型树并储存;Extracting phishing website text information and phishing website model tree from pre-stored phishing website data and storing it; 对所述钓鱼网站文本信息进行拆分得到钓鱼网站拆分词;Splitting the text information of the phishing website to obtain a phishing website split word; 统计各钓鱼网站拆分词在所述钓鱼网站文本信息中的词频;Counting the word frequency of each phishing website split word in the text information of the phishing website; 根据统计到的词频从所述钓鱼网站拆分词中提取钓鱼网站关键词;及Extracting phishing website keywords from the phishing website split words according to the statistical word frequency; and 将提取到的钓鱼网站关键词存储得到预存钓鱼网站关键词;The extracted phishing website keywords are stored to obtain pre-stored phishing website keywords; 在所述根据所述网站访问请求查询网站数据之后,所述方法还包括:After the querying the website data according to the website access request, the method further includes: 根据所述预存钓鱼网站关键词和所述网站数据,检测所述网站访问请求对应的网站是否为疑似钓鱼网站;及Determining, according to the pre-stored phishing website keyword and the website data, whether the website corresponding to the website access request is a suspected phishing website; and 若是,则执行所述根据查询到的网站数据构建文档对象模型树的步骤。If yes, the step of constructing the document object model tree based on the queried website data is performed. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer apparatus according to claim 12, wherein said processor further performs the following steps when said computer readable instructions are executed: 根据预设停用词表识别钓鱼网站拆分词中的停用词;Identify stop words in the phishing website split word according to the preset stop word table; 删除所述钓鱼网站拆分词中识别到的停用词,得到剩余钓鱼网站拆分词;及Deleting the stop words identified in the phishing website split words, and obtaining the remaining phishing website split words; 根据统计到的词频从所述剩余钓鱼网站拆分词中提取钓鱼网站关键词。According to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer apparatus according to claim 13, wherein said processor further performs the following steps when said computer readable instructions are executed: 提取所述网站数据中的网站文本信息;Extracting website text information in the website data; 根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词;The phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword; 统计检测到的钓鱼网站关键词在所述提取到的网站文本信息中的词频;及Statistically detecting the word frequency of the phishing website keyword in the extracted website text information; and 当统计到的各钓鱼网站关键词的词频均大于预设词频阈值时,则检测到所述网站访问请求对应的网站为疑似钓鱼网站。When the word frequency of each of the phishing website keywords is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as a suspected phishing website. 根据权利要求14所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer apparatus according to claim 14, wherein said processor further performs the following steps when said computer readable instructions are executed: 根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与所述钓鱼网站文本信息的相似度;及Calculating the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; 当计算得到的相似度大于预设相似度阈值时,执行所述统计检测到的钓鱼网站关键词在所述提取到的网站文本信息中的词频的步骤。And when the calculated similarity is greater than the preset similarity threshold, the step of performing the statistically detected keyword of the phishing website keyword in the extracted website text information is performed. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps: 接收终端发送的网站访问请求;Receiving a website access request sent by the terminal; 根据所述网站访问请求查询网站数据;Querying website data according to the website access request; 根据查询到的网站数据构建文档对象模型树;Constructing a document object model tree based on the queried website data; 将构建的文档对象模型树与预存的钓鱼网站模型树进行比较,得到比较结果;Comparing the constructed document object model tree with the pre-stored phishing website model tree to obtain a comparison result; 根据所述比较结果确定所述网站访问请求对应的网站是否为钓鱼网站;及Determining, according to the comparison result, whether the website corresponding to the website access request is a phishing website; and 当确定所述网站访问请求对应的网站为钓鱼网站时,向所述终端返回钓鱼网站警告信息。When it is determined that the website corresponding to the website access request is a phishing website, the phishing website warning information is returned to the terminal. 根据权利要求16所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The storage medium of claim 16, wherein the processor further performs the following steps when the computer readable instructions are executed: 从预存的钓鱼网站数据中,提取钓鱼网站文本信息和钓鱼网站模型树并储存;Extracting phishing website text information and phishing website model tree from pre-stored phishing website data and storing it; 对所述钓鱼网站文本信息进行拆分得到钓鱼网站拆分词;Splitting the text information of the phishing website to obtain a phishing website split word; 统计各钓鱼网站拆分词在所述钓鱼网站文本信息中的词频;Counting the word frequency of each phishing website split word in the text information of the phishing website; 根据统计到的词频从所述钓鱼网站拆分词中提取钓鱼网站关键词;及Extracting phishing website keywords from the phishing website split words according to the statistical word frequency; and 将提取到的钓鱼网站关键词存储得到预存钓鱼网站关键词;The extracted phishing website keywords are stored to obtain pre-stored phishing website keywords; 在所述根据所述网站访问请求查询网站数据之后,所述方法还包括:After the querying the website data according to the website access request, the method further includes: 根据所述预存钓鱼网站关键词和所述网站数据,检测所述网站访问请求对应的网站是否为疑似钓鱼网站;及Determining, according to the pre-stored phishing website keyword and the website data, whether the website corresponding to the website access request is a suspected phishing website; and 若是,则执行所述根据查询到的网站数据构建文档对象模型树的步骤。If yes, the step of constructing the document object model tree based on the queried website data is performed. 根据权利要求17所述的存储介质,其特征在于,所述处理器执行所述计算机可 读指令时还执行以下步骤:A storage medium according to claim 17, wherein said processor further performs the following steps when said computer readable instructions are executed: 根据预设停用词表识别钓鱼网站拆分词中的停用词;Identify stop words in the phishing website split word according to the preset stop word table; 删除所述钓鱼网站拆分词中识别到的停用词,得到剩余钓鱼网站拆分词;及Deleting the stop words identified in the phishing website split words, and obtaining the remaining phishing website split words; 根据统计到的词频从所述剩余钓鱼网站拆分词中提取钓鱼网站关键词。According to the statistics of the word frequency, the phishing website keywords are extracted from the remaining phishing website split words. 根据权利要求18所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The storage medium of claim 18, wherein the processor further performs the following steps when the computer readable instructions are executed: 提取所述网站数据中的网站文本信息;Extracting website text information in the website data; 根据预存钓鱼网站关键词检测提取到的网站文本信息中包括的钓鱼网站关键词;The phishing website keyword included in the extracted website text information is detected according to the pre-stored phishing website keyword; 统计检测到的钓鱼网站关键词在所述提取到的网站文本信息中的词频;及Statistically detecting the word frequency of the phishing website keyword in the extracted website text information; and 当统计到的各钓鱼网站关键词的词频均大于预设词频阈值时,则检测到所述网站访问请求对应的网站为疑似钓鱼网站。When the word frequency of each of the phishing website keywords is greater than the preset word frequency threshold, the website corresponding to the website access request is detected as a suspected phishing website. 根据权利要求19所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The storage medium of claim 19, wherein the processor further performs the following steps when the computer readable instructions are executed: 根据检测到的钓鱼网站关键词与预存钓鱼网站关键词,计算提取到的网站文本信息与所述钓鱼网站文本信息的相似度;及Calculating the similarity between the extracted website text information and the phishing website text information according to the detected phishing website keyword and the pre-stored phishing website keyword; 当计算得到的相似度大于预设相似度阈值时,执行所述统计检测到的钓鱼网站关键词在所述提取到的网站文本信息中的词频的步骤。And when the calculated similarity is greater than the preset similarity threshold, the step of performing the statistically detected keyword of the phishing website keyword in the extracted website text information is performed.
PCT/CN2018/088935 2018-01-30 2018-05-30 Phishing website detection method, device, computer equipment and storage medium Ceased WO2019148712A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810091252.8 2018-01-30
CN201810091252.8A CN108306878A (en) 2018-01-30 2018-01-30 Detection method for phishing site, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2019148712A1 true WO2019148712A1 (en) 2019-08-08

Family

ID=62867336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/088935 Ceased WO2019148712A1 (en) 2018-01-30 2018-05-30 Phishing website detection method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN108306878A (en)
WO (1) WO2019148712A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767493A (en) * 2020-07-07 2020-10-13 杭州安恒信息技术股份有限公司 Method, device, equipment and storage medium for displaying content data of website
CN112667943A (en) * 2020-11-10 2021-04-16 中科金审(北京)科技有限公司 Illegal website identification and locking method
CN113609493A (en) * 2021-08-05 2021-11-05 工银科技有限公司 Identification method, device, equipment and medium of phishing website
CN113806732A (en) * 2020-06-16 2021-12-17 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
CN116248415A (en) * 2023-05-11 2023-06-09 北京匠数科技有限公司 Website distinguishing method and device
CN118018274A (en) * 2024-02-01 2024-05-10 徐州好一家科技有限公司 Internet access method and system

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450880A (en) * 2018-10-26 2019-03-08 平安科技(深圳)有限公司 Detection method for phishing site, device and computer equipment based on decision tree
CN114222301B (en) * 2021-12-13 2024-04-12 奇安盘古(上海)信息技术有限公司 Fraud site processing method, fraud site processing device and storage medium
CN114492370B (en) * 2022-01-29 2023-09-01 北京百度网讯科技有限公司 Web page recognition method, device, electronic device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102316081A (en) * 2010-06-30 2012-01-11 北京启明星辰信息技术股份有限公司 Method and device for identifying similar webpage
CN104462152A (en) * 2013-09-23 2015-03-25 深圳市腾讯计算机系统有限公司 Webpage recognition method and device
CN104899508A (en) * 2015-06-17 2015-09-09 中国互联网络信息中心 Multistage phishing website detecting method and system
US20170078286A1 (en) * 2015-09-16 2017-03-16 RiskIQ, Inc. Using hash signatures of dom objects to identify website similarity

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100553242C (en) * 2007-01-19 2009-10-21 深圳市深信服电子科技有限公司 Method of preventing phishing websites based on gateway and network bridge
CN101510887B (en) * 2009-03-27 2012-01-25 腾讯科技(深圳)有限公司 Method and device for identifying website
CN103179095B (en) * 2011-12-22 2016-03-30 阿里巴巴集团控股有限公司 A kind of method and client terminal device detecting fishing website
US9716726B2 (en) * 2014-11-13 2017-07-25 Cleafy S.r.l. Method of identifying and counteracting internet attacks
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system
CN106685936B (en) * 2016-12-14 2020-07-31 深信服科技股份有限公司 Webpage tampering detection method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102316081A (en) * 2010-06-30 2012-01-11 北京启明星辰信息技术股份有限公司 Method and device for identifying similar webpage
CN104462152A (en) * 2013-09-23 2015-03-25 深圳市腾讯计算机系统有限公司 Webpage recognition method and device
CN104899508A (en) * 2015-06-17 2015-09-09 中国互联网络信息中心 Multistage phishing website detecting method and system
US20170078286A1 (en) * 2015-09-16 2017-03-16 RiskIQ, Inc. Using hash signatures of dom objects to identify website similarity

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806732A (en) * 2020-06-16 2021-12-17 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
CN113806732B (en) * 2020-06-16 2023-11-03 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
CN111767493A (en) * 2020-07-07 2020-10-13 杭州安恒信息技术股份有限公司 Method, device, equipment and storage medium for displaying content data of website
CN112667943A (en) * 2020-11-10 2021-04-16 中科金审(北京)科技有限公司 Illegal website identification and locking method
CN113609493A (en) * 2021-08-05 2021-11-05 工银科技有限公司 Identification method, device, equipment and medium of phishing website
CN116248415A (en) * 2023-05-11 2023-06-09 北京匠数科技有限公司 Website distinguishing method and device
CN116248415B (en) * 2023-05-11 2023-08-15 北京匠数科技有限公司 Website distinguishing method and device
CN118018274A (en) * 2024-02-01 2024-05-10 徐州好一家科技有限公司 Internet access method and system

Also Published As

Publication number Publication date
CN108306878A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
WO2019148712A1 (en) Phishing website detection method, device, computer equipment and storage medium
CN110377558B (en) Document query method, device, computer equipment and storage medium
CN108304378B (en) Text similarity computing method, apparatus, computer equipment and storage medium
CN109670049B (en) Map path query method, device, computer equipment and storage medium
CN106033416B (en) A string processing method and device
CN106055574B (en) A Method and Device for Identifying Illegal Uniform Resource Identifier URL
WO2020186786A1 (en) File processing method and apparatus, computer device and storage medium
WO2019109529A1 (en) Webpage identification method, device, computer apparatus, and computer storage medium
CN113992625B (en) Domain name source station detection method, system, computer and readable storage medium
WO2022105119A1 (en) Training corpus generation method for intention recognition model, and related device thereof
US9418155B2 (en) Disambiguation of entities
WO2020134684A1 (en) Information retrieval method, apparatus, device and medium
CN112579931B (en) Network access analysis method, device, computer equipment and storage medium
CN114416847A (en) A method, device, server and storage medium for data conversion
CN118733717A (en) File duplication checking method, device, equipment, storage medium and program product
CN110555165B (en) Information identification method and device, computer equipment and storage medium
WO2019205300A1 (en) Poc attack detection method and apparatus, computer device and storage medium
CN111556042B (en) Malicious URL detection method and device, computer equipment and storage medium
CN109460500B (en) Hotspot event discovery method and device, computer equipment and storage medium
WO2021103594A1 (en) Tacitness degree detection method and device, server and readable storage medium
CN112732926A (en) Text retrieval method and device, computer equipment and storage medium
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
CN110781209B (en) Method and device for quickly querying data
CN116192462A (en) A malware analysis method and device based on PE file format
TWI484359B (en) Method and system for providing article information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18903504

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 11/11/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18903504

Country of ref document: EP

Kind code of ref document: A1