Background
A WEB backdoor: the WEB backdoor is an instruction execution environment in the form of WEB page files such as ASP, PHP, JSP, CGI, and the like, and may also be referred to as webshell. After a hacker invades a website, the ASP or PHP backdoor file is usually mixed with the normal WEB page file in the WEB directory of the website server, and then the hacker can use a browser to access the ASP or PHP backdoor to obtain a command execution environment, so as to achieve the purpose of controlling the website server.
Webpage crawler technology: the first step of active detection of the WEB backdoor is to identify a suspicious WEB path, and usually, the WEB page of a website is obtained to traverse the whole website directory and file by a WEB crawler in a suspicious manner to obtain a path tree of the website. However, since the WEB backdoor is embedded by hackers through WEB site bugs, the WEB backdoor is placed in a hidden position and generally cannot be crawled by a WEB crawler.
WEB backdoor collection: and collecting information such as file names, paths, script types and the like of published WEB backdoors.
WEB backdoor path combining: the WEB back door path combination is a main means for realizing WEB back door path detection. And combining the features of the WEB backdoor, such as the path, the file name, the script type, the common words and the like discovered according to the history with the path of the current website to generate a large number of paths.
The WEB backdoor is a command execution environment in the form of a WEB page file, and may also be referred to as a webshell. After a hacker invades a website, the backdoor file and the normal webpage file in the WEB directory of the website server are mixed together, and then the hacker can use a browser to access the WEB backdoor to obtain a command execution environment so as to achieve the purpose of controlling the website server. Therefore, for the user server, the WEB backdoor is a very dangerous existence, so that finding and deleting the WEB backdoor in time is very critical to guarantee the security of the server.
After a hacker invades a website, a backdoor file is placed in a hidden position, usually in a isolated chain, and is difficult to find by an administrator. Therefore, finding the WEB backdoor is difficult, and there are two general finding ways:
firstly, checking access records through a website;
and secondly, active scanning and identification.
Accordingly, there is a need for improvements in the art.
Disclosure of Invention
The invention aims to provide an efficient web backdoor path detection method.
In order to solve the technical problem, the invention provides a web backdoor path detection method, which comprises the following steps:
1) acquiring an aggregation Path set $ Webshell _ Path and a file Name set $ Webshell _ Name;
2) using a Web crawler to crawl along a website home Page to obtain a directory Tree $ Web _ Catalog, a URL Tree $ Web _ Url _ Tree, a website Root directory $ Web _ Root and a custom Error Page $ Error _ Page of the website;
3) acquiring a URL set $ Target _ Url to be detected according to a set Path set $ Webshell _ Path, a file Name set $ Webshell _ Name, a directory Tree $ Web _ Catalog and a URL Tree $ Web _ Url _ Tree;
and using Http to request to access the connection in the URL set $ Target _ Url to be detected to obtain the Suspicious URL set $ Suspicious _ Url.
As an improvement to the web backdoor path detection method of the present invention:
the step 1 is as follows: adding the Path of the URL of the WEB backdoor into a set Path set $ Webshell _ Path, and adding the file Name of the URL of the WEB backdoor into a file Name set $ Webshell _ Name.
As a further improvement to the web backdoor path detection method of the present invention:
in step 1: the common english and the common person names are added to the set of file names $ Webshell _ Name.
As a further improvement to the web backdoor path detection method of the present invention:
the step 3 comprises the following steps:
3.1), performing Cartesian multiplication on a directory tree $ Web _ Catalog and a file Name set $ Webshell _ Name, and adding a result into a URL set $ Target _ Url to be detected;
performing Cartesian multiplication on a website Root directory $ Web _ Root and a Path set $ Webshell _ Path, performing Cartesian multiplication on a result and a file Name set $ Webshell _ Name, and adding a final result into a URL set $ Target _ Url to be detected;
the URL set $ Target _ Url to be detected is subjected to de-repeated linking, and the union set of the URL set $ Target _ Url to be detected and the URL Tree $ Web _ Url _ Tree is subtracted from the URL set $ Target _ Url to be detected, so that the final URL set $ Target _ Url to be detected is obtained;
3.2), suspicious path identification phase,
and sequentially detecting links in the URL set $ Target _ Url set to be detected finally obtained in the step 3.1), using Http to request to access the links, judging that the response code is 200 and the link of the non-self-defined error page is a Suspicious link, and adding the Suspicious link to the Suspicious URL set $ Suspicious _ Url.
As a further improvement to the web backdoor path detection method of the present invention:
in step 3.2, the method for judging the custom error page comprises the following steps: calculating the content similarity of the user-defined error page and the access link page, and defining the error page for the access link page when the content similarity of the page is larger than a preset value; otherwise it is not a custom error page.
As a further improvement to the web backdoor path detection method of the present invention:
in step 3.2, similarity is judged using the simhash algorithm.
As a further improvement to the web backdoor path detection method of the present invention:
the self-defined Error Page $ Error _ Page is a response Page returned by the website to an absent Page and a malicious request respectively by acquiring the website self-defined Error Page by requesting a batch of absent pages;
the absence or malicious request is as follows:
1) website homepage address + random character string;
2) website homepage address + random character string + script environment;
3) website home page address + malicious request url.
The web backdoor path detection method has the technical advantages that:
after the website is implanted into the WEB backdoor by a hacker, a batch of available WEB paths can be provided, and support is provided for subsequent analysis of the WEB backdoor. The invention provides a method for detecting a WEB back door path, which can detect a suspicious path of a website and provide support for further detection of the WEB back door.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.
Embodiment 1, a web backdoor path detection method, as shown in fig. 1-3, includes three stages:
1. a knowledge collection stage:
a) collecting the existing WEB backdoor through a corresponding knowledge base to serve as the characteristics of a path and a file name;
b) collecting common English words as file name features;
c) and collecting the commonly used names as file name characteristics.
2. And a target website identification stage:
and crawling a directory tree and a website URL tree of the target website.
3. Suspicious path identification stage:
a) and combining the directory tree collected in the target website identification stage 2 according to the path and file name characteristics collected in the knowledge collection stage 1 to form the URL to be detected.
b) Removing the URL in the URL tree of the website from the URL set to be detected to be used as a path to be detected;
c) and accessing the path to be detected, and identifying the Http response code and the content, wherein the response code is 200, and the path is a path of a non-self-defined error page and is a suspicious path.
For the sake of accurate description, the following definitions are made:
a knowledge base:
path set $ Webshell _ Path
Filename set $ Webshell _ Name
Website information:
a directory tree: catalog tree $ Web _ Catalog
URL tree: URL Tree $ Web _ Url _ Tree
Website root directory: website Root directory $ Web _ Root
Self-defining an error page: custom Error Page $ Error _ Page
And (3) detecting a URL set: URL set $ Target _ Url to be detected
And (3) suspicious URL collection: suspicious URL set $ Suspicious _ Url
The method specifically comprises the following steps:
1) and a knowledge collection stage:
collecting existing WEB backdoors through a knowledge base, adding paths of the URLs of the WEB backdoors into a set Path set $ Webshell _ Path and adding file names of the URLs of the WEB backdoors into a file Name set $ Webshell _ Name according to the URLs of the WEB backdoors.
The common english and the common person names are added to the set of file names $ Webshell _ Name.
2) Target website information collection stage
The method comprises the steps of crawling along a website home Page by using a Web crawler, obtaining a directory Tree $ Web _ Catalog, a URL Tree $ Web _ Url _ Tree, a website Root directory $ Web _ Root and a custom Error Page $ Error _ Page of the website.
Custom Error Page $ Error _ Page is a web site custom Error Page obtained by requesting a batch of non-existing pages. Respectively detecting response pages returned by the website to the non-existent pages and the malicious requests.
The absence or malicious request is as follows:
1. website home page address + random character string
2. Website homepage address + random character string + script environment
3. Website home address + malicious request url (e.g., sql injection request)
The website scripting environment is as follows:
Php、jsp、asp、aspx
three examples are given separately, with a hundred degree example:
1、http://www.baidu.com/fsdhjfhsdcfbnsdfkj
2、http://www.baidu.com/44kd9sn39dj.php
3、http://www.baidu.com/?id=1&1=1
3) suspicious path identification stage
3.1) combining the paths to be detected, as shown in FIG. 2;
according to the steps 1 and 2, a resource Path set $ Webshell _ Path, a file Name set $ Webshell _ Name, a directory Tree $ Web _ Catalog and a URL Tree $ Web _ Url _ Tree exist.
And performing Cartesian multiplication on a directory tree $ Web _ Catalog and a file Name set $ Webshell _ Name, and adding a result into a set URL set $ Target _ Url to be detected.
And performing Cartesian multiplication on the website Root directory $ Web _ Root and the Path set $ Webshell _ Path, performing Cartesian multiplication on the result and the file Name set $ Webshell _ Name, and adding the final result into the URL set $ Target _ Url to be detected.
The URL set $ Target _ URL to be detected deduplicates links. And calculating the URL set $ Target _ Url to be detected minus the union of the URL set $ Target _ Url to be detected and the URL Tree $ Web _ Url _ Tree to obtain the final URL set $ Target _ Url to be detected.
3.2), suspicious path identification phase, as shown in FIG. 3
And sequentially detecting links in the URL set $ Target _ Url set to be detected finally obtained in the step 3.1), using Http to request to access the links, judging that the response code is 200 and the link of the non-self-defined error page is a Suspicious link, and adding the Suspicious link to the Suspicious URL set $ Suspicious _ Url. The Error Page judging method is that the Error Page is compared with a self-defined Error Page $ Error _ Page, if the Page contents are similar, the Error Page is a self-defined Error Page; otherwise it is not a custom error page.
The method for judging whether the page contents are similar is the existing simhash algorithm and comprises the following steps:
3.21) calculating the abstract value of the page by using a simhash algorithm;
3.22) calculating the Hamming distance between the two pages through the abstract value;
3.23), setting a threshold value according to experience, wherein the page length is more than 500, and the page similarity is judged if the Hamming distance is less than 3. The page length is <500, and the page is judged to be similar if the hamming distance is less than 10.
The suspect set of URLs $ Suspicious _ Url is output.
Web backdoor path exploration tool instance
Knowledge base
The path and file name detection library of the web backdoor detection is preset by the following method.
1.1, acquiring the appeared file path set and file name characteristics of the published web backdoor case through user experience and Internet search.
1.2 collecting common English words and adding the common English words into the file name set.
1.3 collecting the names of the commonly used persons to access the file name set.
And (3) counting the collection results in 1.1, 1.2 and 1.3, and obtaining a file path set after duplication removal as follows:
/user/other/
/data/th/b/
the set of filenames is as follows:
log.php
index.php
tools.php
detecting website http://192.168.5.1 website Web backdoor
2. Crawling http://192.168.5.1 directory tree using web crawler
By recursively accessing http://192.168.5.1 pages of a website and internal links in all the pages thereof in an http Get mode, a directory tree of the website can be identified as follows
http://192.168.5.1/
http://192.168.5.1/a/
http://192.168.5.1/b/
http://192.168.5.1/b/c/
2.1 suspicious Path identification phase
2.1.1 Path combination phase
Using the path and file in the knowledge base in combination with the directory of the target web site (using Cartesian product), a set of instrumented files is generated
The file set is as follows:
http://192.168.5.1/log.php
http://192.168.5.1/index.php
http://192.168.5.1/tools.php
http://192.168.5.1/user/other/log.php
http://192.168.5.1/user/other/index.php
http://192.168.5.1/user/other/tools.php
http://192.168.5.1/data/th/b/log.php
http://192.168.5.1/data/th/b/index.php
http://192.168.5.1/data/th/b/tools.php
http://192.168.5.1/a/log.php
http://192.168.5.1/a/index.php
http://192.168.5.1/a/tools.php
http://192.168.5.1/a/user/other/log.php
http://192.168.5.1/a/user/other/index.php
http://192.168.5.1/a/user/other/tools.php
http://192.168.5.1/a/data/th/b/log.php
http://192.168.5.1/a/data/th/b/index.php
http://192.168.5.1/a/data/th/b/tools.php
http://192.168.5.1/b/log.php
http://192.168.5.1/b/index.php
http://192.168.5.1/b/tools.php
http://192.168.5.1/b/user/other/log.php
http://192.168.5.1/b/user/other/index.php
http://192.168.5.1/b/user/other/tools.php
http://192.168.5.1/b/data/th/b/log.php
http://192.168.5.1/b/data/th/b/index.php
http://192.168.5.1/b/data/th/b/tools.php
and (3) identifying a custom error page:
request access to random page by http Get means gets web site response page htm3 (text is too long and symbols are used instead).
The random page is http://192.168.5.1/5tdshfdskjf8ds7fu90dsfjqwkj
2.1.2 Access to Path to be detected
And requesting the files in the 2.1.2 by using an http Get mode, judging the response code of each file, and recording a response packet and a file path of the file to wait for subsequent detection if the response code of a certain file is 200. The other response codes are not recorded.
The similarity between html4 and html3 is calculated using the simhash algorithm and if similar, discarded.
Here, http://192.168.5.1/b/user/other/log. php responds 200, responds to page html4 and html4 is not similar to html 3. Then http://192.168.5.1/b/user/other/log. php is the suspect web backdoor file.
The term 1: webshell, the name of the backdoor of a website
The term 2: isolated links, links not in the URL tree of a web site
Finally, it is also noted that the above-mentioned lists merely illustrate a few specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.