[go: up one dir, main page]

CN111107074B - A method, training method and device for preventing web crawler from stealing private data - Google Patents

A method, training method and device for preventing web crawler from stealing private data Download PDF

Info

Publication number
CN111107074B
CN111107074B CN201911284559.0A CN201911284559A CN111107074B CN 111107074 B CN111107074 B CN 111107074B CN 201911284559 A CN201911284559 A CN 201911284559A CN 111107074 B CN111107074 B CN 111107074B
Authority
CN
China
Prior art keywords
data
web crawler
api
sample
time period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911284559.0A
Other languages
Chinese (zh)
Other versions
CN111107074A (en
Inventor
宗志远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Digital Service Technology Co ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911284559.0A priority Critical patent/CN111107074B/en
Publication of CN111107074A publication Critical patent/CN111107074A/en
Application granted granted Critical
Publication of CN111107074B publication Critical patent/CN111107074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6263Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the specification provides a method, a training method and a device for preventing a web crawler from stealing private data. The method for preventing the network crawler from stealing the private data comprises the following steps: and extracting an Application Program Interface (API) access record of the target client within a preset time period from the network traffic data of the target client. And generating data to be identified based on the API access record of the target client, wherein the data to be identified comprises an API access two-dimensional graph of the target client with time and API access amount as dimensionalities in a preset time period. Inputting data to be recognized into a web crawler recognition model to obtain a network recognition result of a target client, wherein the web crawler recognition model is obtained by training web crawler classification tags based on sample data and sample data, and the sample data comprises an API access two-dimensional graph of a sample user with time and API access amount as dimensionalities in a preset time period. And executing a privacy data protection measure matched with the network crawler recognition result on the target client.

Description

Method, training method and device for preventing network crawler from stealing private data
Technical Field
The present disclosure relates to the field of data security technologies, and in particular, to a method, a training method, and an apparatus for preventing a web crawler from stealing private data.
Background
Internet companies provide services to users while also providing opportunities for information crawling. The web crawler only needs to compile an automatic script and can excessively acquire the privacy data of the user in each internet company under the conscious or unconscious authorization of the user. The sensitive information of the individual users is stored in a crawling company, which easily causes large-scale data leakage to occur.
Therefore, a technical scheme capable of automatically identifying the web crawlers and preventing the web crawlers from stealing private data is urgently needed at present.
Disclosure of Invention
Embodiments of the present disclosure provide a method, a training device, an apparatus, and an electronic device for preventing a web crawler from stealing private data, which are capable of identifying the web crawler mechanically and preventing the web crawler from stealing the private data.
In order to achieve the above object, the embodiments of the present specification are implemented as follows:
in a first aspect, a method for preventing a web crawler from stealing private data is provided, which includes:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
In a second aspect, a training method for a web crawler recognition model includes:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
and training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
In a third aspect, an apparatus for preventing a web crawler from stealing private data is provided, including:
the record extraction module is used for extracting an Application Program Interface (API) access record of a target client within a preset time period from network flow data of the target client;
the image generation module is used for generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period by taking time and API access amount as dimensionality;
the crawler identification module is used for inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and the data protection module executes privacy data protection measures matched with the network crawler identification result to the target client.
In a fourth aspect, an electronic device is provided comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
In a fifth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
In a sixth aspect, a method for training a web crawler recognition model is provided, including:
the record extraction module is used for extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
the image generation module generates sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
and the model training module is used for training the web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
In a seventh aspect, an electronic device is provided that includes: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
and training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
In an eighth aspect, a computer readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
and training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
According to the technical scheme, the API access records are extracted from the network flow data, and the two-dimensional images of time and API access amount are used for representing, so that the API access two-dimensional images of sample data are used for training the web crawler recognition model, and the web crawler recognition model learns to obtain a rhythm sequence of initiating API access by the web crawler. When the target client is required to be judged to be the web crawler, the network flow data of the target client can be converted into the API to access the two-dimensional image and input into the web crawler identification model for identification, so that matched privacy data protection measures are executed on the target client according to the web crawler identification result, and the web crawler can be effectively prevented from stealing privacy data. In addition, because the training process does not need to extract the characteristics of the sample client, the loss of characteristic information can be reduced, the training efficiency is improved, and the trained web crawler recognition model has higher accuracy and recall rate.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative efforts.
Fig. 1 is a flowchart illustrating a method for preventing a web crawler from stealing private data according to an embodiment of the present disclosure.
Fig. 2 is a schematic flowchart of a training method for a web crawler recognition model provided in an embodiment of the present specification.
Fig. 3 is a schematic structural diagram of an apparatus for preventing a web crawler from stealing private data according to an embodiment of the present disclosure.
Fig. 4 is a schematic structural diagram of a training apparatus for a web crawler recognition model provided in an embodiment of this specification.
Fig. 5 is a schematic view of an electronic device provided in an embodiment of the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
At present, many data companies steal the private data of users by using a web crawler. This process, even if authorized by the user (and in many cases, by the user unconsciously), still suffers from excessive acquisition. The reason why the sensitive information of the user is stolen by the data crawling company is to utilize the sensitive information, large-scale data leakage is easy to occur, and the privacy and the safety of the user are seriously damaged.
Against this background, this document aims to provide a technical solution for identifying a web crawler based on a deep learning model and preventing the web crawler from stealing private data.
FIG. 1 is a flow chart of a method for preventing a web crawler from stealing private data according to an embodiment of the present disclosure. The method shown in fig. 1 may be performed by a corresponding apparatus, comprising:
step S102, extracting an application program interface API access record of the target client in a preset time period from the network flow data of the target client.
Specifically, in this step, an API access record of the target client within a preset time period may be obtained from a network traffic log of the target client, where the API access record may include, but is not limited to, an API that the target client accesses each time within the preset time period and a corresponding time.
And step S104, generating data to be identified based on the API access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period by taking time and API access amount as dimensions.
It should be understood that the API access two-dimensional graph represents the recording of API access by the target client within a preset time period in two dimensions of time and API access amount, and therefore may represent the rhythm of API access initiated by the target client.
And S106, inputting the data to be recognized into a web crawler recognition model to obtain a network recognition result of the target client, wherein the web crawler recognition model is obtained by training a web crawler classification label based on sample data and sample data, and the sample data comprises an API access two-dimensional graph of a sample user with time and API access amount as dimensions in a preset time period.
It should be understood that the web crawler steals the private data based on the compiled automation script, so that the rhythm of initiating the API access shows a certain rule, and the web crawler identification model can learn to obtain the rhythm of initiating the API access by the web crawler by training the web crawler identification model through sample data, so that the web crawler identification model has the capability of accurately identifying the web crawler.
And step S108, executing privacy data protection measures matched with the network crawler recognition result on the target client.
It should be noted that the embodiments of the present specification do not specifically limit the privacy data protection measure. As an exemplary introduction, in this step, if the result of identifying the web crawler indicates that the target client belongs to the web crawler, the target client may be added to a blacklist to prevent the API access, or the target client may be prevented from performing the API access at a specified time interval every day.
Based on the method shown in fig. 1, the scheme in the embodiment of the present specification extracts API access records from network traffic data, and performs characterization by using a two-dimensional image of time and API access amount, so that the web crawler recognition model is trained by using the API access two-dimensional image of sample data, and the web crawler recognition model learns to obtain a rhythm sequence of API access initiated by the web crawler. When the target client is required to be judged to be the web crawler, the network flow data of the target client can be converted into the API to access the two-dimensional image and input into the web crawler identification model for identification, so that matched privacy data protection measures are executed on the target client according to the web crawler identification result, and the web crawler can be effectively prevented from stealing privacy data.
Corresponding to the method for preventing the web crawler from stealing the private data, an embodiment of the present specification further provides a method for training a web crawler recognition model. FIG. 2 is a flow chart of a method for preventing a web crawler from stealing private data according to an embodiment of the present disclosure. The method shown in fig. 2 may be performed by a corresponding apparatus, comprising:
step S202, extracting an Application Program Interface (API) access record of the sample client in a preset time period from the network traffic data of the sample client.
In this specification embodiment, the sample client may include two classifications of white samples and black samples. The black sample refers to the client determined as the web crawler, and the white sample refers to the client determined as the non-web crawler, that is, the client of the normal user. The black samples and the white samples are distinguished through the labeled web crawler classification marks.
Specifically, in this step, an API access record of the sample client within a preset time period may be obtained from a network traffic log of the sample client, where the API access record may include, but is not limited to, an API that the sample client accesses each time within the preset time period and a corresponding time.
Step S204, generating sample data based on the API access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions.
As described above, the API access two-dimensional graph represents the recording of API access performed by the sample client within the preset time period in two dimensions, i.e., time and API access amount, so that the rhythm of API access initiated by the sample client can be presented.
And S206, training the web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
Specifically, in this step, the web crawler recognition model may be trained by using the sample data as input and the web crawler classification tag of the sample data as output. In the training process, the web crawler recognition model outputs a training result, and the training result is a result of the web crawler recognition model predicting that the sample data is a white sample or a black sample, and may have an error with a true value result indicated by the web crawler classification label. In the step, a loss function is obtained through maximum likelihood estimation derivation, the loss between the training result and the web crawler classification label is calculated, and the parameters of the web crawler identification model are optimized with the aim of reducing the error value.
Based on the training method shown in fig. 2, in the scheme of the embodiment of the present specification, API access records are extracted from network traffic data of a sample client, and are characterized by using two dimensions, i.e., time and API access amount, to obtain an API access two-dimensional image of the sample client, so that the API access two-dimensional image is used to train web crawler recognition model learning, and the web crawler recognition model learning is made to obtain a rhythm sequence for the web crawler to initiate API access, so as to form web crawler recognition capability. In the whole process, the sample client does not need to be subjected to feature extraction, so that the loss of feature information can be reduced, and the training efficiency is improved. Meanwhile, the trained web crawler recognition model has higher accuracy and recall rate.
The method for preventing the web crawler from stealing the private data and the training method of the web crawler recognition model in the embodiments of the present specification are described in detail below with reference to an actual application scenario.
The ResNet model of the scheme of the embodiment serves as a web crawler recognition model. The ResNet model has extremely high-performance image classification capability, and the ResNet model is trained by the API visit two-dimensional image of the black-white sample, so that the ResNet model can distinguish a web crawler client from a normal user client in the API visit rhythm presented by time.
Assuming that in the application scenario, an internet company extracts a batch of network traffic data every day to identify a web crawler, 1 day may be used as a preset time period (which may also be 12 hours or 1 hour, and is not specifically limited herein), and the API is configured to access 1440 unit time components of the time axis of the two-dimensional graph, where the length of each unit time is 1 minute (1 day 24 hours, 60 minutes per hour, and therefore 1440 minutes in total).
First, the internet company randomly selects a certain number of clients that have been determined to be web crawlers and clients of normal users as sample clients. Here, in order to obtain a preferable training effect, the ratio of the number of black samples to the number of white samples is preferably 1:9 or less and 1:20 or more.
Then, the internet company extracts the API and the corresponding time requested by each sample client in 1 day from the network traffic log of the sample client, and converts the extracted records according to the setting of the API for accessing the two-dimensional graph. Here, the API of each sample client obtained by the conversion accesses the two-dimensional graph as sample data.
In the application scenario, the internet company selects 10% of sample data as a test set, and the remaining 90% of sample data as a training set. An internet company uses a training set to repeatedly and iteratively train a ResNet model, so that the ResNet model has the web crawler recognition capability. After training is completed, the Internet company tests the ResNet model by using the test set. If the test requirements are met, the ResNet model is put into use, namely, the ResNet model is operated on an Internet company line. And if the test requirements are not met, reconstructing sample data (for example, selecting a new batch of sample clients), and training and testing the ResNet model again until the ResNet model is put into use.
After the ResNet model is in use, the Internet company may review the daily network traffic data to identify web crawlers. Assuming that an internet company needs to perform web crawler identification on a target client performing API access on the same day, an API access record of the target client may be extracted from the network traffic data of the same day, and the API access record of the target client is converted into an API access two-dimensional graph. And then, the Internet company inputs the API access two-dimensional graph of the target client into a ResNet model, and the ResNet model judges whether the target client is a web crawler.
If the ResNet model determines that the target user is a web crawler, the Internet company can set the target client as a black list to prevent the target client from initiating API access to the website of the Internet company, so that the target client is prevented from stealing the private user data of the Internet company.
The above is a description of the method of the embodiments of the present specification. It will be appreciated that appropriate modifications may be made without departing from the principles outlined herein, and such modifications are intended to be included within the scope of the embodiments herein. For example, the web crawler recognition model in the embodiment of the present specification is not limited to the ResNet model, but any convolutional neural network model that is applicable to image classification may be applicable. In addition, the API access to the two-dimensional graph may also be adjusted appropriately, for example, the time axis may also be divided unevenly according to the access peak period.
In correspondence with the above method for preventing a web crawler from stealing private data, as shown in fig. 3, an embodiment of the present specification further provides an apparatus 300 for preventing a web crawler from stealing private data, including:
the record extraction module 310 is configured to extract an API access record of a target client within a preset time period from network traffic data of the target client;
the image generation module 320 is used for generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
the crawler recognition module 330 is configured to input the data to be recognized into a web crawler recognition model to obtain a network recognition result of the target client, where the web crawler recognition model is obtained by training a web crawler classification tag based on sample data and the sample data, and the sample data includes an API access two-dimensional graph of the sample user in the preset time period, where the time and the API access amount are dimensions;
and the data protection module 340 executes privacy data protection measures matched with the network crawler identification result on the target client.
Based on the device shown in fig. 3, in the scheme of the embodiment of the present specification, API access records are extracted from network traffic data, and a two-dimensional image of time and API access amount is used for characterization, so that the web crawler recognition model is trained by using the API access two-dimensional image of sample data, and the web crawler recognition model learns to obtain a rhythm sequence of API access initiated by the web crawler. When the target client is required to be judged to be the web crawler, the network flow data of the target client can be converted into the API to access the two-dimensional image and input into the web crawler identification model for identification, so that matched privacy data protection measures are executed on the target client according to the web crawler identification result, and the web crawler can be effectively prevented from stealing privacy data.
Optionally, when the data protection module 340 is executed, if the result of the web crawler identification indicates that the target client belongs to the web crawler, the target client is added to a blacklist to prevent API access, or the target client is prevented from API access at a specified time every day.
Obviously, the apparatus of the embodiment of the present specification may be an execution subject of the method for preventing a web crawler from stealing private data shown in fig. 1, and thus can implement the functions of the method implemented in fig. 1. Since the principle is the same, the detailed description is omitted here.
Corresponding to the above training method for the web crawler recognition model, as shown in fig. 4, an embodiment of the present specification further provides a training apparatus 400 for a web crawler recognition model, including:
the record extraction module 410 is used for extracting an Application Program Interface (API) access record of a sample client within a preset time period from network traffic data of the sample client;
the image generation module 420 is configured to generate sample data based on an API access record of the sample client within the preset time period, where the sample data includes an API access two-dimensional graph of the sample client within the preset time period, where time and an API access amount are dimensions;
and the model training module 430 trains the web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
Based on the training device shown in fig. 4, in the scheme of the embodiment of the present specification, API access records are extracted from network traffic data of a sample client, and are characterized by using two dimensions, i.e., time and API access amount, to obtain an API access two-dimensional image of the sample client, so that the API access two-dimensional image is used to train web crawler recognition model learning, and the web crawler recognition model learning is made to obtain a rhythm sequence for the web crawler to initiate API access, so as to form web crawler recognition capability. In the whole process, the sample client does not need to be subjected to feature extraction, so that the loss of feature information can be reduced, and the training efficiency is improved. Meanwhile, the trained web crawler recognition model has higher accuracy and recall rate.
Optionally, the preset time period is one of 1 day, 12 hours, and 1 hour.
Optionally, the web crawler identification model is a convolutional neural network model, such as a delta network model.
Obviously, the training device in the embodiment of the present specification may be an execution subject of the training method of the web crawler recognition model shown in fig. 2, and thus the functions of the training method realized in fig. 2 can be realized. Since the principle is the same, the detailed description is omitted here.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Referring to fig. 5, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.
And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.
Optionally, the processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program, and a device for preventing a web crawler from stealing private data is formed on a logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:
and extracting an Application Program Interface (API) access record of the target client within a preset time period from the network flow data of the target client.
And generating data to be identified based on the API access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality.
Inputting the data to be recognized into a web crawler recognition model to obtain a network recognition result of the target client, wherein the web crawler recognition model is obtained by training a web crawler classification tag based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user in the preset time period by taking time and API access amount as dimensions.
And executing a privacy data protection measure matched with the network crawler recognition result on the target client.
Or the processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the training device of the web crawler recognition model on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
and training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
The method for preventing the network crawler from stealing the private data disclosed in the embodiment shown in fig. 1 of the present specification or the training method for the network crawler recognition model disclosed in the embodiment shown in fig. 2 can be applied to a processor and implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
It should be understood that the electronic device of the embodiments of the present specification may implement the functions of the above-mentioned apparatus for preventing a web crawler from stealing private data in the embodiment shown in fig. 1, or implement the functions of the above-mentioned training apparatus for a web crawler recognition model in the embodiment shown in fig. 2. Since the principle is the same, the detailed description is omitted here.
Of course, besides the software implementation, the electronic device in this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
Furthermore, the present specification embodiments also propose a computer-readable storage medium storing one or more programs, the one or more programs including instructions.
Optionally, the above instructions, when executed by a portable electronic device including a plurality of application programs, can cause the portable electronic device to perform the method of the embodiment shown in fig. 1, and is specifically configured to perform the following method:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
Alternatively, the above instructions, when executed by a portable electronic device comprising a plurality of application programs, can cause the portable electronic device to perform the method of the embodiment shown in fig. 2, and in particular to perform the following method:
and extracting an Application Program Interface (API) access record of the sample client in a preset time period from the network flow data of the sample client.
Generating sample data based on the API access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions.
And training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance.
It should be appreciated that the above-described instructions, when executed by a portable electronic device comprising a plurality of applications, can cause the apparatus for preventing a web crawler from stealing private data described above to implement the functions of the embodiment shown in fig. 1, or cause the apparatus for training a web crawler recognition model described above to implement the functions of the embodiment shown in fig. 2. Since the principle is the same, the detailed description is omitted here.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification. Moreover, all other embodiments obtained by a person skilled in the art without making any inventive step shall fall within the scope of protection of this document.

Claims (12)

1. A method of preventing privacy data theft by a web crawler, comprising:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
2. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
executing a privacy data protection measure matched with the network crawler recognition result on the target client, wherein the privacy data protection measure comprises the following steps:
and if the network crawler identification result indicates that the target client belongs to the network crawler, adding a blacklist to the target client to prevent the API access, or preventing the target client from performing the API access at a specified time interval every day.
3. A training method of a web crawler recognition model comprises the following steps:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance, wherein the web crawler recognition model is used for web crawler recognition, and the output web crawler recognition result is used for making a decision on privacy data protection measures.
4. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,
the preset time period is one of 1 day, 12 hours and 1 hour.
5. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,
the web crawler recognition model is a convolutional neural network model.
6. The method of claim 5, wherein the first and second light sources are selected from the group consisting of,
the web crawler identification model comprises a delta network model.
7. An apparatus for preventing a web crawler from stealing private data, comprising:
the record extraction module is used for extracting an Application Program Interface (API) access record of a target client within a preset time period from network flow data of the target client;
the image generation module is used for generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period by taking time and API access amount as dimensionality;
the crawler identification module is used for inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and the data protection module executes privacy data protection measures matched with the network crawler identification result to the target client.
8. An electronic device includes: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
9. A computer-readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:
extracting an Application Program Interface (API) access record of a target client within a preset time period from network traffic data of the target client;
generating data to be identified based on an Application Program Interface (API) access record of the target client in the preset time period, wherein the data to be identified comprises an API access two-dimensional graph of the target client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
inputting the data to be identified into a web crawler identification model to obtain a network identification result of the target client, wherein the web crawler identification model is obtained by training a web crawler classification label based on sample data and the sample data, and the sample data comprises an API access two-dimensional graph of the sample user with time and API access amount as dimensions in the preset time period;
and executing a privacy data protection measure matched with the network crawler recognition result on the target client.
10. A training device of a web crawler recognition model comprises:
the record extraction module is used for extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
the image generation module generates sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensionality;
and the model training module is used for training a web crawler recognition model based on the sample data and the web crawler classification mark labeled for the sample data in advance, wherein the web crawler recognition model is used for web crawler recognition, and the output web crawler recognition result is used for making a decision on privacy data protection measures.
11. An electronic device includes: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance, wherein the web crawler recognition model is used for web crawler recognition, and the output web crawler recognition result is used for making a decision on privacy data protection measures.
12. A computer-readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:
extracting an Application Program Interface (API) access record of a sample client in a preset time period from network flow data of the sample client;
generating sample data based on an Application Program Interface (API) access record of the sample client in the preset time period, wherein the sample data comprises an API access two-dimensional graph of the sample client in the preset time period, and the API access two-dimensional graph takes time and API access amount as dimensions;
training a web crawler recognition model based on the sample data and the web crawler classification marks labeled for the sample data in advance, wherein the web crawler recognition model is used for web crawler recognition, and the output web crawler recognition result is used for making a decision on privacy data protection measures.
CN201911284559.0A 2019-12-13 2019-12-13 A method, training method and device for preventing web crawler from stealing private data Active CN111107074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911284559.0A CN111107074B (en) 2019-12-13 2019-12-13 A method, training method and device for preventing web crawler from stealing private data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911284559.0A CN111107074B (en) 2019-12-13 2019-12-13 A method, training method and device for preventing web crawler from stealing private data

Publications (2)

Publication Number Publication Date
CN111107074A CN111107074A (en) 2020-05-05
CN111107074B true CN111107074B (en) 2022-04-08

Family

ID=70421905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911284559.0A Active CN111107074B (en) 2019-12-13 2019-12-13 A method, training method and device for preventing web crawler from stealing private data

Country Status (1)

Country Link
CN (1) CN111107074B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100606B (en) * 2020-09-28 2021-12-17 武汉厚溥数字科技有限公司 Online education processing method based on cloud big data calculation and online education platform
CN113987309B (en) * 2021-12-29 2022-03-11 深圳红途科技有限公司 Personal privacy data identification method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489401B1 (en) * 2015-06-16 2016-11-08 My EyeSpy PTY Ltd. Methods and systems for object recognition
CN107943949A (en) * 2017-11-24 2018-04-20 厦门集微科技有限公司 A kind of method and server of definite web crawlers
CN108429721A (en) * 2017-02-15 2018-08-21 腾讯科技(深圳)有限公司 A kind of recognition methods of web crawlers and device
CN110245280A (en) * 2019-05-06 2019-09-17 北京三快在线科技有限公司 Identify method, apparatus, storage medium and the electronic equipment of web crawlers
CN110535777A (en) * 2019-08-12 2019-12-03 新华三大数据技术有限公司 Access request control method, device, electronic equipment and readable storage medium storing program for executing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489401B1 (en) * 2015-06-16 2016-11-08 My EyeSpy PTY Ltd. Methods and systems for object recognition
CN108429721A (en) * 2017-02-15 2018-08-21 腾讯科技(深圳)有限公司 A kind of recognition methods of web crawlers and device
CN107943949A (en) * 2017-11-24 2018-04-20 厦门集微科技有限公司 A kind of method and server of definite web crawlers
CN110245280A (en) * 2019-05-06 2019-09-17 北京三快在线科技有限公司 Identify method, apparatus, storage medium and the electronic equipment of web crawlers
CN110535777A (en) * 2019-08-12 2019-12-03 新华三大数据技术有限公司 Access request control method, device, electronic equipment and readable storage medium storing program for executing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于canvas绘图的网页信息防采集技术研究;陈丽卿;《网络安全技术与应用》;20180915(第09期);全文 *

Also Published As

Publication number Publication date
CN111107074A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN110275958B (en) Website information identification method and device and electronic equipment
CN110489415B (en) Data updating method and related equipment
CN113949527B (en) Abnormal access detection method and device, electronic equipment and readable storage medium
CN108667816A (en) Method and system for detecting and locating network anomalies
CN106874253A (en) Recognize the method and device of sensitive information
CN111143654A (en) Crawler identification method and device for assisting in identifying crawler, and electronic equipment
US12189799B2 (en) Providing images with privacy label
CN114913534A (en) Block chain-based network security abnormal image big data detection method and system
CN108256322B (en) Security testing method and device, computer equipment and storage medium
CN113904861B (en) Encryption traffic safety detection method and device
US12299131B2 (en) Method, device, and computer readable medium for detecting vulnerability in source code
CN112597459B (en) Authentication method and device
CN116225950A (en) Identification method and system of fraud APP based on multi-mode fusion
CN113381962A (en) Data processing method, device and storage medium
CN110968689A (en) Training method of criminal name and law bar prediction model and criminal name and law bar prediction method
CN111107074B (en) A method, training method and device for preventing web crawler from stealing private data
CN113381963B (en) Domain name detection method, device and storage medium
CN113449816B (en) Website classification model training, website classification method, device, equipment and medium
CN117197857A (en) Face forgery attack detection, face recognition method, device and equipment
CN115205766A (en) Block chain-based network security abnormal video big data detection method and system
CN112215230B (en) Information auditing method and device and electronic equipment
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN111259216B (en) Information identification method, device and equipment
Samanta et al. SmartHash: perceptual hashing for image tampering detection and authentication
CN110634018A (en) Feature depiction method, recognition method and related device for lost user

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 310000 Zhejiang Province, Hangzhou City, Xihu District, Xixi Road 543-569 (continuous odd numbers) Building 1, Building 2, 5th Floor, Room 518

Patentee after: Alipay (Hangzhou) Digital Service Technology Co.,Ltd.

Country or region after: China

Address before: 310000 801-11 section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Patentee before: Alipay (Hangzhou) Information Technology Co., Ltd.

Country or region before: China