CN115827947B - Method and device for collecting paging website data and electronic equipment - Google Patents
Method and device for collecting paging website data and electronic equipment Download PDFInfo
- Publication number
- CN115827947B CN115827947B CN202310053204.0A CN202310053204A CN115827947B CN 115827947 B CN115827947 B CN 115827947B CN 202310053204 A CN202310053204 A CN 202310053204A CN 115827947 B CN115827947 B CN 115827947B
- Authority
- CN
- China
- Prior art keywords
- data
- paging
- text
- prediction
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a method, a device and electronic equipment for collecting paging website data, which comprise the following steps: acquiring a data acquisition task and webpage content of a website to be crawled; preprocessing the webpage content to obtain preprocessed data; inputting the preprocessed data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessed data; and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result. According to the method, the page condition of the webpage is predicted, and the target data corresponding to the data acquisition task is determined from the webpage content on the basis of the prediction, so that the efficiency and the accuracy of acquiring the page website data are improved.
Description
Technical Field
The present invention relates to the field of data acquisition technologies, and in particular, to a method, an apparatus, and an electronic device for acquiring paging website data.
Background
The development of high-speed updating of information network technology brings about explosion growth of network information quantity, and in the era of huge network information quantity, how to quickly and pertinently acquire network information required by users is a concern, and the birth of search engines such as crawlers is promoted.
The search engine is a search technology for searching out formulated information from the Internet according to the user demands and a certain algorithm by using a specific strategy and feeding the formulated information back to the user, and the information is organized and processed and provided for the user. A crawler engine is a search engine that automatically browses the web and analyzes web content, and in web crawling, web sites with pagination structures are often encountered. Paging structures typically implement jumps in two ways, one is that the page tags have links, and the crawling can directly acquire the links for scanning the next page. The second is click event trigger function jump, this mode is not directly grabbed by the crawler on the market, and the method generally used is to analyze the request interface and directly acquire the content link. However, if the web page paging structure of the website does not have a linked page tag, the following drawbacks exist in the processing method of the analysis request interface: firstly, the content of the catalog page cannot be acquired, so that the scanned content is lost; and secondly, specific analysis is required for different websites, which is time-consuming and labor-consuming.
In the whole, the existing method for collecting paging website data has the current situations of low efficiency and low accuracy.
Disclosure of Invention
The invention aims to provide a method, a device and electronic equipment for collecting paging website data, so as to improve the efficiency and accuracy of collecting the paging website data.
In a first aspect, an embodiment of the present invention provides a method for collecting paging website data, where the method includes: acquiring a data acquisition task and webpage content of a website to be crawled; preprocessing the webpage content to obtain preprocessed data; inputting the preprocessed data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessed data; and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the paging prediction model is constructed by: acquiring a preset training data set; the training data set comprises a preset text and a prediction label of the preset text; the prediction labeling comprises the following steps: the preset text corresponds to the paging condition of the clause; and training an initial LSTM model by taking the preset text as input and taking a prediction label of the preset text as output until a preset training ending condition is met, so as to obtain a trained paging prediction model.
With reference to the first possible implementation manner of the first aspect, the embodiment of the present invention provides a second possible implementation manner of the first aspect, where before the step of obtaining the preset training data set, the method includes: acquiring webpage text data of a preset website; preprocessing the webpage text data to obtain preprocessed text data; and determining the training data set according to the preprocessed text data.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the step of preprocessing the web page text data to obtain preprocessed text data includes: performing data filtering on the webpage text data based on preset parameters to obtain first intermediate preprocessing text data; performing sentence processing on the first intermediate preprocessing text data according to the html tag corresponding to the first intermediate preprocessing text data to obtain second intermediate preprocessing text data; and marking the paging condition in the second intermediate preprocessing text data to obtain the preprocessing text data.
With reference to the third possible implementation manner of the first aspect, the embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where before the step of labeling the paging situation in the second intermediate preprocessed text data to obtain the preprocessed text data, the method includes: inputting the second intermediate preprocessing text data into a preset dacanno text labeling system; marking the paging condition in the second intermediate preprocessing text data to obtain the preprocessing text data, wherein the step comprises the following steps: and marking the paging condition in the second intermediate preprocessed text data by the dacanno text marking system to obtain the preprocessed text data.
With reference to the first possible implementation manner of the first aspect, the embodiment of the present invention provides a fifth possible implementation manner of the first aspect, wherein the step of training the initial LSTM model with the preset text as an input and the prediction label of the preset text as an output until a preset training end condition is met, and obtaining a trained paging prediction model includes: inputting the preset text into an embedded layer in an initial LSTM model to obtain a plurality of vector sequences; and training the initial LSTM model by taking the plurality of vector sequences as input and the prediction labels of the plurality of vector sequences as output until a preset training ending condition is met, so as to obtain a trained paging prediction model.
With reference to the fifth possible implementation manner of the first aspect, the embodiment of the present invention provides a sixth possible implementation manner of the first aspect, wherein, with the plurality of vector sequences as inputs and prediction labels of the plurality of vector sequences as outputs, training the initial LSTM model until a preset training end condition is met, and before the step of obtaining a trained paging prediction model, the method includes: inputting the vector sequences into an LSTM unit of the initial LSTM model, and outputting prediction labels corresponding to the vector sequences; and constructing an initial LSTM model according to the plurality of vector sequences and the prediction labels corresponding to the plurality of vector sequences.
In a second aspect, an embodiment of the present invention provides an apparatus for collecting paging website data, where the apparatus includes: the data acquisition module is used for acquiring a data acquisition task and webpage content of a website to be crawled; the preprocessing module is used for preprocessing the webpage content to obtain preprocessed data; the paging condition prediction module is used for inputting the preprocessing data into a pre-trained paging prediction model and outputting a paging prediction result of the preprocessing data; and the target data determining module is used for determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores machine executable instructions that are executable by the processor, and the processor executes the machine executable instructions to implement the method for collecting paging website data in any one of the sixth possible implementation manners of the first aspect to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where the computer storage medium stores a computer program, where the computer program includes program instructions, where the program instructions, when executed by a processor, cause the processor to perform a method for collecting paging website data according to any one of the sixth possible implementation manners of the first aspect to the first aspect.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a method and a device for collecting paging website data and electronic equipment, wherein the method comprises the following steps: acquiring a data acquisition task and webpage content of a website to be crawled; preprocessing the webpage content to obtain preprocessed data; inputting the preprocessed data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessed data; and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result. According to the method, the page condition of the webpage is predicted, and the target data corresponding to the data acquisition task is determined from the webpage content on the basis of the prediction, so that the efficiency and the accuracy of acquiring the page website data are improved.
Additional features and advantages of the present embodiments will be set forth in the description which follows, or in part will be obvious from the description, or may be learned by practice of the techniques of the present disclosure.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for collecting paging website data according to an embodiment of the present invention;
FIG. 2 is a flowchart of a training method of a paging prediction model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a device for collecting data of a paged website according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Icon: 31-a data acquisition module; 32-a pretreatment module; 33-a paging situation prediction module; 34-a target data determination module; 41-memory; 42-a processor; 43-bus; 44-communication interface.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
At present, a search engine uses a specific strategy to search out formulated information from the internet according to user requirements and a certain algorithm and feeds the formulated information back to a user, and the information is organized and processed and provided for the user. A crawler engine is a search engine that automatically browses the web and analyzes web content, and in web crawling, web sites with pagination structures are often encountered. Paging structures typically implement jumps in two ways, one is that the page tags have links, and the crawling can directly acquire the links for scanning the next page. The second is click event trigger function jump, this mode is not directly grabbed by the crawler on the market, and the method generally used is to analyze the request interface and directly acquire the content link. However, if the web page paging structure of the website does not have a linked page tag, the following drawbacks exist in the processing method of the analysis request interface: firstly, the content of the catalog page cannot be acquired, so that the scanned content is lost; and secondly, specific analysis is required for different websites, which is time-consuming and labor-consuming.
Based on the above, the embodiment of the invention provides a method, a device and an electronic device for collecting paging website data, which can alleviate the technical problems and improve the efficiency and accuracy of collecting paging website data. For the convenience of understanding the embodiments of the present invention, a method for collecting paging website data disclosed in the embodiments of the present invention will be described in detail.
Example 1
Fig. 1 is a flowchart of a method for collecting paging website data according to an embodiment of the present invention. As seen in fig. 1, the method comprises the steps of:
step S101: and acquiring a data acquisition task and webpage contents of the website to be crawled.
In this embodiment, the web page content includes web page text data.
Step S102: and preprocessing the webpage content to obtain preprocessing data.
In this embodiment, the step S102 includes: and cleaning the data of the webpage content based on a preset rule so as to remove messy codes and meaningless text data.
Step S103: and inputting the preprocessed data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessed data.
In this embodiment, the above-mentioned paging prediction model is constructed in advance based on the LSTM algorithm.
Step S104: and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result.
The embodiment of the invention provides a method for collecting paging website data, which comprises the following steps: acquiring a data acquisition task and webpage content of a website to be crawled; preprocessing the webpage content to obtain preprocessed data; inputting the preprocessed data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessed data; and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result. According to the method, the page condition of the webpage is predicted, and the target data corresponding to the data acquisition task is determined from the webpage content on the basis of the prediction, so that the efficiency and the accuracy of acquiring the page website data are improved.
Example 2
The invention also provides another method for collecting paging website data based on the method shown in fig. 1, which mainly describes the training process of the paging prediction model in step S103 in embodiment 1. As shown in fig. 2, fig. 2 is a flowchart of a training method of a paging prediction model according to an embodiment of the present invention, and as shown in fig. 2, the paging prediction model is obtained through the following training steps:
step S201: acquiring a preset training data set; the training data set comprises a preset text and a prediction label of the preset text; the prediction labeling comprises the following steps: the preset text corresponds to the paging condition of the clause.
Wherein, before step S201, the method includes the following steps A1-A3:
step A1: and acquiring webpage text data of a preset website.
Step A2: and preprocessing the webpage text data to obtain preprocessed text data.
In this embodiment, step A2 includes: firstly, carrying out data filtering on the webpage text data based on preset parameters to obtain first intermediate preprocessing text data. And then, carrying out sentence processing on the first intermediate preprocessing text data according to the html label corresponding to the first intermediate preprocessing text data to obtain second intermediate preprocessing text data. And finally, marking the paging condition in the second intermediate preprocessing text data to obtain the preprocessing text data.
In one embodiment, before the step of marking the paging situation in the second intermediate preprocessed text data to obtain the preprocessed text data, the method includes: inputting the second intermediate preprocessing text data into a preset dacanno text labeling system; marking the paging condition in the second intermediate preprocessing text data to obtain the preprocessing text data, wherein the step comprises the following steps: and marking the paging condition in the second intermediate preprocessed text data by the dacanno text marking system to obtain the preprocessed text data. Further, for example: the second intermediate preprocessed text data is "39 pieces of last page 12 next to 1 st page determination". First, the sentence is marked as a page. The "next page" in the sentence is then marked as a valid segment. Finally, the "next page" is represented by a predetermined symbol or letter.
Step A3: and determining the training data set according to the preprocessed text data.
Step S202: and training an initial LSTM model by taking the preset text as input and taking a prediction label of the preset text as output until a preset training ending condition is met, so as to obtain a trained paging prediction model.
In this embodiment, the step S202 includes: firstly, inputting the preset text into an embedded layer in an initial LSTM model to obtain a plurality of vector sequences. And then taking the plurality of vector sequences as input, taking prediction labels of the plurality of vector sequences as output, and training the initial LSTM model until a preset training ending condition is met, so as to obtain a trained paging prediction model.
Here, before step S202, the method further includes: first, the plurality of vector sequences are input into an LSTM unit of the initial LSTM model, and prediction labels corresponding to the plurality of vector sequences are output. And then, constructing an initial LSTM model according to the plurality of vector sequences and the prediction labels corresponding to the plurality of vector sequences.
Further, the plurality of vector sequences are input into two bidirectional LSTM units of the initial LSTM model; and then, forward and backward splicing each time sequence, mapping the time sequence into a vector with one dimension being the number of output notes through a full connection layer, and normalizing the output into the probability of each label by using Softmax, thereby obtaining the initial LSTM model.
The embodiment of the invention provides a method for collecting paging website data, which comprises the following steps: acquiring a data acquisition task and webpage content of a website to be crawled; preprocessing the webpage content to obtain preprocessed data; inputting the preprocessed data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessed data; the paging prediction model is obtained through the following training steps: acquiring a preset training data set; the training data set comprises a preset text and a prediction label of the preset text; the prediction labeling comprises the following steps: the preset text corresponds to the paging condition of the clause; taking the preset text as input, taking a prediction label of the preset text as output, training an initial LSTM model until a preset training ending condition is met, and obtaining a trained paging prediction model; and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result. According to the method, the page condition of the webpage is predicted by constructing the page prediction model based on the LSTM, and the target data corresponding to the data acquisition task is determined from the webpage content on the basis of the page condition, so that the efficiency and the accuracy of acquiring the page website data are further improved.
Example 3
The embodiment of the invention also provides a device for collecting the paging website data, as shown in fig. 3, and provides a structural schematic diagram of the device for collecting the paging website data. As can be seen in fig. 3, the device comprises:
the data acquisition module 31 is configured to acquire a data acquisition task and web page content of a website to be crawled.
The preprocessing module 32 is configured to preprocess the web page content to obtain preprocessed data.
The paging situation prediction module 33 is configured to input the preprocessed data into a pre-trained paging prediction model, and output a paging prediction result of the preprocessed data.
A target data determining module 34 for determining target data corresponding to the data acquisition task from the web page content based on the page prediction result
The data acquisition module 31, the preprocessing module 32, the paging situation prediction module 33, and the target data determination module 34 are sequentially connected.
In one embodiment, the apparatus further includes a model building model coupled to the target data determination module 34; the model construction module is used for acquiring a preset training data set; the training data set comprises a preset text and a prediction label of the preset text; the prediction labeling comprises the following steps: the preset text corresponds to the paging condition of the clause; and training an initial LSTM model by taking the preset text as input and taking a prediction label of the preset text as output until a preset training ending condition is met, so as to obtain a trained paging prediction model.
In one embodiment, the model building module is further configured to obtain web page text data of a preset website; preprocessing the webpage text data to obtain preprocessed text data; and determining the training data set according to the preprocessed text data.
In one embodiment, the model building module is further configured to perform data filtering on the web page text data based on preset parameters to obtain first intermediate preprocessed text data; performing sentence processing on the first intermediate preprocessing text data according to the html tag corresponding to the first intermediate preprocessing text data to obtain second intermediate preprocessing text data; and marking the paging condition in the second intermediate preprocessing text data to obtain the preprocessing text data.
In one embodiment, the model building module is further configured to input the second intermediate preprocessed text data into a preset dacanno text labeling system; marking the paging condition in the second intermediate preprocessing text data to obtain the preprocessing text data, wherein the step comprises the following steps: and marking the paging condition in the second intermediate preprocessed text data by the dacanno text marking system to obtain the preprocessed text data.
In one embodiment, the model building module is further configured to input the preset text into an embedding layer in an initial LSTM model to obtain a plurality of vector sequences; and training the initial LSTM model by taking the plurality of vector sequences as input and the prediction labels of the plurality of vector sequences as output until a preset training ending condition is met, so as to obtain a trained paging prediction model.
In one embodiment, the model building module is further configured to input the plurality of vector sequences into an LSTM unit of the initial LSTM model, and output prediction labels corresponding to the plurality of vector sequences; and constructing an initial LSTM model according to the plurality of vector sequences and the prediction labels corresponding to the plurality of vector sequences.
The prediction device for collecting paging website data provided by the embodiment of the invention has the same technical characteristics as the method for collecting paging website data provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved. It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the apparatus described above, which is not described herein again.
Example 4
The present embodiment provides an electronic device comprising a processor and a memory storing computer-executable instructions executable by the processor to perform the steps of a method of collecting paged website data.
Referring to fig. 4, a schematic structural diagram of an electronic device includes: the memory 41 and the processor 42, the memory stores a computer program that can run on the processor 42, and the processor implements the steps provided by the method for collecting paging website data when executing the computer program.
As shown in fig. 4, the apparatus further includes: a bus 43 and a communication interface 44, the processor 42, the communication interface 44 and the memory 41 being connected by the bus 43; the processor 42 is arranged to execute executable modules, such as computer programs, stored in the memory 41.
The Memory 41 may include a high-speed random access Memory (RAM, randomAccess Memory), and may further include a non-volatile Memory (non-volatile Memory), such as at least one magnetic disk Memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 44 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc.
The bus 43 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 4, but not only one bus or type of bus.
The memory 41 is used for storing a program, and the processor 42 executes the program after receiving an execution instruction, and any of the above embodiments of the present invention discloses that the method executed by the prediction apparatus for collecting paging website data may be applied to the processor 42 or implemented by the processor 42. The processor 42 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 42. The processor 42 may be a general-purpose processor, including a central processing unit (CentralProcessing Unit, CPU), a Network Processor (NP), etc.; but may also be a digital signal processor (DigitalSignal Processing, DSP for short), application specific integrated circuit (ApplicationSpecific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-ProgrammableGate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 41 and a processor 42 reads information in the memory 41 and in combination with its hardware performs the steps of the method described above.
Further, an embodiment of the present invention provides a computer storage medium storing a computer program, where the computer program includes program instructions, where the program instructions, when executed by the processor 42, cause the processor 42 to perform a method for implementing the method for collecting paging website data.
The prediction device for collecting the paging website data and the verification device for collecting the paging website data provided by the embodiment of the invention have the same technical characteristics, so that the same technical problems can be solved, and the same technical effects can be achieved.
In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Claims (4)
1. A method for gathering paginated website data, comprising:
acquiring a data acquisition task and webpage content of a website to be crawled;
preprocessing the webpage content to obtain preprocessed data;
inputting the preprocessing data into a pre-trained paging prediction model, and outputting a paging prediction result of the preprocessing data;
the paging prediction model is constructed by the following steps:
acquiring webpage text data of a preset website;
performing data filtering on the webpage text data based on preset parameters to obtain first intermediate preprocessing text data;
performing clause processing on the first intermediate preprocessing text data according to the html tag corresponding to the first intermediate preprocessing text data to obtain second intermediate preprocessing text data;
inputting the second intermediate preprocessing text data into a preset dacanno text labeling system;
marking the paging condition in the second intermediate preprocessed text data by the dacanno text marking system to obtain preprocessed text data;
determining a training data set according to the preprocessed text data;
acquiring a preset training data set; the training data set comprises a preset text and a prediction label of the preset text; the prediction labeling comprises the following steps: the preset text corresponds to the paging condition of the clause;
inputting the preset text into an embedded layer in an initial LSTM model to obtain a plurality of vector sequences;
inputting the vector sequences into an LSTM unit of the initial LSTM model, and outputting prediction labels corresponding to the vector sequences;
constructing an initial LSTM model according to the plurality of vector sequences and the prediction labels corresponding to the plurality of vector sequences;
taking the plurality of vector sequences as input, taking prediction labels of the plurality of vector sequences as output, and training the initial LSTM model until a preset training ending condition is met, so as to obtain a trained paging prediction model;
and determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result.
2. An apparatus for gathering paginated website data, comprising:
the data acquisition module is used for acquiring a data acquisition task and webpage content of a website to be crawled;
the preprocessing module is used for preprocessing the webpage content to obtain preprocessed data;
the paging condition prediction module is used for inputting the preprocessed data into a pre-trained paging prediction model and outputting a paging prediction result of the preprocessed data; the paging prediction model is constructed by the following steps: acquiring webpage text data of a preset website; performing data filtering on the webpage text data based on preset parameters to obtain first intermediate preprocessing text data; performing clause processing on the first intermediate preprocessing text data according to the html tag corresponding to the first intermediate preprocessing text data to obtain second intermediate preprocessing text data; inputting the second intermediate preprocessing text data into a preset dacanno text labeling system; marking the paging condition in the second intermediate preprocessed text data by the dacanno text marking system to obtain preprocessed text data; determining a training data set according to the preprocessed text data; acquiring a preset training data set; the training data set comprises a preset text and a prediction label of the preset text; the prediction labeling comprises the following steps: the preset text corresponds to the paging condition of the clause; inputting the preset text into an embedded layer in an initial LSTM model to obtain a plurality of vector sequences; inputting the vector sequences into an LSTM unit of the initial LSTM model, and outputting prediction labels corresponding to the vector sequences; constructing an initial LSTM model according to the plurality of vector sequences and the prediction labels corresponding to the plurality of vector sequences; taking the plurality of vector sequences as input, taking prediction labels of the plurality of vector sequences as output, and training the initial LSTM model until a preset training ending condition is met, so as to obtain a trained paging prediction model;
and the target data determining module is used for determining target data corresponding to the data acquisition task from the webpage content based on the paging prediction result.
3. An electronic device comprising a processor and a memory storing computer-executable instructions executable by the processor to perform the method of collecting paged website data as recited in claim 1.
4. A computer storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of collecting paged website data as claimed in claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310053204.0A CN115827947B (en) | 2023-02-03 | 2023-02-03 | Method and device for collecting paging website data and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310053204.0A CN115827947B (en) | 2023-02-03 | 2023-02-03 | Method and device for collecting paging website data and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115827947A CN115827947A (en) | 2023-03-21 |
CN115827947B true CN115827947B (en) | 2023-04-25 |
Family
ID=85520733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310053204.0A Active CN115827947B (en) | 2023-02-03 | 2023-02-03 | Method and device for collecting paging website data and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115827947B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102211655B1 (en) * | 2019-12-26 | 2021-02-04 | 한양대학교 에리카산학협력단 | Proxy Server And Web Object Prediction Method Using Thereof |
CN115396237A (en) * | 2022-10-27 | 2022-11-25 | 浙江鹏信信息科技股份有限公司 | Webpage malicious tampering identification method and system and readable storage medium |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8260846B2 (en) * | 2008-07-25 | 2012-09-04 | Liveperson, Inc. | Method and system for providing targeted content to a surfer |
CN104281582B (en) * | 2013-07-02 | 2017-08-25 | 阿里巴巴集团控股有限公司 | Pagination Display control method and device |
CN110020310A (en) * | 2017-12-05 | 2019-07-16 | 广东欧珀移动通信有限公司 | Resource loading method, device, terminal and storage medium |
US10877892B2 (en) * | 2018-07-11 | 2020-12-29 | Micron Technology, Inc. | Predictive paging to accelerate memory access |
CN110008421A (en) * | 2018-11-09 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Page processing method, device, equipment and computer readable storage medium |
CN109460816B (en) * | 2018-11-16 | 2020-09-18 | 焦点科技股份有限公司 | User behavior prediction method based on deep learning |
CN110442823A (en) * | 2019-08-06 | 2019-11-12 | 北京智游网安科技有限公司 | Website classification method, Type of website judgment method, storage medium and intelligent terminal |
CN112464618A (en) * | 2020-12-03 | 2021-03-09 | 北京明略软件系统有限公司 | Method and device for paging document data, storage medium and electronic equipment |
-
2023
- 2023-02-03 CN CN202310053204.0A patent/CN115827947B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102211655B1 (en) * | 2019-12-26 | 2021-02-04 | 한양대학교 에리카산학협력단 | Proxy Server And Web Object Prediction Method Using Thereof |
CN115396237A (en) * | 2022-10-27 | 2022-11-25 | 浙江鹏信信息科技股份有限公司 | Webpage malicious tampering identification method and system and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115827947A (en) | 2023-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8458207B2 (en) | Using anchor text to provide context | |
CN110674319A (en) | Label determination method and device, computer equipment and storage medium | |
CN110738049B (en) | Similar text processing method and device and computer readable storage medium | |
CN106547749B (en) | Webpage data acquisition method and device | |
CN108717435A (en) | Webpage loading method, information processing method, computer equipment and storage medium | |
CN110889045B (en) | Label analysis method, device and computer readable storage medium | |
CN110309386B (en) | Method and device for crawling web page | |
CN112732254A (en) | Webpage development method and device, computer equipment and storage medium | |
CN105260469A (en) | Sitemap processing method, apparatus and device | |
CN114138244A (en) | Method and device for automatically generating model files, storage medium and electronic equipment | |
CN111444411A (en) | Network data increment acquisition method, device, equipment and storage medium | |
CN112395485A (en) | Policy big data mining method and device, computer equipment and storage medium | |
US20150205769A1 (en) | System and method for recognizing non-body text in webpage | |
CN107766237A (en) | Method of testing, device, server and the storage medium of web crawlers | |
CN111562911A (en) | Webpage editing method and device and storage medium | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN115827947B (en) | Method and device for collecting paging website data and electronic equipment | |
CN119577126A (en) | Information retrieval method, device and medium | |
US9420052B2 (en) | Web navigation using web navigation pattern histories | |
CN114579834B (en) | Webpage login entity identification method and device, electronic equipment and storage medium | |
CN112230989B (en) | Webpage channel navigation bar extraction method, system, electronic equipment and storage medium | |
CN116204692A (en) | Webpage data extraction method and device, electronic equipment and storage medium | |
CN115186240A (en) | Social network user alignment method, device and medium based on relevance information | |
CN115587588B (en) | Text content auditing method and device and electronic equipment | |
CN113687831A (en) | Method, device, computer equipment and storage medium for generating data acquisition script |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |