CN119003904A

CN119003904A - Method and system for intelligently extracting and optimizing webpage content

Info

Publication number: CN119003904A
Application number: CN202410934428.7A
Authority: CN
Inventors: 林珊; 唐世洁; 陈誉
Original assignee: Shenzhen Ylink Computing System Co ltd
Current assignee: Shenzhen Ylink Computing System Co ltd
Priority date: 2024-07-12
Filing date: 2024-07-12
Publication date: 2024-11-22

Abstract

The present invention discloses a method and system for intelligent extraction and optimization of web page content, including: creating an expression array for storing matching web page content; crawling web page content, obtaining web page source data and preprocessing to obtain preprocessed data; judging whether the expression array corresponding to the source data is empty, if it is empty, calling an intelligent algorithm to generate an expression, if it is not empty, traversing the expression array to match the preprocessed data; deploying an exception capture mechanism and a logging mechanism, and performing parameterized configuration retry decisions. Calling a large model to automatically generate and optimize regular expressions reduces manual input and improves work efficiency; parameterized intelligent retry logic allows users to configure the maximum number of retries and retry intervals according to needs, and these parameters can be easily adjusted to adapt to different network conditions or server response times; improving the flexibility and configurability of the system as well as the stability and reliability of the system.

Description

Method and system for intelligently extracting and optimizing webpage content

Technical Field

The invention relates to the technical field of network information grabbing, in particular to a method and a system for intelligently extracting and optimizing webpage content.

Background

In the present digital age, the information quantity of the internet is exponentially increased, and the effective extraction and analysis of the web page contents become key links in numerous business scenes, such as big data analysis, competitive information collection, content aggregation service and the like.

With the rapid development of artificial intelligence technology, especially the breakthrough in the fields of Natural Language Processing (NLP) and machine learning, new possibilities are provided for solving the problems. However, there is currently a lack of a solution in the market that efficiently fuses artificial intelligence with traditional techniques (such as regular expressions) to intelligently automate the extraction process of web page content, especially for complex and varied web page structures.

At present, for intelligent extraction of webpage content, the following technical problems exist:

(1) The method for extracting the webpage content depends on the generation of the regular expression, and in the process of generating and optimizing the regular expression, the traditional method needs a large amount of manpower to participate in writing, so that the labor cost is increased, and the efficiency is possibly low; updating and expanding of the regular expression library often depend on manual collection and verification, and the regular expression library is low in efficiency and easy to make mistakes;

(2) The website structure is often changed, such as page layout adjustment, element ID change and the like, and the changes may cause the original regular expression to fail, so that the required data cannot be extracted correctly.

Disclosure of Invention

The invention aims to provide a method and a system for intelligently extracting and optimizing webpage content, which are used for solving the problems that the existing extraction method in the background technology increases labor cost, is low in efficiency, and causes the failure of an original regular expression due to the change of a network structure.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

According to one aspect of the present invention, there is provided a method for intelligent extraction and optimization of web content, the method comprising:

Creating an expression array for storing matched web page content; using a database and file system to persist the expression and designing an interface to support the add-drop-check operation;

Crawling webpage content, setting URL and crawling rules of a target website, starting a crawling program to acquire the webpage content, and preprocessing the webpage content to obtain preprocessing data;

Judging whether the expression array corresponding to the webpage content is empty or not, if so, calling an intelligent algorithm to generate an expression, and if not, traversing the expression array to match the preprocessing data;

If the matching is successful, extracting related content and storing the corresponding expression into the expression array; if the matching fails, calling an intelligent algorithm to generate a new expression;

And deploying an exception capturing mechanism and a log recording mechanism, and performing parameterized configuration retry decision.

Based on the foregoing scheme, the preprocessing includes removing HTML tags of the web page source data to obtain preprocessed data.

Based on the foregoing scheme, the calling the intelligent algorithm generates a new expression, including calling the intelligent algorithm, entering the pre-processing data and the target content indication, and returning a matched expression.

Based on the scheme, writing a parser to extract the expression generated by the intelligent algorithm and verifying, and adding the expression into the corresponding expression array if verification is successful.

Based on the scheme, the exception capturing mechanism and the log recording mechanism comprise an exception data recording mechanism which records the log data automatically when an exception is captured; the log data comprises detailed information of key operations and the abnormal data; the exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration.

Based on the scheme, an intelligent analysis engine is developed to automatically analyze the log data and predict anomalies to prompt early warning.

Based on the scheme, the parameterized configuration retry decision comprises configuration maximum retry times and retry intervals; the retry decision includes deciding whether to retry and timing of the retry according to the anomaly type and the historical retry success rate.

Based on the foregoing, the method further comprises providing a configuration interface to customize the monitoring strategy, including setting an error threshold, logging options, and pre-warning conditions.

Based on the scheme, the method further comprises the step of providing a visual monitoring interface and displaying key indexes, wherein the key indexes comprise but are not limited to system running states, success rates and error distribution.

According to another aspect of the invention, a system for intelligently extracting and optimizing web content is provided, which comprises an expression management module, a web content crawling module, a web content matching module, an expression generating module, an anomaly capturing and retrying module and a monitoring and statistics module;

The expression management module is used for managing and maintaining an expression array;

The webpage content crawling module acquires webpage source data and performs preprocessing to acquire preprocessing data;

The webpage content matching module traverses the expression array to match the preprocessing data;

the expression generating module calls an intelligent algorithm to generate a new expression;

The anomaly capturing and retrying module captures anomalies, records anomaly data and parameterizes configuration retrying decisions;

and the monitoring and statistics module is used for recording and automatically analyzing the log data, providing a visual monitoring interface and displaying key indexes.

Compared with the prior art, the invention has at least the following advantages and positive effects:

(1) The intelligent algorithm is called to automatically generate and optimize the regular expression, so that the labor investment is reduced, and the working efficiency is improved;

(2) Parameterized intelligent retry logic allows a user to configure maximum retry times and retry intervals as required, and these parameters can be easily adjusted to accommodate different network conditions or server response times; the flexibility and the configurability of the system are improved, and the stability and the reliability of the system are improved;

(3) The regular expression is stored in a lasting mode, a database and a file system are updated after new regular expressions are generated, the regular expression library can be continuously enriched and accurately used along with the increase of the frequency of use, the change of different website structures can be flexibly dealt with, the data extraction failure caused by webpage modification is reduced, and the stability and the continuity of data extraction are ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 is a schematic diagram of a method for intelligently extracting and optimizing web page content according to the present invention;

FIG. 2 is a schematic diagram of a large model call to generate a regular expression in accordance with the present invention;

FIG. 3 is a diagram illustrating matching of web page source data and an expression array according to the present invention;

fig. 4 is a schematic diagram of a system for intelligent extraction and optimization of web page content according to the present invention.

Detailed Description

For a more clear explanation of the objects, technical solutions and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and that the exemplary embodiments can be implemented in various forms and should not be construed as being limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

The invention will be described in detail with reference to specific examples below:

Example 1

As shown in fig. 1, the embodiment provides a method for intelligently extracting and optimizing web content, which comprises the following specific steps:

S1: creating an expression array for storing matched web page content; using a database and file system to persist the expression and designing an interface to support the add-drop-check operation;

In this embodiment, two arrays are created, respectively: the list_page_regex_array is used for storing the expression of the detail page, and the list_page_regex_array is used for storing the expression of the list page;

further, a proper database and file system are selected to durably store the expression, so that historical data can be effectively utilized after the system is restarted; the database table structure is designed to store expressions and related information (e.g., creation time, update time, source, etc.); an operation interface (e.g., add-drop-check) is implemented to synchronize with the expression array.

Further, RESTful APIs or command line interfaces are designed to allow users to manually add, edit, or delete expressions; and realizing interface back-end logic, processing user requests and updating the expression array.

S2: crawling webpage content, setting URL and crawling rules of a target website, starting a crawling program to acquire webpage source data, and preprocessing to acquire preprocessing data;

in this embodiment, a scrapy framework of Python is used to crawl web page content, input a target web page URL, start a crawling program, obtain the web page content, that is, web page source data, perform preprocessing for removing HTML tags on the web page source data, and obtain preprocessed data.

S3: judging whether the expression array corresponding to the webpage source data is empty or not, if so, calling an intelligent algorithm to generate an expression, and if not, traversing the expression array to match the preprocessing data;

in this embodiment, first, preliminary judgment is performed on the preprocessing data, and if the array for storing the expression is empty, an intelligent algorithm, that is, an external service or an internal algorithm, is required to be called, in this embodiment, a large model is called, and a corresponding regular expression is generated according to a given input and added to the expression array; if the matching operation is not null, traversing each element in the expression array, for each regular expression in the array, attempting to apply the element to the webpage content to be processed, executing the matching operation, performing matching on the webpage content, and recording or processing the matching result of each regular expression;

further, according to the service requirement, further processing or analyzing the matched result; it may be desirable to save the results to a database, file, or send the results to other systems over a network; if no suitable regular expression is found to match some key parts of the web page content, an attempt may be made to regenerate or adjust the regular expression.

Specifically, in this embodiment, as shown in fig. 2, invoking the large model to generate the regular expression includes:

selecting or training a generic large model (e.g., a transducer-based model) capable of generating regular expressions; and taking the preprocessing data and the target content indication as input, and calling a large model to generate a regular expression. Further, a parser is written to extract regular expressions generated by the large model, the generated regular expressions are verified, the effectiveness and accuracy of the regular expressions are guaranteed, and the verified regular expressions are added into corresponding expression arrays and are synchronized to a database and a file system.

S4: if the matching is successful, extracting related content and storing the corresponding expression into the expression array; if the matching fails, calling an intelligent algorithm to generate a new expression;

With reference to fig. 3, after the matching operation is performed in this embodiment, for each web page to be extracted, whether the matching is successful is determined, and if the matching is successful, the type of the web page is determined to be a list page or a detail page; skipping when the webpage source data is a list page and the matching is successful, and extracting and storing related contents when the webpage source data is a detail page and the matching is successful; for the web page source data which fails to match, calling a large model to generate a new regular expression;

Specifically, if the list page is the list page, traversing the expression array, and executing matching operation on each regular expression; if any regular expression matching is successful, further processing of the current web page source data is skipped (since list pages typically do not need to extract specific content); this may be the case if all regular expressions fail to match, but typically does not require immediate action, as the main purpose of the list page may be simply navigation or indexing.

If the rule is a detail page, traversing the expression array as well, and executing matching operation on each regular expression; if any regular expression is successfully matched, extracting relevant contents according to the rule of the regular expression, and storing the contents into a database; if all regular expressions fail to match, then the next step is performed.

For web page source data that fails to match, in this embodiment, a large model is called to generate a new regular expression, and some policies may be required to decide whether to replace or supplement an existing regular expression by adding the generated new regular expression to the regular expression array. Optionally, immediately attempting to re-match the web page source data with the newly generated regular expression that failed to match, or deferring this task to a subsequent batch process; when all the web page source data are processed, the step is finished.

Further, when invoking the large model to generate a new regular expression, additional information about the matching failed page (e.g., URL of the page, partial content sample, etc.) may need to be provided so that the large model can more accurately generate the applicable regular expression; for the generated new regular expression, certain verification or test should be performed to ensure the correctness and validity thereof; data cleansing and formatting may be required to ensure consistency and readability of the data prior to storing the content in the database; performance optimization and error handling mechanisms may need to be considered if large amounts of web page source data are processed to ensure system stability and reliability.

S5: deploying an exception capturing mechanism and a log recording mechanism, and performing parameterized configuration retry decision;

In the embodiment, an exception capturing mechanism and a log recording mechanism are further arranged, exception capturing logic is added in key operation links such as content matching and intelligent algorithm calling operation links, the log recording mechanism is started to record exception data when an exception is captured, and the log recording mechanism automatically records log data; the log data comprises detailed information of key operations and the abnormal data; the exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration. Preferably, an intelligent analysis engine is also developed in this embodiment to automatically analyze the log data and predict anomalies to prompt early warning.

Further, the parameterized configuration retry decision includes configuring a maximum retry number and a retry interval; the retry decision includes deciding whether to retry and timing of the retry according to the anomaly type and the historical retry success rate. And (3) realizing retry logic, setting the maximum retry times, setting short delay before each retry to avoid frequent calling, recording failure details and executing a skip strategy when the retry upper limit is reached. Specifically, according to the abnormal type and the historical retry success rate, intelligently determining whether to retry and the retry time; for some higher deterministic temporary problems, the system may choose to retry immediately; and for problems that may require manual intervention, the detailed information is recorded and skipped. Further, when the set maximum number of retries is reached, the system will perform corresponding failure handling operations according to the configured policies, including but not limited to recording detailed failure logs, sending alarm notifications, executing backup flows, or simply skipping the current task.

Preferably, the exception capture mechanism and the parameterized configuration retry decision are used as a separate module, which can be easily integrated into other parts of the system or replaced and upgraded as required; the parameterized and modularized design not only improves the flexibility and the configurability of the system, but also enhances the stability and the reliability of the system, and ensures that the system can respond reasonably when facing various abnormal conditions.

Preferably, in this embodiment, a log framework is selected and log data is configured, where the log framework includes, but is not limited to, error types, time stamps, and stack tracking, for example, a logging module in a Python standard library, or a third party library such as Loguru. Further, an intelligent analysis engine is developed to automatically analyze the log data, identify abnormal modes and potential problems in the operation of the system, and predict the abnormality to prompt early warning; by using a machine learning algorithm, the intelligent analysis engine can predict the trend of fault occurrence and issue a warning in advance.

Further, the present embodiment provides a configuration interface to customize the monitoring policy, which allows the user to customize the monitoring policy according to specific requirements, including setting an error threshold, a log record option, and an early warning condition, and by providing a user-friendly configuration interface, the user can easily set and manage these parameters.

Further, the embodiment integrates a closed-loop feedback mechanism, ensures that the monitored problems can be timely fed back to system maintenance personnel, supports an automatic workflow, and can automatically trigger the problem solving process according to the monitoring result, including informing related personnel and recording the problem solving state.

Furthermore, the embodiment also provides a visual monitoring interface which can display key performance indexes in real time, support dynamic data update and multidimensional data analysis, and enable a user to know the running condition of the system from different angles; the key performance indicators include, but are not limited to, system operating status, success rate, error distribution.

Furthermore, the embodiment also has the function of generating periodic data reports, and can collect and analyze system performance data in a period of time, including but not limited to charts, statistical data and trend analysis, and provide decision support for users. The system also has strict security and access control mechanism, and ensures the protection of sensitive data; role-based access control is implemented, ensuring that only authorized users can access or modify the monitoring configuration and log data.

Example 2

As shown in fig. 4, the present embodiment exemplarily presents a system for intelligent extraction and optimization of web content, which includes an expression management module, a web content crawling module, a web content matching module, an expression generating module, an anomaly capturing and retry module, and a monitoring and statistics module;

The expression management module is used for managing and maintaining expression arrays, and two expression arrays are created in the embodiment and are respectively used for storing regular expressions of the matching detail page and the list page; the webpage content crawling module acquires webpage content and performs preprocessing for removing HTML labels to obtain preprocessed data; the webpage content matching module traverses the expression array and matches the preprocessing data; the expression generating module calls an intelligent algorithm to generate a new expression; the anomaly capturing and retrying module captures anomalies, records anomaly data and parameterizes configuration retrying decisions; and the monitoring and statistics module is used for recording and automatically analyzing the log data, providing a visual monitoring interface and displaying key indexes.

Specifically, in this embodiment, two arrays are created in the expression management module, where the two arrays are respectively: the list_page_regex_array and list_page_regex_array are used to store the expressions of the storage detail page, the list_page_regex_array is used to store the expressions of the storage list page, the database and the file system are used to persist the expressions, and the operation interface is implemented. Further, database table structures are designed to store regular expressions and related information (e.g., creation time, update time, source, etc.); an operation interface (e.g., add-drop-check) is implemented to synchronize with the expression array. Designing a RESTful API or command line interface to allow a user to manually add, edit, or delete expressions; and realizing interface back-end logic, processing user requests and updating the regular expression array.

The webpage content crawling module inputs a target webpage URL, starts a crawling program, acquires webpage source data, performs preprocessing for removing an HTML tag, and acquires preprocessed data.

The web page content matching module firstly carries out preliminary judgment on web page source data, whether a corresponding expression array is empty or not, if so, the web page source data is transmitted to the expression generating module, the expression generating module in the embodiment calls a large model, generates a corresponding regular expression according to given input and adds the corresponding regular expression into the expression array; if the regular expressions are not null, traversing each element in the expression array, for each regular expression in the array, attempting to apply the regular expression to the webpage content to be processed, executing matching operation, matching the webpage content, and recording or processing the matching result of each regular expression.

Further, in the web page content matching module, for each web page to be extracted, determining the type of the web page to be a list page or a detail page; if the regular expressions are list pages, traversing the expression array, and executing matching operation on each regular expression; if any regular expression matching is successful, further processing of the current web page source data is skipped. If the rule is a detail page, traversing the expression array as well, and executing matching operation on each regular expression; if any regular expression is successfully matched, relevant content is extracted according to the rule of the regular expression.

And for the web page source data with failed matching, transmitting the web page source data to an expression generating module, calling a large model to generate a new regular expression, and adding the new regular expression into an expression array.

An exception capturing and retrying mechanism is arranged in the exception capturing and retrying module, exception capturing logic is added in a key link, and an error log is recorded when an exception is captured; performing parameterized configuration retry decision, setting maximum retry times and retry intervals, and intelligently determining whether to perform retry and the retry time according to the anomaly type and the historical retry success rate; for some higher deterministic temporary problems, the system may choose to retry immediately; and for problems that may require manual intervention, the detailed information is recorded and skipped. Further, when the set maximum number of retries is reached, the system will perform corresponding failure handling operations according to the configured policies, including but not limited to recording detailed failure logs, sending alarm notifications, executing backup flows, or simply skipping the current task. Preferably, the anomaly capture and retry module is a stand-alone module that can be easily integrated into other parts of the system or replaced and upgraded as needed.

A log recording mechanism is arranged in the monitoring and statistics module to automatically record log data, wherein the log data comprises detailed information and abnormal data of key operations; exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration. Further, an intelligent analysis engine is developed to automatically parse the log data and predict anomalies to prompt early warning. Further, a configuration interface is provided to customize the monitoring policy, including setting error thresholds, logging selections, and pre-warning conditions. Further, a closed loop feedback mechanism is integrated to ensure that the monitored problems can be timely fed back to system maintenance personnel, support an automatic workflow, and automatically trigger the problem solving process according to the monitoring result, including informing related personnel and recording the problem solving state. Further, a visual monitoring interface is provided, key indexes are displayed, dynamic data updating and multidimensional data analysis are supported, and a user can know the running condition of the system from different angles; the key indicators include, but are not limited to, system operation status, success rate, error distribution. Further, the system performance data generation system has the function of generating periodic data reports, can collect and analyze system performance data in a period of time, including but not limited to charts, statistical data and trend analysis, and provides decision support for users. Furthermore, the system also has strict security and access control mechanism, and ensures the protection of sensitive data; role-based access control is implemented, ensuring that only authorized users can access or modify the monitoring configuration and log data.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims. It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The intelligent webpage content extraction and optimization method is characterized by comprising the following steps of:

creating an expression array for storing matched web page content; using the database and file system persistent storage expressions and designing interfaces to support add-drop-check operations;

Crawling webpage content, setting URL and crawling rules of a target website, starting a crawling program to acquire webpage source data, and preprocessing to acquire preprocessing data;

judging whether the expression array corresponding to the webpage source data is empty or not, if so, calling an intelligent algorithm to generate an expression, and if not, traversing the expression array to match the preprocessing data;

2. The method of claim 1, wherein the preprocessing includes removing HTML tags of the web page source data to obtain preprocessed data.

3. The method for intelligent extraction and optimization of web page content according to claim 1, wherein the intelligent algorithm is invoked to generate a new expression, the preprocessing data and the target content indication are transmitted, and the corresponding expression is returned.

4. A method for intelligently extracting and optimizing web page contents according to claim 3, wherein a parser is written to extract the expression generated by the intelligent algorithm and verify, and if verification is successful, the expression is added into the corresponding expression array.

5. The method for intelligently extracting and optimizing web page contents according to claim 1, wherein the anomaly capturing mechanism and the log recording mechanism comprise recording anomaly data when anomalies are captured, and the log recording mechanism automatically records log data; the log data comprises detailed information of key operations and the abnormal data; the exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration.

6. The method for intelligent extraction and optimization of web page content according to claim 5, wherein an intelligent analysis engine is developed to automatically parse the log data and predict anomalies to prompt early warning.

7. The method for intelligently extracting and optimizing web content according to claim 1, wherein the parameterizing the retry decision comprises configuring a maximum retry number and a retry interval; the retry decision includes deciding whether to retry and timing of the retry according to the anomaly type and the historical retry success rate.

8. The method of claim 1, further comprising providing a configuration interface to customize the monitoring policy, including setting error thresholds, log entry selections, and pre-warning conditions.

9. The method for intelligently extracting and optimizing web content according to claim 1, further comprising providing a visual monitoring interface to display key performance indicators including, but not limited to, system running status, success rate, error distribution.

10. The intelligent extraction and optimization system for the webpage content is characterized by comprising an expression management module, a webpage content crawling module, a webpage content matching module, an expression generating module, an abnormal capturing and retrying module and a monitoring and statistics module;

The monitoring and statistics module records and automatically analyzes log data, provides a visual monitoring interface and displays key performance indexes.