[go: up one dir, main page]

CN119003904A - Method and system for intelligently extracting and optimizing webpage content - Google Patents

Method and system for intelligently extracting and optimizing webpage content Download PDF

Info

Publication number
CN119003904A
CN119003904A CN202410934428.7A CN202410934428A CN119003904A CN 119003904 A CN119003904 A CN 119003904A CN 202410934428 A CN202410934428 A CN 202410934428A CN 119003904 A CN119003904 A CN 119003904A
Authority
CN
China
Prior art keywords
expression
data
retry
web page
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410934428.7A
Other languages
Chinese (zh)
Inventor
林珊
唐世洁
陈誉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ylink Computing System Co ltd
Original Assignee
Shenzhen Ylink Computing System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ylink Computing System Co ltd filed Critical Shenzhen Ylink Computing System Co ltd
Priority to CN202410934428.7A priority Critical patent/CN119003904A/en
Publication of CN119003904A publication Critical patent/CN119003904A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种网页内容智能提取与优化的方法及系统,包括:创建用于存储匹配网页内容的表达式数组;爬取网页内容,获取网页源数据并进行预处理得到预处理数据;判断源数据对应的表达式数组是否为空,若为空,则调用智能算法生成表达式,若不为空,则遍历表达式数组以对预处理数据进行匹配;部署异常捕获机制和日志记录机制,并进行参数化配置重试决策。调用大模型自动生成和优化正则表达式,降低人工投入,提高工作效率;参数化的智能重试逻辑,允许用户根据需求配置最大重试次数和重试间隔,可以轻松调整这些参数以适应不同的网络状况或服务器响应时间;提高系统的灵活性和可配置性以及系统的稳定性和可靠性。

The present invention discloses a method and system for intelligent extraction and optimization of web page content, including: creating an expression array for storing matching web page content; crawling web page content, obtaining web page source data and preprocessing to obtain preprocessed data; judging whether the expression array corresponding to the source data is empty, if it is empty, calling an intelligent algorithm to generate an expression, if it is not empty, traversing the expression array to match the preprocessed data; deploying an exception capture mechanism and a logging mechanism, and performing parameterized configuration retry decisions. Calling a large model to automatically generate and optimize regular expressions reduces manual input and improves work efficiency; parameterized intelligent retry logic allows users to configure the maximum number of retries and retry intervals according to needs, and these parameters can be easily adjusted to adapt to different network conditions or server response times; improving the flexibility and configurability of the system as well as the stability and reliability of the system.

Description

Method and system for intelligently extracting and optimizing webpage content
Technical Field
The invention relates to the technical field of network information grabbing, in particular to a method and a system for intelligently extracting and optimizing webpage content.
Background
In the present digital age, the information quantity of the internet is exponentially increased, and the effective extraction and analysis of the web page contents become key links in numerous business scenes, such as big data analysis, competitive information collection, content aggregation service and the like.
With the rapid development of artificial intelligence technology, especially the breakthrough in the fields of Natural Language Processing (NLP) and machine learning, new possibilities are provided for solving the problems. However, there is currently a lack of a solution in the market that efficiently fuses artificial intelligence with traditional techniques (such as regular expressions) to intelligently automate the extraction process of web page content, especially for complex and varied web page structures.
At present, for intelligent extraction of webpage content, the following technical problems exist:
(1) The method for extracting the webpage content depends on the generation of the regular expression, and in the process of generating and optimizing the regular expression, the traditional method needs a large amount of manpower to participate in writing, so that the labor cost is increased, and the efficiency is possibly low; updating and expanding of the regular expression library often depend on manual collection and verification, and the regular expression library is low in efficiency and easy to make mistakes;
(2) The website structure is often changed, such as page layout adjustment, element ID change and the like, and the changes may cause the original regular expression to fail, so that the required data cannot be extracted correctly.
Disclosure of Invention
The invention aims to provide a method and a system for intelligently extracting and optimizing webpage content, which are used for solving the problems that the existing extraction method in the background technology increases labor cost, is low in efficiency, and causes the failure of an original regular expression due to the change of a network structure.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
According to one aspect of the present invention, there is provided a method for intelligent extraction and optimization of web content, the method comprising:
Creating an expression array for storing matched web page content; using a database and file system to persist the expression and designing an interface to support the add-drop-check operation;
Crawling webpage content, setting URL and crawling rules of a target website, starting a crawling program to acquire the webpage content, and preprocessing the webpage content to obtain preprocessing data;
Judging whether the expression array corresponding to the webpage content is empty or not, if so, calling an intelligent algorithm to generate an expression, and if not, traversing the expression array to match the preprocessing data;
If the matching is successful, extracting related content and storing the corresponding expression into the expression array; if the matching fails, calling an intelligent algorithm to generate a new expression;
And deploying an exception capturing mechanism and a log recording mechanism, and performing parameterized configuration retry decision.
Based on the foregoing scheme, the preprocessing includes removing HTML tags of the web page source data to obtain preprocessed data.
Based on the foregoing scheme, the calling the intelligent algorithm generates a new expression, including calling the intelligent algorithm, entering the pre-processing data and the target content indication, and returning a matched expression.
Based on the scheme, writing a parser to extract the expression generated by the intelligent algorithm and verifying, and adding the expression into the corresponding expression array if verification is successful.
Based on the scheme, the exception capturing mechanism and the log recording mechanism comprise an exception data recording mechanism which records the log data automatically when an exception is captured; the log data comprises detailed information of key operations and the abnormal data; the exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration.
Based on the scheme, an intelligent analysis engine is developed to automatically analyze the log data and predict anomalies to prompt early warning.
Based on the scheme, the parameterized configuration retry decision comprises configuration maximum retry times and retry intervals; the retry decision includes deciding whether to retry and timing of the retry according to the anomaly type and the historical retry success rate.
Based on the foregoing, the method further comprises providing a configuration interface to customize the monitoring strategy, including setting an error threshold, logging options, and pre-warning conditions.
Based on the scheme, the method further comprises the step of providing a visual monitoring interface and displaying key indexes, wherein the key indexes comprise but are not limited to system running states, success rates and error distribution.
According to another aspect of the invention, a system for intelligently extracting and optimizing web content is provided, which comprises an expression management module, a web content crawling module, a web content matching module, an expression generating module, an anomaly capturing and retrying module and a monitoring and statistics module;
The expression management module is used for managing and maintaining an expression array;
The webpage content crawling module acquires webpage source data and performs preprocessing to acquire preprocessing data;
The webpage content matching module traverses the expression array to match the preprocessing data;
the expression generating module calls an intelligent algorithm to generate a new expression;
The anomaly capturing and retrying module captures anomalies, records anomaly data and parameterizes configuration retrying decisions;
and the monitoring and statistics module is used for recording and automatically analyzing the log data, providing a visual monitoring interface and displaying key indexes.
Compared with the prior art, the invention has at least the following advantages and positive effects:
(1) The intelligent algorithm is called to automatically generate and optimize the regular expression, so that the labor investment is reduced, and the working efficiency is improved;
(2) Parameterized intelligent retry logic allows a user to configure maximum retry times and retry intervals as required, and these parameters can be easily adjusted to accommodate different network conditions or server response times; the flexibility and the configurability of the system are improved, and the stability and the reliability of the system are improved;
(3) The regular expression is stored in a lasting mode, a database and a file system are updated after new regular expressions are generated, the regular expression library can be continuously enriched and accurately used along with the increase of the frequency of use, the change of different website structures can be flexibly dealt with, the data extraction failure caused by webpage modification is reduced, and the stability and the continuity of data extraction are ensured.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 is a schematic diagram of a method for intelligently extracting and optimizing web page content according to the present invention;
FIG. 2 is a schematic diagram of a large model call to generate a regular expression in accordance with the present invention;
FIG. 3 is a diagram illustrating matching of web page source data and an expression array according to the present invention;
fig. 4 is a schematic diagram of a system for intelligent extraction and optimization of web page content according to the present invention.
Detailed Description
For a more clear explanation of the objects, technical solutions and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and that the exemplary embodiments can be implemented in various forms and should not be construed as being limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
The invention will be described in detail with reference to specific examples below:
Example 1
As shown in fig. 1, the embodiment provides a method for intelligently extracting and optimizing web content, which comprises the following specific steps:
S1: creating an expression array for storing matched web page content; using a database and file system to persist the expression and designing an interface to support the add-drop-check operation;
In this embodiment, two arrays are created, respectively: the list_page_regex_array is used for storing the expression of the detail page, and the list_page_regex_array is used for storing the expression of the list page;
further, a proper database and file system are selected to durably store the expression, so that historical data can be effectively utilized after the system is restarted; the database table structure is designed to store expressions and related information (e.g., creation time, update time, source, etc.); an operation interface (e.g., add-drop-check) is implemented to synchronize with the expression array.
Further, RESTful APIs or command line interfaces are designed to allow users to manually add, edit, or delete expressions; and realizing interface back-end logic, processing user requests and updating the expression array.
S2: crawling webpage content, setting URL and crawling rules of a target website, starting a crawling program to acquire webpage source data, and preprocessing to acquire preprocessing data;
in this embodiment, a scrapy framework of Python is used to crawl web page content, input a target web page URL, start a crawling program, obtain the web page content, that is, web page source data, perform preprocessing for removing HTML tags on the web page source data, and obtain preprocessed data.
S3: judging whether the expression array corresponding to the webpage source data is empty or not, if so, calling an intelligent algorithm to generate an expression, and if not, traversing the expression array to match the preprocessing data;
in this embodiment, first, preliminary judgment is performed on the preprocessing data, and if the array for storing the expression is empty, an intelligent algorithm, that is, an external service or an internal algorithm, is required to be called, in this embodiment, a large model is called, and a corresponding regular expression is generated according to a given input and added to the expression array; if the matching operation is not null, traversing each element in the expression array, for each regular expression in the array, attempting to apply the element to the webpage content to be processed, executing the matching operation, performing matching on the webpage content, and recording or processing the matching result of each regular expression;
further, according to the service requirement, further processing or analyzing the matched result; it may be desirable to save the results to a database, file, or send the results to other systems over a network; if no suitable regular expression is found to match some key parts of the web page content, an attempt may be made to regenerate or adjust the regular expression.
Specifically, in this embodiment, as shown in fig. 2, invoking the large model to generate the regular expression includes:
selecting or training a generic large model (e.g., a transducer-based model) capable of generating regular expressions; and taking the preprocessing data and the target content indication as input, and calling a large model to generate a regular expression. Further, a parser is written to extract regular expressions generated by the large model, the generated regular expressions are verified, the effectiveness and accuracy of the regular expressions are guaranteed, and the verified regular expressions are added into corresponding expression arrays and are synchronized to a database and a file system.
S4: if the matching is successful, extracting related content and storing the corresponding expression into the expression array; if the matching fails, calling an intelligent algorithm to generate a new expression;
With reference to fig. 3, after the matching operation is performed in this embodiment, for each web page to be extracted, whether the matching is successful is determined, and if the matching is successful, the type of the web page is determined to be a list page or a detail page; skipping when the webpage source data is a list page and the matching is successful, and extracting and storing related contents when the webpage source data is a detail page and the matching is successful; for the web page source data which fails to match, calling a large model to generate a new regular expression;
Specifically, if the list page is the list page, traversing the expression array, and executing matching operation on each regular expression; if any regular expression matching is successful, further processing of the current web page source data is skipped (since list pages typically do not need to extract specific content); this may be the case if all regular expressions fail to match, but typically does not require immediate action, as the main purpose of the list page may be simply navigation or indexing.
If the rule is a detail page, traversing the expression array as well, and executing matching operation on each regular expression; if any regular expression is successfully matched, extracting relevant contents according to the rule of the regular expression, and storing the contents into a database; if all regular expressions fail to match, then the next step is performed.
For web page source data that fails to match, in this embodiment, a large model is called to generate a new regular expression, and some policies may be required to decide whether to replace or supplement an existing regular expression by adding the generated new regular expression to the regular expression array. Optionally, immediately attempting to re-match the web page source data with the newly generated regular expression that failed to match, or deferring this task to a subsequent batch process; when all the web page source data are processed, the step is finished.
Further, when invoking the large model to generate a new regular expression, additional information about the matching failed page (e.g., URL of the page, partial content sample, etc.) may need to be provided so that the large model can more accurately generate the applicable regular expression; for the generated new regular expression, certain verification or test should be performed to ensure the correctness and validity thereof; data cleansing and formatting may be required to ensure consistency and readability of the data prior to storing the content in the database; performance optimization and error handling mechanisms may need to be considered if large amounts of web page source data are processed to ensure system stability and reliability.
S5: deploying an exception capturing mechanism and a log recording mechanism, and performing parameterized configuration retry decision;
In the embodiment, an exception capturing mechanism and a log recording mechanism are further arranged, exception capturing logic is added in key operation links such as content matching and intelligent algorithm calling operation links, the log recording mechanism is started to record exception data when an exception is captured, and the log recording mechanism automatically records log data; the log data comprises detailed information of key operations and the abnormal data; the exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration. Preferably, an intelligent analysis engine is also developed in this embodiment to automatically analyze the log data and predict anomalies to prompt early warning.
Further, the parameterized configuration retry decision includes configuring a maximum retry number and a retry interval; the retry decision includes deciding whether to retry and timing of the retry according to the anomaly type and the historical retry success rate. And (3) realizing retry logic, setting the maximum retry times, setting short delay before each retry to avoid frequent calling, recording failure details and executing a skip strategy when the retry upper limit is reached. Specifically, according to the abnormal type and the historical retry success rate, intelligently determining whether to retry and the retry time; for some higher deterministic temporary problems, the system may choose to retry immediately; and for problems that may require manual intervention, the detailed information is recorded and skipped. Further, when the set maximum number of retries is reached, the system will perform corresponding failure handling operations according to the configured policies, including but not limited to recording detailed failure logs, sending alarm notifications, executing backup flows, or simply skipping the current task.
Preferably, the exception capture mechanism and the parameterized configuration retry decision are used as a separate module, which can be easily integrated into other parts of the system or replaced and upgraded as required; the parameterized and modularized design not only improves the flexibility and the configurability of the system, but also enhances the stability and the reliability of the system, and ensures that the system can respond reasonably when facing various abnormal conditions.
Preferably, in this embodiment, a log framework is selected and log data is configured, where the log framework includes, but is not limited to, error types, time stamps, and stack tracking, for example, a logging module in a Python standard library, or a third party library such as Loguru. Further, an intelligent analysis engine is developed to automatically analyze the log data, identify abnormal modes and potential problems in the operation of the system, and predict the abnormality to prompt early warning; by using a machine learning algorithm, the intelligent analysis engine can predict the trend of fault occurrence and issue a warning in advance.
Further, the present embodiment provides a configuration interface to customize the monitoring policy, which allows the user to customize the monitoring policy according to specific requirements, including setting an error threshold, a log record option, and an early warning condition, and by providing a user-friendly configuration interface, the user can easily set and manage these parameters.
Further, the embodiment integrates a closed-loop feedback mechanism, ensures that the monitored problems can be timely fed back to system maintenance personnel, supports an automatic workflow, and can automatically trigger the problem solving process according to the monitoring result, including informing related personnel and recording the problem solving state.
Furthermore, the embodiment also provides a visual monitoring interface which can display key performance indexes in real time, support dynamic data update and multidimensional data analysis, and enable a user to know the running condition of the system from different angles; the key performance indicators include, but are not limited to, system operating status, success rate, error distribution.
Furthermore, the embodiment also has the function of generating periodic data reports, and can collect and analyze system performance data in a period of time, including but not limited to charts, statistical data and trend analysis, and provide decision support for users. The system also has strict security and access control mechanism, and ensures the protection of sensitive data; role-based access control is implemented, ensuring that only authorized users can access or modify the monitoring configuration and log data.
Example 2
As shown in fig. 4, the present embodiment exemplarily presents a system for intelligent extraction and optimization of web content, which includes an expression management module, a web content crawling module, a web content matching module, an expression generating module, an anomaly capturing and retry module, and a monitoring and statistics module;
The expression management module is used for managing and maintaining expression arrays, and two expression arrays are created in the embodiment and are respectively used for storing regular expressions of the matching detail page and the list page; the webpage content crawling module acquires webpage content and performs preprocessing for removing HTML labels to obtain preprocessed data; the webpage content matching module traverses the expression array and matches the preprocessing data; the expression generating module calls an intelligent algorithm to generate a new expression; the anomaly capturing and retrying module captures anomalies, records anomaly data and parameterizes configuration retrying decisions; and the monitoring and statistics module is used for recording and automatically analyzing the log data, providing a visual monitoring interface and displaying key indexes.
Specifically, in this embodiment, two arrays are created in the expression management module, where the two arrays are respectively: the list_page_regex_array and list_page_regex_array are used to store the expressions of the storage detail page, the list_page_regex_array is used to store the expressions of the storage list page, the database and the file system are used to persist the expressions, and the operation interface is implemented. Further, database table structures are designed to store regular expressions and related information (e.g., creation time, update time, source, etc.); an operation interface (e.g., add-drop-check) is implemented to synchronize with the expression array. Designing a RESTful API or command line interface to allow a user to manually add, edit, or delete expressions; and realizing interface back-end logic, processing user requests and updating the regular expression array.
The webpage content crawling module inputs a target webpage URL, starts a crawling program, acquires webpage source data, performs preprocessing for removing an HTML tag, and acquires preprocessed data.
The web page content matching module firstly carries out preliminary judgment on web page source data, whether a corresponding expression array is empty or not, if so, the web page source data is transmitted to the expression generating module, the expression generating module in the embodiment calls a large model, generates a corresponding regular expression according to given input and adds the corresponding regular expression into the expression array; if the regular expressions are not null, traversing each element in the expression array, for each regular expression in the array, attempting to apply the regular expression to the webpage content to be processed, executing matching operation, matching the webpage content, and recording or processing the matching result of each regular expression.
Further, in the web page content matching module, for each web page to be extracted, determining the type of the web page to be a list page or a detail page; if the regular expressions are list pages, traversing the expression array, and executing matching operation on each regular expression; if any regular expression matching is successful, further processing of the current web page source data is skipped. If the rule is a detail page, traversing the expression array as well, and executing matching operation on each regular expression; if any regular expression is successfully matched, relevant content is extracted according to the rule of the regular expression.
And for the web page source data with failed matching, transmitting the web page source data to an expression generating module, calling a large model to generate a new regular expression, and adding the new regular expression into an expression array.
An exception capturing and retrying mechanism is arranged in the exception capturing and retrying module, exception capturing logic is added in a key link, and an error log is recorded when an exception is captured; performing parameterized configuration retry decision, setting maximum retry times and retry intervals, and intelligently determining whether to perform retry and the retry time according to the anomaly type and the historical retry success rate; for some higher deterministic temporary problems, the system may choose to retry immediately; and for problems that may require manual intervention, the detailed information is recorded and skipped. Further, when the set maximum number of retries is reached, the system will perform corresponding failure handling operations according to the configured policies, including but not limited to recording detailed failure logs, sending alarm notifications, executing backup flows, or simply skipping the current task. Preferably, the anomaly capture and retry module is a stand-alone module that can be easily integrated into other parts of the system or replaced and upgraded as needed.
A log recording mechanism is arranged in the monitoring and statistics module to automatically record log data, wherein the log data comprises detailed information and abnormal data of key operations; exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration. Further, an intelligent analysis engine is developed to automatically parse the log data and predict anomalies to prompt early warning. Further, a configuration interface is provided to customize the monitoring policy, including setting error thresholds, logging selections, and pre-warning conditions. Further, a closed loop feedback mechanism is integrated to ensure that the monitored problems can be timely fed back to system maintenance personnel, support an automatic workflow, and automatically trigger the problem solving process according to the monitoring result, including informing related personnel and recording the problem solving state. Further, a visual monitoring interface is provided, key indexes are displayed, dynamic data updating and multidimensional data analysis are supported, and a user can know the running condition of the system from different angles; the key indicators include, but are not limited to, system operation status, success rate, error distribution. Further, the system performance data generation system has the function of generating periodic data reports, can collect and analyze system performance data in a period of time, including but not limited to charts, statistical data and trend analysis, and provides decision support for users. Furthermore, the system also has strict security and access control mechanism, and ensures the protection of sensitive data; role-based access control is implemented, ensuring that only authorized users can access or modify the monitoring configuration and log data.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims. It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. The intelligent webpage content extraction and optimization method is characterized by comprising the following steps of:
creating an expression array for storing matched web page content; using the database and file system persistent storage expressions and designing interfaces to support add-drop-check operations;
Crawling webpage content, setting URL and crawling rules of a target website, starting a crawling program to acquire webpage source data, and preprocessing to acquire preprocessing data;
judging whether the expression array corresponding to the webpage source data is empty or not, if so, calling an intelligent algorithm to generate an expression, and if not, traversing the expression array to match the preprocessing data;
If the matching is successful, extracting related content and storing the corresponding expression into the expression array; if the matching fails, calling an intelligent algorithm to generate a new expression;
And deploying an exception capturing mechanism and a log recording mechanism, and performing parameterized configuration retry decision.
2. The method of claim 1, wherein the preprocessing includes removing HTML tags of the web page source data to obtain preprocessed data.
3. The method for intelligent extraction and optimization of web page content according to claim 1, wherein the intelligent algorithm is invoked to generate a new expression, the preprocessing data and the target content indication are transmitted, and the corresponding expression is returned.
4. A method for intelligently extracting and optimizing web page contents according to claim 3, wherein a parser is written to extract the expression generated by the intelligent algorithm and verify, and if verification is successful, the expression is added into the corresponding expression array.
5. The method for intelligently extracting and optimizing web page contents according to claim 1, wherein the anomaly capturing mechanism and the log recording mechanism comprise recording anomaly data when anomalies are captured, and the log recording mechanism automatically records log data; the log data comprises detailed information of key operations and the abnormal data; the exception data includes, but is not limited to, error type, time stamp, stack trace, operator identification, task identification, and processing duration.
6. The method for intelligent extraction and optimization of web page content according to claim 5, wherein an intelligent analysis engine is developed to automatically parse the log data and predict anomalies to prompt early warning.
7. The method for intelligently extracting and optimizing web content according to claim 1, wherein the parameterizing the retry decision comprises configuring a maximum retry number and a retry interval; the retry decision includes deciding whether to retry and timing of the retry according to the anomaly type and the historical retry success rate.
8. The method of claim 1, further comprising providing a configuration interface to customize the monitoring policy, including setting error thresholds, log entry selections, and pre-warning conditions.
9. The method for intelligently extracting and optimizing web content according to claim 1, further comprising providing a visual monitoring interface to display key performance indicators including, but not limited to, system running status, success rate, error distribution.
10. The intelligent extraction and optimization system for the webpage content is characterized by comprising an expression management module, a webpage content crawling module, a webpage content matching module, an expression generating module, an abnormal capturing and retrying module and a monitoring and statistics module;
The expression management module is used for managing and maintaining an expression array;
The webpage content crawling module acquires webpage source data and performs preprocessing to acquire preprocessing data;
The webpage content matching module traverses the expression array to match the preprocessing data;
the expression generating module calls an intelligent algorithm to generate a new expression;
The anomaly capturing and retrying module captures anomalies, records anomaly data and parameterizes configuration retrying decisions;
The monitoring and statistics module records and automatically analyzes log data, provides a visual monitoring interface and displays key performance indexes.
CN202410934428.7A 2024-07-12 2024-07-12 Method and system for intelligently extracting and optimizing webpage content Pending CN119003904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410934428.7A CN119003904A (en) 2024-07-12 2024-07-12 Method and system for intelligently extracting and optimizing webpage content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410934428.7A CN119003904A (en) 2024-07-12 2024-07-12 Method and system for intelligently extracting and optimizing webpage content

Publications (1)

Publication Number Publication Date
CN119003904A true CN119003904A (en) 2024-11-22

Family

ID=93474790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410934428.7A Pending CN119003904A (en) 2024-07-12 2024-07-12 Method and system for intelligently extracting and optimizing webpage content

Country Status (1)

Country Link
CN (1) CN119003904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119848046A (en) * 2024-12-31 2025-04-18 上海数字安全科技有限公司 Log analysis method using AI large model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119848046A (en) * 2024-12-31 2025-04-18 上海数字安全科技有限公司 Log analysis method using AI large model

Similar Documents

Publication Publication Date Title
CN111030857B (en) Network alarm method, device, system and computer readable storage medium
US7146350B2 (en) Static and dynamic assessment procedures
CN105071969B (en) System and method for customized real-time monitoring and automatic exception handling based on JMX
CN110245077A (en) A kind of response method and equipment of program exception
CN109284331B (en) Certificate making information acquisition method based on service data resources, terminal equipment and medium
CN119202708A (en) Fault diagnosis data labeling method and device
CN119003904A (en) Method and system for intelligently extracting and optimizing webpage content
CN117828515A (en) An intelligent log anomaly diagnosis system and method based on a low-code platform
CN117034255A (en) Application error reporting repair method, device, equipment and medium
CN113220585A (en) Automatic fault diagnosis method and related device
CN116069628A (en) Intelligent-treatment software automatic regression testing method, system and equipment
CN120743897A (en) Automatic testing method and device for big data, electronic equipment and storage medium
US20220244975A1 (en) Method and system for generating natural language content from recordings of actions performed to execute workflows in an application
CN115333923A (en) Fault point tracing analysis method, device, equipment and medium
CN111680974B (en) Method and device for positioning problems of electronic underwriting process
CN112270417A (en) Intelligent acquisition method and system for operation and maintenance data of domestic equipment
CN116996205A (en) Monitoring method, system, equipment and storage medium for preventing webpage from being tampered
CN118113622A (en) Detection and repair method, device and equipment applied to batch operation scheduling
CN112131090B (en) Service system performance monitoring method, device, equipment and medium
CN116126344A (en) File processing method, device, equipment and medium
CN120337599B (en) Emergency scene simulation method and device, storage medium and electronic equipment
CN120743601A (en) Root cause analysis method and device of database, computer equipment and storage medium
CN121233594A (en) A method and system for generating electronic data acquisition forms
CN121143903A (en) Module loading optimization methods, devices, equipment and media
CN121501625A (en) An intelligent code review method and system integrating embedded systems and AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination