CN117076773B - Data source screening and optimizing method based on internet information - Google Patents
Data source screening and optimizing method based on internet information Download PDFInfo
- Publication number
- CN117076773B CN117076773B CN202311063341.9A CN202311063341A CN117076773B CN 117076773 B CN117076773 B CN 117076773B CN 202311063341 A CN202311063341 A CN 202311063341A CN 117076773 B CN117076773 B CN 117076773B
- Authority
- CN
- China
- Prior art keywords
- content resource
- website
- information
- value
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data source screening and optimizing method based on internet information, which comprises the following specific steps: s1: the first n search results obtained by each search engine are selected and put into a content resource list, and are used as screening and optimizing input after the duplicate removal treatment; s2: initializing a weight of a search result in a content resource list; s3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained; s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing. In the process of crawling internet information, the data sources are screened and optimized to obtain the data with high value, high matching degree and high reliability, so that the problems of internet information heterogeneous and low value density are solved, and data support and data sources are provided for agricultural production.
Description
Technical Field
The invention belongs to the field of big data, and particularly relates to a data source screening and optimizing method based on internet information.
Background
In order to promote the development of intelligent agriculture and intelligent agriculture better, how to obtain data with high value, high matching degree and high reliability is particularly important. The internet is one of the important information acquisition means, and has the disadvantages of huge information quantity, rich variety, heterogeneous information and low value density. Therefore, in order to acquire the most effective data information as possible, a large amount of labor is often required for data screening. And because the built-in search algorithms of each search engine are different, the search result of a single search engine often has certain limitation, so that the phenomenon of missed detection is caused, and important data information is missed.
Object of the Invention
In order to solve the technical problems, the invention discloses a data source screening and optimizing method based on internet information, which screens and optimizes data sources in the process of crawling internet information to obtain high-value, high-matching-degree and high-reliability data, so as to solve the problems of internet information heterogeneous and low value density and provide data support and data sources for agricultural production.
The specific technical scheme of the invention is as follows:
A data source screening optimization method based on Internet information comprises the following specific steps:
S1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;
s2: initializing a weight of a search result in a content resource list;
S3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained;
s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing.
Preferably, the result list obtained in step S4 is further verified and evaluated, and the specific method is as follows:
Crawling the content information of the content resource websites in the result list according to the expected data information items, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:
Content resource website value = information item crawled by the website/data item desired.
Preferably, in the step S3, the weight scoring calculation is performed on the content resource website from three dimensions of credibility, matching degree and popularity.
Preferably, in the step S3, the content resource websites are weighted according to the formula (1):
Value=V1*a1+V2*a2+…+Vn*an (1);
Where V n represents the score value of the content resource website in the nth dimension, a n represents the weight ratio of the nth dimension, and a 1+a2+…+an =1.
Preferably, in the step S3, the weight of the credibility is distributed according to the type of the information publishing website, the weight of the matching degree is distributed according to the type of the information matching, and the weight of the moderate degree is distributed according to the type of the information applicable standard.
Preferably, the information posting website types include a ministry official posting, a ministry subordinate entity posting, a provincial local official data posting, a local entity posting, an industry tap official website, an industry general enterprise official website, a third party statistics website, and an e-commerce website.
Preferably, the information matching type includes keyword matching, category matching, domain matching and industry matching.
Preferably, the information applicable standard types include national standards, industry standards, local standards, and enterprise standards.
The beneficial effects are that: the invention discloses a data source screening and optimizing method based on internet information, which has the following advantages:
(1) The invention takes the built-in search algorithm and the sequencing rule of different search engines as the data input of the preliminary screening, can comprehensively and fully utilize each search engine to realize the preliminary screening, can not only improve the comprehensiveness of the input data, but also effectively reduce the data quantity of the screening optimization at the later time, and is beneficial to improving the screening optimization efficiency;
(2) The invention performs scoring selection on the content resource websites from three dimensions of credibility, matching degree and popularity to output the content resource websites with high score, thereby realizing screening and optimization of data sources and being beneficial to improving the value degree, reliability and matching degree of search results.
(3) According to the invention, the content information of the content resource website after crawling and screening is compared with the preset expected data information items, so that the optimization result is verified and evaluated in one step, and the value degree, reliability and matching degree of the search result are further ensured.
Drawings
FIG. 1 is a schematic diagram of a data source screening optimization method according to the present invention.
Detailed Description
The invention is further improved and modified in the following description with reference to the drawings, which are also to be regarded as protection.
Example 1
Taking crawling of rice seedling data information in agricultural production as an example, as shown in fig. 1, screening and optimizing data sources based on internet information, and specifically comprises the following steps:
Step 1: setting an input keyword as 'rice seed information', inputting the keyword into four different search engines of hundred degrees, dog searching, 360 and Bing in the embodiment, putting the search result of the top 20 of each search engine rank into a content resource list, and performing de-duplication processing to obtain the following content resource list:
[https://ricedata.cn/,https://www.ricedata.cn/variety/,https://www.cgris.net/,https://zhuanlan.zhihu.com/p/374483809,https://baike.baidu.hk/item/%E6%B0%B4%E7%A8%BB/21285,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.htm,https://baike.baidu.com/item/%E7%A8%BB/4417005,https://www.ricedata.cn/variety/superice.htm,https://www.gov.cn/xinwen/2022-12/05/content_5730461.htm,http://www.jiangdu.gov.cn/jdqxxgk/nyncj/202304/9585364ff7644872a192aa4e764acbd2.shtml,...,https://www.baidu.com/linkurl=mqtDoXWwXYVLdKcQWTGUgzJODBEum5ZwKuGHls3NrfKKlgdy2N-5kfUU9Abxpw4w&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mqtDoXWwXYVLdKcQWTGUgrK3K0aILqMtbYseQAn6vP2-5lVLOgsNpBv4RoklwWfcvNVoWN6OXLGcq3BtRJP_oWtzZritn37lyIlYvPn4fYDFgtxTvg7uqrzcMgWV3bkyRkgqVZEObUtkqLB3m1iUwWAzK3wAnFZXppTYghXeYDUC3pLMHonrqWLeRDJ7KcXKiqTtTRhJtZfzExYxI3mSVr4e8vLxhUSCsuL9doVU6TB0VeGXmp8QLVmkB8-HGBHCwxOUKVFM4f56y-lExxW4U_&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mMw2X75qEAIbS7UaWryrE30mmDQC2vfgEAU1SUVbxG9FcbNBsXgj8I8_2eBtePgQGUP49x7a0L1-uFMfzuAXOw77M9u0awzhoN6a0gmyGqy&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mMw2X75qEAIbS7UaWryrEEZJFrDq5Q8gbyA3LHePwBA6AkxTlgFSzbpcesUaRiFHhXCXi-xOUgwhJ__3SS16zZonqACOiHu99BsG9XVxrGS&wd=&eqid=8e799a1c00002480000000046497f78d].
The search engine in the invention can be but not limited to the above search engine, and the existing search engine capable of realizing information retrieval is applicable.
Step 2: screening and optimizing the content resource list as input, and initializing the weight of the search result in the content resource list, namely initializing the Value of the content resource website corresponding to the dictionary key to be 0. And then, weighting scoring is carried out on the content resource websites in the content resource list according to a scoring rule, so that a dictionary which takes the content resource websites as keys and takes the weight score as a Value is obtained, wherein the dictionary is shown as follows:
[https://www.ricedata.cn/variety/:9,https://ricedata.cn/:8.4,https://www.ricedata.cn/variety/superice.htm:8,https://www.cgris.net/:7.8,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.html:7.2,...,https://baike.baidu.com/item/%E7%A8%BB/4417005:5.8,https://baike.baidu.hk/item/%E6%B0%B4%E7%A8%BB/21285:5.8,https://zhuanlan.zhihu.com/p/374483809:4.6].
The dictionary is ordered from high to low according to the Value, and the top 20 content resource websites are output to the result list.
In the invention, the weight scoring calculation is shown in the formula (1):
Value=V1*a1+V2*a2+…+Vn*an (1);
Where V n represents the score value of the content resource website in the nth dimension, a n represents the weight ratio of the nth dimension, and a 1+a2+…+an =1.
The scoring rules in this embodiment 1 are: and (3) performing weight scoring calculation on the content resource website from three dimensions of credibility, matching degree and popularity, namely, taking n as 3. The weight score table design for each dimension is as follows:
TABLE 1 credibility weight distribution Table
TABLE 2 match weight distribution Table
Keyword matching | Category matching | Domain matching | Industry matching | Weight ratio | |
Degree of matching | 10 | 8 | 6 | 4 | 0.3 |
Table 3 general weight distribution table
National standard | Industry standard | Local standard | Enterprise standard | Weight ratio | |
General degree | 10 | 8 | 6 | 4 | 0.2 |
Step 3: according to the rice seedling data information, expected data information items are set, 22 data items are obtained in total, and the table is shown as follows:
table 4 desired data information entry
Step 4: according to the expected data information items, content information crawling is carried out on the content resource websites in the result list obtained in the step 2, ratio calculation is carried out on the crawled data information items and the expected data information items, so that the value degree of the content resource websites is obtained, the value degree is used for evaluating the quality degree of the screening optimization method), and the calculation formula is as follows: content resource website value = information item crawled by the website/data item desired.
The evaluation criteria of the data source screening and optimizing method can be set according to the actual demands of users, for example: taking the value of the content resource website as a measurement standard, it can be considered that the content resource website is better than 85%, preferably 75% -85%, generally 60% -75% and not better than 60%.
If the result obtained by evaluation is not good, the screening optimization method needs to be adjusted, and the dimension can be increased, the measurement index of each dimension can be further subdivided, and the like.
The foregoing is merely illustrative of the present invention and is a preferred embodiment thereof. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art and are intended to be within the scope of the present invention.
Claims (1)
1. The data source screening and optimizing method based on the Internet information is characterized by comprising the following specific steps:
S1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;
s2: initializing a weight of a search result in a content resource list;
S3: and (3) weighting and scoring each content resource website in the content resource list from three dimensions of credibility, matching degree and popularity according to a scoring rule shown in a formula (1) to obtain a dictionary taking the content resource website as a key and taking the weight score as a Value, wherein the formula (1) is as follows:
Value=V1*a1+ V2*a2+…+Vn*an(1);
Wherein V n represents a score value of the content resource website in the nth dimension, a n represents a weight ratio of the nth dimension, and a 1+a2+…+ an =1;
The weight of the credibility is distributed according to the information release website types, wherein the information release website types comprise ministry official publication, ministry subordinate unit publication, provincial local official data publication, local unit publication, industry tap official website, industry general enterprise official website, third party statistics website and e-commerce website;
The weight of the matching degree is distributed according to information matching types, wherein the information matching types comprise keyword matching, category matching, field matching and industry matching;
the general weight is distributed according to information applicable standard types, wherein the information applicable standard types comprise national standards, industry standards, local standards and enterprise standards;
s4: ordering the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing;
And (3) further verifying and evaluating the result list obtained in the step S4, wherein the specific method is as follows:
Crawling content information of the content resource websites in the result list according to expected data information items, wherein the expected data items are obtained by setting according to keywords, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:
Content resource website value = information item crawled by the website/data item desired.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311063341.9A CN117076773B (en) | 2023-08-23 | 2023-08-23 | Data source screening and optimizing method based on internet information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311063341.9A CN117076773B (en) | 2023-08-23 | 2023-08-23 | Data source screening and optimizing method based on internet information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117076773A CN117076773A (en) | 2023-11-17 |
CN117076773B true CN117076773B (en) | 2024-05-28 |
Family
ID=88714825
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311063341.9A Active CN117076773B (en) | 2023-08-23 | 2023-08-23 | Data source screening and optimizing method based on internet information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117076773B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101639838A (en) * | 2008-07-31 | 2010-02-03 | 深圳龙媒网络技术有限公司 | Method and system for searching resource |
CN102023996A (en) * | 2009-09-21 | 2011-04-20 | 英业达股份有限公司 | System and method for sorting websites according to content of website articles |
CN104008210A (en) * | 2014-06-20 | 2014-08-27 | 李玉坤 | Web information retrieval method based on multiple search engines |
CN104111888A (en) * | 2014-07-03 | 2014-10-22 | 曹建楠 | Code evaluation method, device and system for teaching |
WO2015070673A1 (en) * | 2013-11-15 | 2015-05-21 | 北京奇虎科技有限公司 | Method for browser-side network search and browser |
WO2015089860A1 (en) * | 2013-12-18 | 2015-06-25 | 孙燕群 | Search engine ranking method based on user participation |
CN110175280A (en) * | 2019-04-30 | 2019-08-27 | 广东鼎义互联科技股份有限公司 | A kind of crawler analysis platform based on government affairs big data |
CN110968511A (en) * | 2019-11-29 | 2020-04-07 | 车智互联(北京)科技有限公司 | Recommendation engine testing method, device, computing equipment and system |
CN111177514A (en) * | 2019-12-31 | 2020-05-19 | 沈阳航空航天大学 | Information source evaluation method, device, storage device and program based on website feature analysis |
CN112417299A (en) * | 2020-12-08 | 2021-02-26 | 西安联乘智能科技有限公司 | Webpage recommendation method, computer storage medium and computing device |
CN113722572A (en) * | 2021-10-11 | 2021-11-30 | 上海易路软件有限公司 | Distributed deep crawling method, device and medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8650191B2 (en) * | 2010-08-23 | 2014-02-11 | Vistaprint Schweiz Gmbh | Search engine optimization assistant |
US11693910B2 (en) * | 2018-12-13 | 2023-07-04 | Microsoft Technology Licensing, Llc | Personalized search result rankings |
-
2023
- 2023-08-23 CN CN202311063341.9A patent/CN117076773B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101639838A (en) * | 2008-07-31 | 2010-02-03 | 深圳龙媒网络技术有限公司 | Method and system for searching resource |
CN102023996A (en) * | 2009-09-21 | 2011-04-20 | 英业达股份有限公司 | System and method for sorting websites according to content of website articles |
WO2015070673A1 (en) * | 2013-11-15 | 2015-05-21 | 北京奇虎科技有限公司 | Method for browser-side network search and browser |
WO2015089860A1 (en) * | 2013-12-18 | 2015-06-25 | 孙燕群 | Search engine ranking method based on user participation |
CN104008210A (en) * | 2014-06-20 | 2014-08-27 | 李玉坤 | Web information retrieval method based on multiple search engines |
CN104111888A (en) * | 2014-07-03 | 2014-10-22 | 曹建楠 | Code evaluation method, device and system for teaching |
CN110175280A (en) * | 2019-04-30 | 2019-08-27 | 广东鼎义互联科技股份有限公司 | A kind of crawler analysis platform based on government affairs big data |
CN110968511A (en) * | 2019-11-29 | 2020-04-07 | 车智互联(北京)科技有限公司 | Recommendation engine testing method, device, computing equipment and system |
CN111177514A (en) * | 2019-12-31 | 2020-05-19 | 沈阳航空航天大学 | Information source evaluation method, device, storage device and program based on website feature analysis |
CN112417299A (en) * | 2020-12-08 | 2021-02-26 | 西安联乘智能科技有限公司 | Webpage recommendation method, computer storage medium and computing device |
CN113722572A (en) * | 2021-10-11 | 2021-11-30 | 上海易路软件有限公司 | Distributed deep crawling method, device and medium |
Non-Patent Citations (2)
Title |
---|
面向网页信息筛选的可信度评估研究;靳嘉林,王曰芬,郑小昌;情报理论与实践;20170515;第40卷(第5期);116-121 * |
靳嘉林 ; 王曰芬 ; 郑小昌 ; .面向网页信息筛选的可信度评估研究.情报理论与实践.2017,40(5),116-121. * |
Also Published As
Publication number | Publication date |
---|---|
CN117076773A (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100440224C (en) | An automatic processing method for search engine performance evaluation | |
CN100507920C (en) | A method for reordering search engine retrieval results based on user behavior information | |
US7756867B2 (en) | Ranking documents | |
Singh et al. | A comparative study of page ranking algorithms for information retrieval | |
CN100433007C (en) | Method for providing research result | |
CN107977452A (en) | A kind of information retrieval system and method based on big data | |
Pavani et al. | A novel web crawling method for vertical search engines | |
Choudhary et al. | Role of ranking algorithms for information retrieval | |
Alghamdi et al. | Extended user preference based weighted page ranking algorithm | |
CN117076773B (en) | Data source screening and optimizing method based on internet information | |
CN103257981B (en) | Deep Web data surfacing method based on query interface attribute characteristics | |
Batra et al. | Comparative study of page rank algorithm with different ranking algorithms adopted by search engine for website ranking | |
Lei et al. | Improved relevance ranking in WebGather | |
Batra et al. | Content based hidden web ranking algorithm (CHWRA) | |
Kadam | Search Engine Optimization Techniques and Tools | |
Yerma et al. | Updated page rank of dynamically generated research authors' pages: A new idea | |
Liang et al. | R-SpamRank: a spam detection algorithm based on link analysis | |
Zeraatkar et al. | Improvement of Page Ranking Algorithm by Negative Score of Spam Pages. | |
WO2005024661A2 (en) | Improved search engine optimisation | |
Yan et al. | An improved PageRank method based on genetic algorithm for web search | |
CN102982094B (en) | A kind for the treatment of method and apparatus for network address | |
Bama et al. | Improved pagerank algorithm for web structure mining | |
Zubi | Ranking webpages using web structure mining concepts | |
Rashmi et al. | Deep web crawler: exploring and re-ranking of web forms | |
CN109948019B (en) | Deep network data acquisition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |