[go: up one dir, main page]

CN117076773B - Data source screening and optimizing method based on internet information - Google Patents

Data source screening and optimizing method based on internet information Download PDF

Info

Publication number
CN117076773B
CN117076773B CN202311063341.9A CN202311063341A CN117076773B CN 117076773 B CN117076773 B CN 117076773B CN 202311063341 A CN202311063341 A CN 202311063341A CN 117076773 B CN117076773 B CN 117076773B
Authority
CN
China
Prior art keywords
content resource
website
information
value
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311063341.9A
Other languages
Chinese (zh)
Other versions
CN117076773A (en
Inventor
闫磊
潘俊峰
梁雷
聂磊
董曙光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Languiqi Technology Development Co ltd
Original Assignee
Shanghai Languiqi Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Languiqi Technology Development Co ltd filed Critical Shanghai Languiqi Technology Development Co ltd
Priority to CN202311063341.9A priority Critical patent/CN117076773B/en
Publication of CN117076773A publication Critical patent/CN117076773A/en
Application granted granted Critical
Publication of CN117076773B publication Critical patent/CN117076773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data source screening and optimizing method based on internet information, which comprises the following specific steps: s1: the first n search results obtained by each search engine are selected and put into a content resource list, and are used as screening and optimizing input after the duplicate removal treatment; s2: initializing a weight of a search result in a content resource list; s3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained; s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing. In the process of crawling internet information, the data sources are screened and optimized to obtain the data with high value, high matching degree and high reliability, so that the problems of internet information heterogeneous and low value density are solved, and data support and data sources are provided for agricultural production.

Description

Data source screening and optimizing method based on internet information
Technical Field
The invention belongs to the field of big data, and particularly relates to a data source screening and optimizing method based on internet information.
Background
In order to promote the development of intelligent agriculture and intelligent agriculture better, how to obtain data with high value, high matching degree and high reliability is particularly important. The internet is one of the important information acquisition means, and has the disadvantages of huge information quantity, rich variety, heterogeneous information and low value density. Therefore, in order to acquire the most effective data information as possible, a large amount of labor is often required for data screening. And because the built-in search algorithms of each search engine are different, the search result of a single search engine often has certain limitation, so that the phenomenon of missed detection is caused, and important data information is missed.
Object of the Invention
In order to solve the technical problems, the invention discloses a data source screening and optimizing method based on internet information, which screens and optimizes data sources in the process of crawling internet information to obtain high-value, high-matching-degree and high-reliability data, so as to solve the problems of internet information heterogeneous and low value density and provide data support and data sources for agricultural production.
The specific technical scheme of the invention is as follows:
A data source screening optimization method based on Internet information comprises the following specific steps:
S1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;
s2: initializing a weight of a search result in a content resource list;
S3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained;
s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing.
Preferably, the result list obtained in step S4 is further verified and evaluated, and the specific method is as follows:
Crawling the content information of the content resource websites in the result list according to the expected data information items, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:
Content resource website value = information item crawled by the website/data item desired.
Preferably, in the step S3, the weight scoring calculation is performed on the content resource website from three dimensions of credibility, matching degree and popularity.
Preferably, in the step S3, the content resource websites are weighted according to the formula (1):
Value=V1*a1+V2*a2+…+Vn*an (1);
Where V n represents the score value of the content resource website in the nth dimension, a n represents the weight ratio of the nth dimension, and a 1+a2+…+an =1.
Preferably, in the step S3, the weight of the credibility is distributed according to the type of the information publishing website, the weight of the matching degree is distributed according to the type of the information matching, and the weight of the moderate degree is distributed according to the type of the information applicable standard.
Preferably, the information posting website types include a ministry official posting, a ministry subordinate entity posting, a provincial local official data posting, a local entity posting, an industry tap official website, an industry general enterprise official website, a third party statistics website, and an e-commerce website.
Preferably, the information matching type includes keyword matching, category matching, domain matching and industry matching.
Preferably, the information applicable standard types include national standards, industry standards, local standards, and enterprise standards.
The beneficial effects are that: the invention discloses a data source screening and optimizing method based on internet information, which has the following advantages:
(1) The invention takes the built-in search algorithm and the sequencing rule of different search engines as the data input of the preliminary screening, can comprehensively and fully utilize each search engine to realize the preliminary screening, can not only improve the comprehensiveness of the input data, but also effectively reduce the data quantity of the screening optimization at the later time, and is beneficial to improving the screening optimization efficiency;
(2) The invention performs scoring selection on the content resource websites from three dimensions of credibility, matching degree and popularity to output the content resource websites with high score, thereby realizing screening and optimization of data sources and being beneficial to improving the value degree, reliability and matching degree of search results.
(3) According to the invention, the content information of the content resource website after crawling and screening is compared with the preset expected data information items, so that the optimization result is verified and evaluated in one step, and the value degree, reliability and matching degree of the search result are further ensured.
Drawings
FIG. 1 is a schematic diagram of a data source screening optimization method according to the present invention.
Detailed Description
The invention is further improved and modified in the following description with reference to the drawings, which are also to be regarded as protection.
Example 1
Taking crawling of rice seedling data information in agricultural production as an example, as shown in fig. 1, screening and optimizing data sources based on internet information, and specifically comprises the following steps:
Step 1: setting an input keyword as 'rice seed information', inputting the keyword into four different search engines of hundred degrees, dog searching, 360 and Bing in the embodiment, putting the search result of the top 20 of each search engine rank into a content resource list, and performing de-duplication processing to obtain the following content resource list:
[https://ricedata.cn/,https://www.ricedata.cn/variety/,https://www.cgris.net/,https://zhuanlan.zhihu.com/p/374483809,https://baike.baidu.hk/item/%E6%B0%B4%E7%A8%BB/21285,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.htm,https://baike.baidu.com/item/%E7%A8%BB/4417005,https://www.ricedata.cn/variety/superice.htm,https://www.gov.cn/xinwen/2022-12/05/content_5730461.htm,http://www.jiangdu.gov.cn/jdqxxgk/nyncj/202304/9585364ff7644872a192aa4e764acbd2.shtml,...,https://www.baidu.com/linkurl=mqtDoXWwXYVLdKcQWTGUgzJODBEum5ZwKuGHls3NrfKKlgdy2N-5kfUU9Abxpw4w&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mqtDoXWwXYVLdKcQWTGUgrK3K0aILqMtbYseQAn6vP2-5lVLOgsNpBv4RoklwWfcvNVoWN6OXLGcq3BtRJP_oWtzZritn37lyIlYvPn4fYDFgtxTvg7uqrzcMgWV3bkyRkgqVZEObUtkqLB3m1iUwWAzK3wAnFZXppTYghXeYDUC3pLMHonrqWLeRDJ7KcXKiqTtTRhJtZfzExYxI3mSVr4e8vLxhUSCsuL9doVU6TB0VeGXmp8QLVmkB8-HGBHCwxOUKVFM4f56y-lExxW4U_&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mMw2X75qEAIbS7UaWryrE30mmDQC2vfgEAU1SUVbxG9FcbNBsXgj8I8_2eBtePgQGUP49x7a0L1-uFMfzuAXOw77M9u0awzhoN6a0gmyGqy&wd=&eqid=8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl=mMw2X75qEAIbS7UaWryrEEZJFrDq5Q8gbyA3LHePwBA6AkxTlgFSzbpcesUaRiFHhXCXi-xOUgwhJ__3SS16zZonqACOiHu99BsG9XVxrGS&wd=&eqid=8e799a1c00002480000000046497f78d].
The search engine in the invention can be but not limited to the above search engine, and the existing search engine capable of realizing information retrieval is applicable.
Step 2: screening and optimizing the content resource list as input, and initializing the weight of the search result in the content resource list, namely initializing the Value of the content resource website corresponding to the dictionary key to be 0. And then, weighting scoring is carried out on the content resource websites in the content resource list according to a scoring rule, so that a dictionary which takes the content resource websites as keys and takes the weight score as a Value is obtained, wherein the dictionary is shown as follows:
[https://www.ricedata.cn/variety/:9,https://ricedata.cn/:8.4,https://www.ricedata.cn/variety/superice.htm:8,https://www.cgris.net/:7.8,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.html:7.2,...,https://baike.baidu.com/item/%E7%A8%BB/4417005:5.8,https://baike.baidu.hk/item/%E6%B0%B4%E7%A8%BB/21285:5.8,https://zhuanlan.zhihu.com/p/374483809:4.6].
The dictionary is ordered from high to low according to the Value, and the top 20 content resource websites are output to the result list.
In the invention, the weight scoring calculation is shown in the formula (1):
Value=V1*a1+V2*a2+…+Vn*an (1);
Where V n represents the score value of the content resource website in the nth dimension, a n represents the weight ratio of the nth dimension, and a 1+a2+…+an =1.
The scoring rules in this embodiment 1 are: and (3) performing weight scoring calculation on the content resource website from three dimensions of credibility, matching degree and popularity, namely, taking n as 3. The weight score table design for each dimension is as follows:
TABLE 1 credibility weight distribution Table
TABLE 2 match weight distribution Table
Keyword matching Category matching Domain matching Industry matching Weight ratio
Degree of matching 10 8 6 4 0.3
Table 3 general weight distribution table
National standard Industry standard Local standard Enterprise standard Weight ratio
General degree 10 8 6 4 0.2
Step 3: according to the rice seedling data information, expected data information items are set, 22 data items are obtained in total, and the table is shown as follows:
table 4 desired data information entry
Step 4: according to the expected data information items, content information crawling is carried out on the content resource websites in the result list obtained in the step 2, ratio calculation is carried out on the crawled data information items and the expected data information items, so that the value degree of the content resource websites is obtained, the value degree is used for evaluating the quality degree of the screening optimization method), and the calculation formula is as follows: content resource website value = information item crawled by the website/data item desired.
The evaluation criteria of the data source screening and optimizing method can be set according to the actual demands of users, for example: taking the value of the content resource website as a measurement standard, it can be considered that the content resource website is better than 85%, preferably 75% -85%, generally 60% -75% and not better than 60%.
If the result obtained by evaluation is not good, the screening optimization method needs to be adjusted, and the dimension can be increased, the measurement index of each dimension can be further subdivided, and the like.
The foregoing is merely illustrative of the present invention and is a preferred embodiment thereof. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art and are intended to be within the scope of the present invention.

Claims (1)

1. The data source screening and optimizing method based on the Internet information is characterized by comprising the following specific steps:
S1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;
s2: initializing a weight of a search result in a content resource list;
S3: and (3) weighting and scoring each content resource website in the content resource list from three dimensions of credibility, matching degree and popularity according to a scoring rule shown in a formula (1) to obtain a dictionary taking the content resource website as a key and taking the weight score as a Value, wherein the formula (1) is as follows:
Value=V1*a1+ V2*a2+…+Vn*an(1);
Wherein V n represents a score value of the content resource website in the nth dimension, a n represents a weight ratio of the nth dimension, and a 1+a2+…+ an =1;
The weight of the credibility is distributed according to the information release website types, wherein the information release website types comprise ministry official publication, ministry subordinate unit publication, provincial local official data publication, local unit publication, industry tap official website, industry general enterprise official website, third party statistics website and e-commerce website;
The weight of the matching degree is distributed according to information matching types, wherein the information matching types comprise keyword matching, category matching, field matching and industry matching;
the general weight is distributed according to information applicable standard types, wherein the information applicable standard types comprise national standards, industry standards, local standards and enterprise standards;
s4: ordering the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing;
And (3) further verifying and evaluating the result list obtained in the step S4, wherein the specific method is as follows:
Crawling content information of the content resource websites in the result list according to expected data information items, wherein the expected data items are obtained by setting according to keywords, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:
Content resource website value = information item crawled by the website/data item desired.
CN202311063341.9A 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information Active CN117076773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311063341.9A CN117076773B (en) 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311063341.9A CN117076773B (en) 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information

Publications (2)

Publication Number Publication Date
CN117076773A CN117076773A (en) 2023-11-17
CN117076773B true CN117076773B (en) 2024-05-28

Family

ID=88714825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311063341.9A Active CN117076773B (en) 2023-08-23 2023-08-23 Data source screening and optimizing method based on internet information

Country Status (1)

Country Link
CN (1) CN117076773B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639838A (en) * 2008-07-31 2010-02-03 深圳龙媒网络技术有限公司 Method and system for searching resource
CN102023996A (en) * 2009-09-21 2011-04-20 英业达股份有限公司 System and method for sorting websites according to content of website articles
CN104008210A (en) * 2014-06-20 2014-08-27 李玉坤 Web information retrieval method based on multiple search engines
CN104111888A (en) * 2014-07-03 2014-10-22 曹建楠 Code evaluation method, device and system for teaching
WO2015070673A1 (en) * 2013-11-15 2015-05-21 北京奇虎科技有限公司 Method for browser-side network search and browser
WO2015089860A1 (en) * 2013-12-18 2015-06-25 孙燕群 Search engine ranking method based on user participation
CN110175280A (en) * 2019-04-30 2019-08-27 广东鼎义互联科技股份有限公司 A kind of crawler analysis platform based on government affairs big data
CN110968511A (en) * 2019-11-29 2020-04-07 车智互联(北京)科技有限公司 Recommendation engine testing method, device, computing equipment and system
CN111177514A (en) * 2019-12-31 2020-05-19 沈阳航空航天大学 Information source evaluation method, device, storage device and program based on website feature analysis
CN112417299A (en) * 2020-12-08 2021-02-26 西安联乘智能科技有限公司 Webpage recommendation method, computer storage medium and computing device
CN113722572A (en) * 2021-10-11 2021-11-30 上海易路软件有限公司 Distributed deep crawling method, device and medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8650191B2 (en) * 2010-08-23 2014-02-11 Vistaprint Schweiz Gmbh Search engine optimization assistant
US11693910B2 (en) * 2018-12-13 2023-07-04 Microsoft Technology Licensing, Llc Personalized search result rankings

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639838A (en) * 2008-07-31 2010-02-03 深圳龙媒网络技术有限公司 Method and system for searching resource
CN102023996A (en) * 2009-09-21 2011-04-20 英业达股份有限公司 System and method for sorting websites according to content of website articles
WO2015070673A1 (en) * 2013-11-15 2015-05-21 北京奇虎科技有限公司 Method for browser-side network search and browser
WO2015089860A1 (en) * 2013-12-18 2015-06-25 孙燕群 Search engine ranking method based on user participation
CN104008210A (en) * 2014-06-20 2014-08-27 李玉坤 Web information retrieval method based on multiple search engines
CN104111888A (en) * 2014-07-03 2014-10-22 曹建楠 Code evaluation method, device and system for teaching
CN110175280A (en) * 2019-04-30 2019-08-27 广东鼎义互联科技股份有限公司 A kind of crawler analysis platform based on government affairs big data
CN110968511A (en) * 2019-11-29 2020-04-07 车智互联(北京)科技有限公司 Recommendation engine testing method, device, computing equipment and system
CN111177514A (en) * 2019-12-31 2020-05-19 沈阳航空航天大学 Information source evaluation method, device, storage device and program based on website feature analysis
CN112417299A (en) * 2020-12-08 2021-02-26 西安联乘智能科技有限公司 Webpage recommendation method, computer storage medium and computing device
CN113722572A (en) * 2021-10-11 2021-11-30 上海易路软件有限公司 Distributed deep crawling method, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
面向网页信息筛选的可信度评估研究;靳嘉林,王曰芬,郑小昌;情报理论与实践;20170515;第40卷(第5期);116-121 *
靳嘉林 ; 王曰芬 ; 郑小昌 ; .面向网页信息筛选的可信度评估研究.情报理论与实践.2017,40(5),116-121. *

Also Published As

Publication number Publication date
CN117076773A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN100440224C (en) An automatic processing method for search engine performance evaluation
CN100507920C (en) A method for reordering search engine retrieval results based on user behavior information
US7756867B2 (en) Ranking documents
Singh et al. A comparative study of page ranking algorithms for information retrieval
CN100433007C (en) Method for providing research result
CN107977452A (en) A kind of information retrieval system and method based on big data
Pavani et al. A novel web crawling method for vertical search engines
Choudhary et al. Role of ranking algorithms for information retrieval
Alghamdi et al. Extended user preference based weighted page ranking algorithm
CN117076773B (en) Data source screening and optimizing method based on internet information
CN103257981B (en) Deep Web data surfacing method based on query interface attribute characteristics
Batra et al. Comparative study of page rank algorithm with different ranking algorithms adopted by search engine for website ranking
Lei et al. Improved relevance ranking in WebGather
Batra et al. Content based hidden web ranking algorithm (CHWRA)
Kadam Search Engine Optimization Techniques and Tools
Yerma et al. Updated page rank of dynamically generated research authors' pages: A new idea
Liang et al. R-SpamRank: a spam detection algorithm based on link analysis
Zeraatkar et al. Improvement of Page Ranking Algorithm by Negative Score of Spam Pages.
WO2005024661A2 (en) Improved search engine optimisation
Yan et al. An improved PageRank method based on genetic algorithm for web search
CN102982094B (en) A kind for the treatment of method and apparatus for network address
Bama et al. Improved pagerank algorithm for web structure mining
Zubi Ranking webpages using web structure mining concepts
Rashmi et al. Deep web crawler: exploring and re-ranking of web forms
CN109948019B (en) Deep network data acquisition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant