CN117076773B

CN117076773B - Data source screening and optimizing method based on internet information

Info

Publication number: CN117076773B
Application number: CN202311063341.9A
Authority: CN
Inventors: 闫磊; 潘俊峰; 梁雷; 聂磊; 董曙光
Original assignee: Shanghai Languiqi Technology Development Co ltd
Current assignee: Shanghai Languiqi Technology Development Co ltd
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2024-05-28
Anticipated expiration: 2043-08-23
Also published as: CN117076773A

Abstract

The invention discloses a data source screening and optimizing method based on internet information, which comprises the following specific steps: s1: the first n search results obtained by each search engine are selected and put into a content resource list, and are used as screening and optimizing input after the duplicate removal treatment; s2: initializing a weight of a search result in a content resource list; s3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained; s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing. In the process of crawling internet information, the data sources are screened and optimized to obtain the data with high value, high matching degree and high reliability, so that the problems of internet information heterogeneous and low value density are solved, and data support and data sources are provided for agricultural production.

Description

Data source screening and optimizing method based on internet information

Technical Field

The invention belongs to the field of big data, and particularly relates to a data source screening and optimizing method based on internet information.

Background

In order to promote the development of intelligent agriculture and intelligent agriculture better, how to obtain data with high value, high matching degree and high reliability is particularly important. The internet is one of the important information acquisition means, and has the disadvantages of huge information quantity, rich variety, heterogeneous information and low value density. Therefore, in order to acquire the most effective data information as possible, a large amount of labor is often required for data screening. And because the built-in search algorithms of each search engine are different, the search result of a single search engine often has certain limitation, so that the phenomenon of missed detection is caused, and important data information is missed.

Object of the Invention

In order to solve the technical problems, the invention discloses a data source screening and optimizing method based on internet information, which screens and optimizes data sources in the process of crawling internet information to obtain high-value, high-matching-degree and high-reliability data, so as to solve the problems of internet information heterogeneous and low value density and provide data support and data sources for agricultural production.

The specific technical scheme of the invention is as follows:

A data source screening optimization method based on Internet information comprises the following specific steps:

S1: searching keywords in different search engines in the Internet respectively, selecting the first n search results obtained by each search engine, putting the search results into a content resource list, and performing duplication removal processing to obtain the input of screening optimization;

s2: initializing a weight of a search result in a content resource list;

S3: weighting scoring is carried out on each content resource website in the content resource list according to the scoring rule, so that a dictionary taking the content resource website as a key and taking the weight score as a Value is obtained;

s4: and sequencing the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing.

Preferably, the result list obtained in step S4 is further verified and evaluated, and the specific method is as follows:

Crawling the content information of the content resource websites in the result list according to the expected data information items, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows:

Content resource website value = information item crawled by the website/data item desired.

Preferably, in the step S3, the weight scoring calculation is performed on the content resource website from three dimensions of credibility, matching degree and popularity.

Preferably, in the step S3, the content resource websites are weighted according to the formula (1):

Value＝V₁*a₁+V₂*a₂+…+V_n*a_n (1)；

Where V _n represents the score value of the content resource website in the nth dimension, a _n represents the weight ratio of the nth dimension, and a ₁+a₂+…+a_n =1.

Preferably, in the step S3, the weight of the credibility is distributed according to the type of the information publishing website, the weight of the matching degree is distributed according to the type of the information matching, and the weight of the moderate degree is distributed according to the type of the information applicable standard.

Preferably, the information posting website types include a ministry official posting, a ministry subordinate entity posting, a provincial local official data posting, a local entity posting, an industry tap official website, an industry general enterprise official website, a third party statistics website, and an e-commerce website.

Preferably, the information matching type includes keyword matching, category matching, domain matching and industry matching.

Preferably, the information applicable standard types include national standards, industry standards, local standards, and enterprise standards.

The beneficial effects are that: the invention discloses a data source screening and optimizing method based on internet information, which has the following advantages:

(1) The invention takes the built-in search algorithm and the sequencing rule of different search engines as the data input of the preliminary screening, can comprehensively and fully utilize each search engine to realize the preliminary screening, can not only improve the comprehensiveness of the input data, but also effectively reduce the data quantity of the screening optimization at the later time, and is beneficial to improving the screening optimization efficiency;

(2) The invention performs scoring selection on the content resource websites from three dimensions of credibility, matching degree and popularity to output the content resource websites with high score, thereby realizing screening and optimization of data sources and being beneficial to improving the value degree, reliability and matching degree of search results.

(3) According to the invention, the content information of the content resource website after crawling and screening is compared with the preset expected data information items, so that the optimization result is verified and evaluated in one step, and the value degree, reliability and matching degree of the search result are further ensured.

Drawings

FIG. 1 is a schematic diagram of a data source screening optimization method according to the present invention.

Detailed Description

The invention is further improved and modified in the following description with reference to the drawings, which are also to be regarded as protection.

Example 1

Taking crawling of rice seedling data information in agricultural production as an example, as shown in fig. 1, screening and optimizing data sources based on internet information, and specifically comprises the following steps:

Step 1: setting an input keyword as 'rice seed information', inputting the keyword into four different search engines of hundred degrees, dog searching, 360 and Bing in the embodiment, putting the search result of the top 20 of each search engine rank into a content resource list, and performing de-duplication processing to obtain the following content resource list:

[https://ricedata.cn/,https://www.ricedata.cn/variety/,https://www.cgris.net/,https://zhuanlan.zhihu.com/p/374483809,https://baike.baidu.hk/item/％E6％B0％B4％E7％A8％BB/21285,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.htm,https://baike.baidu.com/item/％E7％A8％BB/4417005,https://www.ricedata.cn/variety/superice.htm,https://www.gov.cn/xinwen/2022-12/05/content_5730461.htm,http://www.jiangdu.gov.cn/jdqxxgk/nyncj/202304/9585364ff7644872a192aa4e764acbd2.shtml,...,https://www.baidu.com/linkurl＝mqtDoXWwXYVLdKcQWTGUgzJODBEum5ZwKuGHls3NrfKKlgdy2N-5kfUU9Abxpw4w&wd＝&eqid＝8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl＝mqtDoXWwXYVLdKcQWTGUgrK3K0aILqMtbYseQAn6vP2-5lVLOgsNpBv4RoklwWfcvNVoWN6OXLGcq3BtRJP_oWtzZritn37lyIlYvPn4fYDFgtxTvg7uqrzcMgWV3bkyRkgqVZEObUtkqLB3m1iUwWAzK3wAnFZXppTYghXeYDUC3pLMHonrqWLeRDJ7KcXKiqTtTRhJtZfzExYxI3mSVr4e8vLxhUSCsuL9doVU6TB0VeGXmp8QLVmkB8-HGBHCwxOUKVFM4f56y-lExxW4U_&wd＝&eqid＝8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl＝mMw2X75qEAIbS7UaWryrE30mmDQC2vfgEAU1SUVbxG9FcbNBsXgj8I8_2eBtePgQGUP49x7a0L1-uFMfzuAXOw77M9u0awzhoN6a0gmyGqy&wd＝&eqid＝8e799a1c00002480000000046497f78d,https://www.baidu.com/linkurl＝mMw2X75qEAIbS7UaWryrEEZJFrDq5Q8gbyA3LHePwBA6AkxTlgFSzbpcesUaRiFHhXCXi-xOUgwhJ__3SS16zZonqACOiHu99BsG9XVxrGS&wd＝&eqid＝8e799a1c00002480000000046497f78d].

The search engine in the invention can be but not limited to the above search engine, and the existing search engine capable of realizing information retrieval is applicable.

Step 2: screening and optimizing the content resource list as input, and initializing the weight of the search result in the content resource list, namely initializing the Value of the content resource website corresponding to the dictionary key to be 0. And then, weighting scoring is carried out on the content resource websites in the content resource list according to a scoring rule, so that a dictionary which takes the content resource websites as keys and takes the weight score as a Value is obtained, wherein the dictionary is shown as follows:

[https://www.ricedata.cn/variety/:9,https://ricedata.cn/:8.4,https://www.ricedata.cn/variety/superice.htm:8,https://www.cgris.net/:7.8,http://www.zys.moa.gov.cn/mhsh/202104/t20210422_6366373.html:7.2,...,https://baike.baidu.com/item/％E7％A8％BB/4417005:5.8,https://baike.baidu.hk/item/％E6％B0％B4％E7％A8％BB/21285:5.8,https://zhuanlan.zhihu.com/p/374483809:4.6].

The dictionary is ordered from high to low according to the Value, and the top 20 content resource websites are output to the result list.

In the invention, the weight scoring calculation is shown in the formula (1):

Value＝V₁*a₁+V₂*a₂+…+V_n*a_n (1)；

The scoring rules in this embodiment 1 are: and (3) performing weight scoring calculation on the content resource website from three dimensions of credibility, matching degree and popularity, namely, taking n as 3. The weight score table design for each dimension is as follows:

TABLE 1 credibility weight distribution Table

TABLE 2 match weight distribution Table

	Keyword matching	Category matching	Domain matching	Industry matching	Weight ratio
						Degree of matching	10	8	6	4	0.3

Table 3 general weight distribution table

	National standard	Industry standard	Local standard	Enterprise standard	Weight ratio
						General degree	10	8	6	4	0.2

Step 3: according to the rice seedling data information, expected data information items are set, 22 data items are obtained in total, and the table is shown as follows:

table 4 desired data information entry

Step 4: according to the expected data information items, content information crawling is carried out on the content resource websites in the result list obtained in the step 2, ratio calculation is carried out on the crawled data information items and the expected data information items, so that the value degree of the content resource websites is obtained, the value degree is used for evaluating the quality degree of the screening optimization method), and the calculation formula is as follows: content resource website value = information item crawled by the website/data item desired.

The evaluation criteria of the data source screening and optimizing method can be set according to the actual demands of users, for example: taking the value of the content resource website as a measurement standard, it can be considered that the content resource website is better than 85%, preferably 75% -85%, generally 60% -75% and not better than 60%.

If the result obtained by evaluation is not good, the screening optimization method needs to be adjusted, and the dimension can be increased, the measurement index of each dimension can be further subdivided, and the like.

The foregoing is merely illustrative of the present invention and is a preferred embodiment thereof. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art and are intended to be within the scope of the present invention.

Claims

1. The data source screening and optimizing method based on the Internet information is characterized by comprising the following specific steps:

s2: initializing a weight of a search result in a content resource list;

S3: and (3) weighting and scoring each content resource website in the content resource list from three dimensions of credibility, matching degree and popularity according to a scoring rule shown in a formula (1) to obtain a dictionary taking the content resource website as a key and taking the weight score as a Value, wherein the formula (1) is as follows:

Value=V₁*a₁+ V₂*a₂+…+V_n*a_n（1）；

Wherein V _n represents a score value of the content resource website in the nth dimension, a _n represents a weight ratio of the nth dimension, and a ₁+a₂+…+ a_n =1;

The weight of the credibility is distributed according to the information release website types, wherein the information release website types comprise ministry official publication, ministry subordinate unit publication, provincial local official data publication, local unit publication, industry tap official website, industry general enterprise official website, third party statistics website and e-commerce website;

The weight of the matching degree is distributed according to information matching types, wherein the information matching types comprise keyword matching, category matching, field matching and industry matching;

the general weight is distributed according to information applicable standard types, wherein the information applicable standard types comprise national standards, industry standards, local standards and enterprise standards;

s4: ordering the dictionary from high Value to low Value, and outputting the content resource websites with the top m rank to a result list as a data source after screening and optimizing;

And (3) further verifying and evaluating the result list obtained in the step S4, wherein the specific method is as follows:

Crawling content information of the content resource websites in the result list according to expected data information items, wherein the expected data items are obtained by setting according to keywords, and calculating the ratio of the crawled data information items to the expected data information items, wherein the ratio is used for measuring the value of the content resource websites, and the calculation formula is as follows: