[go: up one dir, main page]

CN108647249B - Public opinion data prediction method, device, terminal and storage medium - Google Patents

Public opinion data prediction method, device, terminal and storage medium Download PDF

Info

Publication number
CN108647249B
CN108647249B CN201810351128.0A CN201810351128A CN108647249B CN 108647249 B CN108647249 B CN 108647249B CN 201810351128 A CN201810351128 A CN 201810351128A CN 108647249 B CN108647249 B CN 108647249B
Authority
CN
China
Prior art keywords
data
public opinion
diseases
disease
factors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810351128.0A
Other languages
Chinese (zh)
Other versions
CN108647249A (en
Inventor
阮晓雯
徐亮
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810351128.0A priority Critical patent/CN108647249B/en
Priority to PCT/CN2018/100229 priority patent/WO2019200786A1/en
Publication of CN108647249A publication Critical patent/CN108647249A/en
Application granted granted Critical
Publication of CN108647249B publication Critical patent/CN108647249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A public opinion data prediction method comprises the following steps: receiving at least one keyword of a disease input by a user; determining a data source related to the keyword in the Internet, and crawling disease data related to the keyword from the data source by using a crawler program; analyzing the disease data to obtain public opinion factors of the diseases; carrying out data cleaning and abnormal value processing on the public sentiment factors of the diseases; carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data; and calculating derivative variables of public opinion factors of the diseases according to the new disease data, and predicting the diseases according to the derivative variables. The invention also provides a public opinion data prediction device, a terminal and a storage medium. The method can crawl more comprehensive disease data, and perform data sorting, deep analysis and calculation on the disease data, so that the aim of displaying basic data to decision-making data is fulfilled, and a reference basis is provided for disease prediction.

Description

Public opinion data prediction method, device, terminal and storage medium
Technical Field
The invention relates to the technical field of data prediction, in particular to a public opinion data prediction method, a public opinion data prediction device, a public opinion data prediction terminal and a storage medium.
Background
With the rapid development of the internet, computer technology has been convenient for people's life in various industries, and is no exception in the medical field. A large amount of professional data of diseases and inquiry records of users are hidden on the network, but the data are not systematic and complete enough, when an epidemic disease rapidly breaks out, website information cannot be updated timely, information entry is delayed, and the users cannot know the latest information timely, prevent the disease in time and prevent the disease in the bud.
At present, public opinion data about diseases is crawled by adopting a web crawler technology, but a crawling method is single and a simple crawler method is adopted. Secondly, effective and timely inspection is not performed on the crawled data. In addition, for data with different distributions, the same data cleaning and filling mode is adopted, and the data processing effect is poor.
Disclosure of Invention
In view of the above, it is necessary to provide a public opinion data prediction method, apparatus, terminal and storage medium, which can crawl disease data in different data sources and adopt different data inspection, cleaning and abnormal value processing methods.
The first aspect of the present invention provides a public opinion data prediction method, including:
receiving at least one keyword of a disease input by a user;
determining a data source related to the keyword in the Internet, and crawling disease data related to the keyword from the data source by using a crawler program;
analyzing the disease data to obtain public opinion factors of the diseases;
carrying out data cleaning and abnormal value processing on the public sentiment factors of the diseases;
carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data; and
and calculating derivative variables of public opinion factors of the diseases according to the new disease data, and predicting the diseases according to the derivative variables.
According to a preferred embodiment of the present invention, the determining a data source related to the keyword in the internet and using a crawler program to crawl disease data related to the keyword from the data source comprises:
determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;
setting a multithreading crawler program with the same category number as the category number according to the category number obtained by classifying the data source;
and utilizing the multithreading crawler program to respectively crawl disease data related to the keywords from the corresponding data sources.
According to a preferred embodiment of the invention, the method further comprises:
making a chart according to the calculated derivative variables for visual display, wherein the derivative variables comprise: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile.
According to a preferred embodiment of the invention, the data normalization comprises one or a combination of several of the following:
sum normalization, standard deviation normalization, maximum normalization, or range normalization.
According to a preferred embodiment of the present invention, the crawling, by the crawler, the disease data related to the keyword from the data source comprises:
and crawling disease data related to the keywords from the data source within a preset crawler time period by utilizing a crawler program.
According to a preferred embodiment of the present invention, the analyzing the disease data to obtain a public opinion factor of a disease comprises:
calculating the sum of the number of all the sub public opinion factors of the disease, calculating the percentage of each sub public opinion factor in the sum, wherein the percentage is the weight of the corresponding sub public opinion factor, and determining the sub public opinion factor with the weight larger than a preset weight threshold value as the public opinion factor of the disease.
According to a preferred embodiment of the present invention, the data cleaning and outlier processing of the public sentiment factors of the diseases comprises:
carrying out data cleaning on the public sentiment factors of the diseases according to the types of the public sentiment factors of the diseases;
carrying out missing value replacement on the public sentiment factors of the diseases according to the distribution of the public sentiment factors of the diseases; or
The public opinion factors with abnormal diseases are directly discarded.
A second aspect of the present invention provides a public opinion data prediction apparatus, the apparatus comprising:
the receiving module is used for receiving at least one keyword of a disease input by a user;
the crawling module is used for determining a data source related to the keyword in the Internet and crawling disease data related to the keyword from the data source by utilizing a crawler program;
the analysis module is used for analyzing the disease data to obtain public opinion factors of the diseases;
the cleaning module is used for carrying out data cleaning and abnormal value processing on the public sentiment factors of the diseases;
the standardization module is used for carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data; and
and the prediction module is used for calculating derivative variables of public opinion factors of the diseases according to the new disease data and predicting the diseases according to the derivative variables.
A third aspect of the present invention provides a terminal, where the terminal includes a processor and a memory, and the processor is configured to implement the public opinion data prediction method when executing a computer program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the public opinion data prediction method.
According to the public opinion data prediction method, the device, the terminal and the storage medium, different crawler programs are set to correspond to different types of data sources, the multithreading crawler program is utilized to crawl disease data related to input keywords from the corresponding data sources, the crawling efficiency can be improved in a parallel crawling mode, the data formats of the crawled disease data are uniform, and the problem that the crawled data are difficult to crawl or cannot be analyzed due to the storage formats of the data of different data sources or other problems can be avoided; and carrying out data sorting, deep analysis and calculation on the public sentiment factors of the diseases, and making the crawled disease data into graphs or tables after carrying out fine processing on the crawled disease data, so that the results are displayed more clearly, and the problems are conveniently and visually analyzed. In addition, a plurality of variables are derived according to the public sentiment factors of the diseases, so that data indexes are increased, reference basis is provided for disease prediction, the disease prediction is not blind and empirical, and the prediction result is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a public opinion data prediction method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a public opinion data prediction method according to a second embodiment of the present invention.
Fig. 3 is a structural diagram of a public opinion data prediction device according to a third embodiment of the present invention.
Fig. 4 is a structural diagram of a public opinion data prediction device according to a fourth embodiment of the present invention.
Fig. 5 is a structural diagram of a terminal according to a fifth embodiment of the present invention.
The following detailed description will further illustrate the invention in conjunction with the above-described figures.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The public opinion data prediction method is applied to one or more terminals. The public opinion data prediction method can also be applied to a hardware environment consisting of a terminal and a server connected with the terminal through a network. Networks include, but are not limited to: a wide area network, a metropolitan area network, or a local area network. The public opinion data prediction method can be executed by a server or a terminal; or may be performed by both the server and the terminal.
The public opinion data prediction function provided by the method can be directly integrated on the terminal for the terminal needing the public opinion data prediction method, or a client for realizing the method can be installed. For another example, the method provided by the present invention may also be operated on a device such as a server in the form of a software Development Kit (step DK), an interface for public opinion data prediction function is provided in the form of step DK, and the terminal or other devices can predict public opinion data through the provided interface.
Example one
Fig. 1 is a flowchart of a public opinion data prediction method according to an embodiment of the present invention. The execution sequence in the flowchart may be changed and some steps may be omitted according to different requirements.
And 11, receiving at least one keyword of the disease input by the user.
The keywords are words related to symptoms of a disease, for example, when the disease is a cold, the keywords may include: sneezing, nasal discharge, nasal obstruction, headache, dizziness, cough without phlegm, sore throat, etc. As another example, when the disease is hand-foot-mouth, the keywords may include: stomachache, anorexia, low fever, small herpes on hands, small ulcer on mouth, etc.
To facilitate subsequent crawling to more data related to a disease, a user may enter a number of keywords for the disease. The keywords may be symptoms of a disease obtained by the user based on his own experience, or may be symptoms of a disease collected from a disease specialist.
In this embodiment, the terminal presets a function for the user to input keywords of a disease, for example, the terminal provides a text input box through which the user can input at least one keyword. Alternatively, the terminal provides the function of a voice assistant through which the user can input at least one keyword.
And step 12, determining a data source related to the keyword in the Internet, and crawling disease data related to the keyword from the data source by using a crawler program.
Data sources in the internet that are relevant to the keywords may include, but are not limited to: hundredth, google, Teng news, microblog, hot search, know any website supporting the user to search for access, and the like. Crawling disease data related to the keywords from various data sources using a crawler program may include: baidu index, Google trend, Tencent analysis, news information, advertisement data, channel data, microblog popularity, forum public opinion information and the like.
In this embodiment, the user determines a Universal Resource Locator (URL) of a data source in the internet, and the crawler crawls disease data related to the keyword according to the URL.
And step 13, analyzing the disease data to obtain the public opinion factors of the diseases.
And carrying out specific analysis work including public opinion analysis on the disease data, wherein the specific analysis work includes text processing, text analysis, word frequency statistics, relevance analysis and the like so as to obtain public opinion factors of the disease.
In this embodiment, the public opinion factor of the disease may include a plurality of sub-public opinion factors, such as a first sub-public opinion factor, a second sub-public opinion factor, a third sub-public opinion factor, a fourth public opinion factor, and the like.
For example, the first sub-public opinion factor may be headache, the second sub-public opinion factor may be runny nose, the third sub-public opinion factor may be fever, and the fourth sub-public opinion factor may be cough.
And step 14, carrying out data cleaning and abnormal value processing on the public sentiment factors of the diseases.
And performing data cleaning and abnormal value processing on the public sentiment factors of the diseases, so as to eliminate redundant data in the public sentiment factors of the diseases and obtain disease data with a consistent standard format, so that the public sentiment factors of the diseases after being cleaned and processed by the abnormal values can be used and are more suitable for subsequent analysis work.
In this embodiment, the data cleaning of the public sentiment factors of the diseases includes: and performing data cleaning on the public sentiment factors of the diseases according to the types of the public sentiment factors of the diseases.
Types of public sentiment factors for the disease include, but are not limited to: public opinion factors of diseases containing noise, public opinion factors of diseases not conforming to common opinion, public opinion factors of diseases containing repeated information, public opinion factors of diseases with unbalanced data, public opinion factors of inconsistent diseases, public opinion factors of incomplete diseases, etc.
For the public sentiment factors of the diseases containing the noise, a method for removing the extra-large value and the negative value points is adopted for data cleaning; performing data cleaning on the public sentiment factors of the diseases which do not accord with the common rationale by adopting a method for removing abnormal values; carrying out data cleaning on the public sentiment factors of the diseases containing the repeated information by adopting a method of deleting repeated items; carrying out data cleaning on the unbalanced disease public opinion factors by adopting a data denoising method; carrying out data cleaning on the inconsistent public opinion factors of the diseases by adopting a method of classifying according to data types; and (4) performing data cleaning on the incomplete disease public sentiment factors by adopting a method for establishing relevant standard reference values.
In this embodiment, the processing of the abnormal value of the public sentiment factor of the disease includes: and replacing missing values of the public sentiment factors of the diseases according to the distribution of the public sentiment factors of the diseases.
In this embodiment, the distribution of the public sentiment factors of the disease includes, but is not limited to: stable and severe. The public opinion factors of the diseases distributed stably mean that the trend of the public opinion factors of the diseases is relatively stable, for example, 50, 53, 52, 49, 51, etc. The public opinion factors of the severe distribution diseases mean that the public opinion factors of the diseases have relatively sharp variation trend and large variation amplitude, such as 50, 100, 43, 89, 4 and the like.
For the stably distributed public opinion factors of the diseases, a K-nearest neighbor method can be adopted, K samples nearest to the public opinion factor sample with the missing diseases are determined according to Euclidean distance or correlation analysis, and the missing data of the sample is estimated by weighted average of the public opinion factor values of the K diseases; for the stably distributed public opinion factors of the diseases, a prediction model can be used to predict the public opinion factor of each missing disease, if the public opinion factors of the missing diseases are numerical, the public opinion factors of the missing diseases can be filled with an average value, and if the public opinion factors of the missing diseases are non-numerical, the public opinion factors of the missing diseases can be filled with a mode.
For the public opinion factors of the diseases distributed in a severe manner, the public opinion factors of the missing diseases can be replaced by a mean value method.
Preferably, since the mean method for replacing the missing disease public opinion factors is based on the assumption of completely random missing, the variance and standard deviation of the disease public opinion factors are reduced, and thus, the method may further comprise: and performing product calculation on the disease public opinion factors obtained after mean value substitution and preset expansion coefficients to obtain new disease public opinion factors serving as final disease public opinion factors.
The preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is larger than 1.
In other embodiments, the processing the abnormal value of the public sentiment factor of the disease further comprises: the public opinion factors with abnormal diseases are directly discarded. The public opinion factor with abnormal diseases is directly discarded, the cleanness of the obtained public opinion factor with the diseases can be guaranteed, and the interference caused when the public opinion factor with the diseases is analyzed is avoided.
And step 15, carrying out data standardization on the public sentiment factors of the diseases subjected to data cleaning and abnormal value processing to obtain new disease data.
The data standardization of the public sentiment factors of the diseases after the data cleaning and the abnormal value processing is carried out, so that the public sentiment factors of the diseases are converted into dimensionless pure numerical values, and indexes of different units or orders of magnitude can be compared and weighted conveniently.
In this embodiment, the data normalization method includes, but is not limited to: sum normalization, standard deviation normalization, maximum normalization, range normalization, etc. Preferably, the range is normalized, and the maximum value and the minimum value of the new data obtained after the range normalization process are 1 and 0, and the remaining values are between 0 and 1.
And step 16, calculating derivative variables of public opinion factors of the diseases according to the new disease data, and predicting the diseases according to the derivative variables.
In this embodiment, the derived variables include: maximum, minimum, mean, variance, standard deviation, covariance, range (maximum-minimum), median, mode, quartile. Wherein the average, median, mode and quartile describe the concentration degree of the disease public opinion factors, and the larger the concentration degree of the disease public opinion factors is, the more serious the disease is predicted; the range, the variance and the standard deviation depict the dispersion degree of the public opinion factors of the diseases, and the smaller the dispersion degree of the public opinion factors of the diseases is, the more serious the diseases are predicted.
The public opinion data prediction method comprises the steps of receiving at least one keyword of a disease input by a user, determining a data source related to the keyword in the internet, crawling disease data related to the keyword from the data source by using a crawler program, analyzing the disease data to obtain a public opinion factor of the disease, carrying out data cleaning and abnormal value processing on the public opinion factor of the disease, carrying out data standardization on the public opinion factor of the disease after the data cleaning and the abnormal value processing to obtain new disease data, and calculating a derivative variable of the public opinion factor of the disease according to the new disease data so as to predict the disease according to the derivative variable. Roughly inputting keywords related to diseases by a user, and crawling disease data related to the input keywords by using a crawler program to obtain a comprehensive public opinion factor of the diseases related to the diseases; and performing data sorting, deep analysis and calculation on the public sentiment factors of the diseases, and performing fine processing on the crawled disease data can achieve the purposes of basic data display and decision data display, so that a reference basis is provided for disease prediction, and the prediction result is accurate.
Example two
Fig. 2 is a flowchart of a public opinion data prediction method according to a second embodiment of the present invention. The execution sequence in the flowchart may be changed and some steps may be omitted according to different requirements.
And step 21, receiving at least one keyword of the disease input by the user.
Step 21 in this embodiment is the same as step 11 in the first embodiment, and details are not repeated here.
And step 22, determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source.
In this embodiment, the data sources related to the keywords may be divided into two categories according to the types of the data sources, where the first category is an exponential data source, and the second category is a public opinion data source. The exponential data sources include, but are not limited to: hundredth degrees, google, 360, etc. The public opinion data sources include, but are not limited to: microblogs, forums, WeChats, hot searches, and the like.
And step 23, setting a multithreading crawler program with the same category number as the category number according to the category number obtained by classifying the data source.
Different crawler programs are set to correspond to different types of data sources, so that more smooth crawling to the data sources of the types can be facilitated, and the crawling difficulty or the incapability of analyzing the crawled data caused by the storage formats of the data of different data sources or other problems can be avoided.
In this embodiment, if the data sources are divided into two types, the dual-thread crawler program is correspondingly set. For example, hundred degrees and microblogs are two different types of data sources, and each data source has a respective text storage format, a first crawler program is set to be dedicated to crawling of disease data in the hundred degrees and related to the keyword, and a second crawler program is set to be dedicated to crawling of disease data in the microblog and related to the keyword.
In other embodiments, the data sources related to the keywords in the internet may be subdivided into multiple categories according to actual needs, and a corresponding crawler program is set for each category of data source.
And 24, utilizing the multithreading crawler program to respectively crawl disease data related to the keywords from the corresponding data sources.
In this embodiment, the URLs of the data sources corresponding to the crawler programs are placed in a crawling queue, and the multithreading crawler programs crawl disease data related to the keywords from the data sources in parallel.
And 25, analyzing the disease data to obtain the public opinion factors of the diseases.
And 26, carrying out data cleaning and abnormal value processing on the public sentiment factors of the diseases.
And 27, carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data.
Steps 25 to 27 in this embodiment correspond to steps 13 to 15 in the first embodiment, respectively, and are not described in detail herein.
And step 28, calculating derivative variables of the public opinion factors of the diseases according to the new disease data, and making a chart according to the calculated derivative variables for visual display.
Preferably, the step 24 may further include: and classifying and storing the crawled disease data.
The disease data is stored in a local database or in a storage server or in the cloud. For example, the disease data crawled from hundredths is stored in a first storage location in a local database, and the disease data crawled from a microthin is stored in a second storage location in the local database. The first storage location and the second storage location may be located in the same root directory in the local data at the same time, or in different root directories. The first storage location and the second storage location may also be displayed differently by different names. The data obtained by crawling from different data sources are classified and stored, so that the data of the same data source can be analyzed conveniently.
Preferably, in order to ensure that the crawled disease data is up-to-date, the disease data needs to be updated periodically, and the method may further comprise: and crawling disease data related to the keywords from the data source within a preset crawler time period by utilizing a crawler program.
The preset crawler time period is the preset crawler time period, for example, the crawler time period is preset from 24 to 3 points in the evening every day, so that the number of people who generally access the server of the data source is small, great access pressure cannot be caused to the server of the data source, smooth operation of the server of the data source is facilitated, and crawling efficiency can be improved.
Preferably, after crawling disease data related to the keyword from the data source within a preset crawler time period by using a crawler program, and analyzing the disease data to obtain a public opinion factor of a disease, the method may further include: and quantifying each sub public opinion factor of the disease to obtain the weight of the sub public opinion factor of the disease, and determining the sub public opinion factor with the weight larger than a preset weight threshold value as the public opinion factor of the disease.
The specific process of quantifying the sub public opinion factors of each disease to obtain the weight of the sub public opinion factors of the disease is as follows: and calculating the sum of the number of all the sub public opinion factors of the disease, and calculating the percentage of each sub public opinion factor in the sum, wherein the percentage is the weight of the corresponding sub public opinion factor.
The preset weight threshold is a preset weight threshold, when the weight of the sub public opinion factor is greater than the preset weight threshold, the sub public opinion factor is determined as the public opinion factor of the disease, the sub public opinion factor with smaller weight can be effectively screened out, the data calculation amount can be reduced, the disease prediction time can be effectively shortened, and the sub public opinion factor with smaller weight can not cause any influence on the disease prediction result.
In summary, the public opinion data prediction method comprises the steps of determining a data source related to a keyword in the internet by receiving at least one keyword of a disease input by a user, classifying the data source according to the type of the data source, setting a multithreading crawler program with the same number of categories according to the number of categories obtained by classifying the data source, crawling the disease data related to the keyword from the corresponding data source by using the multithreading crawler program, performing data cleaning and abnormal value processing on the abnormal opinion factor of the disease, performing data standardization on the public opinion factor of the disease after the data cleaning and the abnormal value processing to obtain new disease data, calculating a derivative variable of the public opinion factor of the disease according to the new disease data, making a chart according to the calculated derivative variable, and performing visual display, thereby predicting the disease. By setting different crawler programs corresponding to different types of data sources, crawling the disease data related to the input keywords from the corresponding data sources by utilizing the multithreading crawler program, the crawling efficiency can be improved in a parallel crawling mode, the data format of the crawled disease data is uniform, and the problem that crawling is difficult or the crawled data cannot be analyzed due to the storage formats of the data of different data sources or other problems can be avoided; and the crawled disease data are refined and then made into graphs or tables, so that the result display is clearer, the problem analysis is convenient and intuitive, a reference basis is provided for disease prediction, and the prediction result is accurate.
The above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and it will be apparent to those skilled in the art that modifications may be made without departing from the inventive concept of the present invention, and these modifications are within the scope of the present invention.
The following describes functional modules and hardware structures of a terminal implementing the public opinion data prediction method with reference to fig. 3 to 5.
EXAMPLE III
Fig. 3 is a functional block diagram of a public opinion data prediction apparatus according to a third embodiment of the present invention.
In some embodiments, the public opinion data prediction apparatus 30 operates in a terminal. The public opinion data prediction apparatus 30 may include a plurality of functional modules composed of program code segments. The program codes of the respective program segments of the public opinion data prediction apparatus 30 may be stored in a memory and executed by at least one processor to perform (see fig. 1 and the related description) the prediction of the public opinion data.
In this embodiment, the public opinion data prediction device 30 of the terminal may be divided into a plurality of functional modules according to the functions executed by the device. The functional module may include: receiving module 301, crawling module 302, parsing module 303, cleaning module 304, expanding module 305, normalizing module 306, and predicting module 307. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and stored in the memory. In some embodiments, the functionality of the modules will be described in greater detail in subsequent embodiments.
The receiving module 301 is configured to receive at least one keyword of a disease input by a user.
The keywords are words related to symptoms of a disease, for example, when the disease is a cold, the keywords may include: sneezing, nasal discharge, nasal obstruction, headache, dizziness, cough without phlegm, sore throat, etc. As another example, when the disease is hand-foot-mouth, the keywords may include: stomachache, anorexia, low fever, small herpes on hands, small ulcer on mouth, etc.
To facilitate subsequent crawling to more data related to a disease, a user may enter a number of keywords for the disease. The keywords may be symptoms of a disease obtained by the user based on his own experience, or may be symptoms of a disease collected from a disease specialist.
In this embodiment, the terminal presets a function for the user to input keywords of a disease, for example, the terminal provides a text input box through which the user can input at least one keyword. Alternatively, the terminal provides the function of a voice assistant through which the user can input at least one keyword.
A crawling module 302, configured to determine a data source in the internet related to the keyword, and crawl disease data related to the keyword from the data source by using a crawler program.
Data sources in the internet that are relevant to the keywords may include, but are not limited to: hundredth, google, Teng news, microblog, hot search, know any website supporting the user to search for access, and the like. Crawling disease data related to the keywords from various data sources using a crawler program may include: baidu index, Google trend, Tencent analysis, news information, advertisement data, channel data, microblog popularity, forum public opinion information and the like.
In this embodiment, the user determines a Universal Resource Locator (URL) of a data source in the internet, and the crawler crawls disease data related to the keyword according to the URL.
And the analysis module 303 is used for analyzing the disease data to obtain public opinion factors of the disease.
And carrying out specific analysis work including public opinion analysis on the disease data, wherein the specific analysis work includes text processing, text analysis, word frequency statistics, relevance analysis and the like so as to obtain public opinion factors of the disease.
In this embodiment, the public opinion factor of the disease may include a plurality of sub-public opinion factors, such as a first sub-public opinion factor, a second sub-public opinion factor, a third sub-public opinion factor, a fourth public opinion factor, and the like.
For example, the first sub-public opinion factor may be headache, the second sub-public opinion factor may be runny nose, the third sub-public opinion factor may be fever, and the fourth sub-public opinion factor may be cough.
And a cleaning module 304 for performing data cleaning and abnormal value processing on the public sentiment factors of the diseases.
The data cleaning and abnormal value processing are carried out on the public sentiment factors of the diseases, so as to eliminate redundant data in the public sentiment factors of the diseases and obtain the disease data with a consistent standard format, and the public sentiment factors of the diseases after being cleaned and processed by the abnormal values can be used and are more suitable for subsequent analysis work.
The cleaning module 304 is further configured to perform data cleaning on the public sentiment factors of the diseases according to the types of the public sentiment factors of the diseases.
Types of public sentiment factors for the disease include, but are not limited to: public opinion factors of diseases containing noise, public opinion factors of diseases not conforming to common opinion, public opinion factors of diseases containing repeated information, public opinion factors of diseases with unbalanced data, public opinion factors of inconsistent diseases, public opinion factors of incomplete diseases, etc.
For the public sentiment factors of the diseases containing the noise, a method for removing the extra-large value and the negative value points is adopted for data cleaning; performing data cleaning on the public sentiment factors of the diseases which do not accord with the common rationale by adopting a method for removing abnormal values; carrying out data cleaning on the public sentiment factors of the diseases containing the repeated information by adopting a method of deleting repeated items; carrying out data cleaning on the unbalanced disease public opinion factors by adopting a data denoising method; carrying out data cleaning on the inconsistent public opinion factors of the diseases by adopting a method of classifying according to data types; and (4) performing data cleaning on the incomplete disease public sentiment factors by adopting a method for establishing relevant standard reference values.
The cleaning module 304 is further configured to replace missing values of the disease public opinion factors according to distribution of the disease public opinion factors.
In this embodiment, the distribution of the public sentiment factors of the disease includes, but is not limited to: stable and severe. The public opinion factors of the diseases distributed stably mean that the trend of the public opinion factors of the diseases is relatively stable, for example, 50, 53, 52, 49, 51, etc. The public opinion factors of the severe distribution diseases mean that the public opinion factors of the diseases have relatively sharp variation trend and large variation amplitude, such as 50, 100, 43, 89, 4 and the like.
For the stably distributed public opinion factors of the diseases, a K-nearest neighbor method can be adopted, K samples nearest to the public opinion factor sample with the missing diseases are determined according to Euclidean distance or correlation analysis, and the missing data of the sample is estimated by weighted average of the public opinion factor values of the K diseases; for the stably distributed public opinion factors of the diseases, a prediction model can be used to predict the public opinion factor of each missing disease, if the public opinion factors of the missing diseases are numerical, the public opinion factors of the missing diseases can be filled with an average value, and if the public opinion factors of the missing diseases are non-numerical, the public opinion factors of the missing diseases can be filled with a mode.
For the public sentiment factors of the diseases distributed in a severe form, a mean value method can be adopted to replace the missing public sentiment factors of the diseases.
The cleaning module 304 is also used for directly discarding the public opinion factors with abnormal diseases. The public opinion factor with abnormal diseases is directly discarded, the cleanness of the obtained public opinion factor with the diseases can be guaranteed, and the interference caused when the public opinion factor with the diseases is analyzed is avoided.
The expansion module 305 is configured to perform an arithmetic operation on the disease public opinion factor obtained through the mean value substitution and a preset expansion coefficient to obtain a new disease public opinion factor as a final disease public opinion factor. The method for replacing the missing public opinion factors of the diseases by using the mean value method is based on the assumption of complete random missing, so that the variance and standard deviation of the public opinion factors of the diseases become small. The preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is larger than 1.
And the standardization module 306 is used for carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data.
The data standardization of the public sentiment factors of the diseases after the data cleaning and the abnormal value processing is carried out, so that the public sentiment factors of the diseases are converted into dimensionless pure numerical values, and indexes of different units or orders of magnitude can be compared and weighted conveniently.
In this embodiment, the data normalization method includes, but is not limited to: sum normalization, standard deviation normalization, maximum normalization, range normalization, etc. Preferably, the range is normalized, and the maximum value and the minimum value of the new data obtained after the range normalization process are 1 and 0, and the remaining values are between 0 and 1.
And the prediction module 307 is configured to calculate a derivative variable of the public opinion factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
In this embodiment, the derived variables include: maximum, minimum, mean, variance, standard deviation, covariance, range (maximum-minimum), median, mode, quartile. Wherein the average, median, mode and quartile describe the concentration degree of the disease public opinion factors, and the larger the concentration degree of the disease public opinion factors is, the more serious the disease is predicted; the range, the variance and the standard deviation depict the dispersion degree of the public opinion factors of the diseases, and the smaller the dispersion degree of the public opinion factors of the diseases is, the more serious the diseases are predicted.
The public opinion data prediction device 30 receives at least one keyword of a disease input by a user through a receiving module 301, a crawling module 302 determines a data source related to the keyword in the internet, and crawls disease data related to the keyword from the data source by using a crawler program, an analyzing module 303 analyzes the disease data to obtain a public opinion factor of the disease, a cleaning module 304 performs data cleaning and abnormal value processing on the public opinion factor of the disease, a normalizing module 306 performs data normalization on the public opinion factor of the disease after the data cleaning and abnormal value processing to obtain new disease data, and a prediction module 307 calculates a derivative variable of the public opinion factor of the disease according to the new disease data, so as to predict the disease according to the derivative variable. Roughly inputting keywords related to diseases by a user, and crawling disease data related to the input keywords by using a crawler program to obtain a comprehensive public opinion factor of the diseases related to the diseases; and performing data sorting, deep analysis and calculation on the public sentiment factors of the diseases, and performing fine processing on the crawled disease data can achieve the purposes of basic data display and decision data display, so that a reference basis is provided for disease prediction, and the prediction result is accurate.
Example four
Fig. 4 is a functional block diagram of a public opinion data prediction apparatus according to a fourth embodiment of the present invention.
In some embodiments, the public opinion data prediction device 40 operates in a terminal. The public opinion data prediction device 40 may include a plurality of functional modules composed of program code segments. The program codes of the respective program segments of the public opinion data prediction apparatus 40 may be stored in a memory and executed by at least one processor to perform (see fig. 2 and the related description) the prediction of the public opinion data.
In this embodiment, the public opinion data prediction apparatus 40 of the terminal may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: a receiving module 401, a classification module 402, a setting module 403, a crawling module 404, a parsing module 405, a cleaning module 406, a normalization module 407, a visualization module 408, a storage module 409, and a quantification module 410. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and stored in the memory. In some embodiments, the functionality of the modules will be described in greater detail in subsequent embodiments.
The receiving module 401 is configured to receive at least one keyword of a disease input by a user.
A classification module 402, configured to determine a data source in the internet related to the keyword, and classify the data source according to a type of the data source.
In this embodiment, the data sources related to the keywords may be divided into two categories according to the types of the data sources, where the first category is an exponential data source, and the second category is a public opinion data source. The exponential data sources include, but are not limited to: hundredth degrees, google, 360, etc. The public opinion data sources include, but are not limited to: microblogs, forums, WeChats, hot searches, and the like.
A setting module 403, configured to set a multithread crawler program with the same number of categories as the number of categories according to the number of categories obtained by classifying the data source.
Different crawler programs are set to correspond to different types of data sources, so that more smooth crawling to the data sources of the types can be facilitated, and the crawling difficulty or the incapability of analyzing the crawled data caused by the storage formats of the data of different data sources or other problems can be avoided.
In this embodiment, if the data sources are divided into two types, the dual-thread crawler program is correspondingly set. For example, hundred degrees and microblogs are two different types of data sources, and each data source has a respective text storage format, a first crawler program is set to be dedicated to crawling of disease data in the hundred degrees and related to the keyword, and a second crawler program is set to be dedicated to crawling of disease data in the microblog and related to the keyword.
In other embodiments, the data sources related to the keywords in the internet may be subdivided into multiple categories according to actual needs, and a corresponding crawler program is set for each category of data source.
A crawling module 404, configured to crawl, by using the multithreading crawler program, disease data related to the keyword from the corresponding data sources respectively.
In this embodiment, URLs of data sources corresponding to crawlers are placed in a crawling queue, and the multithreading crawlers crawl disease data related to the keywords from the data sources in parallel.
And the analysis module 405 is used for analyzing the disease data to obtain public opinion factors of the disease.
And a cleaning module 406, configured to perform data cleaning and abnormal value processing on the public sentiment factors of the disease.
The standardization module 407 is configured to perform data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data.
And the visualization module 408 is configured to calculate derived variables of the public opinion factors of the diseases according to the new disease data, and make a chart according to the calculated derived variables for visualization display.
And the storage module 409 is used for classifying and storing the obtained disease data.
The disease data is stored in a local database or in a storage server or in the cloud. For example, the disease data crawled from hundredths is stored in a first storage location in a local database, and the disease data crawled from a microthin is stored in a second storage location in the local database. The first storage location and the second storage location may be located in the same root directory of the local data at the same time, or in different root directories. The first storage location and the second storage location may also be displayed differently by different names. The data obtained by crawling from different data sources are classified and stored, so that the data of the same data source can be analyzed conveniently.
Preferably, in order to ensure that the crawled disease data is up-to-date, the disease data needs to be updated periodically, and the crawling module 404 is further configured to crawl the disease data related to the keyword from the data source within a preset crawler time period by using a crawler program.
The preset crawler time period is the preset crawler time period, for example, the crawler time period is preset from 24 to 3 points in the evening every day, so that the number of people who generally access the server of the data source is small, great access pressure cannot be caused to the server of the data source, smooth operation of the server of the data source is facilitated, and crawling efficiency can be improved.
Preferably, after the crawler program is utilized to crawl the disease data related to the keyword from the data source within a preset crawler time period, and the disease data is analyzed to obtain the public opinion factors of the disease, the public opinion data prediction apparatus 40 may further include a quantization module 410, configured to quantize each of the sub public opinion factors of the disease respectively to obtain the weight of the sub public opinion factor of the disease, and determine the sub public opinion factor with the weight greater than a preset weight threshold as the public opinion factor of the disease.
The specific process of quantifying the sub public opinion factors of each disease to obtain the weight of the sub public opinion factors of the disease is as follows: and calculating the sum of the number of all the sub public opinion factors of the disease, and calculating the percentage of each sub public opinion factor in the sum, wherein the percentage is the weight of the corresponding sub public opinion factor.
The preset weight threshold is a preset weight threshold, when the weight of the sub public opinion factor is greater than the preset weight threshold, the sub public opinion factor is determined as the public opinion factor of the disease, the sub public opinion factor with smaller weight can be effectively screened out, the data calculation amount can be reduced, the disease prediction time can be effectively shortened, and the sub public opinion factor with smaller weight can not cause any influence on the disease prediction result.
In summary, the public opinion data prediction device 40 receives at least one keyword of a disease input by a user through a receiving module 401, a classifying module 402 determines a data source related to the keyword in the internet, classifies the data source according to the type of the data source, a setting module 403 sets a multi-thread crawler program with the same number of categories according to the number of categories obtained by classifying the data source, a crawling module 404 crawls disease data related to the keyword from the corresponding data source by using the multi-thread crawler program, an analyzing module 405 analyzes the disease data to obtain a public opinion factor of the disease, a cleaning module 406 performs data cleaning and abnormal value processing on the public opinion factor of the disease, and a normalizing module 407 performs data normalization on the public opinion factor of the disease after data cleaning and abnormal value processing, and obtaining new disease data, the visualization module 408 calculates derivative variables of public sentiment factors of the diseases according to the new disease data, and makes charts according to the calculated derivative variables for visualization display, so as to predict the diseases. By setting different crawler programs corresponding to different types of data sources and utilizing a multi-thread crawler program to crawl disease data related to input keywords from the corresponding data sources, the crawling efficiency can be improved by a parallel crawling mode, the data format of the crawled disease data is uniform, and the problem that the crawl is difficult or the crawled data cannot be analyzed due to the storage formats of the data of different data sources or other problems can be avoided; and the crawled disease data are refined and then made into graphs or tables, so that the result display is clearer, the problem analysis is convenient and intuitive, a reference basis is provided for disease prediction, and the prediction result is accurate. The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a dual-screen device, or a network device) or a processor execute parts of the methods according to the embodiments of the present invention.
EXAMPLE five
Fig. 5 is a schematic diagram of a terminal according to a fifth embodiment of the present invention.
The terminal 5 includes: a memory 51, at least one processor 52, a computer program 53 stored in said memory 51 and executable on said at least one processor 52, at least one communication bus 54.
The at least one processor 52 executes the computer program 53 to implement the steps in the above-mentioned public opinion data prediction method embodiment, or the at least one processor 52 executes the computer program 53 to implement the functions of the modules/units in the above-mentioned device embodiment.
Illustratively, the computer program 53 may be divided into one or more modules/units, which are stored in the memory 51 and executed by the at least one processor 52 to carry out the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 53 in the terminal 5.
The terminal 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. It will be appreciated by those skilled in the art that the schematic diagram 5 is merely an example of the terminal 5 and does not constitute a limitation of the terminal 5, and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal 5 may further include input and output devices, network access devices, buses, etc.
The at least one processor 52 may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. The processor 52 may be a microprocessor or the processor 52 may be any conventional processor or the like, the processor 52 being the control center of the terminal 5, and various interfaces and lines connecting the various parts of the overall terminal 5.
The memory 51 may be used for storing the computer program 53 and/or the module/unit, and the processor 52 may implement various functions of the terminal 5 by running or executing the computer program and/or the module/unit stored in the memory 51 and calling data stored in the memory 51. The memory 51 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal 5, and the like. Further, the memory 51 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.
The modules/units integrated with the terminal 5, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
In the embodiments provided in the present invention, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described terminal embodiment is only illustrative, for example, the division of the unit is only one logical function division, and there may be another division manner in actual implementation.
In addition, functional units in the embodiments of the present invention may be integrated into the same processing unit, or each unit may exist alone physically, or two or more units are integrated into the same unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any figure representation in the claims should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit scope of the technical solutions of the present invention.

Claims (10)

1. A public opinion data prediction method is characterized in that the method comprises the following steps:
receiving at least one keyword of a disease input by a user;
determining a data source related to the keyword in the Internet, and crawling disease data related to the keyword from the data source by using a crawler program;
analyzing the disease data to obtain public opinion factors of the diseases;
performing data cleaning and abnormal value processing on the public sentiment factors of the diseases, wherein the performing abnormal value processing on the public sentiment factors comprises the following steps: performing missing value replacement on the public opinion factors of the diseases according to the distribution of the public opinion factors of the diseases, wherein the distribution of the public opinion factors of the diseases comprises: stable and severe; if the distribution of the public opinion factors of the diseases is stable, estimating the missing public opinion factors by adopting a K-nearest neighbor method, if the distribution of the public opinion factors of the diseases is severe, calculating by adopting an averaging method to obtain the public opinion factors of the diseases, replacing the missing public opinion factors of the diseases, and performing product calculation on the replaced public opinion factors of the diseases and a preset expansion coefficient to obtain new public opinion factors as final public opinion factors of the diseases;
carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data; and
and calculating derivative variables of public opinion factors of the diseases according to the new disease data, and predicting the diseases according to the derivative variables.
2. The method of claim 1, wherein the determining a data source in the internet that is relevant to the keyword and using a crawler to crawl disease data from the data source that is relevant to the keyword comprises:
determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;
setting a multithreading crawler program with the same category number as the category number according to the category number obtained by classifying the data source;
and utilizing the multithreading crawler program to respectively crawl disease data related to the keywords from the corresponding data sources.
3. The method of claim 1, wherein the method further comprises:
making a chart according to the calculated derivative variables for visual display, wherein the derivative variables comprise: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile.
4. The method of claim 1, wherein the data normalization comprises one or a combination of:
sum normalization, standard deviation normalization, maximum normalization, or range normalization.
5. The method of claim 1, wherein the crawling disease data related to the keyword from the data source using a crawler comprises:
and crawling disease data related to the keywords from the data source within a preset crawler time period by utilizing a crawler program.
6. The method of claim 1, wherein the parsing the disease data to obtain a public opinion factor for a disease comprises:
calculating the sum of the number of all the sub public opinion factors of the disease, calculating the percentage of each sub public opinion factor in the sum, wherein the percentage is the weight of the corresponding sub public opinion factor, and determining the sub public opinion factor with the weight larger than a preset weight threshold value as the public opinion factor of the disease.
7. The method of claim 1, wherein the data cleansing of the disease's public sentiment factors comprises:
carrying out data cleaning on the public sentiment factors of the diseases according to the types of the public sentiment factors of the diseases; or
The public opinion factors with abnormal diseases are directly discarded.
8. A public opinion data prediction apparatus, characterized in that the apparatus comprises:
the receiving module is used for receiving at least one keyword of a disease input by a user;
the crawling module is used for determining a data source related to the keyword in the Internet and crawling disease data related to the keyword from the data source by utilizing a crawler program;
the analysis module is used for analyzing the disease data to obtain public opinion factors of the diseases;
a cleaning module, configured to perform data cleaning and abnormal value processing on the public sentiment factors of the disease, where the abnormal value processing includes: performing missing value replacement on the public opinion factors of the diseases according to the distribution of the public opinion factors of the diseases, wherein the distribution of the public opinion factors of the diseases comprises: stable and severe, if the distribution of the public opinion factors of the diseases is stable, estimating the public opinion factors of the missing diseases by adopting a K-nearest neighbor method, if the distribution of the public opinion factors of the diseases is severe, replacing the public opinion factors of the missing diseases by adopting a mean method, and performing the product of the replaced public opinion factors of the diseases and a preset expansion coefficient to obtain new public opinion factors as final public opinion factors of the diseases;
the standardization module is used for carrying out data standardization on the public sentiment factors of the diseases after data cleaning and abnormal value processing to obtain new disease data; and
and the prediction module is used for calculating derivative variables of public opinion factors of the diseases according to the new disease data and predicting the diseases according to the derivative variables.
9. A terminal, characterized in that the terminal comprises a processor and a memory, wherein the processor is used for implementing the public opinion data prediction method according to any one of claims 1 to 7 when executing the computer program stored in the memory.
10. A computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the public opinion data prediction method according to any one of claims 1 to 7.
CN201810351128.0A 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium Active CN108647249B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810351128.0A CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium
PCT/CN2018/100229 WO2019200786A1 (en) 2018-04-18 2018-08-13 Method for forecasting public sentiment data, device, terminal, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810351128.0A CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN108647249A CN108647249A (en) 2018-10-12
CN108647249B true CN108647249B (en) 2022-08-02

Family

ID=63746630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810351128.0A Active CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN108647249B (en)
WO (1) WO2019200786A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110299208A (en) * 2019-05-22 2019-10-01 平安科技(深圳)有限公司 Disease surveillance data exception detection method, system, equipment and storage medium
CN110321342A (en) * 2019-05-27 2019-10-11 平安科技(深圳)有限公司 Business valuation studies method, apparatus and storage medium based on intelligent characteristic selection
CN110675959B (en) * 2019-08-19 2023-07-07 平安科技(深圳)有限公司 Intelligent data analysis method and device, computer equipment and storage medium
CN110569298B (en) * 2019-09-12 2023-03-24 成都中科大旗软件股份有限公司 Data docking and visualization method and system
CN111968753A (en) * 2020-08-06 2020-11-20 平安科技(深圳)有限公司 Epidemic situation monitoring method and device, computer equipment and storage medium
CN111986763B (en) * 2020-09-03 2024-05-14 深圳平安智慧医健科技有限公司 Disease data analysis method, device, electronic equipment and storage medium
CN112749341B (en) * 2021-01-22 2024-03-29 南京莱斯网信技术研究院有限公司 Important public opinion recommendation method, readable storage medium and data processing device
CN113326375B (en) * 2021-05-26 2025-04-18 北京沃东天骏信息技术有限公司 Method, device, electronic device and storage medium for processing public opinion
CN113590914B (en) * 2021-06-23 2024-02-20 北京百度网讯科技有限公司 Information processing method, apparatus, electronic device and storage medium
CN116720155B (en) * 2023-06-21 2025-07-25 电子科技大学 Multi-mode data-based major emergency public opinion trend prediction method
CN116629913B (en) * 2023-07-24 2023-10-03 山东青上化工有限公司 Data extraction system and processing method for compound fertilizer production process

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2335801A1 (en) * 1998-04-29 2002-05-14 Justin Winfield A system and method for text mining
US20120296974A1 (en) * 1999-04-27 2012-11-22 Joseph Akwo Tabe Social network for media topics of information relating to the science of positivism
US7685091B2 (en) * 2006-02-14 2010-03-23 Accenture Global Services Gmbh System and method for online information analysis
CN102043893A (en) * 2009-10-13 2011-05-04 北京大学 Disease pre-warning method and system
GB201103673D0 (en) * 2011-03-03 2011-04-20 Zillian S A Method of generating statistical opinion data
CN103577557B (en) * 2013-10-21 2017-04-05 北京奇虎科技有限公司 A kind of apparatus and method of the crawl frequency for determining network resource point
CN105653527A (en) * 2014-11-11 2016-06-08 江苏威盾网络科技有限公司 Public sentiment treatment and information deploying method based on web crawler technology
CN105740228B (en) * 2016-01-25 2019-06-04 云南大学 A kind of internet public feelings analysis method and system
US20170316080A1 (en) * 2016-04-29 2017-11-02 Quest Software Inc. Automatically generated employee profiles
CN106096056B (en) * 2016-06-30 2019-11-26 西南石油大学 One kind being based on distributed public sentiment data real-time collecting method and system
CN106599553B (en) * 2016-11-29 2019-08-16 中国科学院深圳先进技术研究院 Disease Warning Mechanism device
CN106649270A (en) * 2016-12-19 2017-05-10 四川长虹电器股份有限公司 Public opinion monitoring and analyzing method
CN106951698A (en) * 2017-03-13 2017-07-14 成都育芽科技有限公司 A kind of disease risks forecasting system based on network big data platform
CN107220297B (en) * 2017-05-02 2020-11-20 北京大学 Method and system for automatic collection of multi-source heterogeneous data for software projects
CN107239892B (en) * 2017-05-26 2021-06-15 山东省科学院情报研究所 Quantitative analysis method of regional talent supply and demand balance based on big data
CN107330613A (en) * 2017-06-29 2017-11-07 平安万家医疗投资管理有限责任公司 A kind of public sentiment monitoring method, equipment and computer-readable recording medium

Also Published As

Publication number Publication date
CN108647249A (en) 2018-10-12
WO2019200786A1 (en) 2019-10-24

Similar Documents

Publication Publication Date Title
CN108647249B (en) Public opinion data prediction method, device, terminal and storage medium
CN110020122B (en) Video recommendation method, system and computer readable storage medium
CN112711705B (en) Public opinion data processing method, equipment and storage medium
US20130191395A1 (en) Social media data analysis system and method
US20120011139A1 (en) Unified numerical and semantic analytics system for decision support
CN110544158B (en) Information pushing method, device, equipment and readable storage medium
CN111259220B (en) Data acquisition method and system based on big data
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN113610239A (en) Feature processing method and feature processing system for machine learning
US20130246463A1 (en) Prediction and isolation of patterns across datasets
CN114118287B (en) Sample generation method, device, electronic device and storage medium
CN110880124A (en) Conversion rate evaluation method and device
CN112749238A (en) Search ranking method and device, electronic equipment and computer-readable storage medium
JP2010061332A (en) Brand analysis method and device
CN117971606A (en) Log management system and method based on elastic search
CN110717089A (en) User behavior analysis system and method based on weblog
EP2996047A1 (en) A method and system for selecting public data sources
CN112184370B (en) A method and device for pushing products
CN105512300B (en) information filtering method and system
Xiong et al. Synthesizing knowledge through a data analytics-based systematic literature review protocol
Gutsche Automatic weak signal detection and forecasting
CN112818221A (en) Entity heat determination method and device, electronic equipment and storage medium
CN111460257A (en) Topic generation method, device, electronic device and storage medium
CN115034659A (en) A data source evaluation method, device, electronic device and storage medium
CN119005761B (en) A method and device for evaluating the impact of judicial technology application in judicial institutions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant