[go: up one dir, main page]

CN105447081A - A cloud platform-oriented government public opinion monitoring method - Google Patents

A cloud platform-oriented government public opinion monitoring method Download PDF

Info

Publication number
CN105447081A
CN105447081A CN201510746977.2A CN201510746977A CN105447081A CN 105447081 A CN105447081 A CN 105447081A CN 201510746977 A CN201510746977 A CN 201510746977A CN 105447081 A CN105447081 A CN 105447081A
Authority
CN
China
Prior art keywords
data
text
analysis
early warning
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201510746977.2A
Other languages
Chinese (zh)
Inventor
侯朋
李勇波
季统凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
G Cloud Technology Co Ltd
Original Assignee
G Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by G Cloud Technology Co Ltd filed Critical G Cloud Technology Co Ltd
Priority to CN201510746977.2A priority Critical patent/CN105447081A/en
Publication of CN105447081A publication Critical patent/CN105447081A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明涉及云计算技术领域,尤其是面向云平台的一种政务舆情监控方法。本发明方法包括数据采集、数据预处理、数据分析及预警;所述的系统搭载在分布式集群上,由一个作为主节点的爬虫服务器和多个作为从节点的爬虫客户端组成,主节点负责任务分配,子节点负责任务执行,主从节点之间采用加密的心跳包进行通信;从节点上包括数据采集、预处理、分析及预警模块;所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据,并自动过滤重复数据,构建主题数据库;数据预处理模块基于规则和自动混合的方式提取正文数据;数据分析及预警模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析,并对分析结果进行预警。本发明解决用户的网络舆情监测等问题,可以用于政务舆情监控上。

The present invention relates to the field of cloud computing technology, and in particular to a government affairs public opinion monitoring method for a cloud platform. The method of the present invention includes data collection, data preprocessing, data analysis and early warning; the system is mounted on a distributed cluster, and is composed of a crawler server as a master node and multiple crawler clients as slave nodes. The master node is responsible for task allocation, and the child node is responsible for task execution. The master and slave nodes communicate using encrypted heartbeat packets; the slave nodes include data collection, preprocessing, analysis and early warning modules; the collection module captures forum, news, post bar, blog and other data according to user configuration and knowledge base, and automatically filters duplicate data to build a subject database; the data preprocessing module extracts text data based on rules and automatic mixing; the data analysis and early warning module uses machine learning methods to cluster, sentiment analysis, and hot spot analysis on the cleaned text, and warns the analysis results. The present invention solves the problems of users' network public opinion monitoring and can be used for government affairs public opinion monitoring.

Description

面向云平台的一种政务舆情监控方法A cloud platform-oriented government public opinion monitoring method

技术领域technical field

本发明涉及云计算技术领域,尤其是面向云平台的一种政务舆情监控方法。The invention relates to the technical field of cloud computing, in particular to a cloud platform-oriented government public opinion monitoring method.

背景技术Background technique

基于云数据库的分布式实时智能监控方法,整合互联网信息采集技术及信息智能处理技术,通过对互联网海量信息自动抓取、自动分类聚类、主题检测、专题聚焦,实现用户的网络舆情监测和新闻专题追踪等信息需求,形成简报、报告、图表等分析结果,为客户全面掌握群众思想动态,做出正确舆论引导,提供分析依据。The distributed real-time intelligent monitoring method based on the cloud database integrates Internet information collection technology and information intelligent processing technology, and realizes user network public opinion monitoring and news through automatic capture of massive Internet information, automatic classification and clustering, topic detection, and topic focus Information needs such as topic tracking, form analysis results such as briefings, reports, charts, etc., provide customers with a comprehensive grasp of the ideological trends of the masses, make correct public opinion guidance, and provide analysis basis.

发明内容Contents of the invention

本发明解决的技术问题在于提供面向云平台的一种政务舆情监控方法。The technical problem solved by the present invention is to provide a cloud platform-oriented government public opinion monitoring method.

本发明解决上述技术问题的技术方案是:The technical scheme that the present invention solves the problems of the technologies described above is:

所述的方法包括数据采集、数据预处理、数据分析及预警;所述的系统搭载在分布式集群上,由一个作为主节点的爬虫服务器和多个作为从节点的爬虫客户端组成,主节点负责任务分配,子节点负责任务执行,主从节点之间采用加密的心跳包进行通信;从节点上包括数据采集、预处理、分析及预警模块;所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据,并自动过滤重复数据,构建主题数据库;数据预处理模块基于规则和自动混合的方式提取正文数据;数据分析及预警模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析,并对分析结果进行预警。The method includes data collection, data preprocessing, data analysis and early warning; the system is carried on a distributed cluster and consists of a crawler server as a master node and a plurality of crawler clients as slave nodes, the master node Responsible for task allocation, sub-nodes are responsible for task execution, and encrypted heartbeat packets are used for communication between master and slave nodes; slave nodes include data acquisition, preprocessing, analysis and early warning modules; the acquisition module is based on user configuration and knowledge base Grab data from forums, news, post bars, blogs, etc., and automatically filter duplicate data to build a theme database; the data preprocessing module extracts text data based on rules and automatic mixing methods; the data analysis and early warning module uses machine learning methods to clustering, sentiment analysis, hotspot analysis, and early warning of the analysis results.

所述的主从节点之间的通信,包括如下步骤:The communication between the master-slave nodes includes the following steps:

第一步,用户开启采集任务;In the first step, the user starts the collection task;

第二步,主节点保存任务信息到元数据信息库;In the second step, the master node saves the task information to the metadata repository;

第三步,主节点根据用户配置信息进行任务初始化;In the third step, the master node performs task initialization according to the user configuration information;

第四步,主节点根据丛节点的CPU、内存、当前任务数等指标进行任务分配;In the fourth step, the master node allocates tasks according to the CPU, memory, current number of tasks and other indicators of the cluster nodes;

第五步,从节点接收任务;The fifth step is to receive the task from the node;

第六步,从节点发送成功接收任务消息到主节点;Step 6, the slave node sends a message of successfully receiving the task to the master node;

第七步,主节点写任务信息到元数据库;The seventh step, the master node writes the task information to the metadata database;

第八步,从节点开始执行任务;The eighth step is to execute the task from the node;

第九步,若主节点N次未接收到从节点心跳包,则视为丛节点宕机并记录到日志系统,并重新分配任务给其他节点。In the ninth step, if the master node does not receive the heartbeat packet from the slave node for N times, it is considered that the plex node is down and recorded in the log system, and the task is reassigned to other nodes.

所述的采集模块具体处理流程是:The specific processing flow of the acquisition module is:

第一步,获取待采集的URL;The first step is to obtain the URL to be collected;

第二步,通过数据路由器对URL进行过滤;The second step is to filter the URL through the data router;

第三步,抓取页面数据;The third step is to grab the page data;

第四步,对抓取的数据进行文本抽取,链接抽取,把抽取的链接加入待采集URL集合;The fourth step is to perform text extraction and link extraction on the captured data, and add the extracted links to the collection of URLs to be collected;

第五步,自动文本特征提取,生成网页指纹;The fifth step is automatic text feature extraction to generate web page fingerprints;

第六步,检测是否为有相同文章;The sixth step is to detect whether there are identical articles;

第七步,如果已有相同文章则放弃抓取返回第一步,否则对正文文本进行分词操作;The seventh step, if the same article already exists, give up the crawling and return to the first step, otherwise perform word segmentation on the body text;

第八步,用TF_IDF算法提取N个关键词;The eighth step, use the TF_IDF algorithm to extract N keywords;

第九步,找到与其重合度最高的m篇文章;The ninth step is to find the m articles with the highest degree of overlap;

第十步,若其重合度大于c则归为相应主题数据库;In the tenth step, if the coincidence degree is greater than c, it is classified into the corresponding subject database;

第十一步,建立倒排索引以供其他模块使用。In the eleventh step, create an inverted index for use by other modules.

所述的数据分析及预警模块具体处理流程是:The specific processing flow of the data analysis and early warning module is:

第一步,将主题数据库进行重构,选择有代表性的数据;The first step is to reconstruct the subject database and select representative data;

第二步,对每篇文档进行情感分析并计算分值Tendency∈[-1,1];The second step is to perform sentiment analysis on each document and calculate the score Tendency ∈ [-1, 1];

第三步,对上述分析结果记入预警数据库;The third step is to record the above-mentioned analysis results into the early warning database;

第四步,计算预警级别,其中degreei代表第i篇文档的热度,其计算公式为:The fourth step is to calculate the warning level, Among them, degree i represents the popularity of the i-th document, and its calculation formula is:

degreei=(praisei×0.3+commenti×0.7)/(houri+2)degree i =(praise i ×0.3+comment i ×0.7)/(hour i +2)

其中:praisei代表赞数,commenti代表评论数,houri代表发帖时间到现在的时差;Among them: praise i represents the number of likes, comment i represents the number of comments, and hour i represents the time difference from the posting time to the present;

第五步,根据预警策略和预警级别给予email或短信等相应预警信息。The fifth step is to give corresponding early warning information such as email or SMS according to the early warning strategy and early warning level.

所述的自动文本特征提取,生成网页指纹的步骤是:。In the automatic text feature extraction, the steps of generating web page fingerprints are: .

第一步,提取正文各段落首句关键词(去掉停用词)作为文章的主特征;The first step is to extract the keywords of the first sentence of each paragraph of the text (remove the stop words) as the main feature of the article;

第二步,提取正文各段落的标点符号作为副特征;The second step is to extract the punctuation marks of each paragraph of the text as sub-features;

第三步,分别对主特征和副特征使用SimHash,然后拼接两段特征码,得到整个文章的指纹;The third step is to use SimHash for the main feature and the secondary feature respectively, and then splicing the two feature codes to get the fingerprint of the entire article;

第四步,存入缓存数据库。The fourth step is to store in the cache database.

本发明采用分布式多线程的方式提高了抓取速度,提高了新闻的时效性;通过URL去重和使用文本相似度算法检测文本重复性,从而节省了磁盘空间,也同时提高了抓取速度;通过网页指纹算法提高了网页重复性检测的速度以及准确度。The present invention adopts the distributed multi-threading method to improve the crawling speed, and improves the timeliness of news; through URL deduplication and text similarity algorithm to detect text repetition, thereby saving disk space and improving the crawling speed at the same time ; Improve the speed and accuracy of web page repetition detection through the web page fingerprint algorithm.

附图说明Description of drawings

下面结合附图对本发明进一步说明:The present invention is further described below in conjunction with accompanying drawing:

图1是本发明运用框架图;Fig. 1 is a framework diagram of the present invention;

图2是主从节点架构图;Figure 2 is a master-slave node architecture diagram;

图3是心数据抓取流程图;Fig. 3 is a heart data capture flow chart;

图4是数据分析流程图。Figure 4 is a flow chart of data analysis.

具体实施方式detailed description

如图1至4所示,本发明方法包括数据采集、数据预处理、数据分析及预警;所述的系统搭载在分布式集群上,由一个作为主节点的爬虫服务器和多个作为从节点的爬虫客户端组成,主节点负责任务分配,子节点负责任务执行,主从节点之间采用加密的心跳包进行通信;从节点上包括数据采集、预处理、分析及预警模块;所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据,并自动过滤重复数据,构建主题数据库;数据预处理模块基于规则和自动混合的方式提取正文数据;数据分析及预警模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析,并对分析结果进行预警。As shown in Figures 1 to 4, the method of the present invention includes data collection, data preprocessing, data analysis and early warning; the system is carried on a distributed cluster, and consists of a crawler server as a master node and a plurality of crawler servers as slave nodes. Composed of crawler clients, the master node is responsible for task allocation, and the sub-nodes are responsible for task execution. The encrypted heartbeat packets are used for communication between the master and slave nodes; the slave nodes include data collection, preprocessing, analysis and early warning modules; the acquisition module According to user configuration and knowledge base, data such as forums, news, post bars, and blogs are captured, and duplicate data is automatically filtered to build a theme database; the data preprocessing module extracts text data based on rules and automatic mixing; the data analysis and early warning module utilizes The method of machine learning performs clustering, sentiment analysis, and hotspot analysis on the cleaned text, and provides early warning of the analysis results.

如图2所示:所述的一个主节点和多个从节点组成,主节点负责任务分配,子节点负责任务执行,主从节点之间采用加密的心跳包进行通信,包括如下步骤:As shown in Figure 2: the above-mentioned one master node and multiple slave nodes are composed, the master node is responsible for task distribution, and the child nodes are responsible for task execution, and the encrypted heartbeat packets are used for communication between the master and slave nodes, including the following steps:

第一步,用户开启采集任务;In the first step, the user starts the collection task;

第二步,主节点保存任务信息到元数据信息库;In the second step, the master node saves the task information to the metadata repository;

第三步,主节点根据用户配置信息进行任务初始化;In the third step, the master node performs task initialization according to the user configuration information;

第四步,主节点根据丛节点的CPU、内存、当前任务数等指标进行任务分配;In the fourth step, the master node allocates tasks according to the CPU, memory, current number of tasks and other indicators of the cluster nodes;

第五步,从节点接收任务;The fifth step is to receive the task from the node;

第六步,从节点发送成功接收任务消息到主节点;Step 6, the slave node sends a message of successfully receiving the task to the master node;

第七步,主节点写任务信息到元数据库;The seventh step, the master node writes the task information to the metadata database;

第八步,从节点开始执行任务;The eighth step is to execute the task from the node;

第九步,若主节点N次未接收到从节点心跳包,则视为丛节点宕机并记录到日志系统,并重新分配任务给其他节点。In the ninth step, if the master node does not receive the heartbeat packet from the slave node for N times, it is considered that the plex node is down and recorded in the log system, and the task is reassigned to other nodes.

如图3所示:所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据,并过滤重复数据,构建主题数据库,包括如下流程:As shown in Figure 3: the acquisition module captures data such as forums, news, post bars, and blogs according to user configurations and knowledge bases, and filters duplicate data to build a theme database, including the following processes:

第一步,获取待采集的URL;The first step is to obtain the URL to be collected;

第二步,通过数据路由器对URL进行过滤;The second step is to filter the URL through the data router;

第三步,抓取页面数据;The third step is to grab the page data;

第四步,对抓取的数据进行文本抽取,链接抽取,把抽取的链接加入待采集URL集合;The fourth step is to perform text extraction and link extraction on the captured data, and add the extracted links to the collection of URLs to be collected;

第五步,自动文本特征提取,生成网页指纹;The fifth step is automatic text feature extraction to generate web page fingerprints;

第六步,检测是否为有相同文章;The sixth step is to detect whether there are identical articles;

第七步,如果已有相同文章则放弃抓取返回第一步,否则对正文文本进行分词操作;The seventh step, if the same article already exists, give up the crawling and return to the first step, otherwise perform word segmentation on the body text;

第八步,用TF_IDF算法提取N个关键词;The eighth step, use the TF_IDF algorithm to extract N keywords;

第九步,找到与其重合度最高的m篇文章;The ninth step is to find the m articles with the highest degree of overlap;

第十步,若其重合度大于c则归为相应主题数据库;In the tenth step, if the coincidence degree is greater than c, it is classified into the corresponding subject database;

第十一步,建立倒排索引以供其他模块使用。In the eleventh step, create an inverted index for use by other modules.

如图4所示,数据分析模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析,并对分析结果进行预警,包括如下步骤:As shown in Figure 4, the data analysis module uses machine learning methods to perform clustering, sentiment analysis, and hotspot analysis on the cleaned text, and provides early warning of the analysis results, including the following steps:

第一步,将主题数据库进行重构,选择有代表性的数据;The first step is to reconstruct the subject database and select representative data;

第二步,对每篇文档进行情感分析并计算分值Tendency∈[-1,1];The second step is to perform sentiment analysis on each document and calculate the score Tendency ∈ [-1, 1];

第三步,对上述分析结果记入预警数据库;The third step is to record the above-mentioned analysis results into the early warning database;

第四步,计算预警级别,其中degreei代表第i篇文档的热度,其计算公式为:The fourth step is to calculate the warning level, Among them, degree i represents the popularity of the i-th document, and its calculation formula is:

degreei=(praisei×0.3+commenti×0.7)/(houri+2)degree i =(praise i ×0.3+comment i ×0.7)/(hour i +2)

其中:praisei代表赞数,commenti代表评论数,houri代表发帖时间到现在的时差;Among them: praise i represents the number of likes, comment i represents the number of comments, and hour i represents the time difference from the posting time to the present;

第五步,根据预警策略和预警级别给予email或短信等相应预警信息。The fifth step is to give corresponding early warning information such as email or SMS according to the early warning strategy and early warning level.

如图1所示,采用本发明方法获得信息可以在WEB前端进行展示。As shown in Figure 1, the information obtained by using the method of the present invention can be displayed on the front end of the WEB.

Claims (7)

1.面向政务的一种舆情实时监控方法,其特征在于:所述的方法包括数据采集、数据预处理、数据分析及预警;所述的系统搭载在分布式集群上,由一个作为主节点的爬虫服务器和多个作为从节点的爬虫客户端组成,主节点负责任务分配,子节点负责任务执行,主从节点之间采用加密的心跳包进行通信;从节点上包括数据采集、预处理、分析及预警模块;所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据,并自动过滤重复数据,构建主题数据库;数据预处理模块基于规则和自动混合的方式提取正文数据;数据分析及预警模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析,并对分析结果进行预警。1. A kind of public opinion real-time monitoring method for government affairs, it is characterized in that: described method comprises data acquisition, data preprocessing, data analysis and early warning; Described system is carried on the distributed cluster, by one as master node The crawler server is composed of multiple crawler clients as slave nodes. The master node is responsible for task allocation, and the sub-nodes are responsible for task execution. The master-slave nodes use encrypted heartbeat packets for communication; the slave nodes include data collection, preprocessing, and analysis. and early warning module; the acquisition module captures data such as forums, news, post bars, and blogs according to user configuration and knowledge base, and automatically filters duplicate data to build a theme database; the data preprocessing module extracts based on rules and automatic mixing Text data; data analysis and early warning module uses machine learning methods to perform clustering, sentiment analysis, and hotspot analysis on the cleaned text, and early warning of the analysis results. 2.根据权利要求1所述的面向政务的一种舆情实时监控方法,其特征在于:所述的主从节点之间的通信,包括如下步骤:2. a kind of public opinion real-time monitoring method facing government affairs according to claim 1, is characterized in that: the communication between described master-slave node, comprises the steps: 第一步,用户开启采集任务;In the first step, the user starts the collection task; 第二步,主节点保存任务信息到元数据信息库;In the second step, the master node saves the task information to the metadata repository; 第三步,主节点根据用户配置信息进行任务初始化;In the third step, the master node performs task initialization according to the user configuration information; 第四步,主节点根据丛节点的CPU、内存、当前任务数等指标进行任务分配;In the fourth step, the master node allocates tasks according to the CPU, memory, current number of tasks and other indicators of the cluster nodes; 第五步,从节点接收任务;The fifth step is to receive the task from the node; 第六步,从节点发送成功接收任务消息到主节点;Step 6, the slave node sends a message of successfully receiving the task to the master node; 第七步,主节点写任务信息到元数据库;The seventh step, the master node writes the task information to the metadata database; 第八步,从节点开始执行任务;The eighth step is to execute the task from the node; 第九步,若主节点N次未接收到从节点心跳包,则视为丛节点宕机并记录到日志系统,并重新分配任务给其他节点。In the ninth step, if the master node does not receive the heartbeat packet from the slave node for N times, it is considered that the plex node is down and recorded in the log system, and the task is reassigned to other nodes. 3.根据权利要求1所述的面向政务的一种舆情实时监控方法,其特征在于:所述的采集模块具体处理流程是:3. a kind of public opinion real-time monitoring method facing government affairs according to claim 1, is characterized in that: the specific processing flow of described collection module is: 第一步,获取待采集的URL;The first step is to obtain the URL to be collected; 第二步,通过数据路由器对URL进行过滤;The second step is to filter the URL through the data router; 第三步,抓取页面数据;The third step is to grab the page data; 第四步,对抓取的数据进行文本抽取,链接抽取,把抽取的链接加入待采集URL集合;The fourth step is to perform text extraction and link extraction on the captured data, and add the extracted links to the collection of URLs to be collected; 第五步,自动文本特征提取,生成网页指纹;The fifth step is automatic text feature extraction to generate web page fingerprints; 第六步,检测是否为有相同文章;The sixth step is to detect whether there are identical articles; 第七步,如果已有相同文章则放弃抓取返回第一步,否则对正文文本进行分词操作;The seventh step, if the same article already exists, give up the crawling and return to the first step, otherwise perform word segmentation on the body text; 第八步,用TF_IDF算法提取N个关键词;The eighth step, use the TF_IDF algorithm to extract N keywords; 第九步,找到与其重合度最高的m篇文章;The ninth step is to find the m articles with the highest degree of overlap; 第十步,若其重合度大于c则归为相应主题数据库;In the tenth step, if the coincidence degree is greater than c, it is classified into the corresponding subject database; 第十一步,建立倒排索引以供其他模块使用。In the eleventh step, create an inverted index for use by other modules. 4.根据权利要求2所述的面向政务的一种舆情实时监控方法,其特征在于:所述的采集模块具体处理流程是:4. a kind of public opinion real-time monitoring method facing government affairs according to claim 2, is characterized in that: the specific processing flow of described collection module is: 第一步,获取待采集的URL;The first step is to obtain the URL to be collected; 第二步,通过数据路由器对URL进行过滤;The second step is to filter the URL through the data router; 第三步,抓取页面数据;The third step is to grab the page data; 第四步,对抓取的数据进行文本抽取,链接抽取,把抽取的链接加入待采集URL集合;The fourth step is to perform text extraction and link extraction on the captured data, and add the extracted links to the collection of URLs to be collected; 第五步,自动文本特征提取,生成网页指纹;The fifth step is automatic text feature extraction to generate web page fingerprints; 第六步,检测是否为有相同文章;The sixth step is to detect whether there are identical articles; 第七步,如果已有相同文章则放弃抓取返回第一步,否则对正文文本进行分词操作;The seventh step, if the same article already exists, give up the crawling and return to the first step, otherwise perform word segmentation on the body text; 第八步,用TF_IDF算法提取N个关键词;The eighth step, use the TF_IDF algorithm to extract N keywords; 第九步,找到与其重合度最高的m篇文章;The ninth step is to find the m articles with the highest degree of overlap; 第十步,若其重合度大于c则归为相应主题数据库;In the tenth step, if the coincidence degree is greater than c, it is classified into the corresponding subject database; 第十一步,建立倒排索引以供其他模块使用。In the eleventh step, create an inverted index for use by other modules. 5.根据权利要求1至4任一项所述的面向政务的一种舆情实时监控方法,其特征在于:所述的数据分析及预警模块具体处理流程是:5. A method for real-time monitoring of public opinion facing government affairs according to any one of claims 1 to 4, characterized in that: the specific processing flow of the data analysis and early warning module is: 第一步,将主题数据库进行重构,选择有代表性的数据;The first step is to reconstruct the subject database and select representative data; 第二步,对每篇文档进行情感分析并计算分值Tendency∈[-1,1];The second step is to perform sentiment analysis on each document and calculate the score Tendency∈[-1,1]; 第三步,对上述分析结果记入预警数据库;The third step is to record the above-mentioned analysis results into the early warning database; 第四步,计算预警级别,其中degreei代表第i篇文档的热度,其计算公式为:The fourth step is to calculate the warning level, Among them, degree i represents the popularity of the i-th document, and its calculation formula is: degreei=(praisei×0.3+commenti×0.7)/(houri+2)degree i =(praise i ×0.3+comment i ×0.7)/(hour i +2) 其中:praisei代表赞数,commenti代表评论数,houri代表发帖时间到现在的时差;Among them: praise i represents the number of likes, comment i represents the number of comments, and hour i represents the time difference from the posting time to the present; 第五步,根据预警策略和预警级别给予email或短信等相应预警信息。The fifth step is to give corresponding early warning information such as email or SMS according to the early warning strategy and early warning level. 6.根据权利要求3或4所述的面向政务的一种舆情实时监控方法,其特征在于:所述的自动文本特征提取,生成网页指纹的步骤是:6. according to a kind of public opinion real-time monitoring method facing government affairs described in claim 3 or 4, it is characterized in that: described automatic text feature extraction, the step of generating webpage fingerprint is: 第一步,提取正文各段落首句关键词(去掉停用词)作为文章的主特征;The first step is to extract the keywords of the first sentence of each paragraph of the text (remove the stop words) as the main feature of the article; 第二步,提取正文各段落的标点符号作为副特征;The second step is to extract the punctuation marks of each paragraph of the text as sub-features; 第三步,分别对主特征和副特征使用SimHash,然后拼接两段特征码,得到整个文章的指纹;The third step is to use SimHash for the main feature and the secondary feature respectively, and then splicing the two feature codes to get the fingerprint of the entire article; 第四步,存入缓存数据库。The fourth step is to store in the cache database. 7.根据权利要求5所述的面向政务的一种舆情实时监控方法,其特征在于:所述的自动文本特征提取,生成网页指纹的步骤是:7. a kind of public opinion real-time monitoring method facing government affairs according to claim 5, is characterized in that: described automatic text feature extraction, the step of generating webpage fingerprint is: 第一步,提取正文各段落首句关键词(去掉停用词)作为文章的主特征;The first step is to extract the keywords of the first sentence of each paragraph of the text (remove the stop words) as the main feature of the article; 第二步,提取正文各段落的标点符号作为副特征;The second step is to extract the punctuation marks of each paragraph of the text as sub-features; 第三步,分别对主特征和副特征使用SimHash,然后拼接两段特征码,得到整个文章的指纹;The third step is to use SimHash for the main feature and the secondary feature respectively, and then splicing the two feature codes to get the fingerprint of the entire article; 第四步,存入缓存数据库。The fourth step is to store in the cache database.
CN201510746977.2A 2015-11-04 2015-11-04 A cloud platform-oriented government public opinion monitoring method Withdrawn CN105447081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510746977.2A CN105447081A (en) 2015-11-04 2015-11-04 A cloud platform-oriented government public opinion monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510746977.2A CN105447081A (en) 2015-11-04 2015-11-04 A cloud platform-oriented government public opinion monitoring method

Publications (1)

Publication Number Publication Date
CN105447081A true CN105447081A (en) 2016-03-30

Family

ID=55557259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510746977.2A Withdrawn CN105447081A (en) 2015-11-04 2015-11-04 A cloud platform-oriented government public opinion monitoring method

Country Status (1)

Country Link
CN (1) CN105447081A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302455A (en) * 2016-08-16 2017-01-04 成都鼎昊科技有限公司 A kind of network safety protection method
CN106934014A (en) * 2017-03-10 2017-07-07 山东省科学院情报研究所 A kind of network data excavation based on Hadoop and analysis platform and its method
CN107169143A (en) * 2017-06-15 2017-09-15 易联众信息技术股份有限公司 A kind of efficient magnanimity public sentiment data message trunking matching process
CN107580036A (en) * 2017-08-28 2018-01-12 成都融微软件服务有限公司 The method of the adaptive single-point acquiring of industry information service
CN107800789A (en) * 2017-10-24 2018-03-13 麦格创科技(深圳)有限公司 The distribution method and system of task manager in distributed reptile system
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN108021582A (en) * 2016-11-04 2018-05-11 中国移动通信集团湖南有限公司 Internet public feelings monitoring method and device
CN109739849A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 A kind of network sensitive information of data-driven excavates and early warning platform
CN110046132A (en) * 2019-04-15 2019-07-23 苏州浪潮智能科技有限公司 A kind of metadata request processing method, device, equipment and readable storage medium storing program for executing
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110533212A (en) * 2019-07-04 2019-12-03 西安理工大学 Urban waterlogging public sentiment monitoring and pre-alarming method based on big data
CN110781236A (en) * 2019-10-29 2020-02-11 山西云时代技术有限公司 Method for constructing government affair big data management system
CN111428176A (en) * 2020-03-04 2020-07-17 北京明略软件系统有限公司 User behavior acquisition method and device
CN112100474A (en) * 2020-11-02 2020-12-18 成都智元汇信息技术股份有限公司 Passenger service quality public opinion supervision system and method
CN113609424A (en) * 2021-06-22 2021-11-05 深圳市网联安瑞网络科技有限公司 Computing and early warning system and method for network public sentiment popularity
CN115936404A (en) * 2023-01-09 2023-04-07 中银金融科技有限公司 Data processing method and system
CN116862455A (en) * 2023-09-01 2023-10-10 中国标准化研究院 Multi-mode-based government service complaint early warning method and device
CN116861058A (en) * 2023-09-04 2023-10-10 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affair field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499098A (en) * 2009-03-04 2009-08-05 阿里巴巴集团控股有限公司 Web page assessed value confirming and employing method and system
CN102194001A (en) * 2011-05-17 2011-09-21 杭州电子科技大学 Internet public opinion crisis early-warning method
CN104899324A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Sample training system based on IDC (internet data center) harmful information monitoring system
CN104951539A (en) * 2015-06-19 2015-09-30 成都艾尔普科技有限责任公司 Internet data center harmful information monitoring system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499098A (en) * 2009-03-04 2009-08-05 阿里巴巴集团控股有限公司 Web page assessed value confirming and employing method and system
US20100228718A1 (en) * 2009-03-04 2010-09-09 Alibaba Group Holding Limited Evaluation of web pages
CN102194001A (en) * 2011-05-17 2011-09-21 杭州电子科技大学 Internet public opinion crisis early-warning method
CN104899324A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Sample training system based on IDC (internet data center) harmful information monitoring system
CN104951539A (en) * 2015-06-19 2015-09-30 成都艾尔普科技有限责任公司 Internet data center harmful information monitoring system

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302455A (en) * 2016-08-16 2017-01-04 成都鼎昊科技有限公司 A kind of network safety protection method
CN108021582A (en) * 2016-11-04 2018-05-11 中国移动通信集团湖南有限公司 Internet public feelings monitoring method and device
CN108021582B (en) * 2016-11-04 2020-12-04 中国移动通信集团湖南有限公司 Internet public opinion monitoring method and device
CN106934014A (en) * 2017-03-10 2017-07-07 山东省科学院情报研究所 A kind of network data excavation based on Hadoop and analysis platform and its method
CN106934014B (en) * 2017-03-10 2021-03-19 山东省科学院情报研究所 Hadoop-based network data mining and analyzing platform and method thereof
CN107169143B (en) * 2017-06-15 2020-06-16 易联众信息技术股份有限公司 Efficient mass public opinion data information cluster matching method
CN107169143A (en) * 2017-06-15 2017-09-15 易联众信息技术股份有限公司 A kind of efficient magnanimity public sentiment data message trunking matching process
CN107580036A (en) * 2017-08-28 2018-01-12 成都融微软件服务有限公司 The method of the adaptive single-point acquiring of industry information service
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN107800789A (en) * 2017-10-24 2018-03-13 麦格创科技(深圳)有限公司 The distribution method and system of task manager in distributed reptile system
CN109739849A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 A kind of network sensitive information of data-driven excavates and early warning platform
CN109739849B (en) * 2019-01-02 2021-06-29 山东省科学院情报研究所 Data-driven network sensitive information mining and early warning platform
CN110046132A (en) * 2019-04-15 2019-07-23 苏州浪潮智能科技有限公司 A kind of metadata request processing method, device, equipment and readable storage medium storing program for executing
CN110046132B (en) * 2019-04-15 2022-04-22 苏州浪潮智能科技有限公司 A metadata request processing method, apparatus, device and readable storage medium
CN110533212A (en) * 2019-07-04 2019-12-03 西安理工大学 Urban waterlogging public sentiment monitoring and pre-alarming method based on big data
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110781236A (en) * 2019-10-29 2020-02-11 山西云时代技术有限公司 Method for constructing government affair big data management system
CN111428176A (en) * 2020-03-04 2020-07-17 北京明略软件系统有限公司 User behavior acquisition method and device
CN112100474A (en) * 2020-11-02 2020-12-18 成都智元汇信息技术股份有限公司 Passenger service quality public opinion supervision system and method
CN113609424A (en) * 2021-06-22 2021-11-05 深圳市网联安瑞网络科技有限公司 Computing and early warning system and method for network public sentiment popularity
CN113609424B (en) * 2021-06-22 2024-06-11 深圳市网联安瑞网络科技有限公司 Calculation and early warning system and method for internet public opinion heat
CN115936404A (en) * 2023-01-09 2023-04-07 中银金融科技有限公司 Data processing method and system
CN116862455A (en) * 2023-09-01 2023-10-10 中国标准化研究院 Multi-mode-based government service complaint early warning method and device
CN116861058A (en) * 2023-09-04 2023-10-10 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affair field
CN116861058B (en) * 2023-09-04 2024-04-12 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affairs

Similar Documents

Publication Publication Date Title
CN105447081A (en) A cloud platform-oriented government public opinion monitoring method
Papadopoulou et al. A corpus of debunked and verified user-generated videos
CN107341270B (en) A social platform-oriented user sentiment influence analysis method
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN104537097B (en) Microblogging public sentiment monitoring system
Kou et al. Social network search based on semantic analysis and learning
Chen et al. Adversarial-enhanced hybrid graph network for user identity linkage
Rafea et al. Topic detection approaches in identifying topics and events from Arabic corpora
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
Bykau et al. Fine-grained controversy detection in Wikipedia
CN108305180A (en) A kind of friend recommendation method and device
CN103150335A (en) Co-clustering-based coal mine public sentiment monitoring system
Xu et al. Social media mining and social network analysis: Emerging research: Emerging Research
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN106156117B (en) Hidden community's core communication circle detection towards particular topic finds method and system
CN110889632A (en) Data monitoring and analyzing system of company image improving system
CN104268648A (en) User ranking system integrating multiple interactive information of users and user thematic information
Liu et al. Event detection and evolution based on knowledge base
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
Wang et al. Topic discovery method based on topic model combined with hierarchical clustering
CN104199947A (en) Important person speech supervision and incidence relation excavating method
Elsharkawy et al. Towards feature selection for cascade growth prediction on twitter
CN108830735B (en) Online interpersonal relationship analysis method and system
CN106570167A (en) Knowledge-integrated subject model-based microblog topic detection method
Zhao et al. Collecting, managing and analyzing social networking data effectively

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20160330