CN105447081A

CN105447081A - A cloud platform-oriented government public opinion monitoring method

Info

Publication number: CN105447081A
Application number: CN201510746977.2A
Authority: CN
Inventors: 侯朋; 李勇波; 季统凯
Original assignee: G Cloud Technology Co Ltd
Current assignee: G Cloud Technology Co Ltd
Priority date: 2015-11-04
Filing date: 2015-11-04
Publication date: 2016-03-30

Abstract

The present invention relates to the field of cloud computing technology, and in particular to a government affairs public opinion monitoring method for a cloud platform. The method of the present invention includes data collection, data preprocessing, data analysis and early warning; the system is mounted on a distributed cluster, and is composed of a crawler server as a master node and multiple crawler clients as slave nodes. The master node is responsible for task allocation, and the child node is responsible for task execution. The master and slave nodes communicate using encrypted heartbeat packets; the slave nodes include data collection, preprocessing, analysis and early warning modules; the collection module captures forum, news, post bar, blog and other data according to user configuration and knowledge base, and automatically filters duplicate data to build a subject database; the data preprocessing module extracts text data based on rules and automatic mixing; the data analysis and early warning module uses machine learning methods to cluster, sentiment analysis, and hot spot analysis on the cleaned text, and warns the analysis results. The present invention solves the problems of users' network public opinion monitoring and can be used for government affairs public opinion monitoring.

Description

A cloud platform-oriented government public opinion monitoring method

技术领域technical field

本发明涉及云计算技术领域，尤其是面向云平台的一种政务舆情监控方法。The invention relates to the technical field of cloud computing, in particular to a cloud platform-oriented government public opinion monitoring method.

背景技术Background technique

基于云数据库的分布式实时智能监控方法，整合互联网信息采集技术及信息智能处理技术，通过对互联网海量信息自动抓取、自动分类聚类、主题检测、专题聚焦，实现用户的网络舆情监测和新闻专题追踪等信息需求，形成简报、报告、图表等分析结果，为客户全面掌握群众思想动态，做出正确舆论引导，提供分析依据。The distributed real-time intelligent monitoring method based on the cloud database integrates Internet information collection technology and information intelligent processing technology, and realizes user network public opinion monitoring and news through automatic capture of massive Internet information, automatic classification and clustering, topic detection, and topic focus Information needs such as topic tracking, form analysis results such as briefings, reports, charts, etc., provide customers with a comprehensive grasp of the ideological trends of the masses, make correct public opinion guidance, and provide analysis basis.

发明内容Contents of the invention

本发明解决的技术问题在于提供面向云平台的一种政务舆情监控方法。The technical problem solved by the present invention is to provide a cloud platform-oriented government public opinion monitoring method.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the problems of the technologies described above is:

所述的方法包括数据采集、数据预处理、数据分析及预警；所述的系统搭载在分布式集群上，由一个作为主节点的爬虫服务器和多个作为从节点的爬虫客户端组成，主节点负责任务分配，子节点负责任务执行，主从节点之间采用加密的心跳包进行通信；从节点上包括数据采集、预处理、分析及预警模块；所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据，并自动过滤重复数据，构建主题数据库；数据预处理模块基于规则和自动混合的方式提取正文数据；数据分析及预警模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析，并对分析结果进行预警。The method includes data collection, data preprocessing, data analysis and early warning; the system is carried on a distributed cluster and consists of a crawler server as a master node and a plurality of crawler clients as slave nodes, the master node Responsible for task allocation, sub-nodes are responsible for task execution, and encrypted heartbeat packets are used for communication between master and slave nodes; slave nodes include data acquisition, preprocessing, analysis and early warning modules; the acquisition module is based on user configuration and knowledge base Grab data from forums, news, post bars, blogs, etc., and automatically filter duplicate data to build a theme database; the data preprocessing module extracts text data based on rules and automatic mixing methods; the data analysis and early warning module uses machine learning methods to clustering, sentiment analysis, hotspot analysis, and early warning of the analysis results.

所述的主从节点之间的通信，包括如下步骤：The communication between the master-slave nodes includes the following steps:

第一步，用户开启采集任务；In the first step, the user starts the collection task;

第二步，主节点保存任务信息到元数据信息库；In the second step, the master node saves the task information to the metadata repository;

第三步，主节点根据用户配置信息进行任务初始化；In the third step, the master node performs task initialization according to the user configuration information;

第四步，主节点根据丛节点的CPU、内存、当前任务数等指标进行任务分配；In the fourth step, the master node allocates tasks according to the CPU, memory, current number of tasks and other indicators of the cluster nodes;

第五步，从节点接收任务；The fifth step is to receive the task from the node;

第六步，从节点发送成功接收任务消息到主节点；Step 6, the slave node sends a message of successfully receiving the task to the master node;

第七步，主节点写任务信息到元数据库；The seventh step, the master node writes the task information to the metadata database;

第八步，从节点开始执行任务；The eighth step is to execute the task from the node;

第九步，若主节点N次未接收到从节点心跳包，则视为丛节点宕机并记录到日志系统，并重新分配任务给其他节点。In the ninth step, if the master node does not receive the heartbeat packet from the slave node for N times, it is considered that the plex node is down and recorded in the log system, and the task is reassigned to other nodes.

所述的采集模块具体处理流程是：The specific processing flow of the acquisition module is:

第一步，获取待采集的URL；The first step is to obtain the URL to be collected;

第二步，通过数据路由器对URL进行过滤；The second step is to filter the URL through the data router;

第三步，抓取页面数据；The third step is to grab the page data;

第四步，对抓取的数据进行文本抽取，链接抽取，把抽取的链接加入待采集URL集合；The fourth step is to perform text extraction and link extraction on the captured data, and add the extracted links to the collection of URLs to be collected;

第五步，自动文本特征提取，生成网页指纹；The fifth step is automatic text feature extraction to generate web page fingerprints;

第六步，检测是否为有相同文章；The sixth step is to detect whether there are identical articles;

第七步，如果已有相同文章则放弃抓取返回第一步，否则对正文文本进行分词操作；The seventh step, if the same article already exists, give up the crawling and return to the first step, otherwise perform word segmentation on the body text;

第八步，用TF_IDF算法提取N个关键词；The eighth step, use the TF_IDF algorithm to extract N keywords;

第九步，找到与其重合度最高的m篇文章；The ninth step is to find the m articles with the highest degree of overlap;

第十步，若其重合度大于c则归为相应主题数据库；In the tenth step, if the coincidence degree is greater than c, it is classified into the corresponding subject database;

第十一步，建立倒排索引以供其他模块使用。In the eleventh step, create an inverted index for use by other modules.

所述的数据分析及预警模块具体处理流程是：The specific processing flow of the data analysis and early warning module is:

第一步，将主题数据库进行重构，选择有代表性的数据；The first step is to reconstruct the subject database and select representative data;

第二步，对每篇文档进行情感分析并计算分值Tendency∈[-1，1]；The second step is to perform sentiment analysis on each document and calculate the score Tendency ∈ [-1, 1];

第三步，对上述分析结果记入预警数据库；The third step is to record the above-mentioned analysis results into the early warning database;

第四步，计算预警级别，其中degree_i代表第i篇文档的热度，其计算公式为：The fourth step is to calculate the warning level, Among them, degree _i represents the popularity of the i-th document, and its calculation formula is:

degree_i＝(praise_i×0.3+comment_i×0.7)/(hour_i+2)degree _i ＝(praise _i ×0.3+comment _i ×0.7)/(hour _i +2)

其中：praise_i代表赞数，comment_i代表评论数，hour_i代表发帖时间到现在的时差；Among them: praise _i represents the number of likes, comment _i represents the number of comments, and hour _i represents the time difference from the posting time to the present;

第五步，根据预警策略和预警级别给予email或短信等相应预警信息。The fifth step is to give corresponding early warning information such as email or SMS according to the early warning strategy and early warning level.

所述的自动文本特征提取，生成网页指纹的步骤是：。In the automatic text feature extraction, the steps of generating web page fingerprints are: .

第一步，提取正文各段落首句关键词(去掉停用词)作为文章的主特征；The first step is to extract the keywords of the first sentence of each paragraph of the text (remove the stop words) as the main feature of the article;

第二步，提取正文各段落的标点符号作为副特征；The second step is to extract the punctuation marks of each paragraph of the text as sub-features;

第三步，分别对主特征和副特征使用SimHash，然后拼接两段特征码，得到整个文章的指纹；The third step is to use SimHash for the main feature and the secondary feature respectively, and then splicing the two feature codes to get the fingerprint of the entire article;

第四步，存入缓存数据库。The fourth step is to store in the cache database.

本发明采用分布式多线程的方式提高了抓取速度，提高了新闻的时效性；通过URL去重和使用文本相似度算法检测文本重复性，从而节省了磁盘空间，也同时提高了抓取速度；通过网页指纹算法提高了网页重复性检测的速度以及准确度。The present invention adopts the distributed multi-threading method to improve the crawling speed, and improves the timeliness of news; through URL deduplication and text similarity algorithm to detect text repetition, thereby saving disk space and improving the crawling speed at the same time ; Improve the speed and accuracy of web page repetition detection through the web page fingerprint algorithm.

附图说明Description of drawings

下面结合附图对本发明进一步说明：The present invention is further described below in conjunction with accompanying drawing:

图1是本发明运用框架图；Fig. 1 is a framework diagram of the present invention;

图2是主从节点架构图；Figure 2 is a master-slave node architecture diagram;

图3是心数据抓取流程图；Fig. 3 is a heart data capture flow chart;

图4是数据分析流程图。Figure 4 is a flow chart of data analysis.

具体实施方式detailed description

如图1至4所示，本发明方法包括数据采集、数据预处理、数据分析及预警；所述的系统搭载在分布式集群上，由一个作为主节点的爬虫服务器和多个作为从节点的爬虫客户端组成，主节点负责任务分配，子节点负责任务执行，主从节点之间采用加密的心跳包进行通信；从节点上包括数据采集、预处理、分析及预警模块；所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据，并自动过滤重复数据，构建主题数据库；数据预处理模块基于规则和自动混合的方式提取正文数据；数据分析及预警模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析，并对分析结果进行预警。As shown in Figures 1 to 4, the method of the present invention includes data collection, data preprocessing, data analysis and early warning; the system is carried on a distributed cluster, and consists of a crawler server as a master node and a plurality of crawler servers as slave nodes. Composed of crawler clients, the master node is responsible for task allocation, and the sub-nodes are responsible for task execution. The encrypted heartbeat packets are used for communication between the master and slave nodes; the slave nodes include data collection, preprocessing, analysis and early warning modules; the acquisition module According to user configuration and knowledge base, data such as forums, news, post bars, and blogs are captured, and duplicate data is automatically filtered to build a theme database; the data preprocessing module extracts text data based on rules and automatic mixing; the data analysis and early warning module utilizes The method of machine learning performs clustering, sentiment analysis, and hotspot analysis on the cleaned text, and provides early warning of the analysis results.

如图2所示：所述的一个主节点和多个从节点组成，主节点负责任务分配，子节点负责任务执行，主从节点之间采用加密的心跳包进行通信，包括如下步骤：As shown in Figure 2: the above-mentioned one master node and multiple slave nodes are composed, the master node is responsible for task distribution, and the child nodes are responsible for task execution, and the encrypted heartbeat packets are used for communication between the master and slave nodes, including the following steps:

如图3所示：所述的采集模块根据用户配置、以及知识库抓取论坛、新闻、贴吧、博客等数据，并过滤重复数据，构建主题数据库，包括如下流程：As shown in Figure 3: the acquisition module captures data such as forums, news, post bars, and blogs according to user configurations and knowledge bases, and filters duplicate data to build a theme database, including the following processes:

第三步，抓取页面数据；The third step is to grab the page data;

如图4所示，数据分析模块利用机器学习的方法对清洗后的文本进行聚类、情感分析、热点分析，并对分析结果进行预警，包括如下步骤：As shown in Figure 4, the data analysis module uses machine learning methods to perform clustering, sentiment analysis, and hotspot analysis on the cleaned text, and provides early warning of the analysis results, including the following steps:

如图1所示，采用本发明方法获得信息可以在WEB前端进行展示。As shown in Figure 1, the information obtained by using the method of the present invention can be displayed on the front end of the WEB.

Claims

1. A kind of public opinion real-time monitoring method for government affairs, it is characterized in that: described method comprises data acquisition, data preprocessing, data analysis and early warning; Described system is carried on the distributed cluster, by one as master node The crawler server is composed of multiple crawler clients as slave nodes. The master node is responsible for task allocation, and the sub-nodes are responsible for task execution. The master-slave nodes use encrypted heartbeat packets for communication; the slave nodes include data collection, preprocessing, and analysis. and early warning module; the acquisition module captures data such as forums, news, post bars, and blogs according to user configuration and knowledge base, and automatically filters duplicate data to build a theme database; the data preprocessing module extracts based on rules and automatic mixing Text data; data analysis and early warning module uses machine learning methods to perform clustering, sentiment analysis, and hotspot analysis on the cleaned text, and early warning of the analysis results.

2. a kind of public opinion real-time monitoring method facing government affairs according to claim 1, is characterized in that: the communication between described master-slave node, comprises the steps:

In the first step, the user starts the collection task;

In the second step, the master node saves the task information to the metadata repository;

In the third step, the master node performs task initialization according to the user configuration information;

In the fourth step, the master node allocates tasks according to the CPU, memory, current number of tasks and other indicators of the cluster nodes;

The fifth step is to receive the task from the node;

Step 6, the slave node sends a message of successfully receiving the task to the master node;

The seventh step, the master node writes the task information to the metadata database;

The eighth step is to execute the task from the node;

In the ninth step, if the master node does not receive the heartbeat packet from the slave node for N times, it is considered that the plex node is down and recorded in the log system, and the task is reassigned to other nodes.

3. a kind of public opinion real-time monitoring method facing government affairs according to claim 1, is characterized in that: the specific processing flow of described collection module is:

The first step is to obtain the URL to be collected;

The second step is to filter the URL through the data router;

The third step is to grab the page data;

The fourth step is to perform text extraction and link extraction on the captured data, and add the extracted links to the collection of URLs to be collected;

The fifth step is automatic text feature extraction to generate web page fingerprints;

The sixth step is to detect whether there are identical articles;

The seventh step, if the same article already exists, give up the crawling and return to the first step, otherwise perform word segmentation on the body text;

The eighth step, use the TF_IDF algorithm to extract N keywords;

The ninth step is to find the m articles with the highest degree of overlap;

In the tenth step, if the coincidence degree is greater than c, it is classified into the corresponding subject database;

In the eleventh step, create an inverted index for use by other modules.

4. a kind of public opinion real-time monitoring method facing government affairs according to claim 2, is characterized in that: the specific processing flow of described collection module is:

The first step is to obtain the URL to be collected;

The second step is to filter the URL through the data router;

The third step is to grab the page data;

The sixth step is to detect whether there are identical articles;

The eighth step, use the TF_IDF algorithm to extract N keywords;

The ninth step is to find the m articles with the highest degree of overlap;

In the eleventh step, create an inverted index for use by other modules.

5. A method for real-time monitoring of public opinion facing government affairs according to any one of claims 1 to 4, characterized in that: the specific processing flow of the data analysis and early warning module is:

The first step is to reconstruct the subject database and select representative data;

The second step is to perform sentiment analysis on each document and calculate the score Tendency∈[-1,1];

The third step is to record the above-mentioned analysis results into the early warning database;

The fourth step is to calculate the warning level, Among them, degree _i represents the popularity of the i-th document, and its calculation formula is:

degree _i ＝(praise _i ×0.3+comment _i ×0.7)/(hour _i +2)

Among them: praise _i represents the number of likes, comment _i represents the number of comments, and hour _i represents the time difference from the posting time to the present;

The fifth step is to give corresponding early warning information such as email or SMS according to the early warning strategy and early warning level.

6. according to a kind of public opinion real-time monitoring method facing government affairs described in claim 3 or 4, it is characterized in that: described automatic text feature extraction, the step of generating webpage fingerprint is:

The first step is to extract the keywords of the first sentence of each paragraph of the text (remove the stop words) as the main feature of the article;

The second step is to extract the punctuation marks of each paragraph of the text as sub-features;

The third step is to use SimHash for the main feature and the secondary feature respectively, and then splicing the two feature codes to get the fingerprint of the entire article;

The fourth step is to store in the cache database.

7. a kind of public opinion real-time monitoring method facing government affairs according to claim 5, is characterized in that: described automatic text feature extraction, the step of generating webpage fingerprint is:

The fourth step is to store in the cache database.