[go: up one dir, main page]

CN1822000A - A Method for Automatically Detecting News Events - Google Patents

A Method for Automatically Detecting News Events Download PDF

Info

Publication number
CN1822000A
CN1822000A CN200610007219.XA CN200610007219A CN1822000A CN 1822000 A CN1822000 A CN 1822000A CN 200610007219 A CN200610007219 A CN 200610007219A CN 1822000 A CN1822000 A CN 1822000A
Authority
CN
China
Prior art keywords
news
event
events
report
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200610007219.XA
Other languages
Chinese (zh)
Other versions
CN100461177C (en
Inventor
路斌
杨霙
杨建武
万小军
吴於茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Peking University Founder Research and Development Center
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB200610007219XA priority Critical patent/CN100461177C/en
Publication of CN1822000A publication Critical patent/CN1822000A/en
Application granted granted Critical
Publication of CN100461177C publication Critical patent/CN100461177C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention puts forward a practical news item test method, which introduces events, sequences, merges and adjusts them, eliminates news reports and describes news events to obviously increase the test result of news events, strengthen its practicability and can be widely used in intelligent information process.

Description

一种自动检测新闻事件的方法A Method for Automatically Detecting News Events

技术领域technical field

本发明属于属于智能信息处理技术,具体涉及一种自动检测新闻事件的方法。The invention belongs to intelligent information processing technology, and in particular relates to a method for automatically detecting news events.

背景技术Background technique

随着因特网的迅速发展,新闻信息呈现出爆炸性的增长。如何从不断涌现的新闻报道中及时地获得新发生的热点新闻事件信息,并对自己感兴趣的新闻事件进行持续追踪,近年来成为研究热点。主题检测与追踪技术正是试图解决这一问题的方法。With the rapid development of the Internet, news information has shown explosive growth. How to timely obtain new hot news event information from the emerging news reports, and keep track of the news events you are interested in has become a research hotspot in recent years. Topic detection and tracking technology is an attempt to solve this problem.

主题检测与追踪(TDT)研究始于1996年,当时的研究发起人和参与者James Allan等人在《Topic Detection and Tracking(TDT)Pilot StudyFinal Report》中定义了TDT的具体任务和性能评估指标,并且给出了当时的一些实验结果。TDT的三项主要任务分别为:Topic Detection and Tracking (TDT) research began in 1996, when the research initiator and participant James Allan et al. defined the specific tasks and performance evaluation indicators of TDT in the "Topic Detection and Tracking (TDT) Pilot Study Final Report". And some experimental results at that time are given. The three main tasks of TDT are:

(1)新闻报道切分任务:将连续的广播、电视新闻节目的语音或文字记录分割为不同的报道;(1) News report segmentation task: segment the voice or text records of continuous radio and TV news programs into different reports;

(2)事件检测任务:识别出系统未知的事件,并将相关报道也识别出来;(2) Event detection task: identify events unknown to the system, and identify related reports;

(3)事件追踪任务:监控新闻报道信息流以发现与某一已知事件有关的新报道。(3) Event tracking task: monitor news report information flow to discover new reports related to a known event.

另外,该论文谈到,TDT中目前关注的研究重点是事件的检测与追踪,其中,主题是比事件更加宽泛的一个概念,一个主题可以包含多个相关事件。In addition, the paper mentioned that the current research focus in TDT is the detection and tracking of events, in which a topic is a concept broader than an event, and a topic can include multiple related events.

从本质上看,事件检测是对新闻报道流依据不同的事件做聚类,需要将讨论一个事件的报道归为一类(James Allan,2002)。与通常的文本聚类相比,事件检测的特殊性主要表现在两个方面:首先,事件检测的处理对象是按时间顺序依次出现的新闻报道流,随时间动态变化,而不是一个静态的封闭文本集合;其次,事件检测是依据报道讨论的事件而不是主题类别进行聚类,所依据的信息粒度相对要小,因此由事件检测得到的类应当更多些。尽管如此,文本聚类技术仍然是事件检测技术的基础。In essence, event detection is to cluster news reports according to different events, and it is necessary to classify reports discussing an event into one category (James Allan, 2002). Compared with the usual text clustering, the particularity of event detection is mainly manifested in two aspects: First, the processing object of event detection is the flow of news reports that appear in chronological order, which changes dynamically over time, rather than a static closed Text collection; secondly, event detection clusters based on the events discussed in the report rather than the topic category, and the granularity of the information based on it is relatively small, so there should be more classes obtained from event detection. Nevertheless, text clustering techniques are still the basis of event detection techniques.

事件检测可以根据具体的检测场景细分为:回溯检测和在线检测。回溯检测的目的是从现有的新闻报道集合中发现以前未标识的新闻主题,要求系统输出新闻主题的信息,能够说明新闻报道和主题的关联关系。而在线检测的重点在于及时地从实时新闻报道流中标识新的主题,也就是在某个表达新主题的报道出现的时刻标识出该新闻主题。Event detection can be subdivided according to specific detection scenarios: backtracking detection and online detection. The purpose of backtracking detection is to discover previously unidentified news topics from the existing news report collection, and the system is required to output news topic information, which can explain the relationship between news reports and topics. The focus of online detection is to timely identify new topics from the real-time news report flow, that is, to identify the news topic at the moment when a report expressing a new topic appears.

在过去几年中,事件检测研究者尝试了多种不同的文本聚类方法,如单遍聚类、k-means聚类、层次凝聚聚类、概率模型等。下面介绍几个主要的现有事件检测方法:In the past few years, event detection researchers have tried many different text clustering methods, such as single-pass clustering, k-means clustering, hierarchical agglomerative clustering, probabilistic models, etc. Here are a few main existing event detection methods:

(1)CMU的方法(1) Method of CMU

CMU的研究者(Yiming Yang等)在事件检测中主要采用带有时间窗的单遍聚类算法。CMU的研究者将每篇报道以及每个事件都表示成空间中的一个向量,报道向量和事件向量间的相似度计算主要采用向量夹角余弦值,但要根据时间因素利用一个事件窗口作调整,可以采取两种策略。第一种策略只考虑在时间窗口内出现的事件,第二种策略认为随着当前报道s与事件c之间报道数量的增加,应当降低二者间的相似度值。CMU researchers (Yiming Yang et al.) mainly use single-pass clustering algorithms with time windows in event detection. CMU researchers represent each report and each event as a vector in space. The calculation of the similarity between the report vector and the event vector mainly uses the cosine value of the angle between the vectors, but it needs to be adjusted according to the time factor using an event window. , two strategies can be adopted. The first strategy only considers events that occur within the time window, and the second strategy considers that as the number of reports between the current report s and event c increases, the similarity value between the two should be reduced.

另外,在2002年的SIGKDD上,Yiming Yang等在文章《Topic-conditioned novelty detection》中提出一种基于主题的事件检测方法:首先用有监督的学习算法将在线文档流分入预先定义好的较宽泛的主题类别中,然后结合每个主题的特征对文档流进行新事件检测。In addition, on SIGKDD in 2002, Yiming Yang et al. proposed a topic-based event detection method in the article "Topic-conditioned novelty detection": firstly, a supervised learning algorithm is used to divide the online document stream into a predefined comparison. broad topic categories, and then combine the characteristics of each topic to perform new event detection on the document stream.

(2)马萨诸塞大学的方法(2) The method of the University of Massachusetts

马萨诸塞大学的研究者(James Alan等)用向量模型表示新闻报道,核心算法仍然采用单遍聚类算法。在计算报道和事件相似度时采用了基于时间的阈值模型,利用线性函数调整聚类阈值,使得在时间上距离某个事件越远的新闻报道越难加入该事件。在确定与当前报道最相近的事件时,除了原有的质心比较策略外,增加了最近邻居比较策略。Researchers at the University of Massachusetts (James Alan, etc.) use a vector model to represent news reports, and the core algorithm still uses a single-pass clustering algorithm. A time-based threshold model is used to calculate the similarity between reports and events, and a linear function is used to adjust the clustering threshold, making it more difficult for news reports that are farther away from an event in time to join the event. When determining the closest event to the current report, in addition to the original centroid comparison strategy, a nearest neighbor comparison strategy is added.

在质心比较策略中,设置了两个阈值θmatch和θcertain。若当前报道与某事件的质心相似度高于θmatch,则将该报道归入此事件。但只有它们之间相似度值高于θcertain时,才用当前报道调整该事件的质心,即该事件的向量表示。而最近邻居比较策略进行识别时,首先在已有报道中寻找与当前报道最相似的k篇报道,由这k篇报道和预先设定的阈值确定当前报道应当归属的事件。如果不能把它归入任何一个已知的事件,就把它作为对某个新事件的首次报道,为它建立一个新事件。In the centroid comparison strategy, two thresholds θmatch and θcertain are set. If the centroid similarity between the current report and an event is higher than θmatch, the report is classified into this event. But only when the similarity value between them is higher than θcertain, the current report is used to adjust the centroid of the event, that is, the vector representation of the event. When the nearest neighbor comparison strategy is used for identification, it first finds the k reports that are most similar to the current report among the existing reports, and determines the event that the current report should belong to based on the k reports and the preset threshold. If it cannot be assigned to any of the known events, create a new event for it as the first report of a new event.

另外,James Alan等提到用事件中出现频率最高的几个词作为事件描述。In addition, James Alan et al. mentioned that the most frequently occurring words in the event were used as the description of the event.

(3)IBM公司的方法(3) IBM's method

IBM公司开发的一个相对比较成功的事件检测系统采用了一种两层聚类策略,使用对称的Okapi公式来比较两篇报道的相似性。该系统的第一次处理首先将报道暂时归入不同的微事件(microcluster),第二次处理再以这些微事件为处理对象形成较大的类,即归入最终的事件(Dharanipragada etc.,2002)。以上每次处理都采用单遍聚类算法,差别只在于处理对象不同和选取不同的阈值。A relatively successful incident detection system developed by IBM uses a two-level clustering strategy, using the symmetric Okapi formula to compare the similarity of two reports. In the first processing of the system, reports are temporarily classified into different micro-events (microcluster), and in the second processing, these micro-events are used as processing objects to form a larger class, which is classified into the final event (Dharanipragada etc., 2002). Each of the above processes adopts a single-pass clustering algorithm, and the only difference lies in the different processing objects and the selection of different thresholds.

综上所述,现有技术中在事件检测过程中,常用的步骤可以概括如下:To sum up, in the prior art, during the event detection process, common steps can be summarized as follows:

1)从数据源读入一篇报道,包括内容、时间以及其它相关信息;数据源可能存在多个,报道之间可能没有明显的界限,需要进行报道间的切分等预处理;1) Read a report from the data source, including content, time and other relevant information; there may be multiple data sources, and there may be no clear boundaries between reports, and preprocessing such as segmentation between reports is required;

2)采用质心比较或者最近邻比较策略,计算报道与事件、或者报道与报道间的相似度,确定与当前报道最相近的事件;2) Using the centroid comparison or nearest neighbor comparison strategy, calculate the similarity between reports and events, or between reports and reports, and determine the most similar event to the current report;

3)若报道被归入某个事件,则调整该事件;若报道无法归入现有事件,则将其列为新检测到的事件;3) If the report is classified as an event, adjust the event; if the report cannot be classified as an existing event, list it as a newly detected event;

4)输出检测到的事件,将事件中权重最高的几个特征词、或者具有代表性的某个报道标题做为事件描述。4) Output the detected event, and use several feature words with the highest weight in the event, or a representative report title as the event description.

由于现有的事件检测技术仅仅考虑在固定的小数据集合上的错检率和漏检率,存在以下几个缺陷:Since the existing event detection technology only considers the false detection rate and missed detection rate on a fixed small data set, there are several defects as follows:

(1)事件排序问题(1) Event sequencing problem

人们的注意力成为一种稀缺资源,人们往往没有时间去查看大量的新闻事件,所以最热点的新闻事件排序应该越靠前,这样的系统才能更好地满足人们的需要。现有技术没有考虑该问题,仅仅是简单输出检测到的事件。People's attention has become a scarce resource, and people often don't have time to check a large number of news events, so the most popular news events should be sorted higher, so that such a system can better meet people's needs. The prior art does not consider this problem, and simply outputs detected events.

(2)事件相似性问题(2) Event similarity problem

由于对同一个新闻事件不同方面进行报道的新闻可能相似度较小,从而使得同一个新闻事件在事件发生初期被分为多个小事件,进而随着事态的不断发展,这些事件的相似度可能会越来越大,这样就可能给用户的浏览带来迷惑和不便。现有技术也没有考虑该问题。Since the news that reports on different aspects of the same news event may have less similarity, the same news event is divided into multiple small events at the beginning of the event, and as the situation continues to develop, the similarity of these events may increase. It will become larger and larger, which may cause confusion and inconvenience to users' browsing. The prior art does not consider this problem either.

(3)新闻报道淘汰问题(3) The elimination of news reports

在实际应用环境中,事件检测是一个长期持续的过程。随着事件的动态演化,事件内的一些新闻和该事件的相关性在逐渐降低。另外,周期较长的事件随着时间的积累也可能出现膨胀现象,整个事件内容过于宽泛。现有技术通过引入时间窗策略和动态调整事件来克服事件动态演化的问题,但是没有考虑新闻报道淘汰的问题。In the actual application environment, event detection is a long-term continuous process. With the dynamic evolution of the event, the relevance of some news in the event to the event is gradually decreasing. In addition, events with a long period of time may also expand with the accumulation of time, and the content of the entire event is too broad. The existing technology overcomes the problem of event dynamic evolution by introducing time window strategy and dynamically adjusting events, but does not consider the problem of news report elimination.

(4)事件描述问题(4) Event description problem

目前新闻事件的描述有两种方法:该事件中最重要的若干个特征词,或者选取该事件中某个新闻标题。由于自然语言处理技术还不够成熟,提取的特征词难以有效描述事件,甚至新闻事件中最重要的人名、地名、机构名、时间等特征词可能无法提取到,例如十一五规划,神州六号等。而如果用事件中某个报道标题做为描述,对于一些综合性的事件,则该报道可能仅是事件的一个方面,对事件的描述不够全面。There are two ways to describe a news event: some of the most important feature words in the event, or select a news title in the event. Due to the immature natural language processing technology, the extracted feature words are difficult to effectively describe events, and even the most important feature words such as names of people, places, institutions, and time in news events may not be extracted, such as the Eleventh Five-Year Plan, Shenzhou VI wait. However, if a certain report title in the event is used as the description, for some comprehensive events, the report may only be one aspect of the event, and the description of the event is not comprehensive enough.

发明内容Contents of the invention

针对现有技术中存在的缺陷,本发明的目的是利用新闻事件本身的特点,通过解决事件排序,事件合并与调整,新闻报道淘汰,以及新闻事件描述等问题,实现对持续新闻流进行动态、高效的事件检测。Aiming at the defects existing in the prior art, the purpose of the present invention is to utilize the characteristics of the news event itself, by solving problems such as event sorting, event merging and adjustment, news report elimination, and news event description, etc., to realize the dynamic, Efficient event detection.

为达到以上目的,本发明采用的技术方案是:一种自动检测新闻事件的方法,包括以下步骤:In order to achieve the above object, the technical solution adopted in the present invention is: a method for automatically detecting news events, comprising the following steps:

1)从数据源读入一篇报道,并对报道进行预处理;1) Read a report from the data source and preprocess the report;

2)计算报道与已检测到的事件、或者报道与报道间的相似度,确定与当前报道相关的事件,并归入相关事件;2) Calculate the similarity between the report and the detected event, or between the report and the report, determine the events related to the current report, and classify them into related events;

3)若报道被归入某个现有事件,则调整该事件;若报道无法归入现有事件,则将其列为新检测到的事件;3) If the report is classified into an existing event, adjust the event; if the report cannot be classified into an existing event, list it as a newly detected event;

4)对已检测到的事件进行两两比较,合并相关事件,并重新调整事件、以及报道和事件的相似度;4) Perform a pairwise comparison of the detected events, merge related events, and re-adjust the events, as well as the similarity between reports and events;

5)对各事件内不满足限制条件的报道进行淘汰,并调整事件;5) Eliminate the reports that do not meet the restriction conditions in each event, and adjust the event;

6)比较当前的事件数量与时间窗口大小,若事件数量大于事件窗口大小,则进行事件排序和淘汰;否则转入步骤7;6) Compare the current number of events with the size of the time window, if the number of events is greater than the size of the event window, sort and eliminate events; otherwise, go to step 7;

7)输出检测结果。7) Output the detection result.

进一步,为使本发明获得更好的发明效果,步骤1)中,如果新报道和之前已经处理的新闻报道相似度大于预先设定的阈值θd即重复阈值,则认为是重复的新闻报道,需要对新闻报道进行消重处理,所述的θd取值范围是0<θd≤1,所述的消重处理是根据新闻报道的内容采用文本检索和文本挖掘中的相似度计算方法进行的。Further, in order to make the present invention obtain a better inventive effect, in step 1), if the similarity between the new report and the previously processed news report is greater than the preset threshold θd, that is, the repetition threshold, it is considered to be a repeated news report, which requires Deduplication processing is performed on news reports, and the value range of θd is 0<θd≤1. The deduplication processing is carried out according to the content of news reports by using the similarity calculation method in text retrieval and text mining.

步骤1)中,先采用自动分类的方法对新闻报道按预先设定好的类别进行分类。In step 1), an automatic classification method is first used to classify news reports according to preset categories.

步骤1)中采用自动分类的方法对新闻报道进行分类时,是采用基于来源的规则分类以及基于内容的自动分类相结合的方法,基于内容的自动分类是采用的文本分类技术。如权利要求4所述的一种自动检测新闻事件的方法,其特征在于:所述的文本分类技术是基于向量空间模型的支持向量机算法。When adopting the method of automatic classification to classify news reports in step 1), the method of combining rule classification based on source and automatic classification based on content is adopted, and automatic classification based on content is the text classification technology adopted. A method for automatically detecting news events according to claim 4, characterized in that: said text classification technology is a support vector machine algorithm based on a vector space model.

进一步,为使本发明获得更好的发明效果,步骤2)中所述的确定与当前报道相关的事件时采用质心比较或者最近邻比较策略,相似度计算方法可以采用现有文本挖掘的技术,文档模型是基于向量空间模型、概率模型、或者语言模型;相似度公式采用夹角余弦或者Hellinger距离公式等;相似度计算考虑结合报道的时间特征以及事件的时间特征。Further, in order to make the present invention obtain a better inventive effect, centroid comparison or nearest neighbor comparison strategy is adopted when determining events related to the current report in step 2), and the similarity calculation method can adopt existing text mining technology, The document model is based on the vector space model, probability model, or language model; the similarity formula adopts the cosine angle or the Hellinger distance formula, etc.; the similarity calculation considers the time characteristics of the report and the event.

步骤2)中在进行相似度计算时,给予报道中的标题以较高的权重,或者对于权威性较高的报道以较高权重,报道的权威性采用新闻源的权威性。In step 2), when calculating the similarity, a higher weight is given to the title in the report, or a higher weight is given to the report with higher authority, and the authority of the report adopts the authority of the news source.

进一步,为使本发明获得更好的发明效果,步骤4)中所述的事件间相似度的衡量,是采用传统聚类算法中计算的聚类相似度值;若两个事件的相似度大于合并阈值θu,则视为两个事件相关,并将其合并,所述的θu取值范围是0<θu≤1。同时,事件合并也可以采用其他策略,例如,如若两个事件的内部表示中若干特征词相同,则视为相似度较高,合并这两个事件。Further, in order to make the present invention obtain better invention effect, the measurement of the similarity between events described in step 4) is to adopt the clustering similarity value calculated in the traditional clustering algorithm; if the similarity of two events is greater than The merge threshold θu is regarded as two events are related and merged, and the value range of θu is 0<θu≤1. At the same time, other strategies can also be used for event merging. For example, if several feature words in the internal representations of two events are the same, it is considered that the similarity is high, and the two events are merged.

进一步,为使本发明获得更好的发明效果,步骤5)中所述的限制条件,可以是相似度阈值或者时间限制,也可以是外部限制如报道关注度、用户点击次数等。Further, in order to make the present invention achieve better inventive effects, the restriction conditions described in step 5) can be a similarity threshold or a time limit, or external restrictions such as report attention, user click times, etc.

再进一步,步骤4)或/和5)中,每处理用户所确定数量的新增报道后,或者每运行一段用户所确定的时间后,或者检测到的事件每新增用户所确定的数量后,再进行步骤4)或/和5)的操作。Further, in step 4) or/and 5), after each processing of new reports determined by the user, or after a period of time determined by the user, or after the detected event is determined by the number of new users , and then carry out the operation of step 4) or/and 5).

进一步,为使本发明获得更好的发明效果,步骤6)中计算事件的排序时,需要结合新闻事件的时间特性和数量特性,例如以最近某个时间范围内(例如12个小时)事件内新增报道的数目作为事件得分值;另外,在排序中,可以同时考虑多个不同的排序,例如同时考虑最近12个小时、1天、3天、7天、30天等,只有当事件在任何排序中都不在事件窗口内时,才将该事件淘汰;这样,多重排序就可以给用户提供不同粒度的信息参考。Further, in order to obtain a better inventive effect of the present invention, when calculating the sorting of events in step 6), it is necessary to combine the time characteristics and quantity characteristics of news events, for example, within a certain recent time range (for example, 12 hours) within the event The number of newly added reports is used as the event score value; in addition, in the sorting, multiple different sortings can be considered at the same time, for example, considering the last 12 hours, 1 day, 3 days, 7 days, 30 days, etc., only when the event The event is eliminated only when it is not within the event window in any sorting; in this way, multiple sorting can provide users with information references of different granularities.

步骤6)中计算事件排序时,可以结合步骤6)中的多个排序结果,输出符合用户要求的某个排序,或者同时输出多个排序,例如用户可以同时请求查看1天内和1周内最热点的事件。When calculating the ranking of events in step 6), multiple ranking results in step 6) can be combined to output a ranking that meets the user's requirements, or multiple rankings can be output at the same time. For example, the user can request to view the most Hot events.

进一步,为使本发明获得更好的发明效果,步骤7)中输出检测结果时,对于当前所有事件,计算描述;同时,结合时间特性和数量特性,计算事件得分并对事件进行排序,选择得分较高的新闻事件作为重要新闻事件,输出事件描述和包含的新闻报道列表,其中,事件描述的生成过程如下:Further, in order to make the present invention obtain a better inventive effect, when outputting the detection results in step 7), for all current events, calculate the description; at the same time, combine the time characteristics and quantity characteristics, calculate the event score and sort the events, and select the score Higher news events are regarded as important news events, and the event description and the list of news reports included are output. The generation process of the event description is as follows:

a)选择事件内部权重最高的用户所确定数量的特征词;a) Select the number of feature words determined by the user with the highest internal weight in the event;

b)根据新闻报道选择策略,选取该事件内最具代表性的一篇新闻报道的标题;b) According to the news report selection strategy, select the title of the most representative news report in the event;

c)综合a)和b),输出该事件的描述。c) Combine a) and b), and output a description of the event.

步骤7)中的所述具代表性的新闻报道选择策略是结合新闻来源的权威性、报道点击率、报道时间等相关信息的阈值策略,所述的阈值策略是预先设定的事件阈值θe,所述的θe取值范围是0<θe≤1;例如与事件相似度大于阈值的事件内新闻报道中,选择时间最近的一篇新闻报道的标题。或者是按照用户确定的比例输出最相关的新闻报道。The representative news report selection strategy in step 7) is a threshold strategy that combines the authority of the news source, report click-through rate, report time and other relevant information, and the threshold strategy is a preset event threshold θe, The value range of θe is 0<θe≤1; for example, among the news reports within the event whose similarity with the event is greater than the threshold, the title of the latest news report is selected. Or output the most relevant news stories according to the ratio determined by the user.

本发明的效果在于:本发明在充分考虑了新闻事件的特征,以及人们的认知规律基础上,针对实际应用中的事件排序,事件合并和调整,新闻报道淘汰,以及新闻事件描述等,给出了实际的解决方法。实验表明,采用本发明所述的方法,明显提高新闻事件的检测效果,从而大大增强其实用性。The effect of the present invention is that: on the basis of fully considering the characteristics of news events and people's cognitive rules, the present invention aims at sorting events in practical applications, merging and adjusting events, eliminating news reports, and describing news events. There are practical solutions. Experiments show that the method of the present invention can significantly improve the detection effect of news events, thereby greatly enhancing its practicability.

本发明之所以具有上述发明效果,是因为本发明具有如下特点:Why the present invention has above-mentioned invention effect, because the present invention has following characteristics:

(1)在事件排序方面,引入在某一时刻对事件计算重要性得分值的机制,该机制综合考虑新闻事件的时间特性和数量特性,进而在某一时刻为每个事件给出一个较合理的得分值,用于事件排序。(1) In terms of event ranking, a mechanism for calculating importance scores for events at a certain moment is introduced. This mechanism comprehensively considers the time characteristics and quantity characteristics of news events, and then gives each event a comparative score at a certain moment. Reasonable score value, used for event sorting.

(2)在事件相似性方面,引入事件合并和调整的机制,用于克服同一个新闻事件被误分为多个小事件的现象。每处理固定个数的新闻报道,就对事件两两之间进行比较,若依据比较策略判断两事件相似度较高,则进行事件的合并和调整。(2) In terms of event similarity, a mechanism of event merging and adjustment is introduced to overcome the phenomenon that the same news event is mistakenly divided into multiple small events. Every time a fixed number of news reports are processed, the events are compared in pairs. If the similarity of the two events is judged to be high according to the comparison strategy, the events are merged and adjusted.

(3)在新闻报道方面,引入事件内新闻报道淘汰的机制,用于克服新闻事件内容过于宽泛的现象。每处理固定个数的新闻报道,就对各事件内的新闻报道进行淘汰。(3) In terms of news reports, a mechanism for eliminating news reports within events is introduced to overcome the phenomenon that the content of news events is too broad. Every time a fixed number of news reports are processed, the news reports in each event are eliminated.

(4)在事件描述方面,提出了将特征词和新闻报道标题相结合的方法,用于克服两者的缺陷。首先,选择事件内部权重最高的若干个特征词作为事件描述的一部分;同时,根据报道选择策略,选取该事件内最具代表性的一篇新闻报道,将该报道的标题作为事件描述的一部分。(4) In terms of event description, a method of combining feature words and news report titles is proposed to overcome the defects of both. First, select several feature words with the highest weight within the event as part of the event description; at the same time, according to the report selection strategy, select the most representative news report in the event, and use the title of the report as part of the event description.

附图说明Description of drawings

图1是本发明的流程图;Fig. 1 is a flow chart of the present invention;

图2是采用现有方法对2005年7月22至2005年8月9日期间检测新闻事件的结果示意图;Figure 2 is a schematic diagram of the results of detecting news events from July 22, 2005 to August 9, 2005 using existing methods;

图3是采用本发明所述方法对2005年7月22至2005年8月9日期间检测新闻事件的结果示意图;Fig. 3 is a schematic diagram of the results of detecting news events between July 22, 2005 and August 9, 2005 using the method of the present invention;

图4是2005年8月9日新浪网要闻截图;Figure 4 is a screenshot of the important news on Sina.com on August 9, 2005;

图5是采用现有方法对2005年7月22至2005年10月9日期间检测新闻事件的结果示意图;Figure 5 is a schematic diagram of the results of detecting news events from July 22, 2005 to October 9, 2005 using existing methods;

图6是采用本发明所述方法对2005年7月22至2005年10月9日期间检测新闻事件的结果示意图;Fig. 6 is a schematic diagram of the results of detecting news events during the period from July 22, 2005 to October 9, 2005 using the method of the present invention;

图7是2005年10月9日新浪网要闻截图。Figure 7 is a screenshot of the news from Sina.com on October 9, 2005.

具体实施方式Detailed ways

下面结合附图及实施例对本发明作进一步地描述:Below in conjunction with accompanying drawing and embodiment the present invention will be further described:

如图1所示,一种自动检测新闻事件的方法,包括以下步骤:As shown in Figure 1, a method for automatically detecting news events includes the following steps:

1)从数据源读入一篇报道,对多个新闻网络数据源(例如新浪网、新华网、人民网等)进行不间断地检测,从网络中自动抓取新闻报道,解析出新闻报道的时间、标题和正文信息等,如果没有从报道中找到时间,则以抓取时间为准;1) Read a report from the data source, continuously detect multiple news network data sources (such as Sina.com, Xinhuanet, People.cn, etc.), automatically grab news reports from the network, and analyze the content of the news reports. Time, title and text information, etc. If the time is not found from the report, the crawling time shall prevail;

由于多个数据源之间存在相当的重复,对新抓取的新闻报道,根据报道的文本内容进行消重处理;如果新报道和之前已经处理的新闻报道重复度大于重复阈值θd,则认为是重复的新闻报道,本实施例中设定的重复阈值θd为0.9;Due to the considerable duplication among multiple data sources, newly captured news reports are deduplicated according to the text content of the reports; if the duplication degree between the new report and the previously processed news reports is greater than the duplication threshold θd, it is considered as For repeated news reports, the repetition threshold θd set in this embodiment is 0.9;

由于新闻报道的范围过于宽泛,采用基于来源的规则分类以及基于内容的自动分类相结合的方法,对新闻报道进行分类(类别是预先设定好的,例如参考新浪网的频道,可以分成新闻、科技、财经、体育等)。规则分类根据新闻来源以及作者等进行分类,例如来自新浪“国内新闻”频道的内容归入“国内新闻”类别,来自新华网“科技”频道的内容归入“科技”类别。基于内容的自动分类采用向量空间模型和支持向量机算法,根据报道内容和标题对新闻报道进行自动分类;并且按照所属类别c进行步骤2)-步骤7)的处理;Because the scope of news reports is too broad, the method of combining source-based rule classification and content-based automatic classification is used to classify news reports (categories are pre-set, for example, refer to the channels of Sina.com, which can be divided into news, technology, finance, sports, etc.). The classification rules are based on news sources and authors. For example, content from Sina’s “Domestic News” channel is classified into the “Domestic News” category, and content from Xinhuanet’s “Technology” channel is classified into the “Technology” category. Content-based automatic classification adopts vector space model and support vector machine algorithm to automatically classify news reports according to report content and title; and perform step 2)-step 7) processing according to category c;

2)采用质心比较策略,将报道与所属类别c内现有检测到的新闻事件进行比较,同时考虑时间特征和内容特征,计算报道和事件间的相似度,并记录最大相似度Smax以及相似度最大的事件Es,确定与当前报道最相近的事件;事件本身通过事件内部所有新闻中综合权重最高的若干个特征词来表达;新闻报道和事件之间的相似度基于向量空间模型,通过两者的夹角余弦值(cosine)来计算,同时新闻报道的标题赋予较高权重。2) Using the centroid comparison strategy, compare the report with the existing detected news events in the category c, and consider the time characteristics and content characteristics at the same time, calculate the similarity between the report and the event, and record the maximum similarity S max and the similarity The event Es with the highest degree is used to determine the event closest to the current report; the event itself is expressed by several feature words with the highest comprehensive weight in all news inside the event; the similarity between news reports and events is based on the vector space model, through two The cosine value of the included angle (cosine) of the reporter is used to calculate, and the title of the news report is given a higher weight.

3)根据步骤2)计算得到的最大相似度Smax以及相似度最大的事件Es,对当前报道采取如下措施:3) According to the maximum similarity S max calculated in step 2) and the event Es with the largest similarity, take the following measures for the current report:

a)如果Smax小于创新阈值θn(本实施例中为0.25):在该报道所属类别内创建一个新事件;a) If S max is less than the innovation threshold θn (0.25 in this embodiment): create a new event in the category to which the report belongs;

b)如果Smax大于θn而小于聚类阈值θc(本实施例中为0.30):不作处理,返回步骤1);b) If S max is greater than θn but less than the clustering threshold θc (0.30 in this embodiment): no processing, return to step 1);

c)如果Smax大于θc而小于贡献阈值θt(本实施例中为0.35):归入当前事件;c) If S max is greater than θc but less than the contribution threshold θt (0.35 in this embodiment): classified into the current event;

d)如果Smax大于θt:归入事件Es,并调整Es;d) If S max is greater than θt: classify event Es, and adjust Es;

上述的Smax、θn、θc、θt的取值范围均大于0而小于等于1。The value ranges of the aforementioned S max , θn, θc, and θt are all greater than 0 and less than or equal to 1.

4)当一个类处理用户确定的固定数量(本实施例中确定的数量为20条)的新增报道之后,对该类别内新闻事件两两比较;如果两个事件的相似度大于合并阈值θu(例如0.20),则将其合并。事件之间的相似度计算公式可以采用传统聚类算法中计算两个聚类相似度的方法,例如基于向量空间模型,综合考虑两个事件中所有新闻报道之间的两两相似度,采用如下公式:4) After a class processes a fixed number of new reports determined by the user (the number determined in this embodiment is 20), compare news events in the class; if the similarity of the two events is greater than the merging threshold θu (e.g. 0.20), then merge them. The formula for calculating the similarity between events can use the method of calculating the similarity of two clusters in traditional clustering algorithms, for example, based on the vector space model, comprehensively considering the pairwise similarity between all news reports in the two events, as follows formula:

SimSim (( EE. 11 ,, EE. 22 )) == &Sigma;&Sigma; dd ii &Element;&Element; EE. 11 &Sigma;&Sigma; dd jj &Element;&Element; EE. 22 simsim (( dd ii ,, dd jj )) || EE. 11 || &CenterDot;&CenterDot; || EE. 22 ||

其中,E1,E2是两个检测到的新闻事件,di,dj分别为E1,E2中的新闻报道,sim(di,dj)是两个新闻报道之间的相似度,|E1|,|E2|分别为两个事件中包含的新闻报道数目;Among them, E 1 , E 2 are two detected news events, d i , d j are news reports in E 1 , E 2 respectively, sim(d i , d j ) is the similarity between two news reports degree, |E 1 |, |E 2 | are the number of news reports contained in the two events respectively;

5)当一个类处理用户确定的固定数量(本实施例中确定的数量为20条)的新增报道之后,对各事件内的新闻报道进行淘汰:重新计算新闻报道和该事件的相似度,对相似度低于聚类阈值θc、或者不满足限制条件(例如报道是否为近30天内的)的新闻报道进行淘汰;然后再重新计算事件内部表示及其权重;5) After a class processes a fixed number of new reports determined by the user (the number determined in this embodiment is 20), the news reports in each event are eliminated: the similarity between the news report and the event is recalculated, Eliminate news reports whose similarity is lower than the clustering threshold θc, or that do not meet the constraints (such as whether the report is within the past 30 days); and then recalculate the internal representation of the event and its weight;

6)若当前类别内的事件数量超过事件窗口大小,对类别内的所有新闻事件进行排序:结合新闻事件的时间特性和数量特性,计算新闻事件的得分值并排序;计算得分值时同时考虑多个不同的排序,同时考虑最近12个小时、1天、3天、7天、30天等,只有当事件在任何排序中都不在事件窗口内时,才将该事件淘汰;这样,多重排序就给用户提供了不同粒度的信息参考。系统将不在事件窗中的新闻事件淘汰,用于提高系统处理的效率;6) If the number of events in the current category exceeds the size of the event window, sort all news events in the category: combine the time characteristics and quantity characteristics of news events, calculate the score value of news events and sort them; when calculating the score value, simultaneously Consider multiple different sorts, taking into account the last 12 hours, 1 day, 3 days, 7 days, 30 days, etc., and only knock out an event if it is not within the event window in any sort; thus, multiple Sorting provides users with information references of different granularities. The system eliminates news events that are not in the event window to improve the efficiency of system processing;

7)根据用户要求,对外输出检测结果:对于类别内的当前所有事件,计算其描述;同时,结合事件的时间特性和事件内的新闻报道数量特性,从所有类别中选择出得分最高的若干个新闻事件,作为该类别最热点的新闻事件,输出事件描述和包含的新闻报道列表。其中,事件描述的生成过程如下:7) According to user requirements, externally output the detection results: for all current events in the category, calculate their description; at the same time, combine the time characteristics of the event and the number of news reports in the event, select several highest scoring ones from all categories News event, as the hottest news event in this category, outputs an event description and a list of included news reports. Among them, the generation process of the event description is as follows:

a)读取事件内部权重最高的若干个特征词;a) Read several feature words with the highest internal weight in the event;

b)在与事件相似度大于事件阈值θe(本实施例中为0.6)的事件内新闻报道中,选择时间最近的一篇新闻报道的标题;事件阈值还可以采取按照比例(20%)的方式。b) Among the news reports within the event whose similarity with the event is greater than the event threshold θe (0.6 in this embodiment), select the title of the latest news report; the event threshold can also be taken in a proportional (20%) manner .

c)综合a)和b),输出该事件的描述。c) Combine a) and b), and output a description of the event.

为了验证本发明的有效性,我们采用2005-7-22至2005-10-9期间从新浪网、新华网、人民网等网站部分频道(新闻、科技、体育等)上抓取的10万篇新闻语料做测试,10万新闻语料被分为3大类:新闻、科技、体育。评价指标采用重大新闻事件的检测率(因为新浪网新闻频道要闻栏均为人工编辑整理而成,所以取同时间段的新浪网新闻频道要闻栏作为专家结果进行对比)。我们以“新闻”类为例,说明试验结果,实验结果如图2至图7所示。In order to verify the effectiveness of the present invention, we used 100,000 articles captured from some channels (news, science and technology, sports, etc.) The news corpus was tested, and 100,000 news corpora were divided into 3 categories: news, technology, and sports. The evaluation index adopts the detection rate of major news events (because the Sina.com News Channel Highlights column is edited manually, so the Sina.com News Channel Highlights column of the same time period is used as the expert results for comparison). Let's take the "news" category as an example to illustrate the test results, which are shown in Figure 2 to Figure 7.

图2至图7均是对比本发明的方法和传统方法在新闻检测中止时间检测到的排序前10名的重大新闻事件(其中括号内为检测到的相关新闻数量),以及新浪网新闻频道要闻栏在当天21点钟对于重大新闻事件的列表。其中,图2至图4的新闻检测时间为2005年7月22日至2005年8月9日,图5至图7的新闻检测中止时间为2005年7月22日至2005年10月9日。其中,传统方法为Yiming Yang等采用的单遍聚类算法:事件排序直接采用事件检测到的顺序倒序排列(即最新检测到的事件列在最上边),事件淘汰采用事件窗口的方法(凡是排序超出事件窗口的事件均被淘汰),事件描述采用James Allan等提出关键词描述方法。Fig. 2 to Fig. 7 all compare the top 10 major news events (wherein the parentheses are the number of relevant news detected) of the sorting top 10 major news events detected by the method of the present invention and the traditional method at news detection suspension time, and Sina news channel highlights Column at 21 o'clock of the day for a list of major news events. Among them, the news detection time in Figure 2 to Figure 4 is from July 22, 2005 to August 9, 2005, and the news detection suspension time in Figure 5 to Figure 7 is from July 22, 2005 to October 9, 2005 . Among them, the traditional method is the single-pass clustering algorithm adopted by Yiming Yang et al.: event sorting directly uses the reverse order of event detection (that is, the latest detected event is listed at the top), and event elimination uses the method of event window (every sorting Events beyond the event window are eliminated), and the event description adopts the keyword description method proposed by James Allan et al.

从图2至图7可以看出,本发明提出的方法好于传统方法,包括:As can be seen from Fig. 2 to Fig. 7, the method that the present invention proposes is better than traditional method, comprises:

1.事件排序更加合理;从图2至图7可以看到,本发明提出的方法中在前十个事件对新浪当日主要专题的检测率分别达到了62.5%和57%;1. The sequence of events is more reasonable; as can be seen from Fig. 2 to Fig. 7, in the method proposed by the present invention, the detection rates of the main topic of Sina in the first ten events reached 62.5% and 57% respectively;

2.减少了同一事件被误分为多个小事件的情况;图2中第3-6事件均是纪念抗日战争胜利60周年,在传统方法中被分为多个事件,而在本发明提出的方法中被统一为图3中的第4个事件;2. Reduced the situation that the same event is mistakenly divided into multiple small events; the 3rd-6 events in Fig. 2 all commemorate the 60th anniversary of the victory of the War of Resistance Against Japan, and are divided into multiple events in traditional methods, but in the present invention The method is unified as the fourth event in Figure 3;

3.新闻事件描述更加准确全面;例如“神州六号”事件,通过图5中的第三个事件的描述,会比单纯关键词或者单纯代表性新闻标题更准确全面。3. The description of news events is more accurate and comprehensive; for example, the "Shenzhou 6" incident, through the description of the third event in Figure 5, will be more accurate and comprehensive than pure keywords or pure representative news headlines.

另外,由于本发明提出的方法引入了新闻事件内的新闻报道淘汰机制,新闻事件的内容更加集中。In addition, since the method proposed by the present invention introduces a news report elimination mechanism in the news event, the content of the news event is more concentrated.

实验表明:由于传统方法仅仅考虑在固定的小数据集合上的错检率和漏检率,在实际应用环境中存在诸多缺陷。而本发明提出的方法,充分考虑了新闻事件发生的特征,以及人们的认知规律,使得新闻事件的检测效果获得明显提高,大大增强其实用性。Experiments show that because the traditional method only considers the false detection rate and missed detection rate on a fixed small data set, there are many defects in the actual application environment. The method proposed by the present invention fully considers the characteristics of news events and people's cognition rules, so that the detection effect of news events is significantly improved, and its practicability is greatly enhanced.

实际应用过程中,基于内容的自动分类还可以采用其它的文本分类技术,例如基于语言模型的KNN算法;步骤2)中,确定与当前报道最相近的事件时,还可以采用质心比较策略。因此,本发明所述的方法并不限于具体实施方式中所述的实施例,只要是本领域技术人员根据本发明的技术方案得出其他的实施方式,同样属于本发明的技术创新范围。In practical application, content-based automatic classification can also use other text classification techniques, such as language model-based KNN algorithm; in step 2), when determining the most similar event to the current report, a centroid comparison strategy can also be used. Therefore, the method described in the present invention is not limited to the examples described in the specific implementation, as long as those skilled in the art obtain other implementations according to the technical solution of the present invention, they also belong to the technical innovation scope of the present invention.

Claims (15)

1.一种自动检测新闻事件的方法,包括以下步骤:1. A method for automatically detecting news events, comprising the following steps: 1)从数据源读入一篇报道,并对报道进行预处理;1) Read a report from the data source and preprocess the report; 2)计算报道与已检测到的事件、或者报道与报道间的相似度,确定与当前报道相关的事件,并归入相关事件;2) Calculate the similarity between the report and the detected event, or between the report and the report, determine the events related to the current report, and classify them into related events; 3)若报道被归入某个现有事件,则调整该事件;若报道无法归入现有事件,则将其列为新检测到的事件;3) If the report is classified into an existing event, adjust the event; if the report cannot be classified into an existing event, list it as a newly detected event; 4)对已检测到的事件进行两两比较,合并相关事件,并重新调整事件、以及报道和事件的相似度;4) Perform a pairwise comparison of the detected events, merge related events, and re-adjust the events, as well as the similarity between reports and events; 5)对各事件内不满足限制条件的报道进行淘汰,并调整事件;5) Eliminate the reports that do not meet the restriction conditions in each event, and adjust the event; 6)比较当前的事件数量与时间窗口大小,若事件数量大于事件窗口大小,则进行事件排序和淘汰;否则转入步骤7;6) Compare the current number of events with the size of the time window, if the number of events is greater than the size of the event window, sort and eliminate events; otherwise, go to step 7; 7)输出检测结果。7) Output the detection result. 2.如权利要求1所述的一种自动检测新闻事件的方法,其特征在于:步骤1)中,如果新报道和之前已经处理的新闻报道相似度大于预先设定的阈值θd即重复阈值,则认为是重复的新闻报道,需要对新闻报道进行消重处理,所述的θd取值范围是0<θd≤1,所述的消重处理是根据新闻报道的内容采用文本检索和文本挖掘中的相似度计算方法进行的。2. a kind of method for automatically detecting news event as claimed in claim 1, it is characterized in that: in step 1), if new report and the previously processed news report similarity are greater than preset threshold θd i.e. repetition threshold, It is considered to be a repeated news report, and the news report needs to be deduplicated. The value range of θd is 0<θd≤1. The deduplication process is based on the content of the news report using text retrieval and text mining. The similarity calculation method is carried out. 3.如权利要求1或2所述的一种自动检测新闻事件的方法,其特征在于:步骤1)中,先采用自动分类的方法对新闻报道按预先设定好的类别进行分类。3. A kind of method for automatically detecting news events as claimed in claim 1 or 2, is characterized in that: in step 1), first adopt the method for automatic classification to classify news reports by preset categories. 4.如权利要求3所述的一种自动检测新闻事件的方法,其特征在于:4. A kind of method for automatically detecting news event as claimed in claim 3, is characterized in that: 步骤1)中采用自动分类的方法对新闻报道进行分类时,是采用基于来源的规则分类以及基于内容的自动分类相结合的方法,基于内容的自动分类是采用的文本分类技术。When adopting the method of automatic classification to classify news reports in step 1), the method of combining rule classification based on source and automatic classification based on content is adopted, and automatic classification based on content is the text classification technology adopted. 5.如权利要求4所述的一种自动检测新闻事件的方法,其特征在于:所述的文本分类技术是基于向量空间模型的支持向量机算法。5. A method for automatically detecting news events as claimed in claim 4, characterized in that: said text classification technology is a support vector machine algorithm based on a vector space model. 6.如权利要求1所述的一种自动检测新闻事件的方法,其特征在于:步骤2)中确定与当前报道相关的事件时采用质心比较或者最近邻比较策略,相似度计算方法是采用文本挖掘的技术,文档模型是基于向量空间模型、概率模型、或者语言模型;相似度公式是采用夹角余弦或者Hellinger距离公式;相似度计算还考虑结合报道的时间特征以及事件的时间特征。6. A kind of method for automatically detecting news event as claimed in claim 1, is characterized in that: adopt centroid comparison or nearest neighbor comparison strategy when determining the event relevant to current report in step 2), similarity calculation method is to adopt text For the mining technology, the document model is based on the vector space model, probability model, or language model; the similarity formula uses the cosine angle or the Hellinger distance formula; the similarity calculation also considers the time characteristics of reports and events. 7.如权利要求6所述的一种自动检测新闻事件的方法,其特征在于:7. A method for automatically detecting news events as claimed in claim 6, characterized in that: 步骤2)中在进行相似度计算时,给予报道中的标题以较高的权重,或者对于权威性较高的报道以较高权重,报道的权威性采用新闻源的权威性。In step 2), when calculating the similarity, a higher weight is given to the title in the report, or a higher weight is given to the report with higher authority, and the authority of the report adopts the authority of the news source. 8.如权利要求1所述的一种自动检测新闻事件的方法,其特征在于:步骤4)中所述的事件间相似度的衡量,是采用传统聚类算法中计算的聚类相似度值;若两个事件的相似度大于合并阈值θu,则视为两个事件相关,并将其合并,所述的θu取值范围是0<θu≤1;或者,如若两个事件的内部表示中若干特征词相同,则视为相似度较高,合并这两个事件。8. a kind of method for automatically detecting news events as claimed in claim 1, is characterized in that: step 4) the measurement of similarity between events described in, is to adopt the clustering similarity value calculated in the traditional clustering algorithm ; If the similarity of the two events is greater than the merge threshold θu, the two events are considered to be related and merged, and the value range of θu is 0<θu≤1; or, if the internal representation of the two events If several feature words are the same, it is considered that the similarity is high, and the two events are merged. 9.如权利要求1所述的一种自动检测新闻事件的方法,其特征在于:步骤5)中所述的限制条件是相似度阈值、时间限制或者是外部限制。9. A method for automatically detecting news events according to claim 1, characterized in that: the restriction condition described in step 5) is a similarity threshold, a time restriction or an external restriction. 10.如权利要求8或9所述的一种自动检测新闻事件的方法,其特征在于:步骤4)或/和5)中,每处理用户所确定数量的新增报道后,或者每运行一段用户所确定的时间后,或者检测到的事件每新增用户所确定的数量后,再进行步骤4)或/和5)的操作。10. A kind of method for automatically detecting news event as claimed in claim 8 or 9, it is characterized in that: in step 4) or/and 5), after every newly added report of the determined quantity of processing user, or every running section After the time determined by the user, or after the number of detected events is increased by the number determined by the user, the operation of step 4) or/and 5) is performed. 11.如权利要求1所述的一种自动检测新闻事件的方法,其特征在于:在步骤6)中,结合新闻事件的时间特性和数量特性,计算新闻事件的得分值并排序;系统只保存固定数目的新闻事件,排序靠后的新闻事件被淘汰。11. a kind of method for automatically detecting news events as claimed in claim 1, is characterized in that: in step 6), in conjunction with the time characteristics and quantity characteristics of news events, calculate the scoring value of news events and sort; System only A fixed number of news events are saved, and news events ranked lower are eliminated. 12.如权利要求11所述的一种自动检测新闻事件的方法,其特征在于:在步骤6)中计算事件排序时,需要结合新闻事件的时间特性和数量特性;在排序中,同时考虑多个按不同时间段的排序,只有当事件在任何排序中都不在事件窗口内时,才将该事件淘汰。12. a kind of method for automatically detecting news event as claimed in claim 11 is characterized in that: when calculating event sorting in step 6), need to combine the time characteristic and quantity characteristic of news event; In sorting, consider simultaneously many sorts by different time periods, and events are only eliminated if they are not within the event window in any sort. 13.如权利要求11或12所述的一种自动检测新闻事件的方法,其特征在于:在步骤6)计算事件排序时,结合步骤6)中的多个排序结果,输出符合用户要求的某个排序,或者同时输出多个排序。13. A kind of method for automatically detecting news event as claimed in claim 11 or 12, it is characterized in that: when step 6) calculates event sorting, in conjunction with step 6) in a plurality of sorting results, output meets certain user requirement sort, or output multiple sorts at the same time. 14.如权利要求13所述的一种自动检测新闻事件的方法,其特征在于:步骤7)输出检测结果时,对于当前所有事件,计算事件描述;同时,结合事件的时间特性和数量特性,对事件进行排序,并选择得分较高的新闻事件作为重要新闻事件,输出事件描述和包含的新闻报道列表,其中,事件描述的生成过程如下:14. A kind of method for automatically detecting news event as claimed in claim 13, is characterized in that: when step 7) output detection result, for current all events, calculate event description; Simultaneously, in conjunction with the time characteristic and quantity characteristic of event, Sort the events and select news events with higher scores as important news events, and output event descriptions and a list of news reports included, where the generation process of event descriptions is as follows: a)选择事件内部权重最高的用户所确定数量的特征词;a) Select the number of feature words determined by the user with the highest internal weight in the event; b)根据新闻报道选择策略,选取该事件内最具代表性的一篇新闻报道的标题;b) According to the news report selection strategy, select the title of the most representative news report in the event; c)综合a)和b),输出该事件的描述。c) Combine a) and b), and output a description of the event. 15.如权利要求14所述的一种自动检测新闻事件的方法,其特征在于:步骤b)中的所述的新闻报道选择策略是结合新闻来源的权威性、报道点击率、报道时间的阈值策略,所述的阈值策略是预先设定的事件阈值θe,所述的θe取值范围是0<θe≤1。15. A kind of method for automatically detecting news events as claimed in claim 14, is characterized in that: the news report selection strategy described in step b) is the threshold value of the authoritativeness of combining news source, report click-through rate, report time strategy, the threshold strategy is a preset event threshold θe, and the value range of θe is 0<θe≤1.
CNB200610007219XA 2006-02-14 2006-02-14 A Method for Automatically Detecting News Events Expired - Fee Related CN100461177C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200610007219XA CN100461177C (en) 2006-02-14 2006-02-14 A Method for Automatically Detecting News Events

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200610007219XA CN100461177C (en) 2006-02-14 2006-02-14 A Method for Automatically Detecting News Events

Publications (2)

Publication Number Publication Date
CN1822000A true CN1822000A (en) 2006-08-23
CN100461177C CN100461177C (en) 2009-02-11

Family

ID=36923366

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200610007219XA Expired - Fee Related CN100461177C (en) 2006-02-14 2006-02-14 A Method for Automatically Detecting News Events

Country Status (1)

Country Link
CN (1) CN100461177C (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546359A (en) * 2009-04-28 2009-09-30 上海银晨智能识别科技有限公司 Human body biological information sorting system and sorting method
CN101231640B (en) * 2007-01-22 2010-09-22 北大方正集团有限公司 A method and system for automatically calculating the evolution trend of topics on the Internet
CN102622378A (en) * 2011-01-30 2012-08-01 北京千橡网景科技发展有限公司 Method and device for detecting events from text flow
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN103077190A (en) * 2012-12-20 2013-05-01 人民搜索网络股份公司 Hot event ranking method based on order learning technology
CN103116651A (en) * 2013-03-05 2013-05-22 南京理工大学常熟研究院有限公司 Public sentiment hot topic dynamic detection method
CN103164427A (en) * 2011-12-13 2013-06-19 中国移动通信集团公司 Method and device of news aggregation
CN104636461A (en) * 2015-02-06 2015-05-20 北京中搜网络技术股份有限公司 Dynamic event clustering and extracting method based on KNN
CN105046497A (en) * 2007-11-14 2015-11-11 潘吉瓦公司 Evaluating public records of supply transactions
CN106021063B (en) * 2016-05-09 2018-05-29 北京蓝海讯通科技股份有限公司 Method, application and the system of polymerization events message
CN109299266A (en) * 2018-10-16 2019-02-01 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
US10430846B2 (en) 2007-11-14 2019-10-01 Panjiva, Inc. Transaction facilitating marketplace platform
US10949450B2 (en) 2017-12-04 2021-03-16 Panjiva, Inc. Mtransaction processing improvements
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
US11514096B2 (en) 2015-09-01 2022-11-29 Panjiva, Inc. Natural language processing for entity resolution
US11551244B2 (en) 2017-04-22 2023-01-10 Panjiva, Inc. Nowcasting abstracted census from individual customs transaction records

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442778A (en) * 1991-11-12 1995-08-15 Xerox Corporation Scatter-gather: a cluster-based method and apparatus for browsing large document collections
EP1324229A3 (en) * 2001-12-27 2006-02-01 Ncr International Inc. Using point-in-time views to provide varying levels of data freshness
CN1710563A (en) * 2005-07-18 2005-12-21 北大方正集团有限公司 A Method for Important News Event Detection and Summarization

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640B (en) * 2007-01-22 2010-09-22 北大方正集团有限公司 A method and system for automatically calculating the evolution trend of topics on the Internet
CN105046497A (en) * 2007-11-14 2015-11-11 潘吉瓦公司 Evaluating public records of supply transactions
US10885561B2 (en) 2007-11-14 2021-01-05 Panjiva, Inc. Transaction facilitating marketplace platform
US10504167B2 (en) 2007-11-14 2019-12-10 Panjiva Inc. Evaluating public records of supply transactions
US10430846B2 (en) 2007-11-14 2019-10-01 Panjiva, Inc. Transaction facilitating marketplace platform
CN101546359A (en) * 2009-04-28 2009-09-30 上海银晨智能识别科技有限公司 Human body biological information sorting system and sorting method
CN102622378A (en) * 2011-01-30 2012-08-01 北京千橡网景科技发展有限公司 Method and device for detecting events from text flow
CN103164427B (en) * 2011-12-13 2016-03-02 中国移动通信集团公司 News Aggreagation method and device
CN103164427A (en) * 2011-12-13 2013-06-19 中国移动通信集团公司 Method and device of news aggregation
CN102945246B (en) * 2012-09-28 2015-12-02 北界创想(北京)软件有限公司 The disposal route of network information data and device
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103077190A (en) * 2012-12-20 2013-05-01 人民搜索网络股份公司 Hot event ranking method based on order learning technology
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN103116651A (en) * 2013-03-05 2013-05-22 南京理工大学常熟研究院有限公司 Public sentiment hot topic dynamic detection method
CN104636461A (en) * 2015-02-06 2015-05-20 北京中搜网络技术股份有限公司 Dynamic event clustering and extracting method based on KNN
US11514096B2 (en) 2015-09-01 2022-11-29 Panjiva, Inc. Natural language processing for entity resolution
CN106021063B (en) * 2016-05-09 2018-05-29 北京蓝海讯通科技股份有限公司 Method, application and the system of polymerization events message
US11551244B2 (en) 2017-04-22 2023-01-10 Panjiva, Inc. Nowcasting abstracted census from individual customs transaction records
US10949450B2 (en) 2017-12-04 2021-03-16 Panjiva, Inc. Mtransaction processing improvements
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
CN109299266A (en) * 2018-10-16 2019-02-01 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN109299266B (en) * 2018-10-16 2019-11-12 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product

Also Published As

Publication number Publication date
CN100461177C (en) 2009-02-11

Similar Documents

Publication Publication Date Title
CN1822000A (en) A Method for Automatically Detecting News Events
Petrovic et al. Streaming first story detection with application to twitter
Potthast et al. Overview of the 2nd international competition on plagiarism detection
CN108763402B (en) Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary
Rossetto et al. Cineast: a multi-feature sketch-based video retrieval engine
Li et al. News text classification model based on topic model
CN108376131A (en) Keyword abstraction method based on seq2seq deep neural network models
Liu et al. Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF.
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN108932311B (en) Methods of emergency detection and prediction
CN114064885B (en) Unsupervised Chinese multi-document extraction type abstract method
CN104008090A (en) Multi-subject extraction method based on concept vector model
CN106257455B (en) A kind of Bootstrapping method extracting viewpoint evaluation object based on dependence template
CN102081655A (en) Information retrieval method based on Bayesian classification algorithm
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN107122382A (en) A kind of patent classification method based on specification
CN104346459B (en) A kind of text classification feature selection approach based on term frequency and chi
CN101004761A (en) Hierarchy clustering method of successive dichotomy for document in large scale
CN109885675A (en) Text subtopic discovery method based on improved LDA
CN112418269B (en) Method, system and medium for predicting critical time of social media network event propagation
CN110162632A (en) A method for discovering special news events
Gao et al. A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization
CN116451675A (en) A detection and optimization method for similar duplicate records based on the density clustering algorithm DBSCAN algorithm
CN107562928B (en) A CCMI text feature selection method
Balaneshin-kordan et al. Sequential query expansion using concept graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220908

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: PEKING University FOUNDER R & D CENTER

Patentee after: Peking University

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PEKING University FOUNDER R & D CENTER

Patentee before: Peking University

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090211

CF01 Termination of patent right due to non-payment of annual fee