[go: up one dir, main page]

CN110516215A - An automatic writing method of sports news - Google Patents

An automatic writing method of sports news Download PDF

Info

Publication number
CN110516215A
CN110516215A CN201910404548.5A CN201910404548A CN110516215A CN 110516215 A CN110516215 A CN 110516215A CN 201910404548 A CN201910404548 A CN 201910404548A CN 110516215 A CN110516215 A CN 110516215A
Authority
CN
China
Prior art keywords
time
data
sports news
game
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910404548.5A
Other languages
Chinese (zh)
Inventor
吕学强
李宁
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201910404548.5A priority Critical patent/CN110516215A/en
Publication of CN110516215A publication Critical patent/CN110516215A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种体育新闻自动写作方法,首先根据直播文本数据构建分差‑时间函数,并对数据进行建模,其次根据分差‑时间函数的特性将数据进行合并,提取直播文本片段信息,再根据战报数据直播文本的关键点信息,提取重要的直播片段,生成直播片段数据集,提取直播片段数据集中的触发条件,并与已构建好的模板触发条件进行比对,选取最优模板,并将比赛的事实数据填写进模板,生成战报句子,最终生成战报文章。本发明提供的体育新闻自动写作方法产生的体育新闻文章在以假乱真率、真实率、生动率方面均取得了极佳的效果,能够辅助人们完成体育新闻写作,从而节省出大量的人力物力,极大地改变人们的日常写作方式,能很好地满足实际应用的需要。

The invention relates to an automatic writing method for sports news. First, a difference-time function is constructed according to live text data, and the data is modeled, and then the data is merged according to the characteristics of the difference-time function, and the live text segment information is extracted, Then, according to the key point information of the live text of the battle report data, extract important live clips, generate a live clip data set, extract the trigger conditions in the live clip data set, and compare them with the trigger conditions of the constructed template to select the optimal template. Fill in the factual data of the game into the template, generate battle report sentences, and finally generate battle report articles. The sports news articles generated by the sports news automatic writing method provided by the present invention have achieved excellent results in terms of the false-to-true rate, the true rate and the vivid rate, and can assist people to complete the sports news writing, thereby saving a lot of manpower and material resources, greatly Changing people's daily writing style can well meet the needs of practical applications.

Description

一种体育新闻自动写作方法An automatic writing method of sports news

技术领域technical field

本发明属于计算机自动写作技术领域,具体涉及一种体育新闻自动写作方 法。The invention belongs to the technical field of computer automatic writing, and in particular relates to a sports news automatic writing method.

背景技术Background technique

汉语自动写作是人工智能的重要分支,开展智能写作工作,利用计算机完 全自动的对收集的文档进行整理、提取、过滤、筛选、组装、写作,自动的生 成一篇相关的文章,实现机器自动写作的设想,开辟了全新的写作空间,极大 地改变人们的日常写作方式,辅助人们完成写作,从而节省出大量的人力物力。 现有技术的体育新闻自动写作方法存在的缺陷包括:文字撰写出错率高、时间 不及时,主观性强等。现在亟待研发一种能够克服现有技术缺陷的体育新闻自 动写作方法。Chinese automatic writing is an important branch of artificial intelligence. It carries out intelligent writing work, and uses computers to completely automatically organize, extract, filter, filter, assemble, and write collected documents, and automatically generate a related article to realize automatic machine writing. The idea of opening up a new writing space, greatly changing people's daily writing methods, assisting people to complete writing, thus saving a lot of manpower and material resources. The defects of the sports news automatic writing method in the prior art include: high error rate in writing, untimely time, and strong subjectivity. Now it is urgent to develop an automatic writing method of sports news that can overcome the shortcomings of the existing technology.

发明内容SUMMARY OF THE INVENTION

针对上述现有技术中存在的问题,本发明的目的在于提供一种可避免出现 上述技术缺陷的体育新闻自动写作方法。Aiming at the problems existing in the above-mentioned prior art, the purpose of the present invention is to provide a sports news automatic writing method that can avoid the above-mentioned technical defects.

为了实现上述发明目的,本发明提供的技术方案如下:In order to realize the above-mentioned purpose of the invention, the technical scheme provided by the present invention is as follows:

本发明提供的体育新闻自动写作方法,产生的体育新闻文章在以假乱真率、 真实率、生动率方面均取得了极佳的效果,能够辅助人们完成体育新闻写作, 从而节省出大量的人力物力,极大地改变人们的日常写作方式,可以很好地满 足实际应用的需要。The sports news automatic writing method provided by the present invention achieves excellent results in terms of the false-to-true rate, the true rate and the vivid rate of the generated sports news articles, which can assist people to complete the sports news writing, thereby saving a lot of manpower and material resources, extremely Greatly changing people's daily writing methods can well meet the needs of practical applications.

附图说明Description of drawings

图1为分差-时间函数图;Figure 1 is a difference-time function diagram;

图2为数据切分算法图。Figure 2 is a diagram of the data segmentation algorithm.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,下面结合附图和具 体实施例对本发明做进一步说明。应当理解,此处所描述的具体实施例仅用以 解释本发明,并不用于限定本发明。基于本发明中的实施例,本领域普通技术 人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保 护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

一种体育新闻自动写作方法,首先根据直播文本数据构建分差-时间函数, 并对数据进行建模,其次根据分差-时间函数的特性将数据进行合并,提取直播 文本片段信息,再根据战报数据直播文本的关键点信息,提取重要的直播片段, 生成直播片段数据集,提取直播片段数据集中的触发条件,并与已构建好的模 板触发条件进行比对,选取最优模板,并将比赛的事实数据填写进模板,生成 战报句子,最终生成战报文章。An automatic writing method of sports news, firstly constructs a difference-time function according to the live text data, and models the data, and secondly combines the data according to the characteristics of the difference-time function, extracts the live text fragment information, and then according to the battle report The key point information of the data live text, extract important live clips, generate live clip data sets, extract the trigger conditions in the live clip data set, and compare them with the trigger conditions of the constructed template, select the optimal template, and match the Fill in the template with the factual data, generate the battle report sentence, and finally generate the battle report article.

通过对新闻战报数据的分析发现,比分、球员个人表现、连续得分、焦灼、 追分、反超、最后时刻、压哨球等是被报道的重点。本发明将根据两种球队的 比分变换对数据进行建模,并将球员表现、球队表现、比赛的走势融入到其中。 对直播数据进行切分、提取,根据提取出的比赛关键点与已构建好的模板库进 行匹配,然后将提取的直播数据组装到模板库中去,最终完成自动写作。Through the analysis of the data of the news report, it was found that the score, personal performance of the players, continuous scoring, anxiety, chasing points, go-ahead, last minute, buzzer-beater, etc. are the focus of the report. The present invention will model the data according to the score transformation of the two teams, and incorporate player performance, team performance, and game trends into it. Divide and extract the live broadcast data, match the extracted key points with the template library that has been constructed, and then assemble the extracted live broadcast data into the template library, and finally complete the automatic writing.

体育新闻自动写作模板库构建方法包括:首先根据已分好的类别对模板数 据进行相似度计算找到相同的模板,其次是利用CRF对模板进行触发条件识 别,最终形成触发条件-模板。The construction method of the template library for sports news automatic writing includes: firstly, calculating the similarity of template data according to the classified categories to find the same template; secondly, using CRF to identify the trigger condition of the template, and finally forming the trigger condition-template.

写作模板计算包括基于余弦相似度的写作模板计算,基于余弦相似度的写 作模板计算包括:采用余弦相似度来计算两个句子的相似度,对两个句子进行 分词,列出所有的词语,计算词频,写出词频向量;把词频向量想象成空间中 的两条线段,都是从原点出发,指向不同的方向,两条线段之间形成一个夹角, 计算夹角的余弦相似度。Writing template calculation includes writing template calculation based on cosine similarity. Writing template calculation based on cosine similarity includes: using cosine similarity to calculate the similarity of two sentences, segmenting two sentences, listing all words, calculating Word frequency, write the word frequency vector; imagine the word frequency vector as two line segments in the space, both starting from the origin and pointing in different directions, forming an angle between the two line segments, and calculating the cosine similarity of the angle.

写作模板计算包括基于Word2Vec的写作模板计算,基于Word2Vec的写作 模板计算包括:利用Word2Vec工具把背景语料中的词语用向量进行形式化表 示,把对文本的处理简化为向量空间中的向量运算,通过计算向量空间上的相 似度来表示文本语义上的相似度,实现相关词扩展。The writing template calculation includes the Word2Vec-based writing template calculation, and the Word2Vec-based writing template calculation includes: using the Word2Vec tool to formally represent the words in the background corpus with vectors, simplifying the processing of the text into vector operations in the vector space, through The similarity in the vector space is calculated to represent the semantic similarity of the text, and the related word expansion is realized.

基于CRF的写作模板触发条件构建包括:CRF-based writing template trigger condition construction includes:

对于给定的一个比赛描述句子Texti来说,客队和主队的比分差为Diffsore, 写作模板为Y,触发条件为XiFor a given game description sentence Text i , the score difference between the away team and the home team is Diffsore , the writing template is Y, and the trigger condition is Xi;

Diffsorei=Texti·Score1-Texti·Score2Diffsore i =Text i ·Score 1 -Text i ·Score 2 ;

Y=Diffsore(∑i=1Xi);Y=Diffsore(∑ i=1 X i );

计算每一条text的比分差,并对diffsore进行排序Calculate the score difference of each text and sort the diffsore

List=dis(diffsore);List = dis(diffsore);

List表示基于分差后的text集合,将分差相同的数据进行合并,形成分差 数据集,并对分差数据集内的数据进行触发条件提取。List means that based on the text set after the difference, the data with the same difference is combined to form a difference data set, and the trigger condition is extracted for the data in the difference data set.

基于CRF的写作模板触发条件构建包括:角色标记;特征模板选择。角色 标记包括:定义触发条件为:一场NBA比赛在某个时间段内,描述的事实的条 件,记做CS;定义触发词为一场描述CS所使用的词语,记做CSword;每一类 触发条件包括很多触发词。基于CRF的写作模板触发条件构建包括:首先对带 标注的每一个句子进行分词及词性标注;其次对角色数据进行标注,最后选取 词、词性、角色为特征,利用CRF对触发词识别。特征模板选择包括:选取词、 词性、角色为特征;采用B、I、E、O作为触发词的标注符号,其中B表示触发 词的首字,I表示中间词,E表示基触发词的尾字,O表示非触发词,并分别采 用单一特征模板和复合特征模板,对触发词进行识别。The construction of trigger conditions for writing templates based on CRF includes: role tagging; feature template selection. The role tags include: defining the trigger condition as: an NBA game in a certain period of time, describing the condition of the fact, denoted as CS; defining the trigger word as a word used to describe CS, denoted as CSword; each type of Trigger conditions include many trigger words. The construction of triggering conditions for writing templates based on CRF includes: firstly perform word segmentation and part-of-speech tagging on each annotated sentence; secondly, annotate role data, and finally select words, parts of speech, and roles as features, and use CRF to identify trigger words. The selection of feature templates includes: selecting words, parts of speech, and roles as features; using B, I, E, and O as the labeling symbols of trigger words, where B represents the first word of the trigger word, I represents the middle word, and E represents the end of the basic trigger word word, O represents a non-trigger word, and a single feature template and a compound feature template are used to identify trigger words.

基于分差-时间函数进行数据建模的步骤包括:The steps of data modeling based on the difference-time function include:

比分、时间对于比赛来说是至关重要的因素,本发明利用两只球队的分差来 描述比赛的实际过程。The score and time are crucial factors for the game, and the present invention uses the difference between the two teams to describe the actual process of the game.

对于给定的一条直播数据Ti来说,主队得分为Zsorei,客队得分为Ksorei,两 个球队的分差为Diffsorei For a given piece of live data T i , the home team's score is Zsore i , the away team's score is Ksore i , and the difference between the two teams is Diffsore i

Diffsorei=Zsorei-Ksorei (1.1)Diffsore i = Zsore i -Ksore i (1.1)

利用分差函数得到Diffsorei随时间的走势,本发明将所有的Diffsorei点相连接,最终构建了分差-时间函数即:Using the difference function to obtain the trend of Diffsore i with time, the present invention connects all the Diffsore i points, and finally constructs the difference-time function:

Diffsore=F(time) (1.2)Diffsore=F(time) (1.2)

从图1中可以看出,如果F(time)在timej-timei时间段内为一次函数,则主要 的状态有以下3种情况,每种趋势都代表着不同的状态:As can be seen from Figure 1, if F(time) is a linear function in the time j -time i time period, the main states have the following three situations, and each trend represents a different state:

1、如果F(time)j-F(time)i>0,则表示在timej-timei时间段内主队领先1. If F(time) j - F(time) i > 0, it means that the home team is ahead in the time j - time i time period

2、如果F(time)j-F(time)i=0,则表示在timej-timei时间段内两队平局2. If F(time) j - F(time) i = 0, it means that the two teams are tied within the time j - time i time period

3、如果F(time)j-F(time)i<0,则表示在timej-timei时间段内主队落后3. If F(time) j - F(time) i <0, it means that the home team is behind in the time j - time i time period

如果F(time)在timej-timei时间段内为多次函数,则为上列情况的混合组合, 即可以表示如下表1.1所示If F(time) is a multiple function in the time j -time i time period, it is a mixed combination of the above cases, which can be expressed as shown in Table 1.1 below

表1.1分差-时间函数Table 1.1 Difference-time function

领先lead 平局draw 落后behind 领先-平局lead - draw 平局-领先draw - lead 落后-平局trailing - draw 领先-落后lead - lag 平局-落后draw - behind 落后-领先lag - lead 领先-平局-领先lead - draw - lead 平局-领先-落后draw - lead - behind 落后-平局-落后behind - draw - behind 领先-平局-落后lead-draw-behind 平局-领先-平局draw - lead - draw 落后-平局-领先behind-draw-lead 领先-落后-领先lead-lag-lead 平局-落后-平局draw-behind-draw 落后-领先-落后lag - lead - lag 领先-落后-平局lead-lag-draw 平局-落后-领先draw-behind-lead 落后-领先-平局trailing-leading-drawing ........ ...... ... ...

基于分差-时间函数进行数据切分的步骤包括:The steps of data segmentation based on the difference-time function include:

由于在报道过程中不可能每一个细节都被报道,需要报道重点细节。因此本 发明需要将直播文本进行切分,确定比赛的关键数据片段。本发明根据分差- 时间函数的性质进行判断。Since it is impossible to report every detail during the reporting process, key details need to be reported. Therefore, the present invention needs to segment the live text to determine the key data segments of the game. The present invention makes judgments based on the properties of the differential-time function.

如果F(time)j-F(time)i=0,且timej-timei>3分钟,则保留数据,否则去掉数据。其次如果F(time)j-F(time)i>=0,则将领先类数据进行片段合并,如果 F(time)j-F(time)i<=0,则将落后类数据进行片段合并。If F(time) j -F(time) i =0, and timej - timei >3 minutes, keep the data, otherwise delete the data. Secondly, if F(time) j -F(time) i >=0, merge the leading data into fragments, and if F(time) j -F(time) i <=0, merge the backward data into fragments .

数据切分算法具体步骤如下所示:The specific steps of the data segmentation algorithm are as follows:

Algorithm:Slice AlgorithmAlgorithm: Slice Algorithm

1.Input:直播文本Text=Text1Text2Text3....Textn 1.Input: Live text Text=Text 1 Text 2 Text 3 ....Text n

2.Preprocess:对于每一条Texti,通过公式1.1计算出每一条文字直播的比分,并判断F(time)j-F(time)i的符号关系2. Preprocess: For each Text i , calculate the live score of each text by formula 1.1, and determine the symbolic relationship between F(time) j -F(time) i

3.分片位置poi=1,分片集Slict=NULL,分片数量,num=1,3. Shard position poi=1, slice set Slict=NULL, number of slices, num=1,

4.for K=1 to n do4. for K=1 to n do

5.if F(Textk.Time)与F(Textk+1.Time)符合相反5.if F(Text k .Time) is the opposite of F(Text k+1 .Time)

6.beginnum=Diffsorepoi 6. begin num = Diffsore poi

7.endnum=Diffsorek 7. end num = Diffsore k

8.timenum1=Textpoi·Time8.time num1 = Text poi · Time

9.timenum2=Textk·Time9.time num2 = Text k · Time

10.slicenum={beginnum,endnum,timenum1,timenum2}10. slice num = {begin num , end num , time num1 , time num2 }

11.slicenum添加到Slict集合中11.slice num is added to the Slict collection

12.poi←k,num=num+112.poi←k, num=num+1

13.endif13.endif

14.Output:分片集合Slict14.Output: Sliced collection Slict

如图2所示,通过数据切分算法本发明可以将数据切分成不同的片段,方便 后续的提取,为数据的生成提供服务。As shown in Figure 2, the present invention can divide data into different segments through the data segmentation algorithm, which is convenient for subsequent extraction and provides services for data generation.

比赛关键点分析包括:Analysis of key points of the game includes:

关键点的确定对比赛的报道非常重要,通过对比赛报道的分析,本发明发现, 报道比赛的关键部分大致有以下六种情况,第一种比赛转折点,第二种比赛焦 灼点,第三种球员表现,第四种球队表现,第五种教练指挥能力,第六种压哨 球。The determination of key points is very important for the report of the game. Through the analysis of the game report, the present invention finds that the key parts of the game report roughly have the following six situations: the first type of game turning point, the second type of game focus, and the third type Player performance, the fourth type of team performance, the fifth type of coaching ability, and the sixth type of buzzer-beater.

定义1.1比赛转折点:是指两只球队在特定时间内,两队比分逐渐扩大达 到极大值的过程,或者两队比分逐渐缩小达到极小值的过程。这种极值变化的 过程往往体现了比赛的高潮,需要着重报道。Definition 1.1 The turning point of the game: It refers to the process in which the score of the two teams gradually expands to reach a maximum value within a certain period of time, or the process in which the score of the two teams gradually shrinks to a minimum value. The process of this extreme value change often reflects the climax of the game and needs to be reported emphatically.

定义1.2比赛焦灼点:是指两只球队在特定时间内,两只球队比分交替领 先,比赛处于紧张焦灼的气氛。通常在第四节后半段出现,应该着重报道。Definition 1.2 Game focus: Refers to two teams in a certain period of time, the two teams lead alternately, and the game is in a tense and anxious atmosphere. It usually appears in the second half of the fourth quarter and should be highlighted.

定义1.3球员表现:是指在特定时间内,球员精彩表现或糟糕表现,如连 续得分、关键性进球得分、关键性助攻、连续打铁等,通常在单位时间内具有 连续性、关键性。一般情况球星被报道的次数要高于其它球员。如果替补球员 能够帮助球队取得领先同样会被报道。Definition 1.3 Player performance: refers to the excellent or bad performance of a player within a specific period of time, such as consecutive points, critical goals, critical assists, consecutive strikes, etc., usually with continuity and criticality per unit time. In general, stars are reported more often than other players. It will also be reported if the bench player can help the team take the lead.

定义1.4球队表现:是指在特定时间内,球队精彩或糟糕表现,如球队在 特定时间内连续得分打出比赛高潮,或者球队在单位时间内不得分或者只得很 少的分。球队的表现通常以负面表现为主。如某球队在几分钟内没有得分。Definition 1.4 Team performance: refers to the excellent or bad performance of the team in a specific time, such as the team scoring consecutive points in a specific time to play the climax of the game, or the team scoring no points or only a few points in a unit time. Team performance is usually dominated by negative performances. If a team does not score for a few minutes.

定义1.5教练指挥能力:教练是球队的大脑,通常可以扭转比赛的局势, 教练主动暂停或换人往往会得到意想不到的效果,通常会被报道。Definition 1.5 Coach's commanding ability: The coach is the brain of the team and can usually turn the situation around in the game. The coach's active suspension or substitution often results in unexpected results, which are usually reported.

定义1.6压哨球:是指球员在每一节比赛临近尾声(通常10s内)球员投 出的球,有运气和能力成分,如果球进将会对球队士气起到提升作用,应报道。Definition 1.6 buzzer-beater: Refers to the ball thrown by the player near the end of each quarter (usually within 10s), which has elements of luck and ability. If the ball is scored, it will improve the morale of the team and should be reported.

基于触发条件自动生成体育新闻的步骤包括:The steps to automatically generate sports news based on trigger conditions include:

根据比赛报道关键点及基于分差-时间函数的数据切分算法,找到对应的关 键数据,并对数据进行提取,构建直播文本的报道触发条件,并在已构建好的 模板库中寻找,找到相应模板,并将事实数据进行填充,重新生成句子,最终 完成NBA自动写作。According to the key points of the game report and the data segmentation algorithm based on the difference-time function, find the corresponding key data, extract the data, build the report triggering conditions of the live text, and search for it in the template library that has been constructed. According to the corresponding template, fill in the fact data, regenerate the sentence, and finally complete the NBA automatic writing.

用Key表示比赛的关键点集合,keyi表示比赛的关键点,Slict表示切分直播 文本数据,Sent表示写作模板,Condition表示触发条件,Cword表示触发词, Conditioni={Cword1,Cword2..Cwordn},Senti={condition1,conditong2,...conditonn}。 Slict={Slict1,Slict2,Slict3...Slictn}。如果Slicti中包含key,则将数据片段Slicti 保留,否则舍弃。Use Key to represent the set of key points of the game, keyi to represent the key points of the game, Slict to represent the split live text data, Sent to represent the writing template, Condition to represent the trigger condition, Cword to represent the trigger word, Condition i = {Cword 1 , Cword 2 .. Cword n }, Sent i = {condition 1 , conditong 2 , ... conditon n }. Slict={Slict1, Slict2, Slict3...Slictn}. If Slict i contains a key, keep the data fragment Slicti, otherwise discard it.

本发明将ReportSenti=1的Slicti重新提取,构成报道数据集合Report。也就 是说Report=∑i=1Slicti且Slicti∈Key。The present invention re-extracts Slict i with ReportSenti = 1 to form a report data set Report. That is, Report=Σ i=1 Slict i and Slict i ∈Key.

本发明提取Reporti中的触发条件RCondition,并且在已有的Sent模板集合中 的SCondition中寻找,如果RConditioni中提取的每一个触发条件都能在 SConditioni中找到,则SConditioni模板将被激活。对于RConditioni来说可能有 多个SConditioni与其对应,本发明将随机的选择一个模板进行数据填充,形成 报道句子。The present invention extracts the trigger condition RCondition in Report i , and searches in the SCondition in the existing Sent template set, if every trigger condition extracted in RCondition i can be found in SCondition i , then the SCondition i template will be activated . For RCondition i , there may be multiple SCondition i corresponding to it, and the present invention randomly selects a template for data filling to form a report sentence.

实验结果与分析:Experimental results and analysis:

本发明爬取了新浪的867篇新闻语料构建模板,并爬取虎扑2017赛季所有 的文字直播数据,共计1305篇。The invention crawls 867 news corpus construction templates of Sina, and crawls all the text live broadcast data of Hupu 2017 season, a total of 1305 pieces.

评价指标:Evaluation indicators:

本发明采用现有技术中比较常用的三种评价指标(请参考文献:陈玉敬,吕 学强,周建设,李宁.NBA赛事新闻的自动写作研究[J].北京大学学报(自然科学 版),2017,(02):1+6.201),然后采用人工评价的方法,请3名专家进行打分,采 用3种指标包括:指标1,是否是计算机写的;指标2,是否符合比赛的真实性; 指标3,描述语言是否生动。本发明制定了两种评测方法,一种是严格正确率, 如果3个专家都认为不是计算机写的,就认为不是计算机写的,同理,后两个 指标的严格标准也采用此方法。另一种是宽松正确率,3个专家中只要有多数 人(即两个人以上)认为不是计算机写的,即认为不是计算机写的,后两个指 标的宽松标准也采用此方法。表示方法如下所示:The present invention adopts three kinds of evaluation indexes commonly used in the prior art (please refer to: Chen Yujing, Lv Xueqiang, Zhou Jianshe, Li Ning. Research on Automatic Writing of NBA Event News [J]. Journal of Peking University (Natural Science Edition), 2017, (02): 1+6.201), then use the method of manual evaluation, invite 3 experts to score, and use 3 indicators including: index 1, whether it is written by computer; index 2, whether it conforms to the authenticity of the competition; index 3 , describing whether the language is vivid. The present invention has formulated two evaluation methods, one is the strict accuracy rate, if all three experts believe that it is not written by computer, it is considered that it is not written by computer. Similarly, the strict standard of the latter two indicators also adopts this method. The other is the loose accuracy rate. As long as a majority of the three experts (that is, two or more people) think that it is not written by a computer, it is considered that it is not written by a computer. The loose standard of the latter two indicators also adopts this method. The representation method is as follows:

其中,Cstrict,Rstrict和Lstrict分别代表严格准确率,Clenient、Rlenient和Llenient代表宽松的准确率。Ci(i=1,2,3)表示第i个专家认为不是计算机写的,Ri(i=1,2,3)表示 第i个专家认为符合比赛的真实性,Li(i=1,2,3)表示第i个专家认为描述语言生 动,N代表全部的篇数。Among them, C strict , R strict and L strict represent strict accuracy, respectively, and C lenient , R lenient and L lenient represent loose accuracy. C i (i=1, 2, 3) means that the i-th expert thinks it is not written by the computer, Ri ( i =1, 2, 3) means that the i-th expert thinks it is true to the game, Li ( i = 1, 2, 3) indicates that the i-th expert thinks the description language is vivid, and N represents the total number of articles.

本发明随机选取100篇文字直播文本做实验,本发明提出的方法与现有技术 常用的方法(该方法请参考文献:陈玉敬,吕学强,周建设,李宁.NBA赛事新闻 的自动写作研究[J].北京大学学报(自然科学版),2017,(02):1+6.201)做了对比, 实验结果表明本发明提出的方法要优于对比方法,具体情况如表1.2、1.3、1.4 所示:The present invention randomly selects 100 live texts for experiments, the method proposed by the present invention and the method commonly used in the prior art (for this method, please refer to the literature: Chen Yujing, Lv Xueqiang, Zhou Jianshe, Li Ning. Research on automatic writing of NBA event news [J] .Journal of Peking University (Natural Science Edition), 2017, (02): 1+6.201) made a comparison, the experimental results show that the method proposed by the present invention is better than the comparison method, and the specific conditions are shown in Tables 1.2, 1.3 and 1.4:

表1.2以假乱真实验结果表Table 1.2 The experimental result table of false and true

表1.3真实性实验结果表Table 1.3 Authenticity test result table

表1.4生动性实验结果表Table 1.4 Vividness test result table

从表1.2、表1.3、表1.4可以看出,本发明提出的方法要好于对比实验方法, 这是因为本发明提出的方案在模板构建和选择上优于对比方法,比方方法模板 选择是固定的50多个,而本发明的模板是动态,可以通过新闻战报不断的更新。 因此本发明在以假乱真、真实性、生动性上要优于对比方法,特别是在生动性 上。It can be seen from Table 1.2, Table 1.3 and Table 1.4 that the method proposed by the present invention is better than the comparative experimental method, because the solution proposed by the present invention is superior to the comparative method in template construction and selection, for example, the template selection method is fixed There are more than 50, and the template of the present invention is dynamic and can be continuously updated through news reports. Therefore, the present invention is superior to the contrast method in terms of confounding the real, authenticity and vividness, especially in terms of vividness.

本发明方法在以假乱真率,其中严格的以假乱真率达到90%,宽松的以假 乱真率能够达到95%,说明计算机的自动写作可以认为是人工书写的。94%的 严格真实率和96%的宽松真实率说明本发明方法的有效性,符合NBA赛事新 闻的真实直播要求。严格的生动率达到85%,宽松的生动率能够达到93%,说 明本发明的方法的生动性也达到了很好的结果。The method of the present invention is based on the false-to-true rate, wherein the strict false-to-true rate reaches 90%, and the loose false-to-true rate can reach 95%, indicating that the automatic writing of the computer can be considered as manual writing. The strict truth rate of 94% and the loose truth rate of 96% show the effectiveness of the method of the present invention, which meets the real live broadcast requirements of NBA event news. The strict vividness rate reaches 85%, and the loose vividness rate can reach 93%, which shows that the vividness of the method of the present invention also achieves a good result.

以上所述实施例仅表达了本发明的实施方式,其描述较为具体和详细,但 并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的 普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改 进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权 利要求为准。The above-mentioned embodiment only expresses the embodiment of the present invention, and its description is more specific and detailed, but should not therefore be construed as a limitation to the scope of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can be made, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention shall be subject to the appended claims.

Claims (8)

1.一种体育新闻自动写作方法,其特征在于,包括:基于分差-时间函数进行数据建模;基于分差-时间函数进行数据切分;比赛关键点分析;基于触发条件自动生成体育新闻。1. an automatic writing method of sports news, it is characterized in that, comprising: carry out data modeling based on difference-time function; carry out data segmentation based on difference-time function; analyze key points of game; automatically generate sports news based on trigger condition . 2.根据权利要求1所述的体育新闻写作方法,其特征在于,所述体育新闻自动写作方法,首先根据直播文本数据构建分差-时间函数,并对数据进行建模,其次根据分差-时间函数的特性将数据进行合并,提取直播文本片段信息,再根据战报数据直播文本的关键点信息,提取重要的直播片段,生成直播片段数据集,提取直播片段数据集中的触发条件,并与已构建好的模板触发条件进行比对,选取最优模板,并将比赛的事实数据填写进模板,生成战报句子,最终生成战报文章。2. sports news writing method according to claim 1, is characterized in that, described sports news automatic writing method, at first according to live text data constructs difference-time function, and data is modeled, secondly according to difference- The characteristics of the time function combine the data, extract the information of the live text segment, and then extract the important live segment according to the key point information of the live text of the battle report data, generate the live segment data set, and extract the trigger conditions in the live segment data set. The constructed template trigger conditions are compared, the optimal template is selected, and the factual data of the competition is filled in the template, the battle report sentence is generated, and finally the battle report article is generated. 3.根据权利要求1所述的体育新闻写作方法,其特征在于,基于分差-时间函数进行数据建模的步骤包括:3. sports news writing method according to claim 1, is characterized in that, the step of carrying out data modeling based on difference-time function comprises: 对于给定的一条直播数据Ti,主队得分为Zsorei,客队得分为Ksorei,两个球队的分差为For a given piece of live data T i , the home team's score is Zsore i , the away team's score is Ksore i , and the difference between the two teams is Diffsorei=Zsorei-KsoreiDiffsore i =Zsore i -Ksore i ; 利用分差函数得到Diffsorei随时间的走势,将所有的Diffsorei点相连接,最终构建了分差-时间函数:Use the difference function to get the trend of Diffsore i with time, connect all the Diffsore i points, and finally construct the difference-time function: Diffsore=F(time) ;Diffsore=F(time); 如果F(time)在timej-timei时间段内为一次函数,则状态有以下三种情况,每种趋势都代表不同的状态:If F(time) is a function in the time j -time i time period, the state has the following three cases, each trend represents a different state: 如果F(time)j-F(time)i>0,则表示在timej-timei时间段内主队领先;If F(time) j - F(time) i > 0, it means that the home team is ahead in the time j - time i time period; 如果F(time)j-F(time)i=0,则表示在timej-timei时间段内两队平局;If F(time) j - F(time) i = 0, it means that the two teams are in a draw within the time j - time i time period; 如果F(time)j-F(time)i<0,则表示在timej-timei时间段内主队落后;If F(time) j - F(time) i <0, it means that the home team is behind in the time j - time i period; 如果F(time)在timej-timei时间段内为多次函数,则为上述三种情况的混合组合。If F(time) is a multiple function in the time j -time i time period, it is a mixed combination of the above three cases. 4.根据权利要求1所述的体育新闻写作方法,其特征在于,基于分差-时间函数进行数据切分的步骤包括:4. sports news writing method according to claim 1, is characterized in that, the step of carrying out data segmentation based on difference-time function comprises: 根据分差-时间函数的性质进行判断:Judging according to the nature of the difference-time function: 如果F(time)j-F(time)i=0,且timej-timei>3分钟,则保留数据,否则去掉数据;If F(time) j -F(time) i =0, and timej - timei >3 minutes, keep the data, otherwise remove the data; 如果F(time)j-F(time)i>=0,则将领先类数据进行片段合并;If F(time) j -F(time) i >= 0, the leading class data is merged into segments; 如果F(time)j-F(time)i<=0,则将落后类数据进行片段合并。If F(time) j - F(time) i <= 0, the lag class data is segmented. 5.根据权利要求1所述的体育新闻写作方法,其特征在于,比赛关键点分析包括:报道比赛的关键部分有以下六种情况:第一种比赛转折点,第二种比赛焦灼点,第三种球员表现,第四种球队表现,第五种教练指挥能力,第六种压哨球。5. sports news writing method according to claim 1 is characterized in that, the analysis of key points of the game comprises: the key part of the reported game has the following six situations: the first type of game turning point, the second type of game focus, the third Player performance, team performance fourth, coaching ability fifth, buzzer-beater sixth. 6.根据权利要求1-5所述的体育新闻自动写作方法,其特征在于,所述六种情况的定义分别为:6. sports news automatic writing method according to claim 1-5 is characterized in that, the definition of described six kinds of situations is respectively: 比赛转折点是指两只球队在特定时间内,两队比分逐渐扩大达到极大值的过程,或者两队比分逐渐缩小达到极小值的过程;这种极值变化的过程体现了比赛的高潮,要着重报道;The turning point of the game refers to the process in which the score of the two teams gradually expands to reach a maximum value within a certain period of time, or the process in which the score between the two teams gradually shrinks to a minimum value; this process of extreme value changes reflects the climax of the game. , to focus on reporting; 比赛焦灼点是指两只球队在特定时间内,两只球队比分交替领先,比赛处于紧张焦灼的气氛;在第四节后半段出现,应该着重报道;The hot spot of the game refers to the two teams taking the lead alternately within a certain period of time, and the game is in a tense and anxious atmosphere; if it occurs in the second half of the fourth quarter, it should be reported intensively; 球员表现是指在特定时间内,球员精彩表现或糟糕表现,在单位时间内具有连续性、关键性;球星被报道的次数要高于其它球员;如果替补球员能够帮助球队取得领先同样要被报道;Player performance refers to the excellent or bad performance of a player within a certain period of time, which is continuous and critical within a unit of time; the number of times a star is reported is higher than that of other players; if a substitute player can help the team take the lead, it must also be reported. report; 球队表现是指在特定时间内,球队精彩或糟糕表现,或者球队在单位时间内不得分或者只得很少的分;球队的表现以负面表现为主;Team performance refers to the excellent or bad performance of the team within a certain period of time, or the team has no points or only few points in a unit time; the performance of the team is mainly negative; 教练指挥能力:教练主动暂停或换人,要报道;Coach's commanding ability: the coach takes the initiative to suspend or make a substitution, and report it; 压哨球是指球员在每一节比赛临近尾声球员投出的球,要报道。The buzzer-beater is the ball thrown by the player towards the end of each quarter and is reported. 7.根据权利要求1所述的体育新闻写作方法,其特征在于,基于触发条件自动生成体育新闻的步骤包括:根据比赛报道关键点及基于分差-时间函数的数据切分算法,找到对应的关键数据,并对数据进行提取,构建直播文本的报道触发条件,并在已构建好的模板库中寻找,找到相应模板,并将事实数据进行填充,重新生成句子,最终完成体育新闻自动写作。7. sports news writing method according to claim 1 is characterized in that, the step of automatically generating sports news based on trigger condition comprises: according to the key point of game report and the data segmentation algorithm based on difference-time function, find corresponding Key data, extract the data, construct the report trigger conditions of the live text, search in the template library that has been constructed, find the corresponding template, fill in the fact data, regenerate the sentence, and finally complete the automatic writing of sports news. 8.根据权利要求1-7所述的体育新闻自动写作方法,其特征在于,基于触发条件自动生成体育新闻的步骤包括:8. sports news automatic writing method according to claim 1-7 is characterized in that, the step of automatically generating sports news based on trigger condition comprises: 用Key表示比赛的关键点集合,keyi表示比赛的关键点,Slict表示切分直播文本数据,Sent表示写作模板,Condition表示触发条件,Cword表示触发词,Conditioni={Cword1,Cword2..Cwordn},Senti={condition1,conditong2,...conditonn},Slict={Slict1,Slict2,Slict3...Slictn},如果Slicti中包含key,则将数据片段Slicti保留,否则舍弃,Use Key to represent the set of key points of the game, keyi to represent the key points of the game, Slict to represent the split live text data, Sent to represent the writing template, Condition to represent the trigger condition, Cword to represent the trigger word, Condition i = {Cword 1 , Cword 2 .. Cword n }, Sent i ={condition 1 , conditong 2 ,...conditon n }, Slict={Slict1, Slict2, Slict3...Slictn}, if Slict i contains key, keep the data segment Slicti, otherwise give up, 将ReportSenti=1的Slicti重新提取,构成报道数据集合Report,Report=∑i=1Slicti且Slicti∈Key;Re-extract Slict i with ReportSenti = 1 to form a report data set Report, Report=∑ i =1 Slict i and Slict i ∈Key; 提取Reporti中的触发条件RCondition,并且在已有的Sent模板集合中的SCondition中寻找,如果RConditioni中提取的每一个触发条件都能在SConditioni中找到,则SConditioni模板将被激活;随机选择一个SConditioni模板进行数据填充,形成报道句子。Extract the trigger condition RCondition in Report i , and look for it in the SCondition in the existing Sent template set. If every trigger condition extracted in RCondition i can be found in SCondition i , the SCondition i template will be activated; randomly Select a SCondition i template for data filling to form report sentences.
CN201910404548.5A 2019-05-15 2019-05-15 An automatic writing method of sports news Pending CN110516215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910404548.5A CN110516215A (en) 2019-05-15 2019-05-15 An automatic writing method of sports news

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910404548.5A CN110516215A (en) 2019-05-15 2019-05-15 An automatic writing method of sports news

Publications (1)

Publication Number Publication Date
CN110516215A true CN110516215A (en) 2019-11-29

Family

ID=68622470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910404548.5A Pending CN110516215A (en) 2019-05-15 2019-05-15 An automatic writing method of sports news

Country Status (1)

Country Link
CN (1) CN110516215A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765950A (en) * 2021-01-08 2021-05-07 首都师范大学 Template library generation method and system based on cosine similarity and storage medium
CN112765949A (en) * 2021-01-08 2021-05-07 首都师范大学 Method, system and storage medium for automatically generating event character live broadcast text
CN113411623A (en) * 2021-06-15 2021-09-17 首都师范大学 Automatic news generation method and system based on difference-time function algorithm and computer readable storage medium
CN113497949A (en) * 2021-06-15 2021-10-12 首都师范大学 Live broadcast method based on difference-time function algorithm, event live broadcast terminal, electronic equipment and computer readable storage medium
CN113641818A (en) * 2021-06-28 2021-11-12 中国消防救援学院 Method, apparatus and computer storage medium for classifying sentences based on Boolean weights
CN117633150A (en) * 2023-11-23 2024-03-01 北京奥邦菲特科技有限公司 Sports news construction method and system for sports match live text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6976031B1 (en) * 1999-12-06 2005-12-13 Sportspilot, Inc. System and method for automatically generating a narrative report of an event, such as a sporting event
CN105912526A (en) * 2016-04-15 2016-08-31 北京大学 Sports game live broadcasting text based sports news automatic constructing method and device
CN106407343A (en) * 2016-09-06 2017-02-15 首都师范大学 Automatic generation method for NBA competition news

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6976031B1 (en) * 1999-12-06 2005-12-13 Sportspilot, Inc. System and method for automatically generating a narrative report of an event, such as a sporting event
CN105912526A (en) * 2016-04-15 2016-08-31 北京大学 Sports game live broadcasting text based sports news automatic constructing method and device
CN106407343A (en) * 2016-09-06 2017-02-15 首都师范大学 Automatic generation method for NBA competition news

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈玉敬等: "NBA赛事新闻的自动写作研究", 《北京大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765950A (en) * 2021-01-08 2021-05-07 首都师范大学 Template library generation method and system based on cosine similarity and storage medium
CN112765949A (en) * 2021-01-08 2021-05-07 首都师范大学 Method, system and storage medium for automatically generating event character live broadcast text
CN113411623A (en) * 2021-06-15 2021-09-17 首都师范大学 Automatic news generation method and system based on difference-time function algorithm and computer readable storage medium
CN113497949A (en) * 2021-06-15 2021-10-12 首都师范大学 Live broadcast method based on difference-time function algorithm, event live broadcast terminal, electronic equipment and computer readable storage medium
CN113641818A (en) * 2021-06-28 2021-11-12 中国消防救援学院 Method, apparatus and computer storage medium for classifying sentences based on Boolean weights
CN117633150A (en) * 2023-11-23 2024-03-01 北京奥邦菲特科技有限公司 Sports news construction method and system for sports match live text

Similar Documents

Publication Publication Date Title
CN110516215A (en) An automatic writing method of sports news
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
Bhattacharya et al. Overview of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance.
CN108491462B (en) Semantic query expansion method and device based on word2vec
CN107908650B (en) Knowledge train of thought method for auto constructing based on mass digital books
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
CN109165382B (en) Similar defect report recommendation method combining weighted word vector and potential semantic analysis
CN105718586A (en) Word division method and device
CN112000783B (en) Patent recommendation method, device, device and storage medium based on text similarity analysis
CN108268539A (en) Video matching system based on text analyzing
CN102081642A (en) Chinese label extraction method for clustering search results of search engine
CN104765769A (en) A Short Text Query Expansion and Retrieval Method Based on Word Vector
CN107943919B (en) A Query Expansion Method for Conversational Entity Search
CN110442726B (en) An online clustering method for short texts in social media based on entity constraints
CN115292469A (en) A Question Answering Method Combining Paragraph Search and Machine Reading Comprehension
CN113742446A (en) Knowledge graph question-answering method and system based on path sorting
CN109299357B (en) A method for topic classification of Lao texts
CN110516216A (en) A construction method of sports news automatic writing template library
CN106445921A (en) Chinese text term extracting method utilizing quadratic mutual information
CN103412878B (en) Document theme partitioning method based on domain knowledge map community structure
CN105224520A (en) A kind of Chinese patent documentation term automatic identifying method
CN114328823A (en) Database natural language query method and device, electronic device, storage medium
CN111291163B (en) Disease knowledge graph retrieval method based on symptom characteristics
CN117494724A (en) A semantic enhancement method by fusing medical terminology entity description information
CN109918579B (en) A location inference method for extracting location indicators based on semantic features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191129

WD01 Invention patent application deemed withdrawn after publication