HK1249598B - Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database - Google Patents
Method, apparatus and system for monitoring internet media events based on industry knowledge mapping databaseInfo
- Publication number
- HK1249598B HK1249598B HK18108044.9A HK18108044A HK1249598B HK 1249598 B HK1249598 B HK 1249598B HK 18108044 A HK18108044 A HK 18108044A HK 1249598 B HK1249598 B HK 1249598B
- Authority
- HK
- Hong Kong
- Prior art keywords
- data
- industry
- entities
- entity
- event
- Prior art date
Links
Description
技术领域Technical Field
本发明涉及互联网媒体监测领域,具体而言,涉及一种构建行业知识图谱数据库的技术以及一种基于所构建的行业知识图谱数据库对互联网媒体事件进行监测的技术。The present invention relates to the field of Internet media monitoring, and in particular, to a technology for constructing an industry knowledge graph database and a technology for monitoring Internet media events based on the constructed industry knowledge graph database.
背景技术Background Art
计算机、通信以及网络技术的迅速发展使包括PC、平板电脑、智能手机、网络电视等在内的终端设备的性能不断提高。相应地,互联网媒体,特别是互联网社交媒体,凭借其多元性、迅捷性、交互性、易复制性、多媒体化等特点,已逐渐成为大众获取新闻资讯的主要途径之一。The rapid development of computer, communications, and network technologies has continuously improved the performance of terminal devices, including PCs, tablets, smartphones, and Internet-connected televisions. Accordingly, internet media, particularly social media, has gradually become one of the primary channels for the public to obtain news and information, thanks to its diversity, speed, interactivity, ease of reproduction, and multimedia capabilities.
然而,互联网媒体信息在具有时效性强、获取方式灵活便捷等优势的同时,其信息源和传播方式的开放性特点也导致了以下问题的存在:在未经授权或证实的情况下,一些敏感消息(例如,商业秘密)甚至虚假消息在互联网媒体平台上被大量用户快速传播,从而演变为对相关的个人、企业/机构、行业乃至社会造成不良影响的媒体事件。因此,需要对互联网媒体中的媒体事件进行监测,并在监测到满足一定条件的媒体事件后采取相应的措施,以降低或消除其潜在的影响。However, while internet media information offers advantages such as high timeliness and flexible and convenient access, the open nature of its information sources and dissemination methods also leads to the following problems: Without authorization or verification, sensitive information (e.g., trade secrets) or even false information can be rapidly disseminated by a large number of users on internet media platforms, potentially leading to media events that negatively impact individuals, businesses/institutions, industries, and even society. Therefore, it is necessary to monitor media events on the internet and, when certain conditions are met, take appropriate measures to reduce or eliminate their potential impact.
现有的互联网媒体监测技术则存在以下缺陷:1)使用兴趣匹配的方式为用户提供互联网媒体监测,用户需要自定义感兴趣的内容主题、相关实体等,因此在监测中仅能够识别与用户已定义的实体直接相关的事件,而无法识别用户未定义但是与用户所感兴趣的实体间接相关的事件;2)监测对象的属性单一,仅能够提供针对单一媒体类别和数据源(例如,特定的社交媒体、新闻媒体、论坛、博客等)、单一数据类型(一般为文本)、单一语言的监测。Existing Internet media monitoring technology has the following defects: 1) It uses interest matching to provide users with Internet media monitoring. Users need to customize the content topics and related entities they are interested in. Therefore, in the monitoring, only events directly related to the entities defined by the user can be identified, and events that are not defined by the user but are indirectly related to the entities of interest to the user cannot be identified; 2) The attributes of the monitored objects are single, and monitoring can only be provided for a single media category and data source (for example, specific social media, news media, forums, blogs, etc.), a single data type (generally text), and a single language.
发明内容Summary of the Invention
本发明的一个目的是提供一种构建行业知识图谱数据库的技术,将针对特定行业或领域的相关数据提取并保存在知识图谱数据库中,所构建的行业知识图谱数据库可以应用于互联网媒体监测中,以实现对相关互联网媒体事件的自动化、深层次监测。One purpose of the present invention is to provide a technology for constructing an industry knowledge graph database, extracting relevant data for a specific industry or field and storing it in the knowledge graph database. The constructed industry knowledge graph database can be applied to Internet media monitoring to achieve automated and in-depth monitoring of relevant Internet media events.
本发明的另一个目的是提供一种基于所构建的行业知识图谱数据库对互联网媒体事件进行监测的技术,在监测中能够识别出与特定媒体事件对应的非直接相关实体,并且能够对多种类型的互联网媒体数据进行监测。Another object of the present invention is to provide a technology for monitoring Internet media events based on the constructed industry knowledge graph database, which can identify non-directly related entities corresponding to specific media events during monitoring and can monitor various types of Internet media data.
为了实现上述发明目的,本发明提供的具体技术方案如下。In order to achieve the above-mentioned purpose of the invention, the specific technical solutions provided by the present invention are as follows.
本发明提供了一种构建行业知识图谱数据库的方法,包括以下步骤:从数据源获取行业数据;对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。The present invention provides a method for constructing an industry knowledge graph database, comprising the following steps: obtaining industry data from a data source; performing data processing on the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships; and constructing the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述获取行业数据的步骤通过以下方式实现:从第三方行业数据库获取结构化行业数据,所述结构化行业数据包括多个字段;所述对行业数据进行数据处理的步骤通过以下方式实现:对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。Preferably, the step of acquiring industry data is implemented by: acquiring structured industry data from a third-party industry database, wherein the structured industry data includes multiple fields; the step of processing the industry data is implemented by: performing data cleaning and extraction-transformation-loading (ETL) processing on the structured industry data; the step of constructing an industry knowledge graph database is implemented by: generating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述获取行业数据的步骤通过以下方式实现:利用网络爬虫技术,从互联网数据源获取与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述对行业数据进行数据处理的步骤通过以下方式实现:利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。进一步优选地,上述步骤是以预定的周期定期执行的。Preferably, the step of obtaining industry data is achieved by: using web crawler technology to obtain industry-related data from Internet data sources, and the Internet data sources include unstructured or semi-structured data sources; the step of processing industry data is achieved by: using information extraction technology in natural language processing to perform entity recognition and relationship extraction on the industry-related data to extract the entities, entity attributes and/or entity relationships; the step of constructing an industry knowledge graph database is achieved by: supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships. Further preferably, the above steps are performed regularly at a predetermined period.
优选地,所述获取行业数据的步骤通过以下方式实现:利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述对行业数据进行数据处理的步骤通过以下方式实现:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。进一步优选地,上述步骤是以预定的周期定期执行的。Preferably, the step of obtaining industry data is implemented by: obtaining industry-related data from an Internet data source in a query manner using an application programming interface (API), wherein the Internet data source includes an open data source; the step of processing the industry data is implemented by: before extracting entities related to the industry and corresponding entity attributes and/or entity relationships, performing data cleaning and extraction-transformation-loading (ETL) processing on the industry-related data; the step of constructing an industry knowledge graph database is implemented by: supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships. Further preferably, the above steps are performed regularly at a predetermined period.
优选地,所述获取行业数据的步骤通过以下方式实现:利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述对行业数据进行数据处理的步骤通过以下方式实现:对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。进一步优选地,在所述对行业数据进行数据处理的步骤中通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者,基于语音识别处理从音频或视频数据中识别实体。进一步优选地,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。进一步优选地,上述步骤是实时不间断执行的。Preferably, the step of acquiring industry data is implemented by: utilizing an application programming interface (API) or web crawler technology to acquire industry-related internet media data from an internet data source; the step of processing the industry data is implemented by: performing event detection, event evaluation, and screening on the internet media data to extract specific media events related to the industry and identifying corresponding directly related entities from the internet media data; and the step of constructing an industry knowledge graph database is implemented by: supplementing the industry knowledge graph database based on the specific media events and the corresponding directly related entities, wherein the specific media events are supplemented into the industry knowledge graph database as abstract entities. Further preferably, in the step of processing the industry data, directly related entities corresponding to the specific media events are identified by at least one of the following methods: identifying entities from text data based on entity recognition in natural language processing; identifying entities from image or video data based on image or video recognition processing; or identifying entities from audio or video data based on speech recognition processing. Further preferably, the specific media events include negative events, emergencies, crisis events, mass incidents, public opinion events, or other events of industry significance. Further preferably, the above steps are performed in real time and continuously.
优选地,所述构建行业知识图谱数据库的步骤包括:对所提取的实体进行语义消歧和实体链接。进一步优选地,所述对所提取的实体进行语义消歧和实体链接的步骤进一步通过以下方式中的至少一种实现:基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。Preferably, the step of constructing the industry knowledge graph database includes: performing semantic disambiguation and entity linking on the extracted entities. Further preferably, the step of performing semantic disambiguation and entity linking on the extracted entities is further implemented by at least one of the following methods: performing semantic disambiguation and entity linking on each extracted entity reference independently based on entity knowledge; performing semantic disambiguation and entity linking on the extracted entity references consistently based on the subject consistency assumption and utilizing the associations of candidate entities in the knowledge base.
本发明还提供了一种基于本发明中所构建的行业知识图谱数据库对与行业相关的特定媒体事件进行监测的方法,包括以下步骤:获取互联网媒体数据;基于所获取的互联网媒体数据进行事件检测、事件评价和筛选,以获取所述与行业相关的特定媒体事件;识别与所述特定媒体事件对应的直接相关实体;基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;向所述直接相关实体和/或所述非直接相关实体发送预警消息。The present invention also provides a method for monitoring specific media events related to an industry based on the industry knowledge graph database constructed in the present invention, comprising the following steps: obtaining Internet media data; performing event detection, event evaluation and screening based on the obtained Internet media data to obtain the specific media events related to the industry; identifying directly related entities corresponding to the specific media events; based on the directly related entities, accessing the industry knowledge graph database to determine non-directly related entities corresponding to the specific media events; and sending early warning messages to the directly related entities and/or the non-directly related entities.
优选地,所述进行事件检测、事件评价和筛选步骤中的事件检测包括以下步骤:对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;从所获得的内容中识别涉及的实体;对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。进一步优选地,所述事件检测还包括以下步骤:基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。Preferably, the event detection in the event detection, event evaluation, and event screening steps includes the following steps: performing topic classification on the content in the acquired Internet media data to obtain content targeting a specific topic; identifying entities involved in the acquired content; performing sentiment analysis on the acquired content and the identified entities, and filtering the acquired content based on the results of the sentiment analysis; and performing event discovery based on the filtered content to cluster media events and discover new media events. Further preferably, the event detection also includes the following steps: analyzing the authenticity of the events based on the attributes of the media events, and sorting and/or filtering the media events based on the analysis results.
优选地,在所述识别与特定媒体事件对应的直接相关实体的步骤中通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者,基于语音识别处理从音频或视频数据中识别实体。Preferably, in the step of identifying directly related entities corresponding to the specific media event, the directly related entities corresponding to the specific media event are identified by at least one of the following methods: identifying entities from text data based on entity recognition in natural language processing; identifying entities from image or video data based on image or video recognition processing; or, identifying entities from audio or video data based on speech recognition processing.
优选地,所述访问行业知识图谱数据库的步骤通过以下方式实现:基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。Preferably, the step of accessing the industry knowledge graph database is implemented in the following manner: based on the directly related entities, querying the industry knowledge graph database to determine the indirect related entities.
优选地,所述访问行业知识图谱数据库的步骤通过以下方式实现:基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。Preferably, the step of accessing the industry knowledge graph database is achieved by using data mining technology in the industry knowledge graph database based on the directly related entities to determine the non-directly related entities.
本发明还提供了一种构建行业知识图谱数据库的装置,包括:数据获取模块,用于从数据源获取行业数据;数据处理模块,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;数据库构建模块,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。The present invention also provides a device for constructing an industry knowledge graph database, including: a data acquisition module for acquiring industry data from a data source; a data processing module for processing the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships; a database construction module for constructing the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:从第三方行业数据库获得结构化行业数据,所述结构化行业数据包括多个字段;所述数据处理模块通过以下方式进行数据处理:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。Preferably, the data acquisition module acquires industry data in the following manner: obtaining structured industry data from a third-party industry database, the structured industry data including multiple fields; the data processing module performs data processing in the following manner: before extracting entities related to the industry and the corresponding entity attributes and/or entity relationships, the structured industry data is cleansed and subjected to extraction-transformation-loading (ETL) processing; the database construction module constructs an industry knowledge graph database in the following manner: generating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述数据处理模块通过以下方式进行数据处理:利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition module acquires industry data in the following manner: using web crawler technology to obtain industry-related data from Internet data sources, and the Internet data sources include unstructured or semi-structured data sources; the data processing module performs data processing in the following manner: using information extraction technology in natural language processing to perform entity recognition and relationship extraction on the industry-related data to extract the entities, entity attributes and/or entity relationships; the database construction module constructs an industry knowledge graph database in the following manner: supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述数据处理模块通过以下方式进行数据处理:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition module acquires industry data in the following manner: using an application programming interface (API) to obtain industry-related data from an Internet data source in a query manner, and the Internet data source includes an open data source; the data processing module performs data processing in the following manner: before extracting entities related to the industry and corresponding entity attributes and/or entity relationships, the industry-related data is cleansed and subjected to extraction-transformation-loading (ETL) processing; the database construction module constructs an industry knowledge graph database in the following manner: supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述数据处理模块通过以下方式进行数据处理:对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。Preferably, the data acquisition module acquires industry data in the following manner: it is used to acquire industry-related Internet media data from Internet data sources using an application programming interface (API) or web crawler technology; the data processing module processes data in the following manner: it performs event detection, event evaluation and screening on the Internet media data to extract specific media events related to the industry, and identifies corresponding directly related entities from the Internet media data; the database construction module constructs an industry knowledge graph database in the following manner: based on the specific media events and the corresponding directly related entities, the industry knowledge graph database is supplemented, wherein the specific media events are supplemented into the industry knowledge graph database as abstract entities.
优选地,所述数据库构建模块进一步通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者基于语音识别处理从音频或视频数据中识别实体。Preferably, the database construction module further identifies directly related entities corresponding to the specific media event in at least one of the following ways: identifying entities from text data based on entity recognition in natural language processing; identifying entities from image or video data based on image or video recognition processing; or identifying entities from audio or video data based on speech recognition processing.
优选地,所述数据库构建模块包括:用于对所提取的实体进行语义消歧和实体链接的模块。进一步优选地,所述用于对所提取的实体进行语义消歧和实体链接的模块进一步通过以下方式中的至少一种进行语义消歧和实体链接:基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。Preferably, the database construction module includes: a module for performing semantic disambiguation and entity linking on the extracted entities. Further preferably, the module for performing semantic disambiguation and entity linking on the extracted entities further performs semantic disambiguation and entity linking in at least one of the following ways: performing semantic disambiguation and entity linking on each extracted entity reference independently based on entity knowledge; performing semantic disambiguation and entity linking on the extracted entity references consistently based on the subject consistency assumption and utilizing the associations of candidate entities in the knowledge base.
优选地,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。Preferably, the specific media events include negative events, emergencies, crisis events, mass events, public opinion events or other events with industry significance.
本发明还提供了一种对与行业相关的特定媒体事件进行监测的系统,包括:数据获取单元,用于从数据源获得行业数据;数据处理单元,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;数据库构建单元,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库;数据库存储单元:用于存储所构建的行业知识图谱数据库;媒体事件监测单元:用于获取互联网媒体数据,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选以获得所述与行业相关的特定媒体事件,并且识别与所述特定媒体事件对应的直接相关实体;数据库访问单元:用于基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;消息发送单元,用于向所述直接相关实体和/或所述非直接相关实体发送预警消息。The present invention also provides a system for monitoring specific media events related to an industry, comprising: a data acquisition unit for obtaining industry data from a data source; a data processing unit for processing the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships; a database construction unit for constructing the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships; a database storage unit for storing the constructed industry knowledge graph database; a media event monitoring unit for acquiring Internet media data, performing event detection, event evaluation and screening based on the acquired Internet media data to obtain the specific media events related to the industry, and identifying directly related entities corresponding to the specific media events; a database access unit for accessing the industry knowledge graph database based on the directly related entities to determine non-directly related entities corresponding to the specific media events; and a message sending unit for sending early warning messages to the directly related entities and/or the non-directly related entities.
优选地,所述数据获取单元包括:结构化数据获取单元,用于从第三方行业数据库获得结构化行业数据,所述结构化行业数据包括多个字段;所述数据处理单元包括:结构化数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库生成单元,用于基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。Preferably, the data acquisition unit includes: a structured data acquisition unit for obtaining structured industry data from a third-party industry database, wherein the structured industry data includes multiple fields; the data processing unit includes: a structured data processing unit for performing data cleaning and extraction-transformation-loading (ETL) processing on the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships; the database construction unit includes: a database generation unit for generating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取单元包括:行业相关数据获取单元,用于利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述数据处理单元包括:行业相关数据处理单元,用于利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain industry-related data from Internet data sources using web crawler technology, and the Internet data sources include unstructured or semi-structured data sources; the data processing unit includes: an industry-related data processing unit, which is used to use information extraction technology in natural language processing to perform entity recognition and relationship extraction on the industry-related data to extract the entities, entity attributes and/or entity relationships; the database construction unit includes: a database supplement/update unit, which is used to supplement or update the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取单元包括:行业相关数据获取单元,用于利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述数据处理单元包括:行业相关数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain industry-related data from an Internet data source in a query manner using an application programming interface (API), and the Internet data source includes an open data source; the data processing unit includes: an industry-related data processing unit, which is used to perform data cleaning and extraction-transformation-loading (ETL) processing on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships; the database construction unit includes: a database supplement/update unit, which is used to supplement or update the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
优选地,所述数据获取单元包括:媒体数据获取单元,用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述数据处理单元包括:媒体数据处理单元,用于对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述数据库构建单元包括:数据库补充/更新单元,用于基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。Preferably, the data acquisition unit includes: a media data acquisition unit, which is used to obtain industry-related Internet media data from Internet data sources using an application programming interface (API) or web crawler technology; the data processing unit includes: a media data processing unit, which is used to perform event detection, event evaluation and screening on the Internet media data to extract specific media events related to the industry and identify corresponding directly related entities from the Internet media data; the database construction unit includes: a database supplement/update unit, which is used to supplement the industry knowledge graph database based on the specific media events and the corresponding directly related entities, wherein the specific media events are supplemented into the industry knowledge graph database as abstract entities.
优选地,所述数据库补充/更新单元进一步用于:对所提取的实体进行语义消歧和实体链接。Preferably, the database supplement/update unit is further used for: performing semantic disambiguation and entity linking on the extracted entities.
优选地,所述媒体事件监测单元进一步用于:对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;从所获得的内容中识别涉及的实体;对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。进一步优选地,所述媒体事件监测单元进一步用于:基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。Preferably, the media event monitoring unit is further configured to: perform topic classification on the content of the acquired Internet media data to obtain content targeting specific topics; identify entities involved in the acquired content; perform sentiment analysis on the acquired content and the identified entities, and filter the acquired content based on the results of the sentiment analysis; and perform event discovery based on the filtered content to cluster media events and discover new media events. Further preferably, the media event monitoring unit is further configured to: analyze the authenticity of the events based on their attributes, and sort and/or filter the media events based on the analysis results.
优选地,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。Preferably, the database access unit is further used to: query the industry knowledge graph database based on the directly related entities to determine the indirect related entities.
优选地,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。Preferably, the database access unit is further used to: use data mining technology in the industry knowledge graph database based on the directly related entities to determine the indirect related entities.
优选地,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。Preferably, the specific media events include negative events, emergencies, crisis events, mass events, public opinion events or other events with industry significance.
通过实施本发明提供的技术方案可以获得以下技术效果:1)针对一个或多个目标领域或行业,实现了对相关互联网媒体事件的自动化、深层次监测,能够识别出与特定媒体事件对应的非直接相关实体;2)在监测中实现了对多个数据源、多种数据类型、多种语言的互联网媒体数据的自动化处理。By implementing the technical solution provided by the present invention, the following technical effects can be achieved: 1) Automated and in-depth monitoring of relevant Internet media events is achieved for one or more target fields or industries, and non-directly related entities corresponding to specific media events can be identified; 2) Automated processing of Internet media data from multiple data sources, multiple data types, and multiple languages is achieved during monitoring.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明提供的一种构建行业知识图谱数据库的方法的示例性流程图;FIG1 is an exemplary flow chart of a method for constructing an industry knowledge graph database provided by the present invention;
图2是本发明提供的示例性结构化行业数据;FIG2 is an exemplary structured industry data provided by the present invention;
图3是本发明提供的一种对媒体事件进行监测的方法的示例性流程图;FIG3 is an exemplary flow chart of a method for monitoring media events provided by the present invention;
图4是本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图;FIG4 is an exemplary flow chart of another method for constructing an industry knowledge graph database provided by the present invention;
图5是本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图;FIG5 is an exemplary flow chart of another method for constructing an industry knowledge graph database provided by the present invention;
图6是本发明提供的一种对媒体事件进行监测的系统的示例性框图。FIG6 is an exemplary block diagram of a system for monitoring media events provided by the present invention.
具体实施方式DETAILED DESCRIPTION
以下结合附图通过实施例的形式来描述本发明的具体实施方式,以便于本领域技术人员理解本发明的目的、技术方案和优点。本领域技术人员可以理解,以实施例的形式描述的具体实施方式仅仅是示例性的,而在不具备这些具体内容的情况下也能够实现本发明的构思。The following describes the specific embodiments of the present invention in the form of embodiments in conjunction with the accompanying drawings to facilitate those skilled in the art to understand the objectives, technical solutions and advantages of the present invention. Those skilled in the art will understand that the specific embodiments described in the form of embodiments are merely exemplary and that the concept of the present invention can be implemented without these specific contents.
本发明提供了一种构建行业知识图谱数据库的技术以及一种基于所构建的行业知识图谱数据库对互联网媒体事件进行监测的技术,以实现本发明的目的。The present invention provides a technology for constructing an industry knowledge graph database and a technology for monitoring Internet media events based on the constructed industry knowledge graph database, so as to achieve the purpose of the present invention.
本发明涉及知识图谱(Knowledge Graph)数据库技术的应用。知识图谱数据库是用于知识管理的一种特殊的数据库,便于在相关领域中对知识进行采集、整理和提取。在知识图谱数据库中定义了实体、实体属性以及实体关系。其中,实体对应于现实世界中的事物(例如,一个公司A,一个人物X),每个实体可以用全局唯一的ID来标识。实体属性用于描述实体的内在特性(例如,公司A、人物X的中、英文名称)。实体关系用于连接实体,以描述实体之间的联系(例如,人物X与公司A的任职关系)。通过构建知识图谱数据库,可以更加高效、深入地利用由实体、实体属性、实体关系组成的知识,发现事物之间的复杂联系。The present invention relates to the application of knowledge graph database technology. A knowledge graph database is a special database used for knowledge management, which facilitates the collection, organization and extraction of knowledge in related fields. Entities, entity attributes and entity relationships are defined in the knowledge graph database. Among them, entities correspond to things in the real world (for example, a company A, a person X), and each entity can be identified by a globally unique ID. Entity attributes are used to describe the intrinsic characteristics of an entity (for example, the Chinese and English names of company A and person X). Entity relationships are used to connect entities to describe the connection between entities (for example, the position relationship between person X and company A). By constructing a knowledge graph database, knowledge composed of entities, entity attributes and entity relationships can be more efficiently and deeply utilized to discover complex connections between things.
作为一种数据库,知识图谱数据库可以采用多种形式进行存储。举例而言,知识图谱数据库可以采用传统的关系型数据库,使用语义网络RDF(Resource DescriptionFramework)三元组的方式存储,也可以采用新型的非关系型数据库。优选地,知识图谱数据库可以采用图数据库进行存储,例如Neo4j、OrientDB、Titan-BerkeleyDB、HyperGraphDB等。As a database, a knowledge graph database can be stored in a variety of forms. For example, a knowledge graph database can be stored in a traditional relational database using semantic network RDF (Resource Description Framework) triples, or it can be stored in a new non-relational database. Preferably, the knowledge graph database can be stored in a graph database, such as Neo4j, OrientDB, Titan-BerkeleyDB, HyperGraphDB, etc.
取决于知识图谱数据库的规模和用途,用于构建知识图谱数据库的数据来源可以是多种多样的。举例而言,数据来源可以是开放式的百科类数据源(例如,百度百科、维基百科等),也可以是结构化的数据库(例如,维基数据、DBpedia、垂直网站或特定行业的专业数据库等),还可以是任何相关的第三方半结构化或非结构化数据源(例如,专业网站、在互联网媒体中发布的内容,包括新闻、公司年报、企业公告等)。Depending on the scale and purpose of the knowledge graph database, the data sources used to build the knowledge graph database can be diverse. For example, the data source can be an open encyclopedia data source (for example, Baidu Encyclopedia, Wikipedia, etc.), a structured database (for example, Wikidata, DBpedia, vertical websites or professional databases for specific industries, etc.), or any relevant third-party semi-structured or unstructured data source (for example, professional websites, content published in online media, including news, company annual reports, corporate announcements, etc.).
本领域技术人员应当理解,本发明中所构建的知识图谱数据库在构建过程中是以特定的领域或行业为导向的,但不局限于单个行业。所构建的知识图谱数据库实现了将与一个或多个行业相关的实体和事件、实体和事件的属性以及实体与实体、实体与事件、事件与事件之间的关系整合联接成为一个知识的图谱。Those skilled in the art will appreciate that the knowledge graph database constructed in this invention is oriented toward specific fields or industries, but is not limited to a single industry. The constructed knowledge graph database integrates and connects entities and events related to one or more industries, their attributes, and the relationships between entities, entities and events, and events and events into a knowledge graph.
图1是本发明提供的一种构建行业知识图谱数据库的方法的示例性流程图,该方法可以包括步骤S11-S15。FIG1 is an exemplary flowchart of a method for constructing an industry knowledge graph database provided by the present invention, which may include steps S11-S15.
在步骤S11中,从行业数据源获得行业数据,并从所述行业数据中提取实体以及对应的实体属性和实体关系,以生成所述行业知识图谱数据库。In step S11, industry data is obtained from an industry data source, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate the industry knowledge graph database.
行业数据源是针对一个或多个特定领域或行业的基本数据的来源,其中,这些领域或行业被作为监测的目标。在一个实施例中,行业数据源可以是结构化的行业数据库,以尽可能获得高质量的行业基本数据。可以通过应用程序接口(API)来访问结构化数据库,以查询方式(例如,通过查询命令)获得数据。An industry data source is a source of basic data for one or more specific fields or industries targeted for monitoring. In one embodiment, the industry data source can be a structured industry database to obtain high-quality basic industry data. The structured database can be accessed through an application programming interface (API) to obtain data through queries (e.g., through query commands).
通过“抽取-转换-加载(Extraction-Transform-Load,ETL)”处理,可以对所获得的行业数据进行转换,然后从转换后的数据中提取实体、实体属性和实体关系并将其加载至本发明提出的行业知识图谱数据库中。ETL操作的具体执行步骤可以通过现有的数据整合手段来实现。举例而言,在基于本体的数据整合方法中,以预定的方式定义不同数据库中的各个字段与各种实体信息之间的映射关系,从而根据所述字段及其内容提取实体、实体属性及实体关系,完成构建基本行业知识图谱数据库。另外,由于行业数据库在结构上存在差异,并可能存在数据噪声、数据缺失或数据错误等问题,所以在对行业数据进行数据处理的过程中可能还需要对其进行数据清洗操作。可以采用本领域已知的技术手段,与ETL处理相结合来实现数据清洗操作。Through the "Extraction-Transform-Load (ETL)" process, the obtained industry data can be transformed, and then entities, entity attributes and entity relationships can be extracted from the transformed data and loaded into the industry knowledge graph database proposed by the present invention. The specific execution steps of the ETL operation can be achieved through existing data integration means. For example, in the ontology-based data integration method, the mapping relationship between each field in different databases and various entity information is defined in a predetermined manner, so that entities, entity attributes and entity relationships are extracted according to the fields and their contents, and the basic industry knowledge graph database is constructed. In addition, since industry databases are different in structure and may have problems such as data noise, data missing or data errors, it may be necessary to perform data cleaning operations on the industry data during the process of data processing. Technical means known in the art can be used in combination with ETL processing to achieve data cleaning operations.
作为一个实例,图2示出了示例性的结构化行业数据,如上文所述,该数据可以是从结构化的行业数据库获得的。在图2中,表1是上市公司结构化数据的示例,其包括公司A和公司B两个数据条目,每个数据条目又包括公司中英文名称、注册地址、股票代码、董事会主席等多个字段。通过对该结构化数据进行ETL操作,可以提取其中的实体(即公司A、公司B、人物X、人物Y)、实体属性(即公司A和公司的B的具体信息)以及实体关系(即公司A与人物X以及公司B与人物Y的任职关系),从而生成了针对所属行业的知识图谱数据库。As an example, FIG2 shows exemplary structured industry data, which, as described above, can be obtained from a structured industry database. In FIG2 , Table 1 is an example of structured data of a listed company, which includes two data entries, Company A and Company B, and each data entry includes multiple fields such as the company's Chinese and English names, registered address, stock code, and chairman of the board of directors. By performing ETL operations on the structured data, the entities (i.e., Company A, Company B, Person X, Person Y), entity attributes (i.e., specific information about Company A and Company B), and entity relationships (i.e., the relationship between Company A and Person X, and between Company B and Person Y) can be extracted, thereby generating a knowledge graph database for the industry to which it belongs.
在另一个实施例中,行业数据源也可以是来自互联网的半结构化或非机构化数据源,并且可以通过网络爬虫技术从数据源中抓取行业数据,并采用基于自然语言处理技术的信息抽取操作来提取实体、实体属性以及实体关系。In another embodiment, the industry data source may also be a semi-structured or unstructured data source from the Internet, and industry data may be captured from the data source through web crawler technology, and information extraction operations based on natural language processing technology may be used to extract entities, entity attributes, and entity relationships.
在步骤S12中,从互联网数据源获得与所述行业相关的数据,并从所述数据中提取与所述行业相关的实体以及对应的实体属性和实体关系。In step S12, data related to the industry is obtained from an Internet data source, and entities related to the industry and corresponding entity attributes and entity relationships are extracted from the data.
在该步骤中,首先从互联网数据源中获得与上述特定领域或行业相关的数据。互联网数据源可以是结构化、半结构化或非结构化的数据源。因此,针对互联网数据源的不同结构特性,可以采用不同的方式获得与行业相关的数据。然后,从与行业相关的数据中提取实体以及对应的实体属性和实体关系。In this step, data related to the aforementioned specific domain or industry is first obtained from internet data sources. Internet data sources can be structured, semi-structured, or unstructured. Therefore, different approaches can be used to obtain industry-related data based on the different structural characteristics of internet data sources. Entities, their corresponding entity attributes, and entity relationships are then extracted from this industry-related data.
对于结构化的互联网数据源,可以通过API查询对应的数据内容并获得实体、实体属性和实体关系。对于半结构化的数据源,则可以在抓取数据内容后,通过自然语言处理技术中的信息抽取操作对内容进行分析,从而提取出与行业相关的实体、实体属性和实体关系。半结构化的数据源即包含部分结构化、部分非结构化数据的数据源,因此可以分别按照处理结构化和非结构化数据的方式来处理半结构化数据中的对应部分。举例而言,HTML和XML文件是最常见的半结构化数据。在处理HTML和XML文件的过程中,一方面可以使用其中基于标记符的结构化信息,另一方面可以结合信息抽取技术与机器学习技术来提取所需的信息。For structured Internet data sources, you can query the corresponding data content through the API and obtain entities, entity attributes and entity relationships. For semi-structured data sources, after crawling the data content, you can analyze the content through information extraction operations in natural language processing technology to extract industry-related entities, entity attributes and entity relationships. A semi-structured data source is a data source that contains partially structured and partially unstructured data. Therefore, the corresponding parts of the semi-structured data can be processed in the same way as structured and unstructured data. For example, HTML and XML files are the most common semi-structured data. In the process of processing HTML and XML files, on the one hand, you can use the tag-based structured information therein, and on the other hand, you can combine information extraction technology with machine learning technology to extract the required information.
在一个实施例中,信息抽取操作包括实体识别操作和关系抽取操作。In one embodiment, the information extraction operation includes an entity recognition operation and a relationship extraction operation.
实体识别操作可以采用现有自然语言处理工具(例如,词性标注或命名实体识别工具),或者以机器学习方法针对特定标注数据对实体识别模型进行训练。需要指出的是,一些自然语言处理任务和处理工具是与语言相关的(例如,中文数据需要进行分词处理,英文数据则不需要)。机器学习方法以数字化方式表示不同语言和格式的数据,然后采用通用的、与语言无关的算法(例如,条件随机场算法和隐马尔可夫模型)进行模型训练。Entity recognition operations can use existing natural language processing tools (for example, part-of-speech tagging or named entity recognition tools), or use machine learning methods to train entity recognition models on specific annotated data. It should be noted that some natural language processing tasks and processing tools are language-dependent (for example, Chinese data requires word segmentation, while English data does not). Machine learning methods digitally represent data in different languages and formats, and then use general, language-independent algorithms (for example, conditional random field algorithms and hidden Markov models) for model training.
关系抽取操作可以通过多种现有统计学习或机器学习方法实现。例如,可以采用模板学习方法,以知识图谱数据库中符合某种关系的实体作为实例,在大量文本中抽取并统计现有实例在文本中出现的句式、语境等形成关系抽取模板,然后将所形成的模板应用在文本数据中以抽取新的实例。如果抽取到知识图谱数据库中尚不存在的实例,则可以将其补充到知识图谱数据库中。Relationship extraction can be achieved through a variety of existing statistical learning or machine learning methods. For example, a template learning approach can be used, using entities in a knowledge graph database that match a certain relationship as instances. The existing instances are extracted from a large amount of text and statistically analyzed for sentence patterns and contexts, forming a relationship extraction template. This template can then be applied to the text data to extract new instances. If an instance is extracted that does not yet exist in the knowledge graph database, it can be added to the knowledge graph database.
在步骤S13中,基于所述与行业相关的实体以及对应的实体属性和实体关系,对所述行业知识图谱数据库进行补充或更新。In step S13, the industry knowledge graph database is supplemented or updated based on the industry-related entities and the corresponding entity attributes and entity relationships.
在提取与行业相关的实体以及对应的实体属性和实体关系之后,可以将其与知识图谱数据库中的对应信息进行关联和比对,并按需要将新的实体、实体属性和实体关系加入到知识图谱数据库中,并且可以对现有的实体属性和实体关系进行更新。After extracting industry-related entities and corresponding entity attributes and entity relationships, they can be associated and compared with the corresponding information in the knowledge graph database, and new entities, entity attributes and entity relationships can be added to the knowledge graph database as needed, and existing entity attributes and entity relationships can be updated.
如上文所述,本发明所提出的行业知识图谱数据库可以采用传统的关系型数据库,RDF三元组数据库,也可以采用新型的非关系型数据库(例如,图数据库)。对应地,补充或更新知识图谱数据库的具体操作可以利用数据库查询语言以定制化的方式实现,例如,这些数据库查询语言包括针对关系数据库的SQL语言、RDF三元组查询语言SPARQL、用于Neo4j图数据库的Cypher语言等。As described above, the industry knowledge graph database proposed in the present invention can adopt a traditional relational database, an RDF triple database, or a new non-relational database (e.g., a graph database). Correspondingly, the specific operation of supplementing or updating the knowledge graph database can be implemented in a customized manner using a database query language, such as SQL for relational databases, SPARQL for RDF triples, and Cypher for Neo4j graph databases.
继续结合图2中的实例进行说明。假设通过API查询的方式从结构化的互联网数据源获得了表2的上市公司高管结构化数据,则可以对行业知识图谱数据库进行以下补充和更新:1)将人物Z、人物Z的实体属性以及人物Z与公司B的任职关系补充到知识图谱数据库中;2)补充人物X和人物Y的实体属性;3)更新人物Y和公司B的任职关系(即从“现任职”更新为“曾任职”)。Continuing with the example in Figure 2, let's assume that the structured data on listed company executives in Table 2 is obtained from a structured internet data source via an API query. The following additions and updates can be made to the industry knowledge graph database: 1) Person Z, Person Z's entity attributes, and the employment relationship between Person Z and Company B are added to the knowledge graph database; 2) Person X and Person Y's entity attributes are added; and 3) the employment relationship between Person Y and Company B is updated (i.e., from "currently employed" to "previously employed").
在一个实施例中,在补充或更新行业知识图谱数据库的过程中需要进行实体链接操作和语义消歧操作。In one embodiment, entity linking operations and semantic disambiguation operations are required in the process of supplementing or updating the industry knowledge graph database.
实体链接操作旨在将数据内容中出现的某个实体指代(或实体指称、entitymention)对应到知识图谱数据库中的相关实体概念。例如,在“乔布斯是苹果的创办人之一”以及“史蒂夫·乔布斯于1985年在美国创建NeXT”这两个句子中,“乔布斯”和“史蒂夫·乔布斯”这两个实体指代都应该对应到知识图谱数据库中的同一人物实体概念“史蒂夫·乔布斯(Steve Jobs,ex-CEO of Apple)”,因此需要通过实体链接操作将这个两个实体指代关联到同一个实体。语义消歧旨在对有歧义的实体指代进行消歧操作。例如,“苹果”这个实体指代可以对应多个有歧义的实体,例如“苹果(水果)”、“苹果公司(Apple Inc.)”、“苹果(电影)”等,而上述例子中第一个句子里的“苹果”应该对应到知识图谱数据库中的公司实体概念“苹果公司(Apple Inc.)”而不是“苹果(水果)”或“苹果(电影)”。实体链接和语义消歧通常都是一起进行的。因为语义消歧是实体链接的手段,而实体链接是语义消歧的目的;所以两者经常在不同场合互换使用或互相表示。Entity linking aims to map an entity reference (or entity mention) in the data content to a related entity concept in the knowledge graph database. For example, in the sentences "Jobs is one of the founders of Apple" and "Steve Jobs founded NeXT in the United States in 1985," both entity references "Jobs" and "Steve Jobs" should map to the same person entity concept in the knowledge graph database, "Steve Jobs (ex-CEO of Apple)." Therefore, entity linking is needed to link these two entity references to the same entity. Semantic disambiguation aims to resolve ambiguous entity references. For example, the entity reference "apple" can map to multiple ambiguous entities, such as "apple (fruit)", "Apple Inc.", and "apple (movie)." However, in the first sentence above, "apple" should map to the company entity concept "Apple Inc." in the knowledge graph database, not "apple (fruit)" or "apple (movie)." Entity linking and semantic disambiguation are typically performed together. Because semantic disambiguation is a means of entity linking, and entity linking is the goal of semantic disambiguation; the two are often used interchangeably or represent each other in different situations.
任何现有的实体链接和语义消歧技术均可用于本发明中。举例而言,其中一类方法基于实体知识对实体指代逐一独立地进行消歧与链接。实体知识包括但不局限于,实体的出现概率、实体的名字分布(全名、别名、缩写等)、实体的上下文语境(如词的共现信息、词分布等)以及实体在知识库中的类别信息(如公司实体、个人实体、地点实体等)等。可以使用基于概率的(如线性回归或逻辑回归等)或机器学习的(如支持向量机(SupportVector Machines)、随机森林(Random Forest)等)手段来学习并训练基于实体知识的语义消歧和实体链接模型。另一类方法基于主题一致性的假设(即文章中的实体通常与文本主题相关,所以这些实体之间也具有语义相关性),利用文本内容中所有实体指代的候选实体在知识库(如维基百科或本发明构建的知识图谱)中的关联对一篇文章中的所有实体指代一致性地进行消歧与链接。这一类方法在计算过程中通常使用基于图数据结构的协同推理,即将文章内容中所有实体指代的候选实体,利用其在知识库中的关系构建成一个候选实体图,图的稠密分布反映了图中不同候选实体结点之间的语义关联程度。实体链接的过程就是:通过将证据(不同实体间可能的关联度)按照候选实体图的依存结构迭代传递以协同增强证据,直至收敛。上述两类方法也可以灵活地或有机地进行组合来提高消歧和链接的性能。Any existing entity linking and semantic disambiguation technology can be used in the present invention. For example, one type of method is based on entity knowledge to independently disambiguate and link entity references one by one. Entity knowledge includes, but is not limited to, the probability of occurrence of the entity, the name distribution of the entity (full name, alias, abbreviation, etc.), the context of the entity (such as word co-occurrence information, word distribution, etc.), and the category information of the entity in the knowledge base (such as company entity, personal entity, location entity, etc.). Probabilistic methods (such as linear regression or logistic regression) or machine learning methods (such as support vector machines (SVMs) and random forests (Random Forests)) can be used to learn and train semantic disambiguation and entity linking models based on entity knowledge. Another type of method is based on the assumption of topic consistency (i.e., entities in an article are usually related to the text topic, so these entities also have semantic relevance). It uses the association of candidate entities of all entity references in the text content in the knowledge base (such as Wikipedia or the knowledge graph constructed by the present invention) to consistently disambiguate and link all entity references in an article. This type of method typically uses collaborative reasoning based on graph data structures during computation. Specifically, they construct a candidate entity graph based on the relationships between all entity references in the article's content. The dense distribution of the graph reflects the degree of semantic association between different candidate entity nodes. The entity linking process involves iteratively propagating evidence (the potential associations between different entities) along the dependency structure of the candidate entity graph to collaboratively enhance the evidence until convergence. The two aforementioned methods can also be flexibly and organically combined to improve disambiguation and linking performance.
在步骤S14中,从互联网数据源获得与所述行业相关的互联网媒体数据,并从所述互联网媒体数据中提取与所述行业相关的特定媒体事件以及对应的直接相关实体。In step S14, Internet media data related to the industry is obtained from an Internet data source, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.
可以通过多种方式从互联网数据源获取互联网媒体数据。例如,一些社交媒体网站(例如,新浪微博、Facebook、Twitter等)都开放了用于获取其数据的API。也可以利用网路爬虫技术和内容抽取技术来抓取新闻网站或行业媒体网站数据。There are many ways to obtain internet media data from internet data sources. For example, some social media sites (such as Sina Weibo, Facebook, and Twitter) have open APIs for accessing their data. Web crawling and content extraction techniques can also be used to scrape data from news websites or industry media websites.
在本领域中已有多种对互联网媒体进行监测以获得特定媒体事件的技术实现方式。举例而言,在一种实现方式中,先对互联网媒体数据进行检测,以发现感兴趣的特定领域或行业中媒体事件的内容以及事件所涉及的实体,然后再对新发现的媒体事件按不同指标(例如,事件的负面性、重大性、突发性、传播速度与范围、可信度等)进行评价,以筛选出符合要求的媒体事件。Various technical implementations exist in the field for monitoring internet media for specific media events. For example, in one implementation, internet media data is first monitored to identify the content of media events within a specific field or industry of interest, as well as the entities involved in the events. These newly discovered media events are then evaluated based on various metrics (e.g., negativity, significance, suddenness, speed and scope of dissemination, credibility, etc.) to select eligible media events.
针对不同类型的互联网媒体数据,可以采用不同的处理技术识别与媒体事件对应的直接相关实体。例如,可以使用基于自然语言处理的实体识别技术从文本数据中识别实体,可以使用图像或视频识别处理技术从图像或视频数据中识别实体,并且可以使用语音识别处理技术从音频或视频数据中识别实体。本领域技术人员可以理解,本发明并不对互联网媒体数据的媒体类型以及语言种类做出限制。Different processing techniques can be used to identify directly related entities corresponding to media events for different types of Internet media data. For example, entity recognition techniques based on natural language processing can be used to identify entities from text data, image or video recognition processing techniques can be used to identify entities from image or video data, and speech recognition processing techniques can be used to identify entities from audio or video data. Those skilled in the art will appreciate that the present invention is not limited to the media type or language of Internet media data.
在步骤S15中,基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In step S15, the industry knowledge graph database is supplemented based on the specific media event and the corresponding directly related entities, wherein the specific media event is supplemented into the industry knowledge graph database as an abstract entity.
在获得与行业相关的特定媒体事件以及对应的直接相关实体(例如,某上市公司主席贪腐丑闻事件以及该事件中涉及的公司、人物、地点)之后,把该事件作为抽象实体补充到行业知识图谱数据库中,同时对事件所涉及的直接相关实体进行实体链接和语义消歧,即找出所述实体在行业知识图谱数据库中对应的实体,并将其与代表所述事件的抽象实体进行关联。如发现事件所涉及实体并不存在于行业知识图谱数据库中,则可以按上述步骤S13中说明的方式进行补充。在完成对行业知识图谱数据库的补充之后,即可基于所述事件的直接相关实体在知识图谱数据库中与其他实体之间的关系,找出代表媒体事件的抽象实体在行业知识图谱数据库中的其他非直接相关实体。After obtaining a specific media event related to the industry and the corresponding directly related entities (for example, a corruption scandal of the chairman of a listed company and the companies, people, and places involved in the incident), the event is added to the industry knowledge graph database as an abstract entity, and at the same time, entity linking and semantic disambiguation are performed on the directly related entities involved in the event, that is, the entity corresponding to the entity in the industry knowledge graph database is found, and it is associated with the abstract entity representing the event. If it is found that the entity involved in the event does not exist in the industry knowledge graph database, it can be supplemented in the manner described in step S13 above. After completing the supplement to the industry knowledge graph database, other non-directly related entities of the abstract entity representing the media event in the industry knowledge graph database can be found based on the relationship between the directly related entities of the event and other entities in the knowledge graph database.
在通过以上方式构建行业知识图谱数据库之后,就可以基于所构建的信息对互联网媒体事件进行自动化、深层次的监测。优选地,在完成行业知识图谱数据库的首次构建后,为了保持信息的完整性和有效性,还可以对行业知识图谱数据库进行更新,例如,可以以预定的周期定期执行步骤S12和S13,还可以以实时不间断的方式执行步骤S14和S15。After constructing the industry knowledge graph database in the above manner, it is possible to conduct automated, in-depth monitoring of Internet media events based on the constructed information. Preferably, after the initial construction of the industry knowledge graph database is completed, in order to maintain the integrity and validity of the information, the industry knowledge graph database can also be updated. For example, steps S12 and S13 can be performed regularly at a predetermined period, or steps S14 and S15 can be performed in real time and continuously.
另外,本领域技术人员可以理解,本发明中所涉及的行业数据、与行业相关的数据以及互联网媒体数据等各种数据的内容可以是多种语言的,也可以是多种类型的(例如,文本、图像、视频、语音等),本发明并不对此做出任何限制。In addition, those skilled in the art will understand that the content of various data such as industry data, industry-related data, and Internet media data involved in the present invention can be in multiple languages and can also be of multiple types (for example, text, images, videos, voice, etc.), and the present invention does not impose any restrictions on this.
图3是本发明提供的一种对媒体事件进行监测的方法的示例性流程图,该方法可以基于本发明中所构建的行业知识图谱数据库对与行业相关的特定媒体事件进行监测。该方法可以包括步骤S31-S35。Figure 3 is an exemplary flow chart of a method for monitoring media events provided by the present invention, which can monitor specific media events related to an industry based on the industry knowledge graph database constructed in the present invention. The method can include steps S31-S35.
在步骤S31中,获取互联网媒体数据。In step S31 , Internet media data is acquired.
如上文所述,可以通过多种方式从互联网数据源获取互联网媒体数据。例如,一些社交媒体网站(例如,新浪微博、Facebook、Twitter等)都开放了用于获取其数据的API。也可以利用网路爬虫技术和内容抽取技术来抓取新闻网站或行业媒体网站数据。As mentioned above, there are many ways to obtain internet media data from internet data sources. For example, some social media sites (such as Sina Weibo, Facebook, and Twitter) have open APIs for accessing their data. Web crawling and content extraction techniques can also be used to scrape data from news websites or industry media websites.
在步骤S32中,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选,以获得所述与行业相关的特定媒体事件。In step S32, event detection, event evaluation and screening are performed based on the acquired Internet media data to obtain the specific media events related to the industry.
如上文所述,在本领域中已有多种对互联网媒体进行监测以获得特定媒体事件的技术实现方式。举例而言,在一种实现方式中,先对互联网媒体数据进行检测,以发现感兴趣的特定领域或行业中媒体事件的内容以及事件所涉及的实体,然后再对新发现的媒体事件按不同指标(例如,事件的负面性、重大性、突发性、传播速度与范围、可信度等)进行评价,以筛选出符合要求的媒体事件。As described above, various technical implementations exist in the art for monitoring internet media to identify specific media events. For example, in one implementation, internet media data is first monitored to identify the content of media events within a specific field or industry of interest, as well as the entities involved in the events. These newly discovered media events are then evaluated based on various metrics (e.g., negativity, significance, suddenness, speed and scope of dissemination, credibility, etc.) to select eligible media events.
具体而言,在一个实施例中,事件检测涉及的技术实现步骤可以包括:话题分类、实体识别、情感分析和事件发现。Specifically, in one embodiment, the technical implementation steps involved in event detection may include: topic classification, entity recognition, sentiment analysis, and event discovery.
在话题分类的步骤中,对所获取的互联网媒体数据中的内容进行话题分类以获得针对特定话题的内容。话题分类的目的是从所获取的内容中筛选出属于某种感兴趣话题或与客户需求相关种类的文本。话题分类是一种文本挖掘技术,一般采用机器学习或深度学习方法在标注数据上训练分类模型,然后应用到文本上以判断其话题类别。任何现有分类模型(例如,朴素贝叶斯模型、决策树、支持向量机、人工神经网络等)都可用于本发明中。In the topic classification step, the content in the acquired Internet media data is subject to topic classification to obtain content for a specific topic. The purpose of topic classification is to filter out text belonging to a certain topic of interest or a category related to customer needs from the acquired content. Topic classification is a text mining technology that generally uses machine learning or deep learning methods to train a classification model on labeled data, and then applies it to the text to determine its topic category. Any existing classification model (for example, a naive Bayes model, a decision tree, a support vector machine, an artificial neural network, etc.) can be used in the present invention.
在实体识别的步骤中,从所获得的内容中识别涉及的实体。实体抽取的目的是找出文章中涉及的实体作进一步分析。举例而言,实体识别可以包括以自然语言处理中的信息抽取技术从文本信息中抽取实体,以图像识别技术从图像(含视频)信息中识别实体,以及以语音识别技术从语音信息中识别实体,还可以对从文本、图像、与语音中识别的实体进行合并处理。During the entity recognition step, entities involved are identified from the acquired content. The purpose of entity extraction is to identify entities involved in the article for further analysis. For example, entity recognition can include extracting entities from text using information extraction techniques from natural language processing, identifying entities from images (including videos) using image recognition techniques, and identifying entities from speech using speech recognition techniques. Entities identified from text, images, and speech can also be combined.
在情感分析的步骤中,对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤。情感分析用于判断内容全文以及针对不同实体所表达的情感极性,以找出符合监测条件的内容。现有技术一般以文本分类方法(例如,将情感归类为正面、中性或负面)或回归分析方法(例如,将情感表示成-5到+5之间的分数)实现情感分析。判断内容中针对某一实体的情感则可利用实体在文本中的上下文信息,或者采用依存句法分析工具找出文本中跟该实体相关的文字部份以进行针对实体的情感分析。In the sentiment analysis step, sentiment analysis is performed on the obtained content and the identified entities, and the obtained content is filtered based on the results of the sentiment analysis. Sentiment analysis is used to determine the full text of the content and the polarity of the sentiment expressed for different entities in order to find content that meets the monitoring conditions. Existing technologies generally implement sentiment analysis using text classification methods (for example, classifying sentiment as positive, neutral, or negative) or regression analysis methods (for example, expressing sentiment as a score between -5 and +5). To determine the sentiment towards a certain entity in the content, the contextual information of the entity in the text can be used, or dependency syntactic analysis tools can be used to find the text portion in the text that is related to the entity to perform sentiment analysis on the entity.
在事件发现的步骤中,基于过滤后的内容进行事件发现以对媒体事件进行聚类并发现新的媒体事件。事件发现的目的是从不同文本提取出事件信息(例如,事件发生的时间、地点等),然后将相关的信息聚类、合并成为抽象“事件”,通过与现有事件进行比对以判断新出现的事件,并根据内容的相似性或相关性对事件进行聚类。In the event discovery step, event discovery is performed based on the filtered content to cluster media events and discover new media events. The goal of event discovery is to extract event information (e.g., the time and location of the event) from different texts. This information is then clustered and merged into abstract "events." New events are identified by comparing them with existing events, and events are clustered based on content similarity or relevance.
在一个实施例中,可选地,在事件检测的过程中,还可以基于媒体事件的属性(例如,事件发生的时间、地点,媒体事件发布者及其相关属性等)对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。In one embodiment, optionally, during the event detection process, the authenticity of the event can also be analyzed based on the attributes of the media event (for example, the time and place of the event, the publisher of the media event and its related attributes, etc.), and the media events can be sorted and/or filtered according to the analysis results.
本领域技术人员可以理解,在上述步骤中针对各项操作所列举的实现方式仅仅是示例性的,本领域现有的一些其他方式也可以实现这些操作,本发明并不对实现上述操作的具体方式做出任何限制。Those skilled in the art will understand that the implementation methods listed for each operation in the above steps are merely exemplary, and some other existing methods in the art can also implement these operations. The present invention does not impose any limitation on the specific methods for implementing the above operations.
在步骤S33中,识别与所述特定媒体事件对应的直接相关实体。In step S33 , directly related entities corresponding to the specific media event are identified.
在一个实施例中,通过事件监测中的实体识别和事件发现操作就可以获得每个媒体事件中的各个直接相关实体。同时,如上文所述,可以通过实体链接和语义消歧处理将各个直接相关实体关联到行业知识图谱数据库中对应的实体概念或补充到行业知识图谱数据库中。In one embodiment, through entity recognition and event discovery operations in event monitoring, directly related entities in each media event can be obtained. At the same time, as described above, each directly related entity can be associated with a corresponding entity concept in the industry knowledge graph database or added to the industry knowledge graph database through entity linking and semantic disambiguation.
在步骤S34中,基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体。In step S34, based on the directly related entities, the industry knowledge graph database is accessed to determine indirect related entities corresponding to the specific media event.
在一个实施例中,可以通过预设的各种条件,在行业知识图谱数据库上直接查询与事件直接相关实体有关联关系的其它非直接相关实体。例如,预设的条件可以是:1)与事件直接相关实体在N层内有关联关系的实体(N可以为1,2,3…);2)与事件直接相关实体关联程度满足某种条件(如大于某个指定阈值)的其它实体;3)与事件直接相关实体具有某种特定关系(例如,供货关系、投资关系等)的实体;4)具有某种特定属性(例如,属于某个指定行业、位于某个地点、拥有某个职位等)的实体。这些预设的条件可以单独或随意组合使用。In one embodiment, other non-directly related entities that have an association relationship with the entity directly related to the event can be directly queried on the industry knowledge graph database through various preset conditions. For example, the preset conditions can be: 1) entities that have an association relationship with the entity directly related to the event within N layers (N can be 1, 2, 3...); 2) other entities whose association degree with the entity directly related to the event meets certain conditions (such as being greater than a specified threshold); 3) entities that have a certain specific relationship with the entity directly related to the event (for example, a supply relationship, an investment relationship, etc.); 4) entities with certain specific attributes (for example, belonging to a specified industry, being located in a certain location, holding a certain position, etc.). These preset conditions can be used individually or in any combination.
在另一个实施例中,可以采用数据挖掘的方法,在行业知识图谱数据库的基础之上利用多种条件来挖掘事件的非直接相关实体。举例而言,具体实施方法可以采用针对图数据的链接预测技术(link prediction),即把检测某事件的非直接相关实体问题表示成“预测行业知识图谱数据库中代表该事件的节点与直接相关实体节点以外的其他实体节点之间是否存在连边”这一技术问题。可用于链接预测的条件包括但不局限于事件本身的特征(例如,事件的类型、时间与地点属性、负面性等)、该事件与历史事件的关系(包括关系种类与关系强度)、事件直接相关实体与其他实体之间的关系(包括关系种类和关系强度)以及实体类型和属性等所有可以在知识图谱数据库中挖掘到的知识,从而实现对特定媒体事件的非直接相关实体的综合判断。In another embodiment, a data mining method can be used to mine non-directly related entities of an event using multiple conditions based on an industry knowledge graph database. For example, a specific implementation method can use link prediction technology for graph data, that is, the problem of detecting non-directly related entities of an event can be expressed as the technical problem of "predicting whether there is an edge between the node representing the event and other entity nodes other than the directly related entity nodes in the industry knowledge graph database." The conditions that can be used for link prediction include but are not limited to the characteristics of the event itself (for example, the type of event, time and place attributes, negativity, etc.), the relationship between the event and historical events (including the type of relationship and the strength of the relationship), the relationship between the directly related entity of the event and other entities (including the type of relationship and the strength of the relationship), and all knowledge that can be mined in the knowledge graph database, such as entity types and attributes, thereby achieving a comprehensive judgment of the non-directly related entities of a specific media event.
在步骤S35中,向所述直接相关实体和/或所述非直接相关实体发送预警消息。In step S35, an early warning message is sent to the directly related entity and/or the indirect related entity.
在识别出与特定媒体事件对应的直接和非直接相关实体后,可以利用多种途径(例如,电子邮件、手机短信、实时聊天工具、社交网络平台等)向对应的实体用户发送预警消息。预警消息可以包含对事件本身的文字描述、图片、传播相关统计信息、事件评估指标以及相关实体可能如何受到该事件影响的途径等等。After identifying directly and indirectly related entities corresponding to a specific media event, warning messages can be sent to the corresponding entity users through various channels (e.g., email, mobile text messages, real-time chat tools, social networking platforms, etc.). Warning messages can include a text description of the event itself, images, relevant dissemination statistics, event evaluation indicators, and how related entities may be affected by the event.
本领域技术人员可以理解,本发明中所述的特定媒体事件可以是符合用户所设定条件并且可以从互联网媒体中获得的各种类型的事件,例如,负面事件、突发事件、危机事件、群体性事件或舆情事件等。本发明并不对此做出任何限制。Those skilled in the art will appreciate that the specific media events described in the present invention may be various types of events that meet user-defined conditions and can be obtained from Internet media, such as negative events, emergencies, crisis events, mass incidents, or public opinion events. The present invention does not impose any limitation thereto.
作为一个优选的实施例,图4示出了本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图。该方法可以包括步骤S41、S421/S422以及S43-S45。As a preferred embodiment, Figure 4 shows an exemplary flow chart of another method for constructing an industry knowledge graph database provided by the present invention. The method may include steps S41, S421/S422, and S43-S45.
在步骤S41中,从行业数据源获得行业数据,并从所述行业数据中提取实体以及对应的实体属性和实体关系,以生成行业知识图谱数据库。In step S41, industry data is obtained from an industry data source, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate an industry knowledge graph database.
在步骤S421中,基于结构化数据源,利用应用程序接口以查询方式获得与所述行业相关的实体、实体属性和实体关系。在一个实施例中,所述结构化数据源可以如维基数据、DBPedia这样的结构化开放数据平台,并且可以通过API从中获得与行业相关的数据。In step S421, entities, entity attributes, and entity relationships related to the industry are obtained by querying the structured data source using an application programming interface. In one embodiment, the structured data source can be a structured open data platform such as Wikidata or DBPedia, and industry-related data can be obtained from it through an API.
在步骤S422中,基于半结构化或非结构化数据源,利用自然语言处理技术对数据进行实体识别和关系抽取,以提取与所述行业相关的实体、实体属性和实体关系。在一个实施例中,所述半结构化或非结构化数据源可以诸如维基百科、百度百科这样的开放数据平台,也可以是任何相关的第三方数据源(例如,专业网站、在互联网媒体中发布的内容等),并且可以通过网络爬虫或内容抽取技术获得与行业相关的数据。In step S422, entity recognition and relationship extraction are performed on the data using natural language processing techniques based on the semi-structured or unstructured data source to extract entities, entity attributes, and entity relationships related to the industry. In one embodiment, the semi-structured or unstructured data source can be an open data platform such as Wikipedia or Baidu Encyclopedia, or any relevant third-party data source (e.g., professional websites, content published in internet media, etc.). Industry-related data can be obtained through web crawling or content extraction techniques.
优选地,可以以预定的周期定期执行步骤S421和/或S422、S43。Preferably, steps S421 and/or S422, S43 may be performed regularly at a predetermined cycle.
在步骤S43中,基于所述与行业相关的实体以及对应的实体属性和实体关系,对行业知识图谱数据库进行补充或更新。In step S43, the industry knowledge graph database is supplemented or updated based on the industry-related entities and the corresponding entity attributes and entity relationships.
在步骤S44中,从互联网数据源获得互联网媒体数据,并从所述互联网媒体数据中提取与所述行业相关的特定媒体事件以及对应的直接相关实体。In step S44, Internet media data is obtained from an Internet data source, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.
在步骤S45中,基于所述特定媒体事件以及对应的直接相关实体,对行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In step S45, the industry knowledge graph database is supplemented based on the specific media event and the corresponding directly related entities, wherein the specific media event is supplemented into the industry knowledge graph database as an abstract entity.
优选地,可以以实时不间断的方式执行步骤S44和S45Preferably, steps S44 and S45 can be performed in real time and continuously.
图5是本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图。该方法可以包括步骤S51-S53:FIG5 is an exemplary flow chart of another method for constructing an industry knowledge graph database provided by the present invention. The method may include steps S51-S53:
在步骤S51中,从数据源获取行业数据;In step S51, industry data is obtained from a data source;
在步骤S52中,对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;In step S52, the industry data is processed to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
在步骤S53中,基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。In step S53, the industry knowledge graph database is constructed based on the extracted entities, entity attributes and/or entity relationships.
如上文所述,行业知识图谱数据库的数据来源可以是多种多样的,包括但不限于开放式的百科类数据源、结构化的数据库以及任何相关的第三方半结构化或非结构化互联网数据源。同时,如上文所述,行业知识图谱数据库的数据来源还可以是互联网媒体数据源。As mentioned above, the data sources for industry knowledge graph databases can be diverse, including but not limited to open encyclopedia data sources, structured databases, and any relevant third-party semi-structured or unstructured internet data sources. Furthermore, as mentioned above, the data sources for industry knowledge graph databases can also be internet media data sources.
在一个实施例中,所述数据源可以是结构化的行业数据库,并且所述方法可以通过以下具体方式实现:在步骤S51(1)中,从第三方行业数据库获取包括多个字段的结构化行业数据;在步骤S52(1)中,在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;在步骤S53(1)中,基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。In one embodiment, the data source may be a structured industry database, and the method may be implemented in the following specific manner: in step S51(1), structured industry data including multiple fields is obtained from a third-party industry database; in step S52(1), before extracting entities related to the industry and corresponding entity attributes and/or entity relationships, the structured industry data is cleansed and subjected to extraction-transformation-loading (ETL) processing; in step S53(1), the industry knowledge graph database is generated based on the extracted entities, entity attributes and/or entity relationships.
在另一个实施例中,所述数据源可以是非结构化或半结构化的互联网数据源,并且所述方法可以通过以下具体方式实现:在步骤S51(2)中,利用网络爬虫技术,从互联网数据源获取与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;在步骤S52(2)中,利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;在步骤S53(2)中,基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。In another embodiment, the data source may be an unstructured or semi-structured Internet data source, and the method may be implemented in the following specific manner: in step S51(2), using web crawler technology, industry-related data is obtained from an Internet data source, and the Internet data source includes an unstructured or semi-structured data source; in step S52(2), using information extraction technology in natural language processing, entity recognition and relationship extraction are performed on the industry-related data to extract the entities, entity attributes and/or entity relationships; in step S53(2), the industry knowledge graph database is supplemented or updated based on the extracted entities, entity attributes and/or entity relationships.
此外,所述步骤S51(2)-S53(2)可以是以预定的周期定期执行的。In addition, the steps S51(2)-S53(2) may be performed regularly at a predetermined cycle.
在另一个实施例中,所述数据源可以是开放式的互联网数据源,并且所述方法可以通过以下具体方式实现:在步骤S51(3)中,利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据;在步骤S52(3)中,在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;在步骤S53(3)中,基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。In another embodiment, the data source may be an open Internet data source, and the method may be implemented in the following specific manner: in step S51(3), industry-related data is obtained from the Internet data source in a query manner using an application programming interface (API); in step S52(3), before extracting entities related to the industry and corresponding entity attributes and/or entity relationships, the industry-related data is cleansed and subjected to extraction-transformation-loading (ETL) processing; in step S53(3), the industry knowledge graph database is supplemented or updated based on the extracted entities, entity attributes and/or entity relationships.
此外,所述步骤S51(3)-S53(3)可以是以预定的周期定期执行的。In addition, the steps S51(3)-S53(3) may be performed regularly at a predetermined cycle.
在另一个实施例中,所述数据源可以是互联网媒体数据源,并且所述方法可以通过以下具体方式实现:在步骤S51(4)中,利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取互联网媒体数据;在步骤S52(4)中,对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;在步骤S53(4)中,基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In another embodiment, the data source may be an Internet media data source, and the method may be implemented in the following specific manner: in step S51(4), Internet media data is obtained from an Internet data source using an application programming interface (API) or web crawler technology; in step S52(4), event detection, event evaluation, and screening are performed on the Internet media data to extract specific media events related to the industry, and corresponding directly related entities are identified from the Internet media data; in step S53(4), the industry knowledge graph database is supplemented based on the specific media events and the corresponding directly related entities, wherein the specific media events are supplemented into the industry knowledge graph database as abstract entities.
举例而言,在步骤S52(4)中可以通过以下方式中的至少一种识别与特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者,基于语音识别处理从音频或视频数据中识别实体。For example, in step S52(4), directly related entities corresponding to the specific media event can be identified by at least one of the following methods: identifying entities from text data based on entity recognition in natural language processing; identifying entities from image or video data based on image or video recognition processing; or identifying entities from audio or video data based on speech recognition processing.
举例而言,所述特定媒体事件可以包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。For example, the specific media events may include negative events, emergencies, crisis events, mass events, public opinion events, or other events with industry significance.
此外,所述步骤S51(4)-S53(4)可以是实时不间断执行的。In addition, steps S51(4)-S53(4) can be performed in real time and continuously.
在另一个实施例中,上述步骤S53(2)、S53(3)、S53(4)中对所述行业知识图谱数据库进行补充或更新的步骤可以包括:对所提取的实体进行语义消歧和实体链接。举例而言,可以通过以下方式中的至少一种进行所述语义消歧和实体链接:基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。In another embodiment, the step of supplementing or updating the industry knowledge graph database in the above steps S53(2), S53(3), and S53(4) may include: performing semantic disambiguation and entity linking on the extracted entities. For example, the semantic disambiguation and entity linking may be performed in at least one of the following ways: based on entity knowledge, performing semantic disambiguation and entity linking on each extracted entity reference independently; based on the topic consistency assumption, utilizing the association of candidate entities in the knowledge base, performing semantic disambiguation and entity linking on the extracted entity references consistently.
以上以实施例的方式描述了本发明提供的一种构建行业知识图谱数据库的方法。本领域技术人员可以理解,这些实施例的各种组合也包括在这种构建行业知识图谱数据库的方法的构思之内。The above describes a method for constructing an industry knowledge graph database provided by the present invention in the form of embodiments. Those skilled in the art will understand that various combinations of these embodiments are also included in the concept of this method for constructing an industry knowledge graph database.
图6是本发明提供的一种对媒体事件进行监测的系统的示例性框图。该系统包括数据获取单元、数据获取单元、数据库构建单元、数据库存储单元、媒体事件监测单元、数据库访问单元以及消息发送单元。Figure 6 is an exemplary block diagram of a system for monitoring media events provided by the present invention. The system includes a data acquisition unit, a data acquisition unit, a database construction unit, a database storage unit, a media event monitoring unit, a database access unit, and a message sending unit.
数据获取单元,用于从数据源获得行业数据。The data acquisition unit is used to obtain industry data from the data source.
数据处理单元,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;a data processing unit, configured to process the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
数据库构建单元,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库;A database construction unit, configured to construct the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships;
数据库存储单元:用于存储所构建的行业知识图谱数据库;Database storage unit: used to store the constructed industry knowledge graph database;
媒体事件监测单元:用于获取互联网媒体数据,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选以获得所述与行业相关的特定媒体事件,并且识别与所述特定媒体事件对应的直接相关实体;A media event monitoring unit is configured to acquire Internet media data, perform event detection, event evaluation, and screening based on the acquired Internet media data to obtain the specific media events related to the industry, and identify directly related entities corresponding to the specific media events;
数据库访问单元:用于基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;A database access unit is configured to access the industry knowledge graph database based on the directly related entities to determine the indirect related entities corresponding to the specific media event;
消息发送单元,用于向所述直接相关实体和/或所述非直接相关实体发送预警消息。A message sending unit is used to send a warning message to the directly related entity and/or the indirect related entity.
在一个实施例中,所述数据获取单元包括:结构化数据获取单元,用于从第三方行业数据库获得结构化数据,所述结构化数据包括多个字段;所述数据处理单元包括:结构化数据处理单元,用于对所述结构化数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库生成单元,用于基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。In one embodiment, the data acquisition unit includes: a structured data acquisition unit for obtaining structured data from a third-party industry database, wherein the structured data includes multiple fields; the data processing unit includes: a structured data processing unit for performing data cleaning and extract-transform-load (ETL) processing on the structured data; the database construction unit includes: a database generation unit for generating the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
在另一个实施例中,所述数据获取单元包括:行业相关数据获取单元,用于利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述数据处理单元包括:行业相关数据处理单元,用于利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。In another embodiment, the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain industry-related data from Internet data sources using web crawler technology, and the Internet data sources include unstructured or semi-structured data sources; the data processing unit includes: an industry-related data processing unit, which is used to use information extraction technology in natural language processing to perform entity recognition and relationship extraction on the industry-related data to extract the entities, entity attributes and/or entity relationships; the database construction unit includes: a database supplement/update unit, which is used to supplement or update the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
在另一个实施例中,所述数据获取单元包括:行业相关数据获取单元,用于利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述数据处理单元包括:行业相关数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。In another embodiment, the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain industry-related data from an Internet data source in a query manner using an application programming interface (API), and the Internet data source includes an open data source; the data processing unit includes: an industry-related data processing unit, which is used to perform data cleaning and extraction-transformation-loading (ETL) processing on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships; the database construction unit includes: a database supplement/update unit, which is used to supplement or update the industry knowledge graph database based on the extracted entities, entity attributes and/or entity relationships.
在另一个实施例中,所述数据获取单元包括:媒体数据获取单元,用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述数据处理单元包括:媒体数据处理单元,用于对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述数据库构建单元包括:数据库补充/更新单元,用于基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In another embodiment, the data acquisition unit includes: a media data acquisition unit, which is used to obtain industry-related Internet media data from Internet data sources using an application programming interface (API) or web crawler technology; the data processing unit includes: a media data processing unit, which is used to perform event detection, event evaluation and screening on the Internet media data to extract specific media events related to the industry and identify corresponding directly related entities from the Internet media data; the database construction unit includes: a database supplement/update unit, which is used to supplement the industry knowledge graph database based on the specific media events and the corresponding directly related entities, wherein the specific media events are supplemented into the industry knowledge graph database as abstract entities.
在一个实施例中,所述数据库补充/更新单元进一步用于:对所提取的实体进行语义消歧和实体链接。In one embodiment, the database supplement/update unit is further configured to perform semantic disambiguation and entity linking on the extracted entities.
在一个实施例中,所述媒体事件监测单元进一步用于:对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;从所获得的内容中识别涉及的实体;对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。在另一个实施例中,所述媒体事件监测单元进一步用于:基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。In one embodiment, the media event monitoring unit is further configured to: perform topic classification on the content in the acquired Internet media data to obtain content targeting a specific topic; identify entities involved in the acquired content; perform sentiment analysis on the acquired content and the identified entities, and filter the acquired content based on the results of the sentiment analysis; and perform event discovery based on the filtered content to cluster media events and discover new media events. In another embodiment, the media event monitoring unit is further configured to: analyze the authenticity of the events based on their attributes, and sort and/or filter the media events based on the analysis results.
在一个实施例中,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。在另一个实施例中,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。In one embodiment, the database access unit is further configured to: query the industry knowledge graph database based on the directly related entities to determine the indirect related entities. In another embodiment, the database access unit is further configured to: use data mining technology in the industry knowledge graph database based on the directly related entities to determine the indirect related entities.
以上以实施例的方式描述了本发明提供的一种对媒体事件进行监测的系统。本领域技术人员可以理解,上文结合附图1、3-5所描述的各种方法中的操作步骤可以应用在所述系统的组成单元中,因此这里不再赘述。The above describes a system for monitoring media events provided by the present invention by way of example. Those skilled in the art will appreciate that the operating steps in the various methods described above in conjunction with Figures 1 and 3-5 can be applied to the components of the system, and therefore will not be described in detail here.
本领域技术人员还应当理解,结合本发明公开的各个实施例所描述的各种示例性的方法步骤和单元均可以实现成电子硬件、计算机软件或二者的组合。为了清楚地表示硬件和软件的可交换性,上文中各种示例性的步骤和单元均围绕其功能进行了总体描述。至于这种功能是实现成硬件还是实现成软件,则取决于特定的应用和对整个系统所施加的设计约束条件。本领域技术人员可以针对每个特定应用,以变通的方式实现所描述的功能,但是,这种实现决策不应解释为引起与本公开内容的范围的偏离。Those skilled in the art will also appreciate that the various exemplary method steps and units described in conjunction with the various embodiments disclosed herein can all be implemented as electronic hardware, computer software, or a combination thereof. In order to clearly illustrate the interchangeability of hardware and software, the various exemplary steps and units hereinabove have been generally described around their functions. Whether such functions are implemented as hardware or software depends on the specific application and the design constraints imposed on the entire system. Those skilled in the art may implement the described functions in an adaptable manner for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope of this disclosure.
本发明说明书中使用的“示例/示例性”表示用作例子、例证或说明。说明书中被描述为“示例性”的任何技术方案不应被解释为比其它技术方案更优选或更具优势。The term "exemplary" used in the present invention description means used as an example, illustration or description. Any technical solution described as "exemplary" in the description should not be interpreted as being more preferred or more advantageous than other technical solutions.
本发明提供了对所公开的技术内容的以上描述,以使本领域技术人员能够实现或使用本发明。对于本领域技术人员而言,对这些技术内容的很多修改和变形都是显而易见的,并且本发明所定义的总体原理也可以在不脱离本发明的精神或范围的基础上适用于其它实施例。因此,本发明并不限于上文所示的具体实施方式,而是应与符合本发明公开的发明构思的最广范围相一致。The present invention provides the above description of the disclosed technical content to enable those skilled in the art to implement or use the present invention. Many modifications and variations of these technical contents are obvious to those skilled in the art, and the overall principles defined in the present invention can also be applied to other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention is not limited to the specific embodiments shown above, but should be consistent with the widest range of inventive concepts disclosed in the present invention.
Claims (38)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
HK18108044.9A HK1249598B (en) | 2018-06-22 | Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
HK18108044.9A HK1249598B (en) | 2018-06-22 | Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database |
Publications (2)
Publication Number | Publication Date |
---|---|
HK1249598A1 HK1249598A1 (en) | 2018-11-02 |
HK1249598B true HK1249598B (en) | 2022-05-13 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI664539B (en) | System, apparatus and method for monitoring internet media events based on a constructed industry knowledge graph database | |
US11599714B2 (en) | Methods and systems for modeling complex taxonomies with natural language understanding | |
Subramanian et al. | A survey on hate speech detection and sentiment analysis using machine learning and deep learning models | |
US20250284751A1 (en) | Systems and methods for intelligent content filtering and persistence | |
US20230333919A1 (en) | Flexible and scalable artificial intelligence and analytics platform with advanced content analytics and data ingestion | |
CN111967761B (en) | A monitoring and early warning method, device and electronic equipment based on knowledge graph | |
US11188819B2 (en) | Entity model establishment | |
US20220083949A1 (en) | Method and apparatus for pushing information, device and storage medium | |
US11573995B2 (en) | Analyzing the tone of textual data | |
Hsu et al. | Integrating machine learning and open data into social Chatbot for filtering information rumor | |
CN111966890A (en) | Text-based event pushing method and device, electronic equipment and storage medium | |
US20170109358A1 (en) | Method and system of determining enterprise content specific taxonomies and surrogate tags | |
WO2023129339A1 (en) | Extracting and classifying entities from digital content items | |
CN118643828B (en) | Customer service request data processing method, device, electronic device and storage medium | |
CN113986864A (en) | Log data processing method, device, electronic device and storage medium | |
Krzywicki et al. | Data mining for building knowledge bases: techniques, architectures and applications | |
CN119646022A (en) | Log query method, device, equipment, medium and program product | |
US20070255666A1 (en) | Hypothesis analysis methods, hypothesis analysis devices, and articles of manufacture | |
US10140340B2 (en) | Standardizing attributes and entities in a social networking system | |
US20230090601A1 (en) | System and method for polarity analysis | |
Knap | Towards Odalic, a Semantic Table Interpretation Tool in the ADEQUATe Project. | |
US20150370887A1 (en) | Semantic merge of arguments | |
EP3794457A1 (en) | Recommending secured content | |
SCALIA | Network-based content geolocation on social media for emergency management | |
HK1249598B (en) | Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database |