CN107944032A - Method and apparatus for generating information - Google Patents
Method and apparatus for generating information Download PDFInfo
- Publication number
- CN107944032A CN107944032A CN201711326807.4A CN201711326807A CN107944032A CN 107944032 A CN107944032 A CN 107944032A CN 201711326807 A CN201711326807 A CN 201711326807A CN 107944032 A CN107944032 A CN 107944032A
- Authority
- CN
- China
- Prior art keywords
- article
- excavated
- theme
- mined
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本申请实施例涉及计算机技术领域,具体涉及互联网技术领域,尤其涉及用于生成信息的方法和装置。The embodiments of the present application relate to the field of computer technology, specifically to the field of Internet technology, and in particular to a method and device for generating information.
背景技术Background technique
现阶段,互联网是人们获取信息的一种重要方式,为了精准的向用户推荐其感兴趣的文章,需要精准的理解文章的主题,同时计算出文章与主题的关联度。目前,可以通过文章关键词提取的方式生成文章的主题,例如,首先,对文章全文切词,得到词集合;之后,对词集合进行过滤、词频计算等,并将得到的词集合中的关键词作为文章主题挖掘的结果,这种主题挖掘方式的精准度容易受到词语切分、别名等因素的影响。目前,还可以通过文章主题分类的方式生成文章的主题,例如,对文章中的语句提取词向量特征,进行文章分类获得文章主题,使用这种主题挖掘方式进行主题挖掘容易受候选主题集合的限制,比如,如果用来分类的候选主题集合很小,且候选主题都比较宽泛,那么会造成主题挖掘范围有限,不能全面、精准表达文章。At this stage, the Internet is an important way for people to obtain information. In order to accurately recommend articles of interest to users, it is necessary to accurately understand the topic of the article and calculate the degree of relevance between the article and the topic. At present, the topic of an article can be generated by extracting keywords from an article. For example, firstly, the full text of the article is segmented to obtain a word set; afterward, the word set is filtered, word frequency is calculated, etc., and the keywords in the obtained word set are Words are the result of article topic mining, and the accuracy of this topic mining method is easily affected by factors such as word segmentation and aliases. At present, it is also possible to generate the topic of an article by means of article topic classification. For example, extracting word vector features from sentences in an article and performing article classification to obtain article topics. Using this topic mining method for topic mining is easily limited by the set of candidate topics. , for example, if the set of candidate topics used for classification is small and the candidate topics are relatively broad, then the scope of topic mining will be limited, and articles cannot be comprehensively and accurately expressed.
发明内容Contents of the invention
本申请实施例提出了用于生成信息的方法和装置。The embodiments of the present application propose a method and an apparatus for generating information.
第一方面,本申请实施例提供了一种用于生成信息的方法,包括:获取待挖掘文章;利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度;基于所挖掘主题和所确定的关联度,确定上述待挖掘文章的主题以及上述待挖掘文章与主题的关联度。In the first aspect, the embodiment of the present application provides a method for generating information, including: obtaining articles to be mined; using at least two topic mining methods to mine at least two types of topics in the articles to be mined, and determining the Mining the degree of relevance between the topic and the article to be mined; based on the mined theme and the determined degree of association, determine the theme of the article to be mined and the degree of association between the article to be mined and the theme.
在一些实施例中,上述利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度,包括:对上述待挖掘文章进行命名实体识别,基于命名实体识别结果确定上述待挖掘文章是否包括至少一个第一类型文章主题;响应于确定上述待挖掘文章包括至少一个第一类型文章主题,确定上述待挖掘文章与上述至少一个第一类型文章主题中各个第一类型文章主题的第一关联度。In some embodiments, using at least two topic mining methods to mine at least two types of topics of the above-mentioned articles to be mined, and determining the degree of relevance between the topics to be mined and the above-mentioned articles to be mined, including: Named entity recognition, determining whether the above-mentioned article to be mined includes at least one first-type article topic based on the named-entity recognition result; in response to determining that the above-mentioned article to be mined includes at least one first-type article topic, determining whether the above-mentioned article to be mined and the above-mentioned at least one first-type article topic The first correlation degree of each first-type article topic in the first-type article topic.
在一些实施例中,对上述待挖掘文章进行命名实体识别,基于命名实体识别结果确定上述待挖掘文章是否包括至少一个第一类型文章主题,包括:对上述待挖掘文章进行命名实体识别,确定上述待挖掘文章中是否包含至少一个命名实体;响应于确定上述待挖掘文章中包含至少一个命名实体,将上述至少一个命名实体中的各个命名实体与预先建立的候选主题集合中的候选主题进行匹配,根据匹配结果确定上述待挖掘文章中是否包括至少一个候选主题,其中,上述候选主题集合是基于知识图谱构建的;响应于确定上述待挖掘文章中包括至少一个候选主题,对于上述至少一个候选主题中的每一个候选主题,统计该候选主题在上述待挖掘文章中出现的频次,如果该候选主题在上述待挖掘文章中出现的频次超过预先设定的第一阈值,则确定该候选主题为上述待挖掘文章的第一类型文章主题。In some embodiments, performing named entity recognition on the article to be mined, and determining whether the article to be mined includes at least one first-type article topic based on the named entity recognition result includes: performing named entity recognition on the article to be mined, and determining the above Whether the article to be mined contains at least one named entity; in response to determining that the article to be mined contains at least one named entity, matching each named entity in the at least one named entity with candidate topics in the pre-established candidate topic set, Determine whether the above-mentioned article to be mined includes at least one candidate topic according to the matching result, wherein the above-mentioned candidate topic set is constructed based on a knowledge map; in response to determining that the above-mentioned article to be mined includes at least one candidate topic, for the above-mentioned at least one candidate topic For each candidate topic, count the frequency of occurrence of the candidate topic in the above-mentioned article to be mined, if the frequency of the candidate topic in the above-mentioned article to be mined exceeds the preset first threshold, then determine that the candidate topic is the above-mentioned to-be-mined article The first type of article topic for mining articles.
在一些实施例中,上述响应于确定上述待挖掘文章包括至少一个第一类型文章主题,确定上述待挖掘文章与上述至少一个第一类型文章主题中各个第一类型文章主题的第一关联度,包括:对于上述至少一个第一类型文章主题中的每一个第一类型文章主题,统计该第一类型文章主题在上述待挖掘文章中出现的频次,根据统计得到的频次确定上述待挖掘文章与该第一类型文章主题的第一关联度。In some embodiments, in response to determining that the article to be mined includes at least one first-type article topic, determining a first degree of association between the to-be-mined article and each first-type article topic in the at least one first-type article topic, Including: for each of the first-type article topics in the above-mentioned at least one first-type article topic, count the frequency of the first-type article topic appearing in the above-mentioned article to be mined, and determine the relationship between the above-mentioned article to be mined and the article to be mined according to the frequency obtained by statistics. The first relevance degree of the subject of the first type of article.
在一些实施例中,统计该候选主题在上述待挖掘文章中出现的频次,包括:根据上述知识图谱确定上述待挖掘文章中是否包含该候选主题的别名;响应于确定上述待挖掘文章中包含该候选主题的别名,统计该候选主题的别名在上述待挖掘文章中出现的第一频次;统计该候选主题在上述待挖掘文章中出现的第二频次;计算上述第一频次和上述第二频次之和,将计算结果作为该候选主题在上述待挖掘文章中出现的频次。In some embodiments, counting the frequency of occurrence of the candidate topic in the article to be mined includes: determining whether the alias of the candidate topic is included in the article to be mined according to the knowledge map; in response to determining that the article to be mined contains the alias The alias of the candidate topic, counting the first frequency of the alias of the candidate topic appearing in the above-mentioned article to be mined; counting the second frequency of the candidate topic appearing in the above-mentioned article to be mined; calculating the difference between the above-mentioned first frequency and the above-mentioned second frequency and, take the calculation result as the frequency of occurrence of the candidate topic in the above articles to be mined.
在一些实施例中,上述利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度,包括:确定上述待挖掘文章的来源信息的来源置信度是否超过预先设定的置信度阈值,其中,上述待挖掘文章的来源信息的来源置信度是从预先设定的来源信息与来源置信度关系表中获取的,上述来源信息与来源置信度关系表对应存储有来源信息和来源置信度;响应于确定上述待挖掘文章的来源信息的来源置信度超过预先设定的置信度阈值,将上述待挖掘文章的来源信息作为第二类型文章主题,并将上述待挖掘文章的来源信息的来源置信度作为上述待挖掘文章与上述第二类型文章主题的第二关联度。In some embodiments, using at least two topic mining methods to mine at least two types of topics of the above-mentioned articles to be mined, and determining the degree of relevance between the topics to be mined and the above-mentioned articles to be mined, including: determining the above-mentioned articles to be mined Whether the source confidence of the source information exceeds a preset confidence threshold, wherein the source confidence of the source information of the article to be mined is obtained from a preset source information and source confidence relationship table, the source information The source information and the source confidence are stored corresponding to the source confidence relationship table; in response to determining that the source confidence of the source information of the article to be mined exceeds a preset confidence threshold, the source information of the article to be mined is used as the second type article subject, and use the source confidence degree of the source information of the article to be mined as the second correlation degree between the article to be mined and the subject of the second type article.
在一些实施例中,上述利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度,包括:对上述待挖掘文章进行分词处理,得到至少一个分词;将上述至少一个分词导入预先建立的主题分类模型,得出上述待挖掘文章属于预设第三类型候选文章主题集合中各第三类型候选文章主题的概率;基于上述待挖掘文章属于上述第三类型候选文章主题集合中各第三类型候选文章主题的概率确定上述待挖掘文章的第三类型文章主题,以及上述待挖掘文章与确定的第三类型文章主题的第三关联度。In some embodiments, using at least two topic mining methods to mine at least two types of topics of the above-mentioned articles to be mined, and determining the degree of relevance between the topics to be mined and the above-mentioned articles to be mined, including: Word segmentation processing to obtain at least one word segmentation; import the above-mentioned at least one word segmentation into a pre-established topic classification model, and obtain the probability that the above-mentioned article to be mined belongs to each third type candidate article topic in the preset third type candidate article topic collection; based on the above The probability that the articles to be mined belong to each third type of candidate article topics in the subject set of the third type of candidate articles mentioned above determines the third type of article topics of the articles to be mined, and the third relationship between the above articles to be mined and the determined third type of article topics Correlation.
在一些实施例中,上述主题分类模型为深度神经网络;以及上述方法还包括建立上述深度神经网络的步骤,包括:对样本文章进行分词处理,得到至少一个样本分词;对上述至少一个样本分词进行过滤处理得到上述样本文章的样本分词集合;将上述样本分词集合作为输入,将预先设定的上述样本文章的主题作为输出,训练初始深度神经网络,得到上述深度神经网络。In some embodiments, the above topic classification model is a deep neural network; and the above method also includes the step of establishing the above deep neural network, including: performing word segmentation processing on the sample article to obtain at least one sample word segmentation; performing at least one sample word segmentation The filtering process obtains the sample word segmentation set of the above sample article; the above sample word segmentation set is used as an input, and the preset theme of the above sample article is used as an output, and the initial deep neural network is trained to obtain the above deep neural network.
在一些实施例中,基于所挖掘主题和所确定的关联度,确定上述待挖掘文章的主题以及上述待挖掘文章与主题的关联度,包括:当所挖掘主题包括至少两种类型的主题时,对于上述至少两种类型的主题中的每一种类型的主题,将上述待挖掘文章与该类型的主题的关联度进行归一化处理,并对归一化处理后的关联度进行加权处理。In some embodiments, based on the mined topic and the determined association degree, determining the topic of the article to be mined and the degree of association between the article to be mined and the topic includes: when the mined topic includes at least two types of topics, for For each type of topic in the above at least two types of topics, normalize the degree of association between the article to be mined and the type of topic, and perform weighting processing on the normalized degree of association.
在一些实施例中,上述方法还包括:响应于确定目标关键词与上述待挖掘文章的主题匹配,推送上述待挖掘文章。In some embodiments, the above method further includes: pushing the above-mentioned to-be-mined article in response to determining that the target keyword matches the topic of the above-mentioned to-be-mined article.
第二方面,本申请实施例提供了一种用于生成信息的装置,包括:获取单元,用于获取待挖掘文章;挖掘单元,用于利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度;确定单元,用于基于所挖掘主题和所确定的关联度,确定上述待挖掘文章的主题以及上述待挖掘文章与主题的关联度。In the second aspect, the embodiment of the present application provides a device for generating information, including: an acquisition unit, used to acquire articles to be mined; a mining unit, used to mine the above-mentioned articles to be mined by using at least two topic mining methods At least two types of topics, and determine the degree of relevance between the mined topic and the above-mentioned article to be mined; the determination unit is used to determine the topic of the above-mentioned article to be mined and the above-mentioned article to be mined and relevance of the topic.
在一些实施例中,上述挖掘单元包括:识别子单元,用于对上述待挖掘文章进行命名实体识别,基于命名实体识别结果确定上述待挖掘文章是否包括至少一个第一类型文章主题;第一确定子单元,用于响应于确定上述待挖掘文章包括至少一个第一类型文章主题,确定上述待挖掘文章与上述至少一个第一类型文章主题中各个第一类型文章主题的第一关联度。In some embodiments, the mining unit includes: an identification subunit, configured to perform named entity recognition on the article to be mined, and determine whether the article to be mined includes at least one first type of article subject based on the named entity recognition result; the first determination A subunit, configured to, in response to determining that the article to be mined includes at least one first-type article topic, determine a first degree of association between the above-mentioned article to be mined and each first-type article topic in the at least one first-type article topic.
在一些实施例中,上述识别子单元包括:识别和确定单元,用于对上述待挖掘文章进行命名实体识别,确定上述待挖掘文章中是否包含至少一个命名实体;匹配和确定单元,用于响应于确定上述待挖掘文章中包含至少一个命名实体,将上述至少一个命名实体中的各个命名实体与预先建立的候选主题集合中的候选主题进行匹配,根据匹配结果确定上述待挖掘文章中是否包括至少一个候选主题,其中,上述候选主题集合是基于知识图谱构建的;统计和确定单元,用于响应于确定上述待挖掘文章中包括至少一个候选主题,对于上述至少一个候选主题中的每一个候选主题,统计该候选主题在上述待挖掘文章中出现的频次,如果该候选主题在上述待挖掘文章中出现的频次超过预先设定的第一阈值,则确定该候选主题为上述待挖掘文章的第一类型文章主题。In some embodiments, the recognition subunit includes: a recognition and determination unit, configured to perform named entity recognition on the article to be mined, to determine whether the article to be mined contains at least one named entity; a matching and determination unit, for responding After determining that at least one named entity is contained in the above-mentioned article to be mined, each named entity in the above-mentioned at least one named entity is matched with candidate topics in a pre-established candidate topic collection, and it is determined whether the above-mentioned article to be mined includes at least A candidate topic, wherein the set of candidate topics is constructed based on a knowledge map; the statistics and determination unit is configured to respond to determining that at least one candidate topic is included in the above-mentioned article to be mined, for each candidate topic in the at least one candidate topic , counting the frequency of occurrence of the candidate topic in the above-mentioned article to be mined, if the frequency of the candidate topic in the above-mentioned article to be mined exceeds the preset first threshold, then determine that the candidate topic is the first in the above-mentioned article to be mined Type article topic.
在一些实施例中,上述第一确定子单元进一步用于:对于上述至少一个第一类型文章主题中的每一个第一类型文章主题,统计该第一类型文章主题在上述待挖掘文章中出现的频次,根据统计得到的频次确定上述待挖掘文章与该第一类型文章主题的第一关联度。In some embodiments, the above-mentioned first determining subunit is further configured to: for each first-type article topic in the above-mentioned at least one first-type article topic, count the occurrence of the first-type article topic in the above-mentioned articles to be mined Frequency, determining the first degree of association between the article to be mined and the subject of the first type of article according to the frequency obtained through statistics.
在一些实施例中,上述统计和确定单元进一步用于:根据上述知识图谱确定上述待挖掘文章中是否包含该候选主题的别名;响应于确定上述待挖掘文章中包含该候选主题的别名,统计该候选主题的别名在上述待挖掘文章中出现的第一频次;统计该候选主题在上述待挖掘文章中出现的第二频次;计算上述第一频次和上述第二频次之和,将计算结果作为该候选主题在上述待挖掘文章中出现的频次。In some embodiments, the statistics and determination unit is further configured to: determine whether the alias of the candidate topic is included in the article to be mined according to the knowledge map; in response to determining that the alias of the candidate topic is included in the article to be mined, count the The first frequency that the alias of the candidate topic appears in the above-mentioned article to be mined; count the second frequency of the candidate topic in the above-mentioned article to be mined; calculate the sum of the above-mentioned first frequency and the above-mentioned second frequency, and use the calculation result as the Frequency of candidate topics appearing in the above articles to be mined.
在一些实施例中,上述挖掘单元进一步用于:确定上述待挖掘文章的来源信息的来源置信度是否超过预先设定的置信度阈值,其中,上述待挖掘文章的来源信息的来源置信度是从预先设定的来源信息与来源置信度关系表中获取的,上述来源信息与来源置信度关系表对应存储有来源信息和来源置信度;响应于确定上述待挖掘文章的来源信息的来源置信度超过预先设定的置信度阈值,将上述待挖掘文章的来源信息作为第二类型文章主题,并将上述待挖掘文章的来源信息的来源置信度作为上述待挖掘文章与上述第二类型文章主题的第二关联度。In some embodiments, the mining unit is further configured to: determine whether the source confidence of the source information of the article to be mined exceeds a preset confidence threshold, wherein the source confidence of the source information of the article to be mined is obtained from Acquired from the preset source information and source confidence relationship table, the source information and source confidence relationship table correspondingly stores source information and source confidence; Pre-set confidence threshold, the source information of the above-mentioned article to be mined is used as the second type of article topic, and the source confidence of the source information of the above-mentioned article to be mined is used as the first between the above-mentioned article to be mined and the topic of the second type of article Two degrees of relevance.
在一些实施例中,上述挖掘单元进一步用于:对上述待挖掘文章进行分词处理,得到至少一个分词;将上述至少一个分词导入预先建立的主题分类模型,得出上述待挖掘文章属于预设第三类型候选文章主题集合中各第三类型候选文章主题的概率;基于上述待挖掘文章属于上述第三类型候选文章主题集合中各第三类型候选文章主题的概率确定上述待挖掘文章的第三类型文章主题,以及上述待挖掘文章与确定的第三类型文章主题的第三关联度。In some embodiments, the mining unit is further configured to: perform word segmentation processing on the article to be mined to obtain at least one word; import the at least one word into a pre-established topic classification model to obtain that the article to be mined belongs to the preset first The probability of each third type of candidate article topic in the three types of candidate article topic collection; determine the third type of the above-mentioned article to be mined based on the probability that the above-mentioned article to be mined belongs to each third type of candidate article topic in the above-mentioned third type candidate article topic collection The subject of the article, and the third correlation degree between the article to be mined and the determined subject of the third type of article.
在一些实施例中,上述主题分类模型为深度神经网络;以及上述装置还包括训练单元,上述训练单元用于:对样本文章进行分词处理,得到至少一个样本分词;对上述至少一个样本分词进行过滤处理得到上述样本文章的样本分词集合;将上述样本分词集合作为输入,将预先设定的上述样本文章的主题作为输出,训练初始深度神经网络,得到上述深度神经网络。In some embodiments, the above-mentioned subject classification model is a deep neural network; and the above-mentioned device further includes a training unit, and the above-mentioned training unit is used to: perform word segmentation processing on the sample article to obtain at least one sample word; filter the above-mentioned at least one sample word Process to obtain the sample word segmentation set of the above sample article; use the above sample word segmentation set as input, and use the preset theme of the above sample article as output, train the initial deep neural network, and obtain the above deep neural network.
在一些实施例中,上述确定单元进一步用于:当所挖掘主题包括至少两种类型的主题时,对于上述至少两种类型的主题中的每一种类型的主题,将上述待挖掘文章与该类型的主题的关联度进行归一化处理,并对归一化处理后的关联度进行加权处理。In some embodiments, the determination unit is further configured to: when the topic to be mined includes at least two types of topics, for each type of topic in the above-mentioned at least two types of topics, combine the above-mentioned article to be mined with the type The correlation degree of the topic is normalized, and the normalized correlation degree is weighted.
在一些实施例中,上述装置还包括:推送单元,用于响应于确定目标关键词与上述待挖掘文章的主题匹配,推送上述待挖掘文章。In some embodiments, the above-mentioned apparatus further includes: a push unit, configured to push the above-mentioned to-be-mined article in response to determining that the target keyword matches the subject of the above-mentioned to-be-mined article.
第三方面,本申请实施例提供了一种服务器,该服务器包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当上述一个或多个程序被上述一个或多个处理器执行时,使得上述一个或多个处理器实现如第一方面中任一实现方式描述的方法。In a third aspect, the embodiment of the present application provides a server, the server includes: one or more processors; a storage device for storing one or more programs, when the above one or more programs are When the processor executes, the above-mentioned one or more processors are made to implement the method described in any implementation manner of the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现如第一方面中任一实现方式描述的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the method described in any implementation manner in the first aspect is implemented.
本申请实施例提供的用于生成信息的方法和装置,首先获取待挖掘文章,而后利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度,最后基于所挖掘主题和所确定的关联度,确定上述待挖掘文章的主题以及上述待挖掘文章与主题的关联度,从而实现了从不同维度对待挖掘文章的主题进行挖掘,以获得更全面、准确地主题。The method and device for generating information provided by the embodiments of the present application first obtains the articles to be mined, and then utilizes at least two topic mining methods to mine at least two types of topics of the above-mentioned articles to be mined, and determines that the topics to be mined are related to the above-mentioned The correlation degree of the articles to be mined, and finally based on the mined topics and the determined correlations, determine the topics of the above-mentioned articles to be mined and the correlation between the above-mentioned articles to be mined and the topics, so as to realize the mining of the topics of the excavated articles from different dimensions , to obtain a more comprehensive and accurate subject.
附图说明Description of drawings
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:Other characteristics, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:
图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是根据本申请的用于生成信息的方法的一个实施例的流程图;Figure 2 is a flowchart of one embodiment of a method for generating information according to the present application;
图3是本申请所使用知识图谱的部分结构示意图;Fig. 3 is a partial structural diagram of the knowledge map used in this application;
图4是根据本申请的用于生成信息的方法的一个应用场景的示意图;FIG. 4 is a schematic diagram of an application scenario of a method for generating information according to the present application;
图5是根据本申请的用于生成信息的方法的又一个实施例的流程图;FIG. 5 is a flowchart of another embodiment of a method for generating information according to the present application;
图6是根据本申请的用于生成信息的装置的一个实施例的结构示意图;Fig. 6 is a schematic structural diagram of an embodiment of a device for generating information according to the present application;
图7是适于用来实现本申请实施例的服务器的计算机系统的结构示意图。FIG. 7 is a schematic structural diagram of a computer system suitable for implementing the server of the embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。The application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.
图1示出了可以应用本申请的用于生成信息的方法或用于生成信息的装置的实施例的示例性系统架构100。FIG. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating information or the apparatus for generating information of the present application can be applied.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various client applications may be installed on the terminal devices 101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, laptop computers, desktop computers and so on.
服务器105可以是提供各种服务的服务器,例如对待挖掘文章进行主题挖掘的后台服务器,后台服务器可以对从互联网上获取的待挖掘文章进行主题挖掘,得到待挖掘文章的主题,以及待挖掘文章与所挖掘主题的关联度。Server 105 may be a server that provides various services, such as a background server that performs topic mining on articles to be mined. The relevance of the mined topics.
需要说明的是,本申请实施例所提供的用于生成信息的方法一般由服务器105执行,相应地,用于生成信息的装置一般设置于服务器105中。It should be noted that the method for generating information provided in the embodiment of the present application is generally executed by the server 105 , and correspondingly, the device for generating information is generally disposed in the server 105 .
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
继续参考图2,其示出了根据本申请的用于生成信息的方法的一个实施例的流程200。该用于生成信息的方法,包括以下步骤:Continue referring to FIG. 2 , which shows a flow 200 of an embodiment of a method for generating information according to the present application. The method for generating information includes the following steps:
步骤201,获取待挖掘文章。Step 201, obtaining articles to be mined.
在本实施例中,用于生成信息的方法运行于其上的电子设备(例如图1所示的服务器105)可以通过各种方式(例如,利用网络爬虫进行爬取)从互联网获取待挖掘文章,在这里,上述待挖掘文章可以是文本形式的文章,例如可以是包括标题、内容等部分的文本文章。In this embodiment, the electronic device on which the method for generating information runs (such as the server 105 shown in FIG. 1 ) can obtain articles to be mined from the Internet in various ways (for example, crawling by using a web crawler) , here, the above-mentioned article to be mined may be an article in text form, for example, it may be a text article including title, content and other parts.
步骤202,利用至少两种主题挖掘方式,挖掘待挖掘文章的至少两种类型的主题,并确定所挖掘主题与待挖掘文章的关联度。Step 202, using at least two topic mining methods to mine at least two types of topics in the articles to be mined, and determine the degree of relevance between the topics to be mined and the articles to be mined.
在本实施例中,上述电子设备可以利用至少两种主题挖掘方式,挖掘待挖掘文章的至少两种类型的主题,并确定所挖掘主题与待挖掘文章的关联度,在这里,上述电子设备可以通过多种方式挖掘待挖掘文章的主题,例如,可以通过文章关键词提取的方式生成文章的主题,又例如,可以提取待挖掘文章标题中的关键词作为待挖掘文章的主题。In this embodiment, the above-mentioned electronic device can use at least two topic mining methods to mine at least two types of topics in the articles to be mined, and determine the degree of relevance between the topics to be mined and the articles to be mined. Here, the above-mentioned electronic device can Mining the subject of the article to be mined in various ways, for example, the subject of the article can be generated by extracting keywords from the article, and for example, the keywords in the title of the article to be mined can be extracted as the subject of the article to be mined.
在本实施例的一些可选的实现方式中,上述步骤202可以具体包括:首先,上述电子设备可以对上述待挖掘文章进行命名实体识别,基于命名实体识别结果确定上述待挖掘文章是否包括至少一个第一类型文章主题,作为示例,上述电子设备可以对上述待挖掘文章进行命名实体识别,根据命名实体识别可以判断上述待挖掘文章中是否包括一个或多个命名实体(例如,人名、机构名、地名、图书名等等),如果上述待挖掘文章中包括一个或多个命名实体,则可以将这一个或多个命名实体作为上述待挖掘文章的第一类型文章主题。之后,响应于确定上述待挖掘文章包括至少一个第一类型文章主题,上述电子设备可以确定上述待挖掘文章与上述至少一个第一类型文章主题中各个第一类型文章主题的第一关联度,作为示例,对应上述至少一个第一类型文章主题中的每一个第一类型文章主题,上述电子设备可以根据该第一类型文章主题第一次在上述待挖掘文章中出现的位置确定该第一类型文章主题与上述待挖掘文章的第一关联度,例如,可以设定出现的位置越靠前第一关联度越高。In some optional implementations of this embodiment, the above step 202 may specifically include: first, the above-mentioned electronic device may perform named entity recognition on the above-mentioned article to be mined, and determine whether the above-mentioned article to be mined includes at least one The first type of article theme, as an example, the above-mentioned electronic device can perform named entity recognition on the above-mentioned article to be mined, and according to the named entity recognition, it can be judged whether the above-mentioned article to be mined includes one or more named entities (for example, person name, institution name, place name, book name, etc.), if the above-mentioned article to be mined includes one or more named entities, then this one or more named entities can be used as the first type of article subject of the above-mentioned article to be mined. Afterwards, in response to determining that the article to be mined includes at least one first-type article topic, the electronic device may determine a first degree of association between the above-mentioned article to be mined and each first-type article topic in the at least one first-type article topic, as For example, corresponding to each first-type article topic in the above-mentioned at least one first-type article topic, the above-mentioned electronic device may determine the first-type article according to the position where the first-type article topic first appears in the above-mentioned article to be mined The first degree of association between the topic and the above-mentioned article to be mined, for example, can be set to be higher in the front where it appears, the higher the first degree of association.
在一些可选的实现方式中,上述对上述待挖掘文章进行命名实体识别,基于命名实体识别结果确定上述待挖掘文章是否包括至少一个第一类型文章主题,可以具体包括:首先,上述电子设备可以对上述待挖掘文章进行命名实体识别,确定上述待挖掘文章中是否包含至少一个命名实体。然后,响应于确定上述待挖掘文章中包含至少一个命名实体,上述电子设备可以将上述至少一个命名实体中的各个命名实体与预先建立的候选主题集合中的候选主题进行匹配,根据匹配结果确定上述待挖掘文章中是否包括至少一个候选主题,例如,对于至少一个命名实体中每一个命名实体,上述电子设备可以将该命名实体与上述候选主题集合中的各个候选主题进行对比,如果该命名实体与上述候选主题集合中的某个候选主题相同,则可以确定上述待挖掘文章中包括上述候选主题集合中的、与该命名实体相同的候选主题,在这里,上述候选主题集合可以是基于知识图谱构建的,如图3所示,其示出了知识图谱中的某一部分的结构示意图,图3中以人物、作品、地点、数值、身高等作为知识图谱中的节点,通过知识图谱上述电子设备可以知道命名实体的更多信息(例如,该命名实体的别名、上下位关系、与之相关的实体/概念等等),通过知识图谱可以获取更加全面、精确的候选主题集合。最后,响应于确定上述待挖掘文章中包括至少一个候选主题,对于上述至少一个候选主题中的每一个候选主题,上述电子设备可以统计该候选主题在上述待挖掘文章中出现的频次,即次数,如果该候选主题在上述待挖掘文章中出现的频次超过预先设定的第一阈值,则可以确定该候选主题为上述待挖掘文章的第一类型文章主题,在这里,上述第一阈值可以是根据实际需要人工设定的。In some optional implementation manners, performing named entity recognition on the article to be mined, and determining whether the article to be mined includes at least one first-type article topic based on the named entity recognition result may specifically include: first, the electronic device may Perform named entity recognition on the article to be mined, and determine whether the article to be mined contains at least one named entity. Then, in response to determining that the article to be mined contains at least one named entity, the electronic device may match each named entity in the at least one named entity with candidate topics in a pre-established candidate topic set, and determine the above-mentioned Whether at least one candidate topic is included in the article to be mined, for example, for each named entity in at least one named entity, the above-mentioned electronic device can compare the named entity with each candidate topic in the above-mentioned candidate topic set, if the named entity and If one of the candidate topics in the above-mentioned candidate topic set is the same, it can be determined that the above-mentioned article to be mined includes a candidate topic in the above-mentioned candidate topic set that is the same as the named entity. Here, the above-mentioned candidate topic set can be constructed based on a knowledge map As shown in Figure 3, it shows a schematic structural diagram of a certain part of the knowledge graph. In Figure 3, characters, works, locations, values, heights, etc. are used as nodes in the knowledge graph. Through the knowledge graph, the above-mentioned electronic devices can Knowing more information about the named entity (for example, the alias of the named entity, the upper and lower relationships, related entities/concepts, etc.), a more comprehensive and accurate set of candidate topics can be obtained through the knowledge graph. Finally, in response to determining that the article to be mined includes at least one candidate topic, for each candidate topic in the at least one candidate topic, the electronic device may count the frequency of occurrence of the candidate topic in the article to be mined, that is, the number of times, If the frequency of occurrence of the candidate topic in the article to be mined exceeds the preset first threshold, it can be determined that the candidate topic is the first type of article topic of the article to be mined. Here, the first threshold can be based on In fact, manual setting is required.
在一些可选的实现方式中,上述响应于确定上述待挖掘文章包括至少一个第一类型文章主题,确定上述待挖掘文章与上述至少一个第一类型文章主题中各个第一类型文章主题的第一关联度,可以具体包括:对于上述至少一个第一类型文章主题中的每一个第一类型文章主题,上述电子设备可以统计该第一类型文章主题在上述待挖掘文章中出现的频次,根据统计得到的频次确定上述待挖掘文章与该第一类型文章主题的第一关联度。例如,对于某个第一类型文章主题可以将其在上述待挖掘文章中出现的次数直接或间接(例如,将出现次数乘以预设的权重得到的计算结果)作为该第一类型文章主题与上述待挖掘文章的第一关联度。In some optional implementation manners, in response to determining that the article to be mined includes at least one article topic of the first type, determine whether the article to be mined and the first type of article topic among the at least one article topic of the first type The degree of relevance may specifically include: for each first-type article topic in the above-mentioned at least one first-type article topic, the above-mentioned electronic device may count the frequency of occurrence of the first-type article topic in the above-mentioned article to be mined, and obtain according to the statistics The frequency determines the first degree of association between the article to be mined and the subject of the first type of article. For example, for a certain first type of article topic, the number of times it appears in the above-mentioned article to be mined can be directly or indirectly (for example, the calculation result obtained by multiplying the number of occurrences by the preset weight) as the first type of article topic and The first relevance degree of the above articles to be mined.
可选的,上述统计该候选主题在上述待挖掘文章中出现的频次,可以具体包括:首先,上述电子设备可以根据上述知识图谱确定上述待挖掘文章中是否包含该候选主题的别名,例如,根据知识图谱可以确定“北京大学”的别名为“北大”。然后,响应于确定上述待挖掘文章中包含该候选主题的别名,上述电子设备可以首先统计该候选主题的别名在上述待挖掘文章中出现的第一频次;而后,统计该候选主题在上述待挖掘文章中出现的第二频次;最后,计算上述第一频次和上述第二频次之和,将计算结果作为该候选主题在上述待挖掘文章中出现的频次。Optionally, the counting of the frequency of occurrence of the candidate topic in the article to be mined may specifically include: first, the above-mentioned electronic device may determine whether the alias of the candidate topic is included in the article to be mined according to the above-mentioned knowledge map, for example, according to The knowledge map can determine that the alias of "Peking University" is "Peking University". Then, in response to determining that the alias of the candidate topic is included in the article to be mined, the electronic device may first count the first frequency of the alias of the candidate topic appearing in the article to be mined; The second frequency that appears in the article; finally, calculate the sum of the above-mentioned first frequency and the above-mentioned second frequency, and use the calculation result as the frequency that the candidate topic appears in the above-mentioned article to be mined.
在本实施例的一些可选的实现方式中,上述步骤202还可以具体包括:首先,上述电子设备可以确定上述待挖掘文章的来源信息的来源置信度是否超过预先设定的置信度阈值,其中,上述待挖掘文章的来源信息的来源置信度可以是从预先设定的来源信息与来源置信度关系表中获取的,上述来源信息与来源置信度关系表对应存储有来源信息和来源置信度,作为示例,上述电子设备首先可以获取上述待挖掘文章的来源信息,例如上述待挖掘文章的作者、发布上述待挖掘文章的媒体号、网站等等。之后,可以将上述待挖掘文章的来源信息和来源信息与来源置信度关系表中的来源信息进行匹配,从而得到上述待挖掘文章的来源信息的来源置信度。在这里,来源信息与来源置信度关系表中的、来源信息的来源置信度可以是根据多种方式确定的,例如,当上述来源信息为用于发布信息的媒体号时,该媒体号的来源置信度可以根据其历史发布信息的主题集中度、质量、是否原创等因素设定,例如,可以设定历史发布信息的主题越集中来源置信度越高,还可以设定历史发布信息的质量(例如,可以通过阅读量、点击量、订阅量等确定)越高来源置信度越高。然后,响应于确定上述待挖掘文章的来源信息的来源置信度超过预先设定的置信度阈值,上述电子设备可以将上述待挖掘文章的来源信息作为第二类型文章主题,并可以将上述待挖掘文章的来源信息的来源置信度作为上述待挖掘文章与上述第二类型文章主题的第二关联度。In some optional implementations of this embodiment, the above step 202 may also specifically include: first, the above electronic device may determine whether the source confidence of the source information of the article to be mined exceeds a preset confidence threshold, where , the source confidence of the source information of the article to be mined may be obtained from a preset source information and source confidence relationship table, and the source information and source confidence relationship table stores the source information and the source confidence correspondingly, As an example, the above-mentioned electronic device may first obtain source information of the above-mentioned article to be mined, such as the author of the above-mentioned article to be mined, the media account and website that published the above-mentioned article to be mined, and so on. After that, the source information of the article to be mined can be matched with the source information in the source confidence relationship table, so as to obtain the source confidence of the source information of the article to be mined. Here, in the source information and source confidence relationship table, the source confidence of the source information can be determined in various ways, for example, when the above-mentioned source information is a media number used to release information, the source of the media number Confidence can be set according to factors such as the topic concentration, quality, and originality of the historically released information. For example, it can be set that the more concentrated the topic of the historically released information, the higher the source confidence, and the quality of the historically released information can also be set ( For example, it can be determined by reading volume, click volume, subscription volume, etc.) The higher the source, the higher the confidence. Then, in response to determining that the source confidence of the source information of the article to be mined exceeds a preset confidence threshold, the electronic device may use the source information of the article to be mined as the subject of the second type of article, and may use the source information of the article to be mined The source confidence degree of the source information of the article is used as the second correlation degree between the article to be mined and the subject of the second type of article.
在本实施例的一些可选的实现方式中,上述步骤202还可以具体包括:首先,上述电子设备可以对上述待挖掘文章进行分词处理,得到至少一个分词;之后,可以将上述至少一个分词导入预先建立的主题分类模型,得出上述待挖掘文章属于预设第三类型候选文章主题集合中各第三类型候选文章主题的概率,在这里,上述第三类型候选文章主题集合可以是基于上述知识网络建立的,例如,上述电子设备可以提取上述知识网络中的抽象概念作为第三类型候选文章主题,例如,可以提取“星座”、“语言”、“汽车”、“综艺”等词作为第三类型候选文章主题。需要说明的是,上述主题分类模型可以用于表征分词集合与主题分类结果的对应关系,作为示例,主题分类模型可以是技术人员基于对大量的分词集合和主题分类结果的统计而预先制定的、存储有多个分词集合与主题分类结果的对应关系的对应关系表。最后,基于上述待挖掘文章属于上述第三类型候选文章主题集合中各第三类型候选文章主题的概率确定上述待挖掘文章的第三类型文章主题,以及上述待挖掘文章与确定的第三类型文章主题的第三关联度,作为示例,上述电子设备可以判断上述待挖掘文章属于某第三类型候选文章主题的概率是否超过预先设定的概率阈值,如果超过,则可以确定该第三类型候选文章主题为上述待挖掘文章的第三类型文章主题,此外,上述电子设备还可以将上述待挖掘文章属于该第三类型候选文章主题的概率作为上述待挖掘文章与该第三类型文章主题的第三关联度。In some optional implementations of this embodiment, the above step 202 may also specifically include: first, the above-mentioned electronic device may perform word segmentation processing on the above-mentioned article to be mined to obtain at least one word segmentation; The pre-established topic classification model obtains the probability that the above-mentioned article to be mined belongs to each third-type candidate article topic in the preset third-type candidate article topic set. Here, the above-mentioned third-type candidate article topic set can be based on the above-mentioned knowledge The network is established, for example, the above-mentioned electronic device can extract the abstract concepts in the above-mentioned knowledge network as the third type of candidate article topics, for example, words such as "constellation", "language", "car" and "variety show" can be extracted as the third type Type candidate article topic. It should be noted that the above topic classification model can be used to characterize the corresponding relationship between word segmentation sets and topic classification results. A correspondence table storing correspondences between multiple word segmentation sets and subject classification results. Finally, based on the probability that the article to be mined belongs to each of the third type candidate article topics in the third type candidate article topic set, the third type article subject of the above article to be mined is determined, and the relationship between the above article to be mined and the determined third type article The third degree of relevance of the topic, as an example, the electronic device can determine whether the probability that the article to be mined belongs to a third type of candidate article topic exceeds a preset probability threshold, and if it exceeds, the third type of candidate article can be determined The subject is the third type of article subject of the article to be mined. In addition, the electronic device may also use the probability that the article to be mined belongs to the subject of the third type of candidate article as the third relationship between the article to be mined and the subject of the third type of article. Correlation.
在一些可选的实现方式中,上述主题分类模型可以为深度神经网络;以及上述用于生成信息的方法还可以包括建立上述深度神经网络的步骤,具体可以包括:首先,上述电子设备或者其他用于训练上述深度神经网络的电子设备可以对样本文章进行分词处理,得到至少一个样本分词。之后,可以对上述至少一个样本分词进行过滤处理得到上述样本文章的样本分词集合,例如,可以对上述至少一个样本分词进行过滤处理,去除上述至少一个样本分词中的一些标点符号、停用词等等。最后,可以将上述样本分词集合作为输入,将预先设定的上述样本文章的主题作为输出,训练初始深度神经网络,得到上述深度神经网络,在这里,上述初始深度神经网络可以是通过各种方式得到的初始深度神经网络,例如,基于现有的深度神经网络,随机生成该神经网络的网络参数,得到的深度神经网络。作为示例,上述初始深度神经网络可以包括卷积神经网络和全连接层,上述深度神经网络的具体训练过程可以包括:首先,可以将样本分词集合输入至卷积神经网络,得到样本分词集合的特征向量;之后,可以将样本分词集合的特征向量输入至全连接层,得到样本文章属于第三类型候选文章主题集合中各第三类型候选文章主题的预测概率,将样本文章属于第三类型候选文章主题集合中各第三类型候选文章主题的预测概率与预先设定的样本文章的主题(在这里,可以假定样本文章属于预先设定的样本文章的主题的概率为100%,属于除预设主题之外的其他主题的概率为0)进行比较,从而得到初始深度神经网络的预测准确率,如果预测准确率大于预设的准确率阈值,则可以将初始深度神经网络作为训练完成的深度神经网络。如果预测准确率小于预设的准确率阈值,则调整初始深度神经网络的网络参数。需要说明的是,上述深度神经网络的训练过程仅仅用于说明深度神经网络参数的调整过程,可以认为初始深度神经网络为参数调整前的网络,深度神经网络为参数调整后的网络,网络参数的调整过程并不仅限于一次,可以根据网络的优化程度以及实际需要等重复多次。In some optional implementations, the subject classification model above may be a deep neural network; and the above method for generating information may also include the step of establishing the above deep neural network, specifically may include: first, the above electronic device or other The electronic device for training the above-mentioned deep neural network can perform word segmentation processing on sample articles to obtain at least one sample word segment. After that, the at least one sample word segmentation can be filtered to obtain the sample word segmentation set of the sample article, for example, the at least one sample word segmentation can be filtered to remove some punctuation marks, stop words, etc. in the at least one sample word segmentation Wait. Finally, the above-mentioned sample word segmentation set can be used as input, and the preset theme of the above-mentioned sample articles can be used as output to train the initial deep neural network to obtain the above-mentioned deep neural network. Here, the above-mentioned initial deep neural network can be obtained in various ways The obtained initial deep neural network, for example, based on the existing deep neural network, randomly generates the network parameters of the neural network to obtain the deep neural network. As an example, the above-mentioned initial deep neural network may include a convolutional neural network and a fully connected layer, and the specific training process of the above-mentioned deep neural network may include: first, the sample word segmentation set may be input into the convolutional neural network to obtain the characteristics of the sample word segmentation set Afterwards, the feature vector of the sample word segmentation set can be input to the fully connected layer to obtain the predicted probability that the sample article belongs to each third-type candidate article subject in the third-type candidate article subject set, and the sample article belongs to the third-type candidate article subject set The predicted probability of each third type of candidate article topic in the topic set and the topic of the preset sample article (here, it can be assumed that the probability that the sample article belongs to the topic of the preset sample article is 100%, and the probability of belonging to the topic except the preset topic The probability of other topics other than 0) is compared to obtain the prediction accuracy of the initial deep neural network. If the prediction accuracy is greater than the preset accuracy threshold, the initial deep neural network can be used as the trained deep neural network . If the prediction accuracy is less than the preset accuracy threshold, the network parameters of the initial deep neural network are adjusted. It should be noted that the training process of the above-mentioned deep neural network is only used to illustrate the adjustment process of the deep neural network parameters. It can be considered that the initial deep neural network is the network before parameter adjustment, and the deep neural network is the network after parameter adjustment. The adjustment process is not limited to one time, and can be repeated many times according to the optimization degree of the network and actual needs.
步骤203,基于所挖掘主题和所确定的关联度,确定待挖掘文章的主题以及待挖掘文章与主题的关联度。Step 203, based on the mined topic and the determined correlation degree, determine the topic of the article to be mined and the correlation degree between the article to be mined and the topic.
在本实施例中,上述电子设备可以基于所挖掘主题和所确定的关联度,确定上述待挖掘文章的主题以及上述待挖掘文章与主题的关联度,例如,上述电子设备可以将所挖掘主题的全部或部分作为上述待挖掘主题的主题。又例如,对于某种类型的主题,上述电子设备可以通过各种方式(例如,在关联度上加某个预设值)提升该类型主题与上述待挖掘文章的关联度。In this embodiment, the above-mentioned electronic device may determine the subject of the above-mentioned article to be mined and the degree of relevance between the above-mentioned article to be mined and the topic based on the mined topic and the determined degree of association. All or part of the subject as the subject to be excavated above. For another example, for a certain type of topic, the electronic device may increase the degree of relevance between this type of topic and the above-mentioned article to be mined in various ways (for example, adding a preset value to the degree of relevance).
在本实施例的一些可选的实现方式中,上述步骤203可以具体包括:当所挖掘主题包括至少两种类型的主题时,对于上述至少两种类型的主题中的每一种类型的主题,上述电子设备可以将上述待挖掘文章与该类型的主题的关联度进行归一化处理,并对归一化处理后的关联度进行加权处理。一般通过不同主题挖掘方式确定的所挖掘主题与上述待挖掘文章的关联度会分布在不同的数字区间,例如,通过方式一确定的所挖掘主题与上述待挖掘文章的关联度可能分布在数字区间[0,100]之间,通过方式二确定的所挖掘主题与上述待挖掘文章的关联度可能分布在数字区间[0,1]之间,因此,为了便于比较可以对确定的关联度进行归一化处理,使通过不同主题挖掘方式确定的所挖掘主题与上述待挖掘文章的关联度分布相同的数字区间。此外,还可以对归一化处理后的关联度进行加权处理,例如,对于挖掘效果好的主题挖掘方式,可以通过加权(例如乘以某个值)的方式提升该主题挖掘方式确定的所挖掘主题与上述待挖掘文章的关联度。In some optional implementations of this embodiment, the above step 203 may specifically include: when the mined topic includes at least two types of topics, for each type of topic in the above-mentioned at least two types of topics, the above-mentioned The electronic device may perform normalization processing on the degree of association between the article to be mined and the topic of this type, and perform weighting processing on the normalized degree of association. Generally, the degree of relevance between the mined topics determined by different topic mining methods and the above-mentioned articles to be mined will be distributed in different numerical intervals. Between [0,100], the degree of association between the mined topic and the above-mentioned articles to be mined determined by the second method may be distributed between the digital interval [0,1], therefore, for the convenience of comparison, the determined association degree can be normalized Processing, so that the mined topics determined by different topic mining methods and the above-mentioned articles to be mined are distributed in the same numerical range. In addition, weighting can also be performed on the normalized association degree. For example, for a topic mining method with a good mining effect, the mined value determined by the topic mining method can be increased by weighting (such as multiplying a certain value). The degree of relevance between the topic and the above articles to be mined.
继续参见图4,图4是根据本实施例的用于生成信息的方法的应用场景的一个示意图。在图4的应用场景中,服务器401通过网站A获取一篇关于演员甲和演员乙争吵的文章作为待挖掘文章;之后,服务器401可以利用至少两种主题挖掘方式,挖掘该待挖掘文章的至少两种类型的主题,例如,挖掘出命名实体类的主题“演员甲”、“演员乙”等等,以及挖掘出抽象概念类的主题“争吵”、“炒作”等等,并确定所挖掘主题与该待挖掘文章的关联度;最后,服务器401可以基于所挖掘主题和所确定的关联度,确定该待挖掘文章的主题以及该待挖掘文章与主题的关联度,例如,服务器可以将所挖掘主题全部或部分作为该待挖掘主题的主题。Continue referring to FIG. 4 , which is a schematic diagram of an application scenario of the method for generating information according to this embodiment. In the application scenario of Fig. 4, server 401 acquires an article about actor A and actor B's quarrel through website A as an article to be mined; after that, server 401 can use at least two topic mining methods to mine at least Two types of topics, for example, dig out the topics "Actor A", "Actor B", etc. of the named entity class, and dig out the topics "quarrel", "hype", etc. of the abstract concept class, and determine the topics to be mined degree of association with the article to be mined; finally, the server 401 can determine the subject of the article to be mined and the degree of association between the article to be mined and the topic based on the mined topic and the determined association degree, for example, the server can mine the All or part of the topic is used as the topic of the topic to be mined.
本申请的上述实施例提供的方法通过采用至少两种主题挖掘方式挖掘待挖掘文章的至少两种类型的主题,并确定所挖掘主题与待挖掘文章的关联度,从而实现了从不同维度对待挖掘文章的主题进行挖掘,以获得更全面、准确地主题。The method provided by the above-mentioned embodiments of the present application uses at least two topic mining methods to mine at least two types of topics in the articles to be mined, and determines the degree of relevance between the topics to be mined and the articles to be mined, thereby realizing mining from different dimensions The topic of the article is mined to obtain a more comprehensive and accurate topic.
进一步参考图5,其示出了用于生成信息的方法的又一个实施例的流程500。该用于生成信息的方法的流程500,包括以下步骤:Further referring to FIG. 5 , it shows a flow 500 of still another embodiment of a method for generating information. The flow 500 of the method for generating information includes the following steps:
步骤501,获取待挖掘文章。Step 501, obtaining articles to be mined.
在本实施例中,用于生成信息的方法运行于其上的电子设备(例如图1所示的服务器105)可以通过各种方式从互联网获取待挖掘文章。In this embodiment, the electronic device on which the method for generating information runs (for example, the server 105 shown in FIG. 1 ) can acquire articles to be mined from the Internet in various ways.
步骤502,利用至少两种主题挖掘方式,挖掘待挖掘文章的至少两种类型的主题,并确定所挖掘主题与待挖掘文章的关联度。Step 502, using at least two topic mining methods to mine at least two types of topics in the articles to be mined, and determine the degree of relevance between the topics to be mined and the articles to be mined.
在本实施例中,基于步骤501获取的待挖掘文章,上述电子设备可以利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度。In this embodiment, based on the articles to be mined obtained in step 501, the above-mentioned electronic device can use at least two topic mining methods to mine at least two types of topics of the above-mentioned articles to be mined, and determine the relationship between the topics to be mined and the above-mentioned articles to be mined. degree of relevance.
步骤503,基于所挖掘主题和所确定的关联度,确定待挖掘文章的主题以及待挖掘文章与主题的关联度。Step 503, based on the mined topic and the determined correlation degree, determine the topic of the article to be mined and the correlation degree between the article to be mined and the topic.
在本实施例中,上述电子设备可以基于所挖掘主题和所确定的关联度,确定待挖掘文章的主题以及待挖掘文章与主题的关联度。In this embodiment, the above-mentioned electronic device may determine the topic of the article to be mined and the degree of association between the article to be mined and the topic based on the mined topic and the determined correlation degree.
步骤504,响应于确定目标关键词与待挖掘文章的主题匹配,推送待挖掘文章。Step 504, pushing the article to be mined in response to determining that the target keyword matches the subject of the article to be mined.
在本实施例中,上述电子设备可以将目标关键词与步骤503中确定的上述待挖掘文章的主题进行匹配,例如,可以将目标关键词与步骤503中确定的上述待挖掘文章的主题进行对比,如果目标关键词与步骤503中确定的上述待挖掘文章的某个主题相同,且该主题与上述待挖掘文章的关联度大于预先设定的推送阈值,则可以确定目标关键词与待挖掘文章的主题匹配。响应于确定目标关键词与上述待挖掘文章的主题匹配,上述电子设备可以推送上述待挖掘文章,例如,上述电子设备可以向发送目标关键词的终端推送上述待挖掘文章。在这里,上述目标关键词可以是从用户通过终端发送的搜索信息中提取的,例如,用户通过终端发送搜索信息“明星章XX”,上述电子设备可以提取“章XX”作为目标关键词。另外,上述目标关键词可以是基于用户画像得到的关键词,例如,某个用户的用户画像显示该用户为“明星李XX”的粉丝,则上述电子设备可以将“李XX”作为目标关键词。In this embodiment, the electronic device may match the target keyword with the topic of the article to be mined determined in step 503, for example, may compare the target keyword with the topic of the article to be mined determined in step 503 , if the target keyword is the same as a topic of the article to be mined determined in step 503, and the correlation between the topic and the article to be mined is greater than a preset push threshold, then it can be determined that the target keyword is related to the article to be mined Themes match. In response to determining that the target keyword matches the topic of the article to be mined, the electronic device may push the article to be mined, for example, the electronic device may push the article to be mined to the terminal sending the target keyword. Here, the target keyword may be extracted from the search information sent by the user through the terminal, for example, the user sends the search information "star chapter XX" through the terminal, and the electronic device may extract "chapter XX" as the target keyword. In addition, the above-mentioned target keywords may be keywords obtained based on user portraits. For example, if the user portrait of a certain user shows that the user is a fan of "star Li XX", the above-mentioned electronic device may use "Li XX" as the target keyword .
从图5中可以看出,与图2对应的实施例相比,本实施例中的用于生成信息的方法的流程500突出了推送待挖掘文章的步骤。由此,本实施例描述的方案可以有效利用确定的待挖掘文章的主题以及待挖掘文章与主题的关联度,从而实现更准确的信息推送。It can be seen from FIG. 5 that, compared with the embodiment corresponding to FIG. 2 , the process 500 of the method for generating information in this embodiment highlights the step of pushing articles to be mined. Therefore, the solution described in this embodiment can effectively utilize the determined subject of the article to be mined and the degree of association between the article to be mined and the subject, thereby realizing more accurate information push.
进一步参考图6,作为对上述各图所示方法的实现,本申请提供了一种用于生成信息的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Further referring to FIG. 6 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a device for generating information. This device embodiment corresponds to the method embodiment shown in FIG. 2 , the The device can be specifically applied to various electronic devices.
如图6所示,本实施例的用于生成信息的装置600包括:获取单元601、挖掘单元602和确定单元603。其中,获取单元601用于获取待挖掘文章;挖掘单元602用于利用至少两种主题挖掘方式,挖掘上述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与上述待挖掘文章的关联度;确定单元603用于基于所挖掘主题和所确定的关联度,确定上述待挖掘文章的主题以及上述待挖掘文章与主题的关联度。As shown in FIG. 6 , the apparatus 600 for generating information in this embodiment includes: an acquiring unit 601 , a mining unit 602 and a determining unit 603 . Wherein, the obtaining unit 601 is used to obtain articles to be mined; the mining unit 602 is used to use at least two topic mining methods to mine at least two types of topics of the above-mentioned articles to be mined, and determine the association between the topics to be mined and the above-mentioned articles to be mined degree; the determining unit 603 is configured to determine the topic of the article to be mined and the degree of association between the article to be mined and the topic based on the mined topic and the determined association degree.
在本实施例中,用于生成信息的装置600的获取单元601、挖掘单元602和确定单元603的具体处理及其所带来的技术效果可分别参考图2对应实施例中步骤201、步骤202和步骤203的相关说明,在此不再赘述。In this embodiment, the specific processing of the acquiring unit 601, the mining unit 602, and the determining unit 603 of the apparatus 600 for generating information and the technical effects brought about by them can refer to step 201 and step 202 in the corresponding embodiment of FIG. 2 respectively. The related description of step 203 will not be repeated here.
在本实施例的一些可选的实现方式中,上述挖掘单元602包括:识别子单元(图中未示出),用于对上述待挖掘文章进行命名实体识别,基于命名实体识别结果确定上述待挖掘文章是否包括至少一个第一类型文章主题;第一确定子单元(图中未示出),用于响应于确定上述待挖掘文章包括至少一个第一类型文章主题,确定上述待挖掘文章与上述至少一个第一类型文章主题中各个第一类型文章主题的第一关联度。In some optional implementations of this embodiment, the above-mentioned mining unit 602 includes: an identification subunit (not shown in the figure), configured to perform named entity recognition on the above-mentioned articles to be mined, and determine the above-mentioned articles to be mined based on the named entity recognition results Whether the excavated article includes at least one first type of article subject; the first determination subunit (not shown in the figure) is used for determining that the above-mentioned article to be mined is related to the above-mentioned The first degree of relevance of each first-type article topic in at least one first-type article topic.
在本实施例的一些可选的实现方式中,上述识别子单元包括:识别和确定单元(图中未示出),用于对上述待挖掘文章进行命名实体识别,确定上述待挖掘文章中是否包含至少一个命名实体;匹配和确定单元(图中未示出),用于响应于确定上述待挖掘文章中包含至少一个命名实体,将上述至少一个命名实体中的各个命名实体与预先建立的候选主题集合中的候选主题进行匹配,根据匹配结果确定上述待挖掘文章中是否包括至少一个候选主题,其中,上述候选主题集合是基于知识图谱构建的;统计和确定单元(图中未示出),用于响应于确定上述待挖掘文章中包括至少一个候选主题,对于上述至少一个候选主题中的每一个候选主题,统计该候选主题在上述待挖掘文章中出现的频次,如果该候选主题在上述待挖掘文章中出现的频次超过预先设定的第一阈值,则确定该候选主题为上述待挖掘文章的第一类型文章主题。In some optional implementations of this embodiment, the recognition subunit includes: a recognition and determination unit (not shown in the figure), configured to perform named entity recognition on the article to be mined, and determine whether the article to be mined is Contains at least one named entity; a matching and determining unit (not shown in the figure), configured to match each named entity in the at least one named entity with a pre-established candidate in response to determining that the article to be mined contains at least one named entity The candidate topics in the topic collection are matched, and it is determined whether at least one candidate topic is included in the above-mentioned article to be mined according to the matching result, wherein, the above-mentioned candidate topic collection is constructed based on a knowledge map; a statistical and determination unit (not shown in the figure), In response to determining that the above-mentioned article to be mined includes at least one candidate topic, for each candidate topic in the above-mentioned at least one candidate topic, count the frequency of occurrence of the candidate topic in the above-mentioned article to be mined, if the candidate topic is in the above-mentioned to-be-mined If the frequency of occurrences in the mined articles exceeds the preset first threshold, the candidate topic is determined to be the first type of article topic of the above-mentioned articles to be mined.
在本实施例的一些可选的实现方式中,上述第一确定子单元进一步用于:对于上述至少一个第一类型文章主题中的每一个第一类型文章主题,统计该第一类型文章主题在上述待挖掘文章中出现的频次,根据统计得到的频次确定上述待挖掘文章与该第一类型文章主题的第一关联度。In some optional implementations of this embodiment, the above-mentioned first determining subunit is further configured to: for each first-type article topic in the above-mentioned at least one first-type article topic, count the first-type article topics in The frequency of appearance of the article to be mined is determined according to the frequency obtained by statistics, and the first correlation degree between the article to be mined and the subject of the first type of article is determined.
在本实施例的一些可选的实现方式中,上述统计和确定单元进一步用于:根据上述知识图谱确定上述待挖掘文章中是否包含该候选主题的别名;响应于确定上述待挖掘文章中包含该候选主题的别名,统计该候选主题的别名在上述待挖掘文章中出现的第一频次;统计该候选主题在上述待挖掘文章中出现的第二频次;计算上述第一频次和上述第二频次之和,将计算结果作为该候选主题在上述待挖掘文章中出现的频次。In some optional implementations of this embodiment, the statistics and determination unit is further configured to: determine whether the alias of the candidate topic is included in the article to be mined according to the knowledge map; The alias of the candidate topic, counting the first frequency of the alias of the candidate topic appearing in the above-mentioned article to be mined; counting the second frequency of the candidate topic appearing in the above-mentioned article to be mined; calculating the difference between the above-mentioned first frequency and the above-mentioned second frequency and, take the calculation result as the frequency of occurrence of the candidate topic in the above articles to be mined.
在本实施例的一些可选的实现方式中,上述挖掘单元602进一步用于:确定上述待挖掘文章的来源信息的来源置信度是否超过预先设定的置信度阈值,其中,上述待挖掘文章的来源信息的来源置信度是从预先设定的来源信息与来源置信度关系表中获取的,上述来源信息与来源置信度关系表对应存储有来源信息和来源置信度;响应于确定上述待挖掘文章的来源信息的来源置信度超过预先设定的置信度阈值,将上述待挖掘文章的来源信息作为第二类型文章主题,并将上述待挖掘文章的来源信息的来源置信度作为上述待挖掘文章与上述第二类型文章主题的第二关联度。In some optional implementations of this embodiment, the mining unit 602 is further configured to: determine whether the source confidence of the source information of the article to be mined exceeds a preset confidence threshold, wherein the source information of the article to be mined The source confidence of the source information is obtained from the preset source information and source confidence relationship table, and the above source information and source confidence relationship table correspondingly stores the source information and source confidence; in response to determining the above-mentioned article to be mined If the source confidence of the source information exceeds the preset confidence threshold, the source information of the above-mentioned article to be mined is taken as the second type of article topic, and the source confidence of the source information of the above-mentioned article to be mined is used as the source information of the above-mentioned article to be mined and A second degree of relevance to the subject of the second type of article above.
在本实施例的一些可选的实现方式中,上述挖掘单元602进一步用于:对上述待挖掘文章进行分词处理,得到至少一个分词;将上述至少一个分词导入预先建立的主题分类模型,得出上述待挖掘文章属于预设第三类型候选文章主题集合中各第三类型候选文章主题的概率;基于上述待挖掘文章属于上述第三类型候选文章主题集合中各第三类型候选文章主题的概率确定上述待挖掘文章的第三类型文章主题,以及上述待挖掘文章与确定的第三类型文章主题的第三关联度。In some optional implementations of this embodiment, the mining unit 602 is further configured to: perform word segmentation on the article to be mined to obtain at least one word; import the at least one word into a pre-established topic classification model to obtain The probability that the article to be mined belongs to each third-type candidate article topic in the preset third-type candidate article topic set; is determined based on the probability that the above-mentioned article to be mined belongs to each third-type candidate article topic in the third-type candidate article topic set A third type of article topic of the article to be mined, and a third correlation degree between the above article to be mined and the determined third type of article topic.
在本实施例的一些可选的实现方式中,上述主题分类模型为深度神经网络;以及上述装置还包括训练单元(图中未示出),上述训练单元用于:对样本文章进行分词处理,得到至少一个样本分词;对上述至少一个样本分词进行过滤处理得到上述样本文章的样本分词集合;将上述样本分词集合作为输入,将预先设定的上述样本文章的主题作为输出,训练初始深度神经网络,得到上述深度神经网络。In some optional implementations of this embodiment, the above-mentioned topic classification model is a deep neural network; and the above-mentioned device further includes a training unit (not shown in the figure), and the above-mentioned training unit is used to: perform word segmentation processing on sample articles, Obtain at least one sample word; filter the at least one sample word to obtain a sample word set of the sample article; use the above sample word set as input, and use the preset theme of the above sample article as output to train the initial deep neural network , to obtain the above deep neural network.
在本实施例的一些可选的实现方式中,上述确定单元603进一步用于:当所挖掘主题包括至少两种类型的主题时,对于上述至少两种类型的主题中的每一种类型的主题,将上述待挖掘文章与该类型的主题的关联度进行归一化处理,并对归一化处理后的关联度进行加权处理。In some optional implementations of this embodiment, the determination unit 603 is further configured to: when the mined topics include at least two types of topics, for each type of topic in the at least two types of topics, Perform normalization processing on the degree of association between the above-mentioned articles to be mined and the topic of this type, and perform weighting processing on the degree of association after normalization processing.
在本实施例的一些可选的实现方式中,上述装置600还包括:推送单元(图中未示出),用于响应于确定目标关键词与上述待挖掘文章的主题匹配,推送上述待挖掘文章。In some optional implementations of this embodiment, the above-mentioned device 600 further includes: a push unit (not shown in the figure), configured to push the above-mentioned to-be-mined article.
下面参考图7,其示出了适于用来实现本申请实施例的服务器的计算机系统700的结构示意图。图7示出的服务器仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。Referring now to FIG. 7 , it shows a schematic structural diagram of a computer system 700 suitable for implementing the server of the embodiment of the present application. The server shown in FIG. 7 is only an example, and should not limit the functions and scope of use of this embodiment of the present application.
如图7所示,计算机系统700包括中央处理单元(CPU,Central Processing Unit)701,其可以根据存储在只读存储器(ROM,Read Only Memory)702中的程序或者从存储部分706加载到随机访问存储器(RAM,Random Access Memory)703中的程序而执行各种适当的动作和处理。在RAM 703中,还存储有系统700操作所需的各种程序和数据。CPU 701、ROM702以及RAM 703通过总线704彼此相连。输入/输出(I/O,Input/Output)接口705也连接至总线704。As shown in FIG. 7 , a computer system 700 includes a central processing unit (CPU, Central Processing Unit) 701, which can be randomly accessed according to a program stored in a read-only memory (ROM, Read Only Memory) 702 or loaded from a storage section 706 Various appropriate actions and processes are executed by programs stored in random access memory (RAM) 703 . In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701 , ROM 702 , and RAM 703 are connected to each other via a bus 704 . An input/output (I/O, Input/Output) interface 705 is also connected to the bus 704 .
以下部件连接至I/O接口705:包括硬盘等的存储部分5706;以及包括诸如LAN(局域网,Local Area Network)卡、调制解调器等的网络接口卡的通信部分707。通信部分707经由诸如因特网的网络执行通信处理。驱动器708也根据需要连接至I/O接口705。可拆卸介质709,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器708上,以便于从其上读出的计算机程序根据需要被安装入存储部分706。The following components are connected to the I/O interface 705: a storage section 5706 including a hard disk and the like; and a communication section 707 including a network interface card such as a LAN (Local Area Network) card, a modem, and the like. The communication section 707 performs communication processing via a network such as the Internet. A drive 708 is also connected to the I/O interface 705 as necessary. A removable medium 709 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc. is mounted on the drive 708 as necessary so that a computer program read therefrom is installed into the storage section 706 as necessary.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分707从网络上被下载和安装,和/或从可拆卸介质709被安装。在该计算机程序被中央处理单元(CPU)701执行时,执行本申请的方法中限定的上述功能。需要说明的是,本申请所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication portion 707 and/or installed from removable media 709 . When the computer program is executed by the central processing unit (CPU) 701, the above-mentioned functions defined in the method of the present application are performed. It should be noted that the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present application, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program codes are carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. . Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括获取单元、挖掘单元和确定单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取待挖掘文章的单元”。The units involved in the embodiments described in the present application may be implemented by means of software or by means of hardware. The described units may also be set in a processor, for example, it may be described as: a processor includes an acquisition unit, a mining unit, and a determination unit. Wherein, the names of these units do not limit the unit itself under certain circumstances, for example, the obtaining unit may also be described as "a unit for obtaining articles to be mined".
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的装置中所包含的;也可以是单独存在,而未装配入该装置中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该装置执行时,使得该装置:获取待挖掘文章;利用至少两种主题挖掘方式,挖掘所述待挖掘文章的至少两种类型的主题,并确定所挖掘主题与所述待挖掘文章的关联度;基于所挖掘主题和所确定的关联度,确定所述待挖掘文章的主题以及所述待挖掘文章与主题的关联度。As another aspect, the present application also provides a computer-readable medium. The computer-readable medium may be included in the device described in the above embodiments, or it may exist independently without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the device, the device: acquires articles to be mined; uses at least two topic mining methods to mine the content of the articles to be mined at least two types of topics, and determine the degree of association between the mined topic and the article to be mined; based on the mined topic and the determined association degree, determine the topic of the article to be mined and the relationship between the article to be mined and the topic Correlation.
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, and should also cover the technical solutions formed by the above-mentioned technical features or without departing from the above-mentioned inventive concept. Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with technical features with similar functions disclosed in (but not limited to) this application.
Claims (22)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711326807.4A CN107944032B (en) | 2017-12-13 | 2017-12-13 | Method and apparatus for generating information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711326807.4A CN107944032B (en) | 2017-12-13 | 2017-12-13 | Method and apparatus for generating information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107944032A true CN107944032A (en) | 2018-04-20 |
CN107944032B CN107944032B (en) | 2021-12-31 |
Family
ID=61944045
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711326807.4A Active CN107944032B (en) | 2017-12-13 | 2017-12-13 | Method and apparatus for generating information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107944032B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241296A (en) * | 2018-09-14 | 2019-01-18 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating information |
CN109388745A (en) * | 2018-06-15 | 2019-02-26 | 云天弈(北京)信息技术有限公司 | A kind of automatic authoring system of batch article |
CN110245334A (en) * | 2019-06-25 | 2019-09-17 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
CN110263342A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Method for digging and device, the electronic equipment of the hyponymy of entity |
CN110543574A (en) * | 2019-08-30 | 2019-12-06 | 北京百度网讯科技有限公司 | A method, device, equipment and medium for constructing a knowledge graph |
CN112446206A (en) * | 2019-08-16 | 2021-03-05 | 阿里巴巴集团控股有限公司 | Menu title generation method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102207948A (en) * | 2010-07-13 | 2011-10-05 | 天津海量信息技术有限公司 | Method for generating incident statement sentence material base |
US20150026255A1 (en) * | 2013-07-17 | 2015-01-22 | Yahoo! Inc. | Determination of general and topical news and geographical scope of news content |
CN104951542A (en) * | 2015-06-19 | 2015-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing class of social contact short texts and method and device for training classification models |
CN106355628A (en) * | 2015-07-16 | 2017-01-25 | 中国石油化工股份有限公司 | Image-text knowledge point marking method and device and image-text mark correcting method and system |
CN107168992A (en) * | 2017-03-29 | 2017-09-15 | 北京百度网讯科技有限公司 | Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence |
-
2017
- 2017-12-13 CN CN201711326807.4A patent/CN107944032B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102207948A (en) * | 2010-07-13 | 2011-10-05 | 天津海量信息技术有限公司 | Method for generating incident statement sentence material base |
US20150026255A1 (en) * | 2013-07-17 | 2015-01-22 | Yahoo! Inc. | Determination of general and topical news and geographical scope of news content |
CN104951542A (en) * | 2015-06-19 | 2015-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing class of social contact short texts and method and device for training classification models |
CN106355628A (en) * | 2015-07-16 | 2017-01-25 | 中国石油化工股份有限公司 | Image-text knowledge point marking method and device and image-text mark correcting method and system |
CN107168992A (en) * | 2017-03-29 | 2017-09-15 | 北京百度网讯科技有限公司 | Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388745A (en) * | 2018-06-15 | 2019-02-26 | 云天弈(北京)信息技术有限公司 | A kind of automatic authoring system of batch article |
CN109241296A (en) * | 2018-09-14 | 2019-01-18 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating information |
CN110263342A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Method for digging and device, the electronic equipment of the hyponymy of entity |
CN110245334A (en) * | 2019-06-25 | 2019-09-17 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
CN110245334B (en) * | 2019-06-25 | 2023-06-16 | 北京百度网讯科技有限公司 | Method and device for outputting information |
CN112446206A (en) * | 2019-08-16 | 2021-03-05 | 阿里巴巴集团控股有限公司 | Menu title generation method and device |
CN110543574A (en) * | 2019-08-30 | 2019-12-06 | 北京百度网讯科技有限公司 | A method, device, equipment and medium for constructing a knowledge graph |
CN110543574B (en) * | 2019-08-30 | 2022-05-17 | 北京百度网讯科技有限公司 | Knowledge graph construction method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN107944032B (en) | 2021-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11023505B2 (en) | Method and apparatus for pushing information | |
US10210865B2 (en) | Method and apparatus for inputting information | |
CN107346336B (en) | Artificial intelligence-based information processing method and device | |
CN107944032A (en) | Method and apparatus for generating information | |
AU2019419888B2 (en) | System and method for information extraction with character level features | |
CN107241260B (en) | News pushing method and device based on artificial intelligence | |
CN107797982B (en) | Method, device and equipment for recognizing text type | |
CN108153901A (en) | The information-pushing method and device of knowledge based collection of illustrative plates | |
WO2020052069A1 (en) | Method and apparatus for word segmentation | |
CN107526718B (en) | Method and apparatus for generating text | |
CN113657113A (en) | Text processing method and device and electronic equipment | |
CN110874532B (en) | Method and device for extracting keywords from feedback information | |
CN110275963A (en) | Method and device for outputting information | |
CN106354856A (en) | Deep neural network enhanced search method and device based on artificial intelligence | |
CN108038200A (en) | Method and apparatus for storing data | |
US12086171B2 (en) | Word mining method and apparatus, electronic device and readable storage medium | |
CN106681598A (en) | Information input method and device | |
CN108090042A (en) | For identifying the method and apparatus of text subject | |
CN110427453B (en) | Data similarity calculation method, device, computer equipment and storage medium | |
CN107977379A (en) | Method and apparatus for mined information | |
CN108038172A (en) | Searching method and device based on artificial intelligence | |
CN113868481A (en) | Component acquisition method, device, electronic device and storage medium | |
CN107766498B (en) | Method and apparatus for generating information | |
CN114490400A (en) | Method and device for processing test cases | |
CN114244795A (en) | Information pushing method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |