CN115238163A

CN115238163A - Information pushing method and device based on document data, storage medium and terminal

Info

Publication number: CN115238163A
Application number: CN202110444526.9A
Authority: CN
Inventors: 江明; 李永智; 谷俊
Original assignee: Shanghai Biguan Data Technology Co ltd; Shanghai Education Talent Exchange Service Center
Current assignee: Shanghai Biguan Data Technology Co ltd; Shanghai Education Talent Exchange Service Center
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2022-10-25

Abstract

A document data-based information push method, device, storage medium, and terminal, the document data-based information push method comprises: crawling document data published by multiple users, the document data including document subject and citation document data; if different document data If the author has the same name, at least the author with the same name is identified according to the subject similarity of the document data; according to the crawled document data, the document data published by at least some of the multiple users and the cited document data are extracted. The published document data includes: The number of published documents, the cited document data includes the number of cited documents; the evaluation results of at least some users are calculated based on the literature data published by at least some users and the cited document data; the information is pushed according to the evaluation results of each user, and the pushed information includes the user and/or user bibliographic data. The technical solution of the present invention can truly and accurately realize information push based on document data.

Description

Document data-based information push method and device, storage medium and terminal

技术领域technical field

本发明涉及数据处理技术领域，尤其涉及一种基于文献数据的信息推送方法及装置、存储介质、终端。The present invention relates to the technical field of data processing, and in particular, to a method and device for pushing information based on document data, a storage medium and a terminal.

背景技术Background technique

H指数(H index)是一个混合量化指标，可用于评估研究人员的学术产出数量与学术产出水平。用户的H指数是指其发表的Np篇论文中有H篇每篇至少被引H次、而其余Np-H篇论文每篇被引均小于或等于H次。The H index is a hybrid quantitative indicator that can be used to assess the amount and level of academic output of researchers. The user's H-index means that among the Np papers published by the user, each of H papers is cited at least H times, and each of the remaining Np-H papers is cited less than or equal to H times.

但是，现有技术中H指数对人才的发文数据和高被引数据不敏感，导致采用H指数对人才评价以及人才推荐不能反映真实情况。However, the H-index in the prior art is not sensitive to the published data and highly cited data of talents, so that the use of the H-index to evaluate and recommend talents cannot reflect the real situation.

发明内容SUMMARY OF THE INVENTION

本发明解决的技术问题是如何真实以及准确地实现基于文献数据的信息推送。The technical problem solved by the present invention is how to truly and accurately realize information push based on document data.

为解决上述技术问题，本发明实施例提供一种基于多层引用网络的信息推送方法，基于多层引用网络的信息推送方法包括：爬取多个用户发表的文献数据，所述文献数据包括文献主题以及引用文献数据；如果不同文献数据的作者同名，则至少按照文献数据的主题相似度对同名的作者进行识别；根据爬取到的文献数据提取所述多个用户中至少部分用户所发表的文献数据以及引用文献数据，所述发表的文献数据包括发表的文献数量，所述引用文献数据包括引用文献数量；至少根据所述至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果；根据各个用户的评价结果进行信息推送，推送的信息包括所述用户和/或所述用户的文献数据。In order to solve the above technical problem, an embodiment of the present invention provides an information push method based on a multi-layer citation network. The information push method based on the multi-layer citation network includes: crawling document data published by multiple users, where the document data includes documents Subject and cited document data; if the authors of different document data have the same name, at least identify the author with the same name according to the subject similarity of the document data; extract the articles published by at least some of the multiple users according to the crawled document data Document data and cited document data, the published document data includes the number of published documents, and the cited document data includes the number of cited documents; calculating the document data at least according to the document data published by the at least some users and the cited document data Evaluation results of at least some users; information is pushed according to the evaluation results of each user, and the pushed information includes the user and/or the user's document data.

可选的，所述至少根据所述至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果包括：利用所述引用文献数据计算引用文献总量；确定所述发表的文献数据中零被引文献数量，并计算所述零被引文献数量与剩余文献数量的加权之和；根据所述引用文献总量以及所述加权之和计算所述至少部分用户的评价结果。Optionally, the calculating at least the evaluation result of the at least some users according to the document data published by the at least some users and the cited document data includes: calculating the total number of cited documents by using the cited document data; determining the Calculate the number of zero-cited documents in the published document data, and calculate the weighted sum of the number of zero-cited documents and the number of remaining documents; calculate the evaluation of the at least some users according to the total number of cited documents and the weighted sum result.

可选的，所述至少根据所述至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果包括：利用所述引用文献数据计算引用文献总量；确定所述发表的文献数据中零被引文献数量，并计算所述零被引文献数量与剩余文献数量的加权之和；计算所述至少部分用户的H指数；将所述H指数与所述引用文献总量、所述加权之和进行合并，以得到所述至少部分用户的评价结果。Optionally, the calculating at least the evaluation result of the at least some users according to the document data published by the at least some users and the cited document data includes: calculating the total number of cited documents by using the cited document data; determining the The number of zero-cited documents in the published document data, and calculate the weighted sum of the number of zero-cited documents and the number of remaining documents; calculate the H-index of the at least some users; compare the H-index with the total number of cited documents; The amount and the weighted sum are combined to obtain the evaluation result of the at least part of the users.

可选的，所述根据爬取到的多个文献数据确定所述至少部分用户所发表的文献数据以及引用文献数据包括：根据爬取到的多个文献建立发文数据库，所述发文数据库包括各个用户所发表的文献数据；根据爬取到的多个文献建立引文数据库，所述引文数据库包括各个用户所发表的文献数据所引用的文献数据；根据所述发文数据库和所述引文数据库确定所述至少部分用户所发表的文献数据的数量以及引用文献数量。Optionally, the determining of the document data published by the at least some users and the cited document data according to the crawled multiple documents includes: establishing a published document database according to the crawled multiple documents, and the published document database includes various documents. Document data published by users; establish a citation database based on multiple crawled documents, the citation database includes document data cited by the document data published by each user; determine the The number of literature data published by at least some users and the number of citations.

可选的，所述爬取多个用户发表的文献数据之后包括：将所述文献数据中的作者、发文机构按照预设格式进行标准化处理；和/或，如果不同文献数据的作者为同名作者，则至少按照文献数据的主题相似度对所述同名作者进行识别。Optionally, after the crawling of document data published by multiple users includes: standardizing the authors and issuing agencies in the document data according to a preset format; and/or, if the authors of different document data are authors of the same name , the author of the same name is identified at least according to the subject similarity of the document data.

可选的，所述至少按照文献数据的主题相似度对所述作者进行识别包括：计算文献数据之间的主题相似度；如果所述主题相似度小于第一预设阈值，则确定所述同名作者所发表的文献在同一时间段内出现同名机构的比例；如果所述比例大于第二预设阈值，则确定所述同名作者所发表的文献在所述同一时间段以及同一机构内合作作者的比例；如果所述比例大于第三预设阈值，则确定所述同名作者为同一作者，否则确定所述同名作者为不同的作者。Optionally, the identifying the author at least according to the subject similarity of the document data includes: calculating the subject similarity between the document data; if the subject similarity is less than a first preset threshold, determining the same name. The proportion of the documents published by the author appearing in the same name institution within the same time period; if the proportion is greater than the second preset threshold, it is determined that the documents published by the author with the same name appear in the same time period and the co-authors in the same institution. ratio; if the ratio is greater than the third preset threshold, determine that the author with the same name is the same author; otherwise, determine that the author with the same name is a different author.

可选的，所述文献数据包括标题、年份、来源、机构、关键词以及摘要；所述文献的类型包括论文、专利、图书以及会议报告。Optionally, the document data includes title, year, source, institution, keywords, and abstract; the types of documents include papers, patents, books, and conference reports.

为解决上述技术问题，本发明实施例还公开了一种基于文献数据的信息推送装置，基于文献数据的信息推送装置包括：文献数据爬取模块，用于爬取多个用户发表的文献数据，所述文献数据包括文献主题以及引用文献数据；同名识别模块，用于如果不同文献数据的作者同名，则至少按照文献数据的主题相似度对同名的作者进行识别；文献数据确定模块，用于根据爬取到的多个文献数据确定所述至少部分用户所发表的文献数据以及引用文献数据，所述发表的文献数据包括发表的文献数量，所述引用文献数据包括引用文献数量；评价结果计算模块，用于至少根据所述至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果；推送模块，用于根据各个用户的评价结果进行信息推送，推送的信息包括所述用户和/或所述用户的文献数据。In order to solve the above technical problems, the embodiment of the present invention also discloses an information push device based on document data. The information push device based on document data includes: a document data crawling module for crawling document data published by multiple users, The document data includes document subject and cited document data; the same name identification module is used to identify the author with the same name at least according to the subject similarity of the document data if the authors of different document data have the same name; the document data determination module is used to identify the author according to the same name. The crawled multiple document data determines the document data published by the at least some users and the cited document data, the published document data includes the number of published documents, and the cited document data includes the number of cited documents; evaluation result calculation module is used to calculate the evaluation results of the at least part of the users according to the literature data published by the at least part of the users and the cited literature data; the push module is used to push information according to the evaluation results of each user, and the pushed information includes the user and/or bibliographic data of the user.

本发明实施例还公开了一种存储介质，其上存储有计算机程序，所述计算机程序被处理器运行时执行所述基于文献数据的信息推送方法的步骤。The embodiment of the present invention further discloses a storage medium on which a computer program is stored, and when the computer program is run by a processor, the steps of the method for pushing information based on document data are executed.

本发明实施例还公开了一种终端，包括存储器和处理器，所述存储器上存储有可在所述处理器上运行的计算机程序，所述处理器运行所述计算机程序时执行所述基于文献数据的信息推送方法的步骤。An embodiment of the present invention further discloses a terminal, including a memory and a processor, the memory stores a computer program that can be run on the processor, and the processor executes the document-based document when running the computer program. The steps of the information push method of the data.

与现有技术相比，本发明实施例的技术方案具有以下有益效果：Compared with the prior art, the technical solutions of the embodiments of the present invention have the following beneficial effects:

本发明技术方案中，通过利用至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果，所述发表的文献数据包括发表的文献数量，所述引用文献数据包括引用文献数量，并用于信息推送；相对于现有技术中H指数，使得计算出的用户的评价结果能够反映用户的发文量以及引用频次从而使得评价结果能够体现用户的发文数据和高被引数据。此外，通过对文献的同名作者进行区分，能够提升文献数据的准确性，能够更加真实地反映用户真实情况，进而提升信息推送(如人才推荐)的准确性。In the technical solution of the present invention, the evaluation result of the at least part of the users is calculated by using the document data published by at least some users and the cited document data, the published document data includes the number of published documents, and the cited document data includes The number of cited documents and used for information push; compared with the H index in the prior art, the calculated user evaluation results can reflect the user's published document volume and citation frequency, so that the evaluation results can reflect the user's published document data and highly cited data. . In addition, by distinguishing the authors of the same name in the literature, the accuracy of the literature data can be improved, the real situation of the user can be more truly reflected, and the accuracy of information push (such as talent recommendation) can be improved.

附图说明Description of drawings

图1是本发明实施例一种基于文献数据的信息推送方法的流程图；Fig. 1 is a flow chart of a method for pushing information based on document data according to an embodiment of the present invention;

图2是图1所示步骤S103的一种具体实施方式的流程图；FIG. 2 is a flowchart of a specific implementation of step S103 shown in FIG. 1;

图3是本发明实施例一种基于文献数据的信息推送装置的结构示意图。FIG. 3 is a schematic structural diagram of an apparatus for pushing information based on document data according to an embodiment of the present invention.

具体实施方式Detailed ways

如背景技术中所述，现有技术中H指数对人才的发文数据和高被引数据不敏感，导致采用H指数对人才评价以及人才推荐不能反映真实情况。As described in the background art, the H-index in the prior art is not sensitive to the published data and highly cited data of talents, so that the H-index for talent evaluation and talent recommendation cannot reflect the real situation.

为使本发明的上述目的、特征和优点能够更为明显易懂，下面结合附图对本发明的具体实施例做详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

图1是本发明实施例一种基于文献数据的信息推送方法的流程图。FIG. 1 is a flowchart of a method for pushing information based on document data according to an embodiment of the present invention.

本发明实施例的基于文献数据的信息推送方法可以由计算设备执行，所述计算设备可以是各种恰当的终端，例如手机、电脑、物联网设备、服务器等，但并不限于此。The method for pushing information based on document data in this embodiment of the present invention may be performed by a computing device, and the computing device may be various appropriate terminals, such as a mobile phone, a computer, an Internet of Things device, a server, etc., but is not limited thereto.

本发明实施例所称文献可以是论文，也可以是专利、图书、会议报告等任意具有参考文献的文献数据，本发明实施例对此不作限制。The documents referred to in the embodiments of the present invention may be papers, or may be any document data with references, such as patents, books, and conference reports, which are not limited in the embodiments of the present invention.

具体地，所述基于文献数据的信息推送方法可以包括以下步骤：Specifically, the method for pushing information based on document data may include the following steps:

步骤S101：爬取多个用户发表的文献数据，所述文献数据包括文献主题以及引用文献数据；Step S101: Crawling document data published by multiple users, where the document data includes document topics and cited document data;

步骤S102：如果不同文献数据的作者同名，则至少按照文献数据的主题相似度对同名的作者进行识别；Step S102: If the authors of different document data have the same name, then at least identify the author with the same name according to the subject similarity of the document data;

步骤S103：根据爬取到的文献数据提取所述多个用户中至少部分用户所发表的文献数据以及引用文献数据，所述发表的文献数据包括发表的文献数量，所述引用文献数据包括引用文献数量；Step S103: Extracting document data and citation document data published by at least some of the multiple users according to the crawled document data, where the published document data includes the number of published documents, and the citation document data includes citing documents quantity;

步骤S104：至少根据所述至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果；Step S104: Calculate the evaluation result of the at least some users according to at least the document data published by the at least some users and the cited document data;

步骤S105：根据各个用户的评价结果进行信息推送，推送的信息包括所述用户和/或所述用户的文献数据。Step S105: Push information according to the evaluation results of each user, and the pushed information includes the user and/or the user's document data.

需要指出的是，本实施例中各个步骤的序号并不代表对各个步骤的执行顺序的限定。It should be noted that the sequence numbers of the steps in this embodiment do not represent limitations on the execution order of the steps.

本实施例中的评价结果可以用于评估用户的学术产出数量与学术产出水平，可以称为H++指数，或者其他任意可实施的名称，本发明实施例对此不作限制。The evaluation result in this embodiment may be used to evaluate the academic output quantity and academic output level of the user, and may be called the H++ index, or any other implementable name, which is not limited in this embodiment of the present invention.

在步骤S101的具体实施中，可以爬取得到多个用户的文献数据，具体可以是利用爬虫技术从多个文献来源库采集文献数据。在一个具体的应用场景中，所述多个用户可以是属于同一学科的用户。那么通过在后续步骤中计算多个用户的评价结果，可以确定出该学科内具有较高学术水平的用户，以用于人才推荐。In the specific implementation of step S101, document data of multiple users may be obtained by crawling, and specifically, the document data may be collected from multiple document source databases by using a crawler technology. In a specific application scenario, the multiple users may be users belonging to the same discipline. Then, by calculating the evaluation results of multiple users in the subsequent steps, users with higher academic levels in the subject can be determined for talent recommendation.

具体地，爬取到的文献数据可以包括标题、年份、来源、机构、关键词以及摘要，或者其他公开信息，本发明实施例对此不作限制。Specifically, the crawled document data may include a title, a year, a source, an organization, a keyword, and an abstract, or other public information, which is not limited in this embodiment of the present invention.

需要说明的是，在爬取到文献数据之后，还可以对爬取到的文献数据进行去重、标准化处理等操作。关于爬虫技术以及数据清洗的具体实现方式可以参照现有技术，本发明实施例对此不作限制。It should be noted that after crawling the document data, operations such as deduplication and standardization can also be performed on the crawled document data. For the specific implementation of the crawler technology and data cleaning, reference may be made to the prior art, which is not limited in this embodiment of the present invention.

由于不同文献的作者的姓名可能是相同的，但指代的不是同一人，因此需要对具有同名的作者进行区分，以避免不加区分的情况下数据统计错误，保证后续计算的准确性。在步骤S102的具体实施中，可以至少根据文献的主题相似度来确定同名的作者是否是同一人。主题相似度可以是基于文献之间的标题、关键词和摘要进行计算，具体计算时可以采用自然语言处理算法，例如汉语词法分析系统(Institute of Computing Technology,Chinese Lexical Analysis System,ICTCLAS)、jieba分词和词频逆文档频率(termfrequency–inverse document frequency,TF-IDF)算法等，本发明实施例对此不作限制。Since the names of authors of different documents may be the same, but they do not refer to the same person, it is necessary to distinguish authors with the same name to avoid statistical errors in the case of indiscriminate data and to ensure the accuracy of subsequent calculations. In the specific implementation of step S102, whether the authors with the same name are the same person may be determined at least according to the subject similarity of the documents. Topic similarity can be calculated based on titles, keywords and abstracts between documents, and natural language processing algorithms can be used for specific calculations, such as Chinese Lexical Analysis System (Institute of Computing Technology, Chinese Lexical Analysis System, ICTCLAS), jieba word segmentation and term frequency-inverse document frequency (TF-IDF) algorithm, etc., which are not limited in this embodiment of the present invention.

如果两篇的文献的主题相似度达到预设门限，例如60％，则可以确定两篇文章为同一领域的文章，那么可以验证同名的作者为同一人。如果发现主题相似度低于预设门限，则认为同名的作者不是同一人。If the subject similarity of the two documents reaches a preset threshold, for example, 60%, it can be determined that the two articles are articles in the same field, and it can be verified that the authors with the same name are the same person. If the topic similarity is found to be lower than a preset threshold, it is considered that the author with the same name is not the same person.

经过步骤S102的数据清洗之后，爬取到的文献数据具有更高的准确性，为后续评价结果计算的准确性奠定了基础。After the data cleaning in step S102, the crawled document data has higher accuracy, which lays a foundation for the accuracy of subsequent evaluation results calculation.

在步骤S103-步骤S104的具体实施中，对于至少部分用户，可以确定其发表的文献数据，如至少部分用户发表文献的文献数量、标题、年份、来源、机构、关键词以及摘要等；以及至少部分用户所发表文献的引用文献数据，如引用文献数量、引用文献的标题、年份、来源、机构、关键词以及摘要等。在计算至少部分用户的评价结果时，可以至少利用发表的文献数量以及引用文献数量进行计算。由此，可以在用户的评价结果中体现用户的发文量以及引用频次。In the specific implementation of step S103-step S104, for at least some users, the literature data published by them can be determined, such as the number, title, year, source, organization, keywords and abstract of the literature published by at least part of the users; and at least The citation data of the articles published by some users, such as the number of citations, the title, year, source, institution, keywords, and abstract of the cited articles. When calculating the evaluation results of at least some users, at least the number of published documents and the number of cited documents can be used for calculation. In this way, the user's post volume and citation frequency can be reflected in the user's evaluation result.

具体地，评价结果可以是数值的形式，例如得分，也可以是等级的形式。评价结果与发表的文献数量以及引用文献数量均是正相关，用户发表的文献数量越多，该用户的评价结果也越高；用户的引用文献数量越多，该用户的评价结果也越高。Specifically, the evaluation results may be in the form of numerical values, such as scores, or may be in the form of grades. The evaluation results are positively correlated with the number of published documents and the number of cited documents. The more documents a user publishes, the higher the user's evaluation result is; the more the user's cited documents are, the higher the user's evaluation result is.

本发明实施例的评价结果能够反映用户发文量，也能够反映引用量。对于发文少但文献引用量高的用户而言，采用本发明实施例能够获得较高的评价结果，从而解决了现有技术对于做冷门研究的人才评价不科学，以及引用滞后性带来的问题。The evaluation result in the embodiment of the present invention can reflect the amount of documents posted by the user, and can also reflect the amount of citations. For users who publish few papers but have high number of citations, the embodiment of the present invention can obtain higher evaluation results, thereby solving the problems caused by the unscientific evaluation of talents who do unpopular research and the citation lag in the prior art .

进而在步骤S105的具体实施中，通过用户的评价结果对用户进行信息推送，可以提升人才推荐的准确性。Furthermore, in the specific implementation of step S105, information is pushed to the user through the user's evaluation result, which can improve the accuracy of the talent recommendation.

本发明一个具体实施例中，可以将计算得到的各个用户的评价结果存储下来，形成人才数据库。人才数据库包括各个用户的标识及其对应的评价结果。进一步地，人才数据库中的数据还可以定时更新。In a specific embodiment of the present invention, the calculated evaluation results of each user may be stored to form a talent database. The talent database includes the identifiers of each user and their corresponding evaluation results. Further, the data in the talent database can also be updated regularly.

在本发明一个非限制性的实施例中，图1所示步骤S104可以包括以下步骤：利用所述引用文献数据计算引用文献总量；确定所述发表的文献数据中零被引文献数量，并计算所述零被引文献数量与剩余文献数量的加权之和；根据所述引用文献总量以及所述和计算所述至少部分用户的评价结果。In a non-limiting embodiment of the present invention, step S104 shown in FIG. 1 may include the following steps: calculating the total number of cited documents using the cited document data; determining the number of zero-cited documents in the published document data, and Calculate the weighted sum of the number of zero-cited documents and the number of remaining documents; calculate the evaluation result of the at least some users according to the total amount of cited documents and the sum.

本实施例中，用户发表的文献中存在零被引文献，也即该文献不存在引用文献；零被引文献对用户的评价结果的影响比较小，故而可以通过将其数量与剩余文献的数量的加权计算评价结果，以减小零被引文献对用户的评价结果的影响。该加权之和可以反映用户发表的文献数量。In this embodiment, there are zero-cited documents in the documents published by the user, that is, there are no cited documents in the document; the impact of the zero-cited documents on the user's evaluation results is relatively small, so the number of the documents can be compared with the number of the remaining documents. The weighted calculation of the evaluation results to reduce the impact of zero-cited literature on the user's evaluation results. The weighted sum can reflect the number of articles published by the user.

此外，还可以计算引用文献总量，也即对于待评级用户发表的所有文献，将其引用文献的数量进行求和，可以得到引用文献总量。基于引用文献总量以及上述加权之和可以计算出用户的评价结果。In addition, the total number of citations can also be calculated, that is, for all the documents published by the user to be rated, the total number of citations can be obtained by summing the number of citations. The user's evaluation result can be calculated based on the total number of citations and the above weighted sum.

需要说明的是，可以采用任意可实施的数学运算方式对引用文献总量以及上述加权之和进行运算，以获得用户的评价结果。例如，可以直接对引用文献总量和上述和求和、求乘积；或者，引用文献总量数值较大，为了避免其对评价结果的影响过大，可以对其求平方根、立方根、对数等操作，在此基础上，再与上述和求和、求乘积等，本发明实施例对此不作限制。It should be noted that, any implementable mathematical operation method may be used to calculate the total number of cited documents and the above weighted sum to obtain the user's evaluation result. For example, the total number of citations and the above sum can be directly summed or multiplied; or, if the total number of citations is relatively large, in order to avoid excessive influence on the evaluation results, the square root, cube root, logarithm, etc. can be calculated. The operation, on this basis, is then added to the above sum, multiplied, and the like, which is not limited in this embodiment of the present invention.

在本发明另一个非限制性的实施例中，图1所示步骤S104可以包括以下步骤：利用所述引用文献数据计算引用文献总量；确定所述发表的文献数据中零被引文献数量，并计算所述零被引文献数量与剩余文献数量的加权之和；计算所述至少部分用户的H指数；将所述H指数与所述引用文献总量、所述和进行合并，以得到所述至少部分用户的评价结果。In another non-limiting embodiment of the present invention, step S104 shown in FIG. 1 may include the following steps: calculating the total number of cited documents by using the cited document data; determining the number of zero-cited documents in the published document data, and calculate the weighted sum of the number of zero-cited documents and the number of remaining documents; calculate the H-index of the at least part of the users; combine the H-index with the total number of cited documents and the sum to obtain the Describe the evaluation results of at least some users.

与前述实施例不同的是，本发明实施例除了计算引用文献总量以及上述加权之和外，还计算至少部分用户的H指数。并基于H指数与引用文献总量以及上述加权之和计算至少部分用户的评价结果。Different from the foregoing embodiments, in the embodiment of the present invention, in addition to calculating the total number of cited documents and the above weighted sum, the H-index of at least some users is also calculated. The evaluation results of at least some users are calculated based on the H-index, the total number of citations, and the above weighted sum.

需要说明的是，可以采用任意可实施的数学运算方式对H指数、引用文献总量以及上述加权之和进行运算，以获得用户的评价结果。关于计算H指数的具体实现方式可以参照现有，本发明实施例在此不再赘述。It should be noted that the H-index, the total number of cited documents, and the above weighted sum can be calculated by using any implementable mathematical operation method to obtain the user's evaluation result. For a specific implementation manner of calculating the H index, reference may be made to the prior art, and details are not described herein again in this embodiment of the present invention.

本发明一个具体实施例中，请参照图2，图1所示步骤S103可以包括以下步骤：In a specific embodiment of the present invention, please refer to FIG. 2. Step S103 shown in FIG. 1 may include the following steps:

步骤S201：根据爬取到的多个文献建立发文数据库，所述发文数据库包括各个用户所发表的文献数据；Step S201: establishing a publication database according to the crawled multiple documents, where the publication database includes document data published by each user;

步骤S202：根据爬取到的多个文献建立引文数据库，所述引文数据库包括各个用户所发表的文献数据所引用的文献数据；Step S202: establishing a citation database according to the crawled multiple documents, the citation database including the document data cited by the document data published by each user;

步骤S203：根据所述发文数据库和所述引文数据库确定所述至少部分用户所发表的文献数据的数量以及引用文献数量。Step S203: Determine the number of document data and the number of cited documents published by the at least some users according to the publication database and the citation database.

本实施例中，爬取到的多个文献可以形成人才文献库，例如人才论文数据库。在此基础上，可以对人才文献库中的数据进行拆分，形成发文数据库和引文数据库。发文数据库主要用于存储人才所发表的所有文献数据(包括标题、发表年份、来源期刊、所属机构、关键词、摘要等信息)，引文数据库主要用于存储人才所发表文献中引用其他文献的数据，进而形成人才引用其他人的数据，反过来则能够为指定的人才筛选出引用人及引用次数。In this embodiment, a plurality of crawled documents may form a talent document database, such as a talent paper database. On this basis, the data in the talent literature database can be split to form a publication database and a citation database. The publication database is mainly used to store all the literature data published by talents (including title, publication year, source journal, affiliation, keywords, abstracts, etc.), and the citation database is mainly used to store the data of other literatures cited in the literature published by talents , and then form the data of talents citing other people, which in turn can filter out the citing people and the number of citations for the specified talents.

具体而言，后续步骤在确定至少部分用户所发表的文献数据时，可以是基于发文数据库中的数据确定的；在确定至少部分用户的引用文献数据时，可以是基于引文数据库中的数据确定的。Specifically, when determining the document data published by at least some users, the subsequent steps may be determined based on the data in the publication database; when determining the cited document data of at least some users, it may be determined based on the data in the citation database. .

在本发明一个非限制性的实施例中，图1所示步骤S101之后可以包括以下步骤：将所述文献数据中的作者、发文机构按照预设格式进行标准化处理。In a non-limiting embodiment of the present invention, after step S101 shown in FIG. 1 , the following steps may be included: standardize the author and the issuing institution in the document data according to a preset format.

本发明实施例可以对爬取到的数据进行清洗处理。可以将所述文献数据中的作者、发文机构按照预设格式进行标准化处理。如作者在文献中不同的简写方法的格式化处理以及同一作者描述的标准化处理，同名作者不同论文的区分与处理，以及论文中不同机构描述方法的标准化处理等。In the embodiment of the present invention, the crawled data can be cleaned. Authors and issuing agencies in the document data may be standardized according to a preset format. For example, the formatting processing of different abbreviation methods in the literature and the standardization processing of the description of the same author, the distinction and processing of different papers of the same author, and the standardization processing of the description methods of different institutions in the papers.

更进一步而言，在区分同名作者不同论文时，可以通过以下步骤实现：计算文献数据之间的主题相似度；如果所述主题相似度小于第一预设阈值，则确定所述同名作者所发表的文献在同一时间段内出现同名机构的比例；如果所述比例大于第二预设阈值，则确定所述同名作者所发表的文献在所述同一时间段以及同一机构内合作作者的比例；如果所述比例大于第三预设阈值，则确定所述同名作者为同一作者，否则确定所述同名作者为不同的作者。Furthermore, when distinguishing between different papers by authors with the same name, the following steps can be used: calculating the subject similarity between document data; if the subject similarity is less than a first preset threshold, determine the publication by the author with the same name. The proportion of the literature with the same name in the same time period; if the proportion is greater than the second preset threshold, determine the proportion of the literature published by the same name author in the same time period and the same organization; if the proportion is greater than the second preset threshold If the ratio is greater than the third preset threshold, it is determined that the author with the same name is the same author; otherwise, it is determined that the author with the same name is a different author.

具体实施中，由于作者名称的写法不同，例如姓在前名在后，名在前姓在后，姓名简写等问题，因此本实施例中，可以预先进行作者名称的初步识别，再进行同名作者的去重处理。首先抽取所有文献中的作者信息，并将所有的作者进行排序，扫描排序中完全相同名称的作者做第一次合并，形成初步作者集合。进而将剩余作者，按照作者名称字母顺序进行顺序排列，并进行两两对比，如果对比下来两个字符串完全匹配较短字符串的顺序，则认为这两个姓名有可能是同一个人；为了进一步确认，需要将两个原始名称中的“姓”和“名”提取出来，按照姓、名的不同排列方式进行比对(比对方法同上，但不进行字母顺序调整)。如果遍历所有的排列方式，出现了一种最短匹配的排列方式(例如：Thomas Huang和T.Huang)，则进一步确认这两个名称疑似为同一人，进行第二次合并，放入初步作者集合中。进行第二次合并，并补充进初步作者集合。In the specific implementation, due to the different ways of writing the author's name, for example, the surname comes after the first name, the first name comes after the surname, and the name is abbreviated. Therefore, in this embodiment, the initial identification of the author's name can be performed in advance, and then the author with the same name can be identified. deduplication processing. First, extract author information from all documents, sort all authors, and merge authors with the same name in the scanning sequence for the first time to form a preliminary author set. Then, arrange the remaining authors in alphabetical order by author names, and compare them in pairs. If the two strings completely match the order of the shorter strings, it is considered that the two names may be the same person; in order to further To confirm, it is necessary to extract the "last name" and "first name" in the two original names, and compare them according to the different arrangements of the first and last names (the comparison method is the same as above, but the alphabetical order is not adjusted). If all the arrangements are traversed, and there is a shortest matching arrangement (for example: Thomas Huang and T. Huang), then it is further confirmed that the two names are suspected to be the same person, and the second merge is performed and placed in the preliminary author set middle. A second merge was made and added to the preliminary author set.

在初步作者集合中，对同名作者进行区分处理。具体地，遍历同名作者的所有文章，两两对比文章的相似性，以确认是否为同一领域。相似性对比可以采用标题、关键词和摘要进行对比，例如使用ICTCLAS、jieba分词和TF-IDF算法进行计算。如果主题相似度超过第一预设阈值(例如60％)，则可以确定两篇文章为同一领域的文章，初步验证同名作者为同一人。如果发现主题相似度低于第一预设阈值，则认为同名作者可能不是同一人。上述步骤结束后均需进一步处理。In the preliminary author set, the authors with the same name are treated differently. Specifically, it traverses all the articles of the author with the same name, and compares the similarity of the articles to confirm whether they are in the same field. Similarity comparison can be performed using titles, keywords, and abstracts, such as ICTCLAS, jieba word segmentation, and TF-IDF algorithms. If the topic similarity exceeds the first preset threshold (for example, 60%), it can be determined that the two articles are articles in the same field, and it is preliminarily verified that the author with the same name is the same person. If the topic similarity is found to be lower than the first preset threshold, it is considered that the author with the same name may not be the same person. After the above steps are completed, further processing is required.

识别作者所在机构。根据文章中对应的作者机构进行同名处理，如果初步验证为同一人，则根据文章发表年份，识别出作者所在机构的时间。如果初步验证不是同一人，则需进一步识别机构，判定同一时间段内(例如5年)出现同名机构的比例，如果比例超过阈值第二预设阈值(例如0.5)，则说明两种情况，一种是同一机构中出现了不同专业的两个同名的人，另一种是此人是同一个人，但是研究方向发生了变化。因此，需要进一步判定其合作者。Identify the author's institution. The same name is processed according to the corresponding author's institution in the article. If the initial verification is the same person, the time of the author's institution will be identified according to the year of publication of the article. If the initial verification is not the same person, it is necessary to further identify the institution and determine the proportion of institutions with the same name within the same time period (for example, 5 years). One is that two people with the same name in different majors appear in the same institution, and the other is that the person is the same person, but the research direction has changed. Therefore, it is necessary to further determine its collaborators.

如果在同一时间段和同一机构内相同合作者的比例超过第三预设阈值(例如30％)，则认为此人是同一人，只是研究方向发生了变化；如果比例低于第三预设阈值，则说明两人为同一机构的不同的作者。经过上述处理后，根据前述规则，能够区分不同文献数据的同名作者，从而进行作者的归类与机构的识别。If the proportion of the same collaborators in the same time period and in the same institution exceeds the third preset threshold (for example, 30%), the person is considered to be the same person, but the research direction has changed; if the proportion is lower than the third preset threshold , it means that the two are different authors of the same institution. After the above processing, according to the aforementioned rules, authors with the same name from different document data can be distinguished, so as to classify authors and identify institutions.

本发明实施例还公开了一种基于文献数据的信息推送装置30，基于文献数据的信息推送装置30可以包括：The embodiment of the present invention also discloses an information push device 30 based on document data. The information push device 30 based on document data may include:

文献数据爬取模块301，用于爬取多个用户发表的文献数据，所述文献数据包括文献主题以及引用文献数据；A document data crawling module 301, configured to crawl document data published by multiple users, the document data including document subject and citation document data;

同名识别模块302，用于如果不同文献数据的作者同名，则至少按照文献数据的主题相似度对同名的作者进行识别；The same name identification module 302 is configured to identify the author with the same name at least according to the subject similarity of the document data if the authors of different document data have the same name;

文献数据确定模块303，用于根据爬取到的多个文献数据确定所述至少部分用户所发表的文献数据以及引用文献数据，所述发表的文献数据包括发表的文献数量，所述引用文献数据包括引用文献数量；A document data determination module 303, configured to determine the document data published by the at least some users and the cited document data according to the crawled multiple document data, the published document data includes the number of published documents, and the cited document data including the number of citations;

评价结果计算模块304，用于至少根据所述至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果；An evaluation result calculation module 304, configured to calculate an evaluation result of the at least part of the users at least according to the literature data published by the at least part of the users and the cited literature data;

推送模块305，用于根据各个用户的评价结果进行信息推送，所述信息包括所述用户和/或所述用户的文献数据。The push module 305 is configured to push information according to the evaluation results of each user, where the information includes the user and/or the user's document data.

本发明实施例中，通过利用至少部分用户所发表的文献数据以及所述引用文献数据计算所述至少部分用户的评价结果，所述发表的文献数据包括发表的文献数量，所述引用文献数据包括引用文献数量，并用于信息推送；相对于现有技术中H指数，使得计算出的用户的评价结果能够反映用户的发文量以及引用频次从而使得评价结果能够体现用户的发文数据和高被引数据。此外，通过对文献的同名作者进行区分，能够提升文献数据的准确性，能够更加真实地反映用户真实情况，进而提升信息推送(如人才推荐)的准确性。In this embodiment of the present invention, the evaluation result of the at least some users is calculated by using the document data published by at least some users and the cited document data, where the published document data includes the number of published documents, and the cited document data includes The number of cited documents and used for information push; compared with the H index in the prior art, the calculated user evaluation results can reflect the user's published document volume and citation frequency, so that the evaluation results can reflect the user's published document data and highly cited data. . In addition, by distinguishing the authors of the same name in the literature, the accuracy of the literature data can be improved, the real situation of the user can be more truly reflected, and the accuracy of information push (such as talent recommendation) can be improved.

关于所述基于文献数据的信息推送装置30的工作原理、工作方式的更多内容，可以参照图1至图2中的相关描述，这里不再赘述。For more details on the working principle and working mode of the information push device 30 based on document data, reference may be made to the relevant descriptions in FIG. 1 to FIG. 2 , which will not be repeated here.

本发明实施例还公开了一种存储介质，所述存储介质为计算机可读存储介质，其上存储有计算机程序，所述计算机程序运行时可以执行图1或图2中所示方法的步骤。所述存储介质可以包括ROM、RAM、磁盘或光盘等。所述存储介质还可以包括非挥发性存储器(non-volatile)或者非瞬态(non-transitory)存储器等。An embodiment of the present invention further discloses a storage medium, which is a computer-readable storage medium, and stores a computer program thereon, and when the computer program runs, the steps of the method shown in FIG. 1 or FIG. 2 can be executed. The storage medium may include ROM, RAM, magnetic or optical disks, and the like. The storage medium may also include a non-volatile memory (non-volatile) or a non-transitory (non-transitory) memory and the like.

本发明实施例还公开了一种终端，所述终端可以包括存储器和处理器，所述存储器上存储有可在所述处理器上运行的计算机程序。所述处理器运行所述计算机程序时可以执行图1或图2中所示方法的步骤。所述终端包括但不限于手机、计算机、平板电脑等终端设备。An embodiment of the present invention further discloses a terminal, where the terminal may include a memory and a processor, and the memory stores a computer program that can run on the processor. The processor may execute the steps of the method shown in FIG. 1 or FIG. 2 when running the computer program. The terminals include but are not limited to terminal devices such as mobile phones, computers, and tablet computers.

本申请实施例中的终端可以指各种形式的终端、接入终端、用户单元、用户站、移动站、移动台(mobile station，建成MS)、远方站、远程终端、移动设备、用户终端、终端设备(terminal equipment)、无线通信设备、用户代理或用户装置。终端设备还可以是蜂窝电话、无绳电话、会话启动协议(Session Initiation Protocol，简称SIP)电话、无线本地环路(Wireless Local Loop，简称WLL)站、个人数字处理(Personal Digital Assistant，简称PDA)、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、车载设备、可穿戴设备，未来5G网络中的终端设备或者未来演进的公用陆地移动通信网络(Public Land Mobile Network，简称PLMN)中的终端设备等，本申请实施例对此并不限定。The terminal in this embodiment of the present application may refer to various forms of terminal, access terminal, subscriber unit, subscriber station, mobile station, mobile station (mobile station, built as MS), remote station, remote terminal, mobile device, user terminal, Terminal equipment, wireless communication equipment, user agent or user equipment. The terminal device may also be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA), Handheld devices, computing devices or other processing devices connected to wireless modems, in-vehicle devices, wearable devices with wireless communication capabilities, terminal devices in future 5G networks or future evolved public land mobile communication networks (Public Land Mobile Network, referred to as PLMN), which is not limited in this embodiment of the present application.

应理解，本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/“，表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" in this document is only an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A and B exist at the same time , there are three cases of B alone. In addition, the character "/" in this text indicates that the related objects before and after are an "or" relationship.

本申请实施例中出现的“多个”是指两个或两个以上。The "plurality" in the embodiments of the present application refers to two or more.

本申请实施例中出现的第一、第二等描述，仅作示意与区分描述对象之用，没有次序之分，也不表示本申请实施例中对设备个数的特别限定，不能构成对本申请实施例的任何限制。The descriptions of the first, second, etc. appearing in the embodiments of the present application are only used for illustration and distinguishing the description objects, and have no order. any limitations of the examples.

本申请实施例中出现的“连接”是指直接连接或者间接连接等各种连接方式，以实现设备间的通信，本申请实施例对此不做任何限定。The "connection" in the embodiments of the present application refers to various connection modes such as direct connection or indirect connection, so as to realize communication between devices, which is not limited in the embodiments of the present application.

应理解，本申请实施例中，所述处理器可以为中央处理单元(central processingunit，简称CPU)，该处理器还可以是其他通用处理器、数字信号处理器(digital signalprocessor，简称DSP)、专用集成电路(application specific integrated circuit，简称ASIC)、现成可编程门阵列(field programmable gate array，简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor may be a central processing unit (central processing unit, CPU for short), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP for short), special-purpose processors An integrated circuit (application specific integrated circuit, ASIC for short), an off-the-shelf programmable gate array (field programmable gate array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

还应理解，本申请实施例中的存储器可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(read-only memory，简称ROM)、可编程只读存储器(programmable ROM，简称PROM)、可擦除可编程只读存储器(erasable PROM，简称EPROM)、电可擦除可编程只读存储器(electricallyEPROM，简称EEPROM)或闪存。易失性存储器可以是随机存取存储器(random accessmemory，简称RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的随机存取存储器(random access memory，简称RAM)可用，例如静态随机存取存储器(staticRAM，简称SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronousDRAM，简称SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM，简称DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM，简称ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM，简称SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM，简称DR RAM)。It should also be understood that the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be a read-only memory (ROM for short), a programmable read-only memory (PROM for short), an erasable PROM for short (EPROM) , Electrically Erasable Programmable Read-Only Memory (electrically EPROM, EEPROM for short) or flash memory. The volatile memory may be random access memory (RAM for short), which is used as an external cache memory. By way of example and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic Random access memory (synchronous DRAM, referred to as SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, referred to as DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, referred to as ESDRAM), synchronous connection Dynamic random access memory (synchlink DRAM, referred to as SLDRAM) and direct memory bus random access memory (direct rambus RAM, referred to as DR RAM).

上述实施例，可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时，上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission by wire or wireless to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media. The semiconductor medium may be a solid state drive.

在本申请所提供的几个实施例中，应该理解到，所揭露的方法、装置和系统，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的；例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式；例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed method, apparatus and system may be implemented in other manners. For example, the device embodiments described above are only illustrative; for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation; for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理包括，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included individually, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

上述以软件功能单元的形式实现的集成的单元，可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated units implemented in the form of software functional units can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium, and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute some steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.

虽然本发明披露如上，但本发明并非限定于此。任何本领域技术人员，在不脱离本发明的精神和范围内，均可作各种更动与修改，因此本发明的保护范围应当以权利要求所限定的范围为准。Although the present invention is disclosed above, the present invention is not limited thereto. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention should be based on the scope defined by the claims.

Claims

1. An information pushing method based on literature data is characterized by comprising the following steps:

crawling literature data published by a plurality of users, wherein the literature data comprises literature subjects and cited literature data;

if the authors of different literature data are homonymous, identifying the homonymous authors at least according to the topic similarity of the literature data;

extracting published document data and cited document data of at least some users in the plurality of users according to the crawled document data, wherein the published document data comprises the published document number, and the cited document data comprises the cited document number;

calculating the evaluation result of the at least part of users according to the literature data published by the at least part of users and the cited literature data;

and pushing information according to the evaluation result of each user, wherein the pushed information comprises the user and/or the literature data of the user.

2. The information pushing method based on literature data according to claim 1, wherein the calculating the evaluation result of the at least some users according to at least the literature data published by the at least some users and the cited literature data comprises:

calculating the total amount of cited documents by using the cited document data;

determining a zero cited document quantity in the published document data and calculating a weighted sum of the zero cited document quantity and a remaining document quantity;

and calculating the evaluation result of at least part of users according to the total amount of the cited documents and the weighted sum.

3. The information pushing method based on literature data according to claim 1, wherein the calculating the evaluation result of the at least some users according to at least the literature data published by the at least some users and the cited literature data comprises:

calculating a total number of cited documents using the cited document data;

calculating an H index of the at least some users;

and combining the H index with the total amount of the cited documents and the weighted sum to obtain the evaluation result of at least part of users.

4. The method according to claim 1, wherein the determining the published literature data and the cited literature data of at least some users according to the crawled literature data comprises:

establishing a document sending database according to a plurality of crawled documents, wherein the document sending database comprises document data published by each user;

establishing a citation database according to a plurality of crawled documents, wherein the citation database comprises document data quoted by document data published by each user;

and determining the quantity of the document data published by at least part of users and the quantity of cited documents according to the document publishing database and the citation database.

5. The method according to claim 1, wherein crawling published literature data of multiple users comprises:

and standardizing the authors and the issuing institutions in the literature data according to a preset format.

6. The information pushing method based on literature data according to claim 1, wherein said identifying authors of the same name at least according to topic similarity of literature data comprises:

calculating topic similarity between the literature data;

if the topic similarity is smaller than a first preset threshold value, determining the proportion of the appearance of the same-name organization in the document published by the same-name author in the same time period;

if the ratio is larger than a second preset threshold value, determining the ratio of the collaborators in the same time period and the same mechanism of the document published by the same author;

and if the ratio is larger than a third preset threshold value, determining that the same author is the same author, otherwise, determining that the same author is different authors.

7. The information pushing method based on literature data according to any one of claims 1 to 6, wherein the literature data comprises title, year, source, organization, keyword and abstract; the types of documents include papers, patents, books, and meeting reports.

8. An information pushing apparatus based on document data, comprising:

the document data crawling module is used for crawling document data published by a plurality of users, and the document data comprises document subjects and cited document data;

the homonymy identification module is used for identifying homonymy authors according to the topic similarity of the literature data at least if the authors of different literature data have the same name;

the literature data determining module is used for determining published literature data and cited literature data of at least part of users according to a plurality of crawled literature data, wherein the published literature data comprises published literature quantity, and the cited literature data comprises cited literature quantity;

the evaluation result calculation module is used for calculating the evaluation results of the at least part of users at least according to the literature data published by the at least part of users and the cited literature data;

and the pushing module is used for pushing information according to the evaluation result of each user, and the pushed information comprises the user and/or the literature data of the user.

9. A storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to perform the steps of the document data based information pushing method according to any one of claims 1 to 7.

10. A terminal comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and the processor executes the computer program to perform the steps of the information pushing method based on literature data according to any one of claims 1 to 7.