CN111930792A - Data resource labeling method and device, storage medium and electronic equipment - Google Patents
Data resource labeling method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN111930792A CN111930792A CN202010580828.4A CN202010580828A CN111930792A CN 111930792 A CN111930792 A CN 111930792A CN 202010580828 A CN202010580828 A CN 202010580828A CN 111930792 A CN111930792 A CN 111930792A
- Authority
- CN
- China
- Prior art keywords
- target
- knowledge point
- word
- vocabulary
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Animal Behavior & Ethology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请实施例公开了一种数据资源的标注方法、装置、存储介质及电子设备,属于计算机技术领域。方法包括:服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合,可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。
The embodiments of the present application disclose a data resource labeling method, device, storage medium and electronic device, which belong to the technical field of computers. The method includes: the server preprocesses the original data resource to obtain text data, calculates the similarity between the text data and a plurality of target knowledge points respectively to obtain the similarity value, and generates the original data resource according to the comparison result of the similarity value and the similarity threshold. The basic knowledge point label set, according to the characteristic information of the original data resource and the basic knowledge point label set, generate the comprehensive knowledge point label set of the original data resource, which can quickly and accurately label the original data resource with the relevant knowledge point label, improve the Labeling efficiency and labeling accuracy.
Description
技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种数据资源的标注方法、装置、存储介质及电子设备。The present application relates to the field of computer technology, and in particular, to a method, device, storage medium and electronic device for labeling data resources.
背景技术Background technique
随着互联网的发展,数据在互联网行业扮演着越来越重要的角色,例如:零售、交通、社交、搜索、教育、医疗等各个行业均涉及大规模的数据分析。以在线教育为例,在线教育场景中,工作人员通常需要分析用户的教学数据以获取用户的教学情况、学习情况,便于后续为用户提供更好的服务,而分析用户的学习数据过程需要获取用户已学习的数据资源上关联的知识点标签,类似的应用场景在其他领域也较为普遍。但在相关技术中,数据资源上关联的知识点标签通常是需要通过人工的方式提前进行标注,这种标注知识点标签的方式效率较低,且会受到标注人的主观因素的影响,导致不能准确地为数据资源标注上知识点标签。With the development of the Internet, data plays an increasingly important role in the Internet industry. For example, various industries such as retail, transportation, social networking, search, education, and medical care all involve large-scale data analysis. Taking online education as an example, in online education scenarios, staff usually need to analyze the user's teaching data to obtain the user's teaching situation and learning situation, so as to provide better services for the user in the future, and the process of analyzing the user's learning data needs to obtain the user's Knowledge point labels associated with learned data resources, and similar application scenarios are also common in other fields. However, in related technologies, knowledge point labels associated with data resources usually need to be marked in advance by manual methods. This method of labeling knowledge point labels is inefficient and will be affected by the subjective factors of the labeler, resulting in inability to Accurately label data resources with knowledge point labels.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种数据资源的标注方法、装置、存储介质及电子设备,可以解决相关技术中对数据资源标注知识点标签不准确且效率低的问题。The embodiments of the present application provide a data resource labeling method, device, storage medium and electronic device, which can solve the problem of inaccurate and low efficiency in labeling knowledge point labels on data resources in the related art.
所述技术方案如下:The technical solution is as follows:
第一方面,本申请实施例提供了一种数据资源的标注方法,所述方法包括:In a first aspect, an embodiment of the present application provides a method for labeling data resources, the method comprising:
对原始数据资源进行预处理获取文本数据;Preprocess raw data resources to obtain text data;
将所述文本数据分别和多个目标知识点进行相似度计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;Perform similarity calculation with the text data and a plurality of target knowledge points respectively to obtain a similarity value; wherein, each of the plurality of target knowledge points is associated with a basic knowledge point label;
根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;The basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value; wherein, the basic knowledge point label included in the basic knowledge point label set is: the target whose similarity value is greater than the similarity threshold value Basic knowledge point labels associated with knowledge points;
根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。A comprehensive knowledge point tag set of the original data resource is generated according to the feature information of the original data resource and the basic knowledge point tag set.
第二方面,本申请实施例提供了一种数据资源的标注装置,所述数据资源的标注装置包括:In a second aspect, an embodiment of the present application provides a device for labeling data resources, where the device for labeling data resources includes:
预处理模块,用于对原始数据资源进行预处理获取文本数据;The preprocessing module is used to preprocess the original data resources to obtain text data;
计算模块,用于将所述文本数据分别和多个目标知识点进行相似度计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;a calculation module, configured to calculate the similarity between the text data and a plurality of target knowledge points respectively to obtain a similarity value; wherein each of the plurality of target knowledge points is associated with a basic knowledge point label;
第一处理模块,用于根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;The first processing module is configured to generate the basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold value; wherein, the basic knowledge point label included in the basic knowledge point label set is: similarity The basic knowledge point label associated with the target knowledge point whose value is greater than the similarity threshold;
第二处理模块,用于根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。The second processing module is configured to generate a comprehensive knowledge point tag set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point tag set.
第三方面,本申请实施例提供一种计算机存储介质,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行上述的方法步骤。In a third aspect, an embodiment of the present application provides a computer storage medium, where the computer storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the above method steps.
第四方面,本申请实施例提供一种电子设备,可包括:处理器和存储器;其中,所述存储器存储有计算机程序,所述计算机程序适于由所述处理器加载并执行上述的方法步骤。In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein, the memory stores a computer program, and the computer program is adapted to be loaded by the processor and execute the above method steps .
本申请一些实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solutions provided by some embodiments of the present application include at least:
本申请实施例的方案在执行时,服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合,可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。When the solution of the embodiment of the present application is executed, the server preprocesses the original data resource to obtain text data, and calculates the similarity between the text data and multiple target knowledge points to obtain a similarity value. According to the difference between the similarity value and the similarity threshold The comparison result generates the basic knowledge point label set of the original data resource, and generates the comprehensive knowledge point label set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point label set, which can quickly and accurately label the original data resource with Related knowledge point labels, improve the efficiency of labeling and the accuracy of labeling.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1是本申请实施例提供的一种系统架构图;1 is a system architecture diagram provided by an embodiment of the present application;
图2是本申请实施例提供的数据资源的标注方法的流程示意图;2 is a schematic flowchart of a method for labeling data resources provided by an embodiment of the present application;
图3是本申请实施例提供的数据资源的标注方法的另一流程示意图;3 is another schematic flowchart of a method for labeling data resources provided by an embodiment of the present application;
图4是本申请实施例提供的数据资源的标注方法的相似度计算流程示意图;FIG. 4 is a schematic flowchart of similarity calculation of the data resource labeling method provided by the embodiment of the present application;
图5是本申请实施例提供的一种装置的结构示意图;5 is a schematic structural diagram of a device provided by an embodiment of the present application;
图6是本申请实施例提供的一种装置的结构示意图。FIG. 6 is a schematic structural diagram of an apparatus provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
图1示出了可以应用本申请实施例的数据资源的标注方法或数据资源的标注装置的示例性系统架构100的示意图。FIG. 1 shows a schematic diagram of an
如图1所示,系统架构100可以包括终端设备101、102、103中的一种或多种,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质,终端设备101、102、103上可以安装有各种通信客户端应用,例如:视频录制应用、视频播放应用、语音交互应用、搜索类应用、及时通信工具、邮箱客户端、社交平台软件等。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、便携式计算机和台式计算机等等。网络104可以包括各种类型的有线通信链路或无线通信链路,例如:有线通信链路包括光纤、双绞线或同轴电缆的,无线通信链路包括蓝牙通信链路、无线保真(WIreless-FIdelity,Wi-Fi)通信链路或微波通信链路等。终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为软件时,可以是安装于上述所列举的电子设备中。其可以实现呈多个软件或软件模块(例如:用来提供分布式服务),也可以实现成单个软件或软件模块,在此不作具体限定。当终端设备101、102、103为硬件时,其上还可以安装有显示设备和摄像头,显示设备显示可以是各种能实现显示功能的设备,摄像头用于采集视频流;例如:显示设备可以是阴极射线管显示器(Cathode raytubedisplay,简称CR)、发光二极管显示器(Light-emitting diode display,简称LED)、电子墨水屏、液晶显示屏(Liquid crystal display,简称LCD)、等离子显示面板(Plasmadisplaypanel,简称PDP)等。用户可以利用终端设备101、102、103上的显示设备,来查看显示的文字、图片、视频等信息。The user can use the
需要说明的是,本申请实施例提供的数据资源的标注方法通常由服务器105执行,相应的,数据资源的标注装置通常设置于服务器105中。服务器105可以是提供各种服务的服务器,服务器105可以是硬件,也可以是软件。当服务器105为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器105为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块,在此不做具体限定。It should be noted that, the method for labeling data resources provided by the embodiments of the present application is usually performed by the
本申请中的服务器105可以为提供各种服务的终端设备,如:服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。The
在此需要说明的是,本申请实施例所提供的数据资源的标注方法可以由终端设备101、102、103中的一个或多个,和/或,服务器105执行,相应地,本申请实施例所提供的数据资源的标注装置一般设置于对应终端设备中,和/或,服务器105中,但本申请不限于此。It should be noted here that the data resource labeling method provided by the embodiments of the present application may be executed by one or more of the
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
下面将结合附图2至附图4,对本申请实施例提供的数据资源的标注方法进行详细介绍。在这里需要说明的是,为了方便描述,实施例以在线教育行业为例进行说明,但本领域技术人员明白,本申请的适用并不局限于在线教育行业,本申请所描述的数据资源的标注方法可以有效应用于互联网各个行业领域。The method for labeling data resources provided by the embodiments of the present application will be described in detail below with reference to FIG. 2 to FIG. 4 . It should be noted here that, for the convenience of description, the embodiments are described by taking the online education industry as an example, but those skilled in the art will understand that the application of this application is not limited to the online education industry. The method can be effectively applied to various fields of the Internet.
请参见图2,为本申请实施例提供了一种数据资源的标注方法的流程示意图。如图2所示,本申请实施例的所述方法可以包括以下步骤:Referring to FIG. 2 , a schematic flowchart of a method for labeling data resources is provided in an embodiment of the present application. As shown in FIG. 2 , the method of the embodiment of the present application may include the following steps:
S201,对原始数据资源进行预处理获取文本数据。S201, preprocessing the original data resource to obtain text data.
其中,原始数据资源是指以文本、音频、视频等类型存在的数据资源,如可以包括习题、绘本、学习音频、学习视频等数据资源,原始数据资源中包含与之对应的学习级别和科目。文本数据是指将文本、音频、视频等类型的原始数据资源统一转换成文本类型后的数据。Among them, the original data resources refer to data resources that exist in the form of text, audio, video, etc., such as exercises, picture books, learning audio, learning videos and other data resources, and the original data resources include the corresponding learning levels and subjects. Text data refers to the data after unified conversion of original data resources of text, audio, video and other types into text type.
一般的,在原始数据资源为音频或视频类型时,可通过ASR(Automatic SpeechRecognition,自动语音识别)技术将原始数据资源转化为预设文本类型的文本数据。ASR技术是一种基于关键词语列表将音频转换为文本的技术,将音频(或视频中的音频)内容通过频谱转换为语音特征,并将该语音特征与关键词语列表中的条目进行匹配,将得到的最优匹配结果作为识别结果。在原始数据资源为文本类型,但非预设文本类型时,则需要将该文本类型转化为预设文本类型。常见的文本类型有:txt.、doc.、hlp.、wps.、rtf.、htm.、pdf等,预设文本类型可根据实际需要设定不同的文本类型。Generally, when the original data resource is of audio or video type, the original data resource may be converted into text data of a preset text type through an ASR (Automatic Speech Recognition, automatic speech recognition) technology. ASR technology is a technology that converts audio to text based on a list of key words, converts audio (or audio in video) content into speech features through spectrum, and matches the speech features with the entries in the key word list, and The best matching result obtained is used as the recognition result. When the original data resource is a text type but not a preset text type, the text type needs to be converted into a preset text type. Common text types are: txt., doc., hlp., wps., rtf., htm., pdf, etc. The default text types can be set to different text types according to actual needs.
S202,将文本数据分别和多个目标知识点进行相似度计算得到相似度值。S202: Perform similarity calculation on the text data and a plurality of target knowledge points respectively to obtain a similarity value.
其中,目标知识点包括目标内容词汇、目标高频词汇、目标动词词汇、目标数学词汇、目标音标、目标句式和目标语法中的至少一种,是可以从知识图谱中获取的知识点,不同学习级别对应不同的目标知识点,多个目标知识点各自关联有一个基础知识点标签。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似。The target knowledge point includes at least one of target content vocabulary, target high-frequency vocabulary, target verb vocabulary, target mathematical vocabulary, target phonetic symbol, target sentence pattern, and target grammar, and is a knowledge point that can be obtained from the knowledge map. Learning levels correspond to different target knowledge points, and multiple target knowledge points are each associated with a basic knowledge point label. The similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value is, the more similar the two quantities are.
一般的,在将文本数据分别和多个目标知识点进行相似度计算得到相似度值之前,需要获取原始数据资源对应的课程信息,从预设的知识图谱中查询课程信息得到与之对应的多个目标知识点,知识图谱中包含不同学习级别、不同年段的目标知识点。对文本数据分别和多个目标知识点进行相似度计算,也即计算文本数据中的基础知识点与目标知识点的相似度值,可基于该相似度值可判定该文本数据对应的基础知识点标签,也即原始数据资源对应的基础知识点标签。文本数据中的基础知识点可以包括参照内容词汇、参照高频词汇、参照动词词汇、参照数学词汇、参照音标、参照句式和参照语法中的一种或多种。Generally, before calculating the similarity between text data and multiple target knowledge points to obtain the similarity value, it is necessary to obtain the course information corresponding to the original data resource, and query the course information from the preset knowledge map to obtain the corresponding multi-level information. A target knowledge point, and the knowledge map contains target knowledge points of different learning levels and different years. The similarity calculation is performed on the text data and multiple target knowledge points respectively, that is, the similarity value between the basic knowledge point in the text data and the target knowledge point is calculated, and the basic knowledge point corresponding to the text data can be determined based on the similarity value. Label, that is, the basic knowledge point label corresponding to the original data resource. The basic knowledge points in the text data may include one or more of reference content vocabulary, reference high frequency vocabulary, reference verb vocabulary, reference mathematical vocabulary, reference phonetic symbol, reference sentence pattern and reference grammar.
S203,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合。S203: Generate a basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold.
其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值。基础知识点标签集合是包含原始数据资源对应的基础知识点标签的集合,可以包括与目标内容词汇、目标高频词汇、目标动词词汇、目标数学词汇、目标音标、目标句式和目标语法各自关联的知识点标签,基础知识点标签集合包括的基础知识点标签为相似度值大于相似度阈值的目标知识点关联的基础知识点标签。Among them, the similarity threshold refers to the minimum lower limit value that the similarity value can satisfy the condition, that is, the similarity threshold value. The basic knowledge point tag set is a collection of basic knowledge point tags corresponding to the original data resources, which can include the target content vocabulary, target high-frequency vocabulary, target verb vocabulary, target mathematical vocabulary, target phonetic symbol, target sentence pattern and target grammar. The knowledge point labels of , the basic knowledge point labels included in the basic knowledge point label set are the basic knowledge point labels associated with the target knowledge points whose similarity value is greater than the similarity threshold.
一般的,根据基础知识点不同以及相似度值计算方法的不同,与之对应的相似度值、相似度阈值、基础知识点标签均不相同,在目标知识点为目标内容词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分块处理得到词语块集合,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值,将重要程度权值大于第一预设权值的词语块作为参照内容词汇,计算参照内容词汇与目标内容词汇的相似度值,在相似度值大于相似度阈值时,获取目标内容词汇关联的基础知识点标签,将目标内容词汇关联的基础知识点标签加入到基础知识点标签集合中。Generally, according to different basic knowledge points and different calculation methods of similarity value, the corresponding similarity value, similarity threshold, and basic knowledge point label are different. When the target knowledge point is the target content vocabulary, according to the similarity degree The comparison result of the value and the similarity threshold generates the basic knowledge point label set of the original data resource, including: performing sentence segmentation on the text data to obtain a sentence set, dividing the sentence set to obtain a word block set, and extracting TF based on keywords. - The IDF algorithm calculates the weight of the importance of each word block in the word block set, uses the word block whose importance weight is greater than the first preset weight as the reference content word, and calculates the similarity value between the reference content word and the target content word , when the similarity value is greater than the similarity threshold, obtain the basic knowledge point labels associated with the target content vocabulary, and add the basic knowledge point labels associated with the target content vocabulary to the basic knowledge point label set.
在目标知识点为目标高频词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分块处理得到词语块集合,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值,将重要程度权值小于或等于第二预设权值的词语块作为参照高频词汇,计算参照高频词汇与目标高频词汇的相似度值,在相似度值大于相似度阈值时,获取目标高频词汇关联的基础知识点标签,将目标高频词汇关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target high-frequency vocabulary, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, including: performing sentence segmentation on the text data to obtain the sentence set, and analyzing the sentence set. Perform block processing to obtain a word block set, calculate the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm, and use the word block whose importance degree weight is less than or equal to the second preset weight as the word block. Refer to the high-frequency vocabulary, calculate the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary, and when the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target high-frequency vocabulary, and associate the target high-frequency vocabulary. The knowledge point label is added to the basic knowledge point label collection.
在目标知识点为目标动词词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,将词性为动词词性的词语作为参照动词词汇,计算参照动词词汇与目标动词词汇的相似度值,在相似度值大于相似度阈值时,获取目标动词词汇关联的基础知识点标签,将目标动词词汇关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target verb vocabulary, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, including: performing sentence segmentation on the text data to obtain the sentence set, and processing the sentence set. The word set is obtained by the word segmentation process. Based on the word set, each word in the word set is marked with part of speech to obtain the set of part-of-speech tags. The word whose part of speech is the verb part of speech is used as the reference verb vocabulary, and the similarity value between the reference verb vocabulary and the target verb vocabulary is calculated. When the similarity value is greater than the similarity threshold, the basic knowledge point label associated with the target verb vocabulary is obtained, and the basic knowledge point label associated with the target verb vocabulary is added to the basic knowledge point label set.
在目标知识点为目标数学词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,将词性为数词词性的词语作为参照数学词汇,计算参照数学词汇与目标数学词汇的相似度值,在相似度值大于相似度阈值时,获取目标数学词汇关联的基础知识点标签,将目标数学词汇关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target mathematical vocabulary, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, including: performing sentence segmentation on the text data to obtain the sentence set, and processing the sentence set. The word set is obtained by the word segmentation process. Based on the word set, each word in the word set is marked with part of speech to obtain the set of part of speech marking, and the word whose part of speech is the part of speech of the numeral is used as the reference mathematical vocabulary, and the similarity value between the reference mathematical vocabulary and the target mathematical vocabulary is calculated. When the similarity value is greater than the similarity threshold, the basic knowledge point labels associated with the target mathematical vocabulary are obtained, and the basic knowledge point labels associated with the target mathematical vocabulary are added to the basic knowledge point label set.
在目标知识点为目标音标时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,分析词语集合中的各个词语,并为各个词语标上音标得到音标集合,计算音标集合中的词语音标与目标音标的相似度值,在相似度值大于相似度阈值时,获取目标音标关联的基础知识点标签,将目标音标关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target phonetic symbol, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, including: performing sentence segmentation on the text data to obtain a sentence set, and segmenting the sentence set. Process to obtain a word set, analyze each word in the word set, and label each word with a phonetic symbol to obtain a phonetic symbol set, calculate the similarity value between the phonetic symbol of the word in the phonetic symbol set and the target phonetic symbol, and obtain when the similarity value is greater than the similarity threshold. The basic knowledge point label associated with the target phonetic symbol is added to the basic knowledge point label set.
在目标知识点为目标句式时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,对句子集合中的句子进行依存句法分析得到依存句法树,计算词语集合中的词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值,在相似度值大于相似度阈值时,获取目标句式关联的基础知识点标签,将目标句式关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target sentence pattern, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, including: performing sentence segmentation on the text data to obtain the sentence set, and processing the sentence set. The word set is obtained by the word segmentation process. Based on the word set, the part-of-speech tagging of each word in the word set is performed to obtain the part-of-speech tag set. The similarity value corresponding to the word, part of speech and syntax tree in the target sentence pattern, respectively, and when the similarity value is greater than the similarity threshold, the basic knowledge point label associated with the target sentence pattern is obtained, and the target sentence pattern The associated basic knowledge point label is added to the basic knowledge point label collection.
在目标知识点为目标语法时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,基于词语集合和词性标注集合计算句子集合包含的语法与目标语法的相似度值,在相似度值大于相似度阈值时,获取目标语法关联的基础知识点标签,将目标语法关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target grammar, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, including: performing sentence segmentation on the text data to obtain a sentence set, and segmenting the sentence set. The word set is obtained by processing, and the part-of-speech tagging is performed on each word in the word set based on the word set to obtain the part-of-speech tagging set. When the degree threshold is reached, the basic knowledge point labels associated with the target grammar are obtained, and the basic knowledge point labels associated with the target grammar are added to the basic knowledge point label set.
S204,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。S204: Generate a comprehensive knowledge point label set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point label set.
其中,特征信息是指原始数据资源的类型,如原始数据资源的类型可以包括音频、视频、文本、绘本等类型,不同的类型能训练/锻炼的学习能力(如:听、说、读、写能力)不同。综合知识点标签集合中的综合知识点标签是指基于原始数据资源的特征信息和基础知识点标签集合生成的能反应该原始数据资源锻炼用户(学生)能力的知识点标签,一种原始数据资源对应的综合知识点标签可以有多个,如:原始数据资源包含音频,且基础知识点标签集合中含有与目标音标相关联的知识点标签,则可分析得到该原始数据资源对应的综合知识点标签为听力标签。Among them, feature information refers to the type of raw data resources. For example, the types of raw data resources can include audio, video, text, picture books and other types. Different types can train/exercise the learning ability (such as listening, speaking, reading, writing, etc.) abilities) are different. The comprehensive knowledge point label in the comprehensive knowledge point label set refers to the knowledge point label generated based on the characteristic information of the original data resource and the basic knowledge point label set, which can reflect the ability of the original data resource to exercise the user (student), a kind of original data resource. There can be multiple corresponding comprehensive knowledge point labels, for example: the original data resource contains audio, and the basic knowledge point label set contains the knowledge point label associated with the target phonetic symbol, then the comprehensive knowledge point corresponding to the original data resource can be obtained by analysis The label is the hearing label.
一般的,原始数据资源的综合知识点标签与原始数据资源自身的特征有关,综合知识点标签集合中的综合知识点标签与基础知识点标签集合中的基础知识点标签均与原始数据资源相关联,在基础知识点标签集合和综合知识点标签集合生成后,也即表明原始数据资源已标注上了相关的知识点标签(包括基础知识点标签和综合知识点标签)。Generally, the comprehensive knowledge point labels of the original data resources are related to the characteristics of the original data resources, and the comprehensive knowledge point labels in the comprehensive knowledge point label set and the basic knowledge point labels in the basic knowledge point label set are both associated with the original data resources. , after the basic knowledge point label set and the comprehensive knowledge point label set are generated, it means that the original data resource has been marked with the relevant knowledge point labels (including the basic knowledge point label and the comprehensive knowledge point label).
本申请实施例的方案在执行时,服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合,可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。When the solution of the embodiment of the present application is executed, the server preprocesses the original data resource to obtain text data, and calculates the similarity between the text data and multiple target knowledge points to obtain a similarity value. According to the difference between the similarity value and the similarity threshold The comparison result generates the basic knowledge point label set of the original data resource, and generates the comprehensive knowledge point label set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point label set, which can quickly and accurately label the original data resource with Related knowledge point labels, improve the efficiency of labeling and the accuracy of labeling.
正如前面描述,本申请实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiments of the present application are mainly described by taking the online education industry as an example, but those skilled in the art will understand that the application of this method is not limited to the online education industry, such as retail, transportation, social networking, search, education, The method described in this application can be applied to user label processing in various industries such as medical treatment.
请参见图3,为本申请实施例提供了一种数据资源的标注方法的流程示意图,该数据资源的标注方法可以包括以下步骤:Referring to FIG. 3, an embodiment of the present application provides a schematic flowchart of a method for labeling data resources. The method for labeling data resources may include the following steps:
S301,对原始数据资源进行预处理获取文本数据。S301 , preprocessing the original data resource to obtain text data.
其中,原始数据资源是指以文本、音频、视频等类型存在的数据资源,如可以包括习题、绘本、学习音频、学习视频等教学资源,原始数据资源中包含与之对应的学习级别和科目。文本数据是指将文本、音频、视频等类型的原始数据资源统一转换成文本类型后的数据。文本数据中的基础知识点可以包括参照内容词汇、参照高频词汇、参照动词词汇、参照数学词汇、参照音标、参照句式和参照语法中的一种或多种。Among them, the original data resources refer to data resources in the form of text, audio, video, etc., such as exercises, picture books, learning audio, learning videos and other teaching resources, and the original data resources include the corresponding learning levels and subjects. Text data refers to the data after unified conversion of original data resources of text, audio, video and other types into text type. The basic knowledge points in the text data may include one or more of reference content vocabulary, reference high frequency vocabulary, reference verb vocabulary, reference mathematical vocabulary, reference phonetic symbol, reference sentence pattern and reference grammar.
一般的,在原始数据资源为音频或视频类型时,可通过ASR(Automatic SpeechRecognition,自动语音识别)技术将原始数据资源转化为预设文本类型的文本数据。ASR技术是一种基于关键词语列表将音频转换为文本的技术,将音频(或视频中的音频)内容通过频谱转换为语音特征,并将该语音特征与关键词语列表中的条目进行匹配,将得到的最优匹配结果作为识别结果。在原始数据资源为文本类型,但非预设文本类型时,则需要将该文本类型转化为预设文本类型。常见的文本类型有:txt.、doc.、hlp.、wps.、rtf.、htm.、pdf等,预设文本类型可根据实际需要设定不同的文本类型。Generally, when the original data resource is of audio or video type, the original data resource may be converted into text data of a preset text type through an ASR (Automatic Speech Recognition, automatic speech recognition) technology. ASR technology is a technology that converts audio to text based on a list of key words, converts audio (or audio in video) content into speech features through spectrum, and matches the speech features with the entries in the key word list, and The best matching result obtained is used as the recognition result. When the original data resource is a text type but not a preset text type, the text type needs to be converted into a preset text type. Common text types are: txt., doc., hlp., wps., rtf., htm., pdf, etc. The default text types can be set to different text types according to actual needs.
S302,从预设知识图谱中查询原始数据资源对应的属性信息,得到与属性信息对应的多个目标知识点。S302: Query attribute information corresponding to the original data resource from a preset knowledge graph, and obtain multiple target knowledge points corresponding to the attribute information.
其中,原始数据资源为教学资源,属性信息为课程信息,课程信息是指与原始数据资源对应的学习级别、学习科目等课程相关的信息。目标知识点包括目标内容词汇、目标高频词汇、目标动词词汇、目标数学词汇、目标音标、目标句式和目标语法中的至少一种,是可以从知识图谱中获取的知识点,不同学习级别对应不同的目标知识点,多个目标知识点各自关联有一个基础知识点标签。知识图谱中包含不同学习级别、不同年段、不同学习科目的目标知识点,也理解为知识域可视化或知识领域映射地图,是显示知识发展进程与结构关系的一系列各种不同的图形数据,用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。The original data resources are teaching resources, the attribute information is course information, and the course information refers to course-related information such as learning levels and learning subjects corresponding to the original data resources. The target knowledge point includes at least one of the target content vocabulary, target high-frequency vocabulary, target verb vocabulary, target mathematical vocabulary, target phonetic symbol, target sentence pattern and target grammar. It is a knowledge point that can be obtained from the knowledge map. Different learning levels Corresponding to different target knowledge points, multiple target knowledge points are each associated with a basic knowledge point label. The knowledge map contains the target knowledge points of different learning levels, different years, and different learning subjects. It is also understood as knowledge domain visualization or knowledge domain mapping map, which is a series of various graphical data showing the relationship between knowledge development process and structure. Use visualization technology to describe knowledge resources and their carriers, mine, analyze, construct, draw and display knowledge and their interrelationships.
一般的,在将文本数据分别和多个目标知识点进行相似度计算得到相似度值之前,需要获取原始数据资源对应的课程信息,从预设的知识图谱中查询课程信息得到与之对应的多个目标知识点,知识图谱中包含不同学习级别、不同年段、不同学习科目的目标知识点。对文本数据分别和多个目标知识点进行相似度计算,也即计算文本数据中的基础知识点与目标知识点的相似度值,可基于该相似度值可判定该文本数据对应的基础知识点标签,也即原始数据资源对应的基础知识点标签。根据原始数据资源的不同,可获取与之对应的不同课程信息,基于的课程信息可从预设的知识图谱中查询到与原始数据资源对应的多个目标知识点,但原始数据资源可能不能完全包含从预设的知识图谱中查询到的全部目标知识点,原始数据资源可能会包含一种或多种从预设的知识图谱中查询到的目标知识点,故需要分析原始数据资源中基础知识点与预设知识图谱中目标知识点的相似度值,基于相似度值可确定原始数据资源包含的目标知识点,进而确定原始数据资源可以关联的基础知识点标签。Generally, before calculating the similarity between text data and multiple target knowledge points to obtain the similarity value, it is necessary to obtain the course information corresponding to the original data resource, and query the course information from the preset knowledge map to obtain the corresponding multi-level information. A target knowledge point, and the knowledge map contains target knowledge points of different learning levels, different years, and different learning subjects. The similarity calculation is performed on the text data and multiple target knowledge points respectively, that is, the similarity value between the basic knowledge point in the text data and the target knowledge point is calculated, and the basic knowledge point corresponding to the text data can be determined based on the similarity value. Label, that is, the basic knowledge point label corresponding to the original data resource. Depending on the original data resource, different course information corresponding to it can be obtained. Based on the course information, multiple target knowledge points corresponding to the original data resource can be queried from the preset knowledge graph, but the original data resource may not be completely Contains all target knowledge points queried from the preset knowledge graph. The original data resource may contain one or more target knowledge points queried from the preset knowledge graph, so it is necessary to analyze the basic knowledge in the original data resource. The similarity value between the point and the target knowledge point in the preset knowledge graph. Based on the similarity value, the target knowledge point contained in the original data resource can be determined, and then the basic knowledge point label that can be associated with the original data resource can be determined.
S303,对文本数据进行句子分割处理得到句子集合,并对句子集合分别进行分块处理得到词语块集合,以及分词处理得到词语集合。S303 , performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set respectively to obtain a word block set, and word segmentation processing to obtain a word set.
其中,句子集合是指对文本数据进行句子分割处理后得到包含多个句子的集合,句子集合中的句子是根据文本数据的文本内容、换行符、标点符号等进行分割处理得到的一个或多个完整的句子。词语块集合是指分别对句子集合中各个句子进行短语划分得到的包括多个短语(词语块)的集合。词语集合是指分别对句子集合中各个句子进行词语划分得到的包括多个词语的集合。The sentence set refers to a set containing multiple sentences obtained after sentence segmentation processing is performed on the text data, and the sentences in the sentence set are one or more sentences obtained by dividing the text data according to the text content, line breaks, punctuation marks, etc. full sentence. The word block set refers to a set including a plurality of phrases (word blocks) obtained by phrase division of each sentence in the sentence set. The word set refers to a set including a plurality of words obtained by dividing each sentence in the sentence set by words.
S304,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值。S304, based on the keyword extraction TF-IDF algorithm, calculate the importance degree weight of each word block in the word block set.
其中,重要程度权值是指基于关键词提取TF-IDF算法对词语块进行分析后得到与各个词语块对应的重要程度的权重,也即TF-IDF值,权值指加权平均数中的每个数的频数,也称为权数或权重。Among them, the importance degree weight refers to the weight of the importance degree corresponding to each word block obtained by analyzing the word block based on the keyword extraction TF-IDF algorithm, that is, the TF-IDF value, and the weight value refers to the weight of each word block in the weighted average. The frequency of the number, also known as the weight or weight.
一般的,TF-IDF(Term Frequency–Inverse Document Frequency)是用于信息检索与数据挖掘的常用加权技术,TF-IDF是一种统计方法,用于评估字词对于一个文件集或一个语料库中的其中一份文件的重要程度,字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF代表词频(TermFrequency),表示语料出现的次数除以该问答库中的总句数,IDF代表权重,表示词语块出现的逆文档频率。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。In general, TF-IDF (Term Frequency–Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, and TF-IDF is a statistical method used to evaluate words for a document set or a corpus. The importance of one of the documents, the importance of a word increases proportionally to the number of times it appears in the document, but at the same time decreases inversely proportional to the frequency of its occurrence in the corpus. TF stands for Term Frequency (TermFrequency), which means the number of occurrences of the corpus divided by the total number of sentences in the question-and-answer database, and IDF stands for weight, which means the inverse document frequency of word blocks. The main idea of IDF is: if there are fewer documents containing term t, that is, the smaller n is, and the larger the IDF is, it means that term t has a good ability to distinguish categories. If the number of documents containing term t in a certain type of document C is m, and the total number of documents containing t in other types is k, obviously the number of documents containing t is n=m+k. When m is large, n is also large. , the value of the IDF obtained according to the IDF formula will be small, which means that the classification ability of the entry t is not strong.
基于关键词提取TF-IDF算法的计算过程主要包括:计算词频,针对文本长短不同的文本内容进行比较,对词语块进行"词频"标准化;计算逆文档频率,利用语料库(corpus)模拟语言的使用环境,若一个字词出现次数越多,则分母越大,逆文档频率就越小(越接近0),对分母加1以避免出现分母为0(即所有文档都不包含该词)的情况;计算TF-IDF值,TF-IDF值与一个字词在文本中的出现次数成正比,与该字词在整个语言中的出现次数成反比,在计算出文本中每个字词的TF-IDF值后,可根据TF-IDF值剪降序排列。The calculation process of the TF-IDF algorithm based on keyword extraction mainly includes: calculating the word frequency, comparing the text content with different text lengths, and standardizing the "word frequency" of the word blocks; calculating the inverse document frequency, using the corpus (corpus) to simulate the use of language Environment, if a word occurs more frequently, the denominator will be larger, and the inverse document frequency will be smaller (closer to 0). Add 1 to the denominator to avoid the case where the denominator is 0 (that is, all documents do not contain the word). ; Calculate the TF-IDF value. The TF-IDF value is proportional to the number of occurrences of a word in the text and inversely proportional to the number of occurrences of the word in the entire language. After calculating the TF-IDF of each word in the text After the IDF value, it can be sorted in descending order according to the TF-IDF value.
S305,将重要程度权值大于第一预设权值的词语块作为参照内容词汇,将重要程度权值小于或等于第二预设权值的词语块作为参照高频词汇。S305 , using the word block with the importance degree weight greater than the first preset weight value as the reference content vocabulary, and using the word block with the importance degree weight value less than or equal to the second preset weight value as the reference high frequency vocabulary.
其中,第一预设取值是筛选参照内容词汇的依据,第一预设权值和第一预设权值均可根据实际需要设定。参照内容词汇是指包含一定含义的词语快。第二预设权值是筛选参照高频词汇的依据,通常第一预设权值小于第二预设权值。Wherein, the first preset value is a basis for filtering reference content words, and the first preset weight and the first preset weight can be set according to actual needs. Reference content vocabulary refers to words that contain a certain meaning. The second preset weight is a basis for screening and referring to high-frequency words, and usually the first preset weight is smaller than the second preset weight.
一般的,在基于关键词提取TF-IDF算法对词语块集合中的各个词语块进行重要程度打分后,可按照降序排序重要程度权值的词语块,取降序排序的前1/3的词语块作为参照内容词汇,取降序排序的后1/3的词语块作为参照高频词汇。Generally, after the importance of each word block in the word block set is scored based on the keyword extraction TF-IDF algorithm, the word blocks of the importance degree weights can be sorted in descending order, and the first 1/3 of the word blocks sorted in descending order are selected. As reference content words, the last 1/3 word blocks in descending order are taken as reference high frequency words.
S306,计算参照内容词汇与目标内容词汇的相似度值,以及参照高频词汇与目标高频词汇的相似度值。S306: Calculate the similarity value between the reference content word and the target content word, and the similarity value between the reference high frequency word and the target high frequency word.
其中,相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是参照内容词汇与目标内容词汇的相似度值和参照高频词汇与目标高频词汇的相似度值。目标内容词汇和目标高频词汇均是知识图谱中与原始数据资源的课程信息对应的目标知识点,参照内容词汇和参照高频词汇是原始数据资源的文本数据中的基础知识点。Among them, the similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value is, the more similar the two quantities are. Here, the similarity value can be the difference between the reference content vocabulary and the target content vocabulary. The similarity value and the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary. Both the target content vocabulary and the target high-frequency vocabulary are the target knowledge points corresponding to the course information of the original data resource in the knowledge graph, and the reference content vocabulary and the reference high-frequency vocabulary are the basic knowledge points in the text data of the original data resource.
S307,在相似度值大于相似度阈值时,获取目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签。S307, when the similarity value is greater than the similarity threshold, acquire the basic knowledge point label associated with the target content vocabulary and the basic knowledge point label associated with the target high-frequency vocabulary.
其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,参照内容词汇对应的相似度阈值与参照高频词汇对应的相似度阈值不同。基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the similarity threshold refers to the minimum lower limit value that the similarity value can satisfy the conditions, that is, the similarity threshold value. According to the different basic knowledge points, the corresponding similarity threshold value can also be different. The similarity threshold is different from the similarity threshold corresponding to the reference high-frequency vocabulary. The basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge point of the original data resource contains a basic knowledge point whose similarity value is greater than the similarity threshold with the target knowledge point in the knowledge graph, the The basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning capabilities involved in data resources, and labeling data resources with different labels can be used. Realize the screening and analysis of data resources.
举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分块(Chunk)得到各个句子对应的词语块集合{B11,B12…B1m1},{B21,B22…}…,由于参照内容词汇大多是基于学习主题的词汇,为加快计算速度,采用基于关键词提取TF-IDF算法对每一个词语块集合{B11,B12…B1m1},{B21,B22…}…进行重要度打分得到各个词语块集合中词语块的重要程度权值,将重要程度权值按降序排序后,取其前1/3的词语块作为参照内容词汇,即将重要程度权值大于第一预设权值的词语块作为参照内容词汇,并依次与目标内容词汇进行相似度计算,参照内容词汇记为Bi,目标内容词汇记为Kj,分别计算Bi与Kj的编辑距离相似度sim_raw1,词性还原之后编辑距离相识度sim_lemma1以及语义相似度sim_sem1,根据上述相似度计算出总相似度值voc_score1=vocα·sim_raw+vocβ··sim_lemma+vocγ·sim_sem,如果总相似度值高于第一预设相似度阈值voc_score_threshold1,则将参照内容词汇Bi标注上目标内容词汇Kj关联的知识点标签。For example: divide the sentence set {S1, S2...Sn} obtained by segmentation processing into blocks (Chunk) by sentence to obtain the word block set {B11,B12...B1m1},{B21,B22...}..., corresponding to each sentence, Since most of the reference content vocabulary is based on the vocabulary of the learning topic, in order to speed up the calculation, the TF-IDF algorithm based on keyword extraction is used to evaluate the importance of each word block set {B11,B12...B1m1},{B21,B22...}... Score the importance weights of the word blocks in each word block set. After sorting the importance weights in descending order, take the first 1/3 of the word blocks as the reference content vocabulary, that is, the importance weight is greater than the first preset weight. The word block of the value is used as the reference content word, and the similarity calculation is performed with the target content word in turn. The reference content word is recorded as Bi, and the target content word is recorded as Kj, and the edit distance similarity sim_raw1 between Bi and Kj is calculated respectively, and the part of speech is restored after editing. Distance acquaintance sim_lemma1 and semantic similarity sim_sem1, according to the above similarity to calculate the total similarity value voc_score1=vocα·sim_raw+vocβ··sim_lemma+vocγ·sim_sem, if the total similarity value is higher than the first preset similarity threshold voc_score_threshold1 , the reference content word Bi is marked with the knowledge point label associated with the target content word Kj.
举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分块(Chunk)得到各个句子对应的词语块集合{B11,B12…B1m1},{B21,B22…}…,由于参照内容词汇大多是基于学习主题的词汇,为加快计算速度,采用基于关键词提取TF-IDF算法对每一个词语块集合{B11,B12…B1m1},{B21,B22…}…进行重要度打分得到各个词语块集合中词语块的重要程度权值,将重要程度权值按降序排序后,取其后1/3的词语块作为参照高频词汇,并依次与目标高频词汇进行相似度计算,参照高频词汇记为Ba,目标高频词汇记为Kb,分别计算Ba与Kb的编辑距离相似度sim_raw2,词性还原之后编辑距离相识度sim_lemma2以及语义相似度sim_sem2,根据上述相似度计算出总相似度值voc_score2=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem,如果总相似度值高于预设相似度阈值voc_score_threshold2,则将参照高频词汇Ba标注上目标高频词汇Kb关联的知识点标签。For example: divide the sentence set {S1, S2...Sn} obtained by segmentation processing into blocks (Chunk) by sentence to obtain the word block set {B11,B12...B1m1},{B21,B22...}..., corresponding to each sentence, Since most of the reference content vocabulary is based on the vocabulary of the learning topic, in order to speed up the calculation, the TF-IDF algorithm based on keyword extraction is used to evaluate the importance of each word block set {B11,B12...B1m1},{B21,B22...}... Score the importance weights of the word blocks in each word block set. After sorting the importance weights in descending order, take the last 1/3 of the word blocks as the reference high-frequency vocabulary, and compare the similarity with the target high-frequency vocabulary in turn. Calculate, refer to high-frequency vocabulary as Ba, target high-frequency vocabulary as Kb, calculate the edit distance similarity sim_raw2 of Ba and Kb respectively, after the part-of-speech restoration, edit distance acquaintance sim_lemma2 and semantic similarity sim_sem2, calculated according to the above similarity The total similarity value voc_score2=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem, if the total similarity value is higher than the preset similarity threshold voc_score_threshold2, the knowledge points associated with the target high-frequency vocabulary Kb will be marked with reference to the high-frequency vocabulary Ba Label.
S308,将目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签加入到基础知识点标签集合中。S308, adding the basic knowledge point label associated with the target content vocabulary and the basic knowledge point label associated with the target high-frequency vocabulary to the basic knowledge point label set.
其中,基础知识点标签集合是包含原始数据资源对应的基础知识点标签的集合,基础知识点标签集合包括的基础知识点标签为相似度值大于相似度阈值的目标知识点关联的基础知识点标签,根据基础知识点不同,与之对应的相似度阈值也不相同。基础知识点标签集合中的基础知识点标签的数量与原始数据资源的内容有关,且基础知识点标签集合中的基础知识点标签是与原始数据资源关联的,也即基础知识点标签集合中的基础知识点标签相当于原始数据资源已标注上与之对应的基础知识点标签。Among them, the basic knowledge point label set is a collection of basic knowledge point labels corresponding to the original data resources, and the basic knowledge point labels included in the basic knowledge point label set are the basic knowledge point labels associated with the target knowledge points whose similarity value is greater than the similarity threshold. , according to the different basic knowledge points, the corresponding similarity thresholds are also different. The number of basic knowledge point tags in the basic knowledge point tag set is related to the content of the original data resources, and the basic knowledge point tags in the basic knowledge point tag set are associated with the original data resources, that is, the basic knowledge point tags in the basic knowledge point tag set. The basic knowledge point label is equivalent to the corresponding basic knowledge point label that has been marked on the original data resource.
S309,对词语集合中的各个词语进行词性标注得到词性标注集合。S309, performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set.
其中,词性标注集合是指与词语集合中各个词语对应的词性集合,词语集合中的每个词语与词性标注集合中的每个词性存在一一对应的映射关系。词性标注集合中的词性可以包括名词、动词、形容词、数词、量词、代词、副词、介词、连词、助词、叹词、拟声词等,具体词性仍与原始数据资源中的学习内容相关。The part-of-speech tagging set refers to a part-of-speech set corresponding to each word in the word set, and each word in the word set has a one-to-one mapping relationship with each part of speech in the part-of-speech tagging set. The parts of speech in the part-of-speech tagging set can include nouns, verbs, adjectives, numerals, quantifiers, pronouns, adverbs, prepositions, conjunctions, auxiliary words, interjections, onomatopoeia, etc. The specific parts of speech are still related to the learning content in the original data resources.
一般的,词性标注过程通常需要依据句子上下文为每个词确定与之对应的最合适的词性,词性标注过程中也会存在一词多性的情况,如同一词既可以作名词也可以作动词,又称做兼类词,这种词在常用词中出现的概率很大。针对这种情况可通过利用概率的方法来解决,如:可利用HMM(Hidden Markov Model,隐马尔科夫模型)来处理这种词语的标注。除此之外,也可基于转换思想或基于分类思想的方法对词语进行词性标注。Generally, the part-of-speech tagging process usually needs to determine the most appropriate part-of-speech for each word according to the context of the sentence. There will also be multiple words in the part-of-speech tagging process, just as a word can be used as a noun or a verb. , also known as concurrent words, such words have a high probability of appearing in commonly used words. This situation can be solved by using a probability method, for example, HMM (Hidden Markov Model, Hidden Markov Model) can be used to process the labeling of such words. In addition, part-of-speech tagging can also be performed on words based on transformation thinking or classification thinking-based methods.
S310,将词性为动词词性的词语作为参照动词词汇,将词性为数词词性的词语作为参照数学词汇。S310 , the words whose part of speech is the part of speech of the verb are used as the reference verb vocabulary, and the words whose part of speech is the part of speech of the numeral are used as the reference mathematical vocabulary.
一般的,通过对词语集合中的词语进行词性分析后,可得到原始数据资源中包含的词语对应的词性,故可根据词性将词性为动词词性和数词词性的词语分别筛选出来,将词性为动词词性的词语作为参照动词词汇知识点,将词性为数词词性的词语作为参照数学词汇知识点。Generally, the part of speech corresponding to the words contained in the original data resource can be obtained after part-of-speech analysis of the words in the word set, so words whose part-of-speech is verb part-of-speech and numeral part-of-speech can be screened out respectively according to the part-of-speech, and the part-of-speech as Words with verb parts of speech are used as reference verb vocabulary knowledge points, and words whose part of speech is numeral part of speech are used as reference mathematical vocabulary knowledge points.
S311,计算参照动词词汇与目标动词词汇的相似度值,以及参照数学词汇与目标数学词汇的相似度值。S311, calculate the similarity value between the reference verb vocabulary and the target verb vocabulary, and the similarity value between the reference mathematical vocabulary and the target mathematical vocabulary.
其中,参照动词词汇和参照数学词汇均是原始数据资源中包含的基础知识点,目标动词词汇和目标数学词汇均是知识图谱中与原始数据资源的课程信息对应的目标知识点。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是参照动词词汇与目标动词词汇的相似度值和参照数学词汇与目标数学词汇的相似度值。Among them, the reference verb vocabulary and the reference mathematical vocabulary are the basic knowledge points contained in the original data resource, and the target verb vocabulary and the target mathematical vocabulary are the target knowledge points corresponding to the course information of the original data resource in the knowledge map. The similarity value refers to the similarity between the two quantities being compared. Usually, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the similarity between the reference verb vocabulary and the target verb vocabulary. value and the similarity value of the reference math vocabulary to the target math vocabulary.
S312,在相似度值大于相似度阈值时,获取目标动词词汇关联的基础知识点标签,以及目标数学词汇关联的基础知识点标签。S312, when the similarity value is greater than the similarity threshold, acquire the basic knowledge point label associated with the target verb vocabulary and the basic knowledge point label associated with the target math vocabulary.
其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,参照动词词汇对应的相似度阈值与参照数学词汇对应的相似度阈值不同。Among them, the similarity threshold refers to the minimum lower limit value that the similarity value can satisfy the conditions, that is, the similarity threshold value. According to the different basic knowledge points, the corresponding similarity threshold value can also be different. The similarity threshold is different from the similarity threshold corresponding to the reference mathematical vocabulary.
S313,将目标动词词汇关联的基础知识点标签和目标数学词汇关联的基础知识点标签加入到基础知识点标签集合中。S313: Add the basic knowledge point label associated with the target verb vocabulary and the basic knowledge point label associated with the target mathematical vocabulary to the basic knowledge point label set.
其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge point of the original data resource contains a basic knowledge point whose similarity value is greater than the similarity threshold value with the target knowledge point in the knowledge graph, Then, the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning capabilities involved in data resources, and labeling data resources with different labels can be used. Realize the screening and analysis of data resources.
举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},对词语集合中的各个词语进行词性标注得词性标注集合,并将词性为动词词性的词语作为参照动词词汇,将词性为数词词性的词语作为参照数学词汇;将参照动词词汇依次与目标动词词汇进行相似度计算,参照动词词汇记为Vi,目标动词词汇记为Uj,将数词词汇依次与目标数词词汇进行相似度计算,数词词汇记为Va,目标数词词汇记为Ub,分别计算Vi与Uj的编辑距离相似度值sim_raw3,词性还原之后编辑距离相识度sim_lemma3以及语义相似度sim_sem3,根据上述相似度计算出总相似度值voc_score3=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem,如果总相似度值高于第三预设相似度阈值voc_score_threshold3,则将参照动词词汇Vi标注上目标动词词汇Uj关联的知识点标签;以及分别计算Va与Ub的编辑距离相似度值sim_raw4,词性还原之后编辑距离相识度sim_lemma4以及语义相似度sim_sem4,根据上述相似度计算出总相似度值voc_score4=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem,如果总相似度值高于第四预设相似度阈值voc_score_threshold4,则将数词词汇Va标注上目标数词词汇Ub关联的知识点标签。For example: the sentence set {S1, S2...Sn} obtained by segmentation is processed by sentence segmentation to obtain the corresponding word set {{T11,T12...T1o1},{T21,T22...T2o2}...}, and the Each word in the word set is marked by part of speech to obtain a set of part-of-speech tags, and the words whose part of speech is the part of speech of the verb are used as the reference verb vocabulary, and the words whose part of speech is the part of speech of the numeral are used as the reference mathematical vocabulary; the reference verb vocabulary is similar to the target verb vocabulary in turn. Degree calculation, the reference verb vocabulary is recorded as Vi, the target verb vocabulary is recorded as Uj, the similarity between the numeral vocabulary and the target numeral vocabulary is calculated in turn, the numeral vocabulary is recorded as Va, the target numeral vocabulary is recorded as Ub, and Vi is calculated respectively. The edit distance similarity value sim_raw3 with Uj, the edit distance acquaintance degree sim_lemma3 and the semantic similarity degree sim_sem3 after the part-of-speech restoration, the total similarity value voc_score3=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem is calculated according to the above similarity. If the similarity value is higher than the third preset similarity threshold voc_score_threshold3, the knowledge point label associated with the target verb vocabulary Uj will be marked with the reference verb vocabulary Vi; and the edit distance similarity value sim_raw4 of Va and Ub will be calculated respectively, and edited after the part of speech is restored. The distance acquaintance degree sim_lemma4 and the semantic similarity degree sim_sem4, the total similarity value voc_score4=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem is calculated according to the above similarity, if the total similarity value is higher than the fourth preset similarity threshold voc_score_threshold4, Then, the numeral vocabulary Va is marked with the knowledge point label associated with the target numeral vocabulary Ub.
S314,分析词语集合中的各个词语,并为各个词语标上音标得到音标集合。S314, analyze each word in the word set, and label each word with phonetic symbols to obtain a phonetic symbol set.
其中,音标集合是包含词语集合中各个词语对应音标的集合,词语集合中的每个词语与音标集合中的每个音标存在一一对应的映射关系。The phonetic symbol set is a set including phonetic symbols corresponding to each word in the word set, and each word in the word set has a one-to-one mapping relationship with each phonetic symbol in the phonetic symbol set.
S315,计算音标集合中的词语音标与目标音标的相似度值。S315: Calculate the similarity value between the word phonetic symbol in the phonetic symbol set and the target phonetic symbol.
其中,音标集合里的词语音标是原始数据资源中包含的基础知识点,目标音标是知识图谱中与原始数据资源的课程信息对应的目标知识点。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是词语音标与目标音标的相似度值。Among them, the word phonetic symbols in the phonetic symbol collection are the basic knowledge points contained in the original data resources, and the target phonetic symbols are the target knowledge points in the knowledge map corresponding to the course information of the original data resources. The similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value is, the more similar the two quantities are. Here, the similarity value may be the similarity value between the word phonetic symbol and the target phonetic symbol.
S316,在相似度值大于相似度阈值时,获取目标音标关联的基础知识点标签。S316, when the similarity value is greater than the similarity threshold, acquire the basic knowledge point label associated with the target phonetic symbol.
其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,词语音标对应的相似度阈值可根据需要任意设定,与前面所述的相似度阈值也可能不相同。Among them, the similarity threshold refers to the minimum lower limit value that the similarity value can satisfy the condition, that is, the similarity threshold value. According to the different basic knowledge points, the corresponding similarity threshold value can also be different. The degree threshold can be arbitrarily set as required, and may also be different from the aforementioned similarity threshold.
S317,将目标音标关联的基础知识点标签加入到基础知识点标签集合中。S317, add the basic knowledge point tag associated with the target phonetic symbol to the basic knowledge point tag set.
其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge point of the original data resource contains a basic knowledge point whose similarity value is greater than the similarity threshold value with the target knowledge point in the knowledge graph, Then, the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning capabilities involved in data resources, and labeling data resources with different labels can be used. Realize the screening and analysis of data resources.
举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},通过字典工具将词语集合中的词语转化为与之对应的音标集合{{P11,P12…P1o1},{P21,P22…P2o2}…},依次将音标集合中的词语音标Pi与目标音标Kj进行相似度计算,包括发音组合是否在词语音标Pi对应的源单词Ti中的包含相似度sim_in,Pi与Kj的编辑距离相似度sim_edit,以及词语音标Pi与目标音标Kj的最长公共子串相似度sim_lcs,根据上述分数计算总相似度值phon_score=phonα·sim_in+phonβ·sim_edit+phonγ·sim_lcs,如果总相似度值高于预设相似度阈值phon_score_threshold,则将词语音标Pi标注上目标音标Kj关联的知识点标签。For example: the sentence set {S1, S2...Sn} obtained by segmentation is processed by sentence segmentation to obtain the corresponding word set {{T11,T12...T1o1},{T21,T22...T2o2}...}, through The dictionary tool converts the words in the word set into the corresponding phonetic symbol set {{P11,P12…P1o1},{P21,P22…P2o2}…}, and sequentially compares the word phonetic symbol Pi in the phonetic symbol set with the target phonetic symbol Kj. Degree calculation, including whether the pronunciation combination contains similarity sim_in in the source word Ti corresponding to the word phonetic symbol Pi, the edit distance similarity sim_edit between Pi and Kj, and the longest common substring similarity sim_lcs between the word phonetic symbol Pi and the target phonetic symbol Kj , calculate the total similarity value phon_score=phonα·sim_in+phonβ·sim_edit+phonγ·sim_lcs according to the above scores, if the total similarity value is higher than the preset similarity threshold phon_score_threshold, then the word phonetic symbol Pi is marked with the knowledge associated with the target phonetic symbol Kj Click the label.
S318,对句子集合中的句子进行依存句法分析得到依存句法树。S318, perform dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree.
其中,依存句法树是指可描述出数据资源中各个词语之间的依存关系的关系树,能表示出各个词语之间在句法上的搭配关系,这种搭配关系是和语义相关联的。依存句法分析的基本任务是确定句子的句法结构或者句子中词汇之间的依存关系,主要包括两方面的内容:确定语言的语法体系,即对语言中合法句子的语法结构给予形式化的定义;句法分析技术,即根据给定的语法体系,自动推导出句子的句法结构,分析句子所包含的句法单位和句法单位之间的关系。Among them, the dependency syntax tree refers to a relationship tree that can describe the dependency relationship between each word in the data resource, and can express the syntactic collocation relationship between each word, and this collocation relationship is related to semantics. The basic task of dependency syntax analysis is to determine the syntactic structure of a sentence or the dependencies between words in a sentence, which mainly includes two aspects: determining the grammatical system of the language, that is, giving a formal definition to the grammatical structure of legal sentences in the language; Syntactic analysis technology, that is, according to a given grammar system, automatically deduce the syntactic structure of a sentence, and analyze the relationship between the syntactic units and syntactic units contained in the sentence.
S319,计算词语集合中的词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值。S319: Calculate the similarity values corresponding to the words in the word set, the part of speech and the dependent syntax tree in the part of speech tagging set and the words, part of speech and syntax tree in the target sentence pattern respectively.
其中,相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是原始数据资源的词语集合中词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值。The similarity value refers to the similarity between the two quantities being compared. Generally, the larger the similarity value is, the more similar the two quantities are. Here, the similarity value can be the words in the word set of the original data resource. , the similarity values of the parts of speech and dependent syntax trees in the part of speech tagging set respectively corresponding to the words, parts of speech and syntax trees in the target sentence pattern.
S320,在相似度值大于相似度阈值时,获取目标句式关联的基础知识点标签。S320, when the similarity value is greater than the similarity threshold, acquire the basic knowledge point label associated with the target sentence pattern.
其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,词语集合中的词语、词性标注集合中的词性和依存句法树对应的相似度阈值可根据需要任意设定,与前面所述的相似度阈值也可能不相同。Among them, the similarity threshold refers to the minimum lower limit value that the similarity value can satisfy the condition, that is, the similarity threshold value. According to different basic knowledge points, the corresponding similarity threshold value can also be different. The similarity threshold corresponding to the part of speech and the dependent syntax tree in the part-of-speech tagging set can be arbitrarily set as required, and may also be different from the similarity threshold described above.
S321,将目标句式关联的基础知识点标签加入到基础知识点标签集合中。S321, adding the basic knowledge point label associated with the target sentence pattern to the basic knowledge point label set.
其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge point of the original data resource contains a basic knowledge point whose similarity value is greater than the similarity threshold value with the target knowledge point in the knowledge graph, Then, the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning capabilities involved in data resources, and labeling data resources with different labels can be used. Realize the screening and analysis of data resources.
举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},对句子集合{S1,S2…Sn}进行词性标注处理和依存句法分析处理,可分别得到词性标注集合{{Pos11,Pos12…Pos1o1},{Pos21,Pos22…Pos2o2}…},依存句法树{Tree1,Tree2…Treen},进而依次计算词语集合{Ti1,Ti2…Tim},Ti1={T11,T12…T1o1}与目标句式知识点的例句集合{KTj1,KTj2…KTjn}的jaccard相似度值sim_token_jaccard、词性标注集合{Posi1,Posi2…Posio1}与目标句式知识点例句的词性标注集合{KPosj1,KPosj2…KPosjn}间的编辑距离相似度sim_pos_edit,最长公共子串相似度值sim_pos_lcs,Treei与目标句式知识点例句KTreej的树相似度sim_tree,根据上述相似度值计算总相似度值sent_score=sentα·sim_token_jaccard+sentβ·sim_pos_edit+sentγ·sim_pos_lcs+sentθ·sim_tree,如果总相似度值高于预设相似度阈值sent_score_threshold,则将句子Si标注上Kj句式关联的基础知识点标签。For example: the sentence set {S1, S2...Sn} obtained by segmentation is processed by sentence segmentation to obtain the corresponding word set {{T11,T12...T1o1},{T21,T22...T2o2}...}, and the Sentence set {S1, S2…Sn} is processed by part-of-speech tagging and dependency syntax analysis, and the part-of-speech tagging set {{Pos11, Pos12…Pos1o1}, {Pos21, Pos22…Pos2o2}…} can be obtained respectively, and the dependency syntax tree {Tree1, Tree2...Treen}, and then calculate the jaccard similarity value sim_token_jaccard, sim_token_jaccard, sim_token_jaccard, Edit distance similarity sim_pos_edit, longest common substring similarity value sim_pos_lcs, treei and target sentence Formula knowledge point example tree similarity sim_tree of KTreej, calculate the total similarity value according to the above similarity value sent_score=sentα·sim_token_jaccard+sentβ·sim_pos_edit+sentγ·sim_pos_lcs+sentθ·sim_tree, if the total similarity value is higher than the preset similarity The threshold value sent_score_threshold, the sentence Si is marked with the basic knowledge point label associated with the Kj sentence pattern.
S322,基于词语集合和词性标注集合计算句子集合包含的语法与目标语法的相似度值。S322: Calculate, based on the word set and the part-of-speech tagging set, a similarity value between the grammar included in the sentence set and the target grammar.
其中,句子集合包含的语法为原始数据资源的基础知识点,目标语法为知识图谱中与原始数据资源的课程信息对应的目标知识点。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是原始数据资源的词语集合中语法与目标语法对应的相似度值。The grammar contained in the sentence set is the basic knowledge point of the original data resource, and the target grammar is the target knowledge point corresponding to the course information of the original data resource in the knowledge graph. The similarity value refers to the similarity between the two quantities being compared. Usually, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the grammar and target in the word set of the original data resource. The similarity value corresponding to the grammar.
S323,在相似度值大于相似度阈值时,获取目标语法关联的基础知识点标签。S323, when the similarity value is greater than the similarity threshold, acquire the basic knowledge point label associated with the target grammar.
其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,词语集合中的语法对应的相似度阈值可根据需要任意设定,与前面所述的相似度阈值也可能不相同。Among them, the similarity threshold refers to the minimum lower limit value that the similarity value can satisfy the condition, that is, the similarity threshold value. According to the different basic knowledge points, the corresponding similarity threshold value can also be different. The corresponding similarity threshold can be arbitrarily set as required, and may also be different from the similarity threshold described above.
S324,将目标语法关联的基础知识点标签加入到基础知识点标签集合中。S324, adding the basic knowledge point tag associated with the target grammar to the basic knowledge point tag set.
其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge point of the original data resource contains a basic knowledge point whose similarity value is greater than the similarity threshold value with the target knowledge point in the knowledge graph, Then, the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning capabilities involved in data resources, and labeling data resources with different labels can be used. Realize the screening and analysis of data resources.
举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},对句子集合{S1,S2…Sn}进行词性标注处理,可得到词性标注集合{{Pos11,Pos12…Pos1o1},{Pos21,Pos22…Pos2o2}…},计算语法片段是否包含在词语集合{Ti1,Ti2…Tim},Ti1={T11,T12…T1o1},Ti2={T21,T22…T2o2}…或者词性标注集合{KPosj1,KPosj2…KPosjn},KPosj1={Pos11,Pos12…Pos1o1},KPosj2={Pos21,Pos22…Pos2o2}…中的包含相似度sim_in,语法片段在词语Ti1={T11,T12…T1o1}与目标词语{KTj1,KTj2…KTjn}中的位置相似度sim_position,语法片段的词性相似度sim_pos,根据上述相似度计算总相似度值gram=gramα·sim_in+gramβ·sim_position+gramγ·sim_pos,如果总相似度值高于预设相似度阈值gram_score_threshold,则将句子Si标注上目标句子Kj语法相关联的基础知识点标签。For example: the sentence set {S1, S2...Sn} obtained by segmentation is processed by sentence segmentation to obtain the corresponding word set {{T11,T12...T1o1},{T21,T22...T2o2}...}, and the Sentence set {S1, S2…Sn} is processed by part-of-speech tagging, and part-of-speech tagging set {{Pos11,Pos12…Pos1o1},{Pos21,Pos22…Pos2o2}…} can be obtained, and whether the grammatical fragment is included in the word set {Ti1,Ti2 ...Tim}, Ti1={T11,T12...T1o1}, Ti2={T21,T22...T2o2}...or part-of-speech tagging set {KPosj1,KPosj2...KPosjn}, KPosj1={Pos11,Pos12...Pos1o1}, KPosj2={Pos21 ,Pos22...Pos2o2}... contains the similarity sim_in, the position similarity sim_position of the grammar fragment in the word Ti1={T11, T12...T1o1} and the target word {KTj1, KTj2...KTjn}, the part of speech similarity sim_pos of the grammar fragment , calculate the total similarity value gram=gramα·sim_in+gramβ·sim_position+gramγ·sim_pos according to the above similarity, if the total similarity value is higher than the preset similarity threshold gram_score_threshold, then mark the sentence Si on the target sentence Kj grammar associated The basics of point labels.
举例说明:计算可得原始数据资源对应的基础知识点标签与数量,可记为:Tag(Text)={voc:num1,verb:num2,math:num3,hfw:num4,phon:num5,sent:num5,gram:num7|numi>=0}。整个计算原始数据资源的基础知识点与知识图谱中与原始数据资源的课程信息对应的目标知识点的相似度的流程,可参见图4。For example: Calculate the basic knowledge point label and quantity corresponding to the original data resource, which can be recorded as: Tag(Text)={voc:num1,verb:num2,math:num3,hfw:num4,phon:num5,sent: num5,gram:num7|numi>=0}. The entire process of calculating the similarity between the basic knowledge point of the original data resource and the target knowledge point corresponding to the course information of the original data resource in the knowledge graph is shown in FIG. 4 .
S325,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。S325: Generate a comprehensive knowledge point label set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point label set.
其中,特征信息是指原始数据资源的类型,如原始数据资源的类型可以包括音频、视频、文本、绘本等类型,不同的类型能训练/锻炼的学习能力(如:听、说、读、写能力)不同。综合知识点标签集合中的综合知识点标签是指基于原始数据资源的特征信息和基础知识点标签集合生成的能反应该原始数据资源锻炼用户(学生)能力的知识点标签,一种原始数据资源对应的综合知识点标签可以有多个,如:原始数据资源包含音频,且基础知识点标签集合中含有与目标音标相关联的知识点标签,则可分析得到该原始数据资源对应的综合知识点标签为听力标签。Among them, feature information refers to the type of raw data resources. For example, the types of raw data resources can include audio, video, text, picture books and other types. Different types can train/exercise the learning ability (such as listening, speaking, reading, writing, etc.) abilities) are different. The comprehensive knowledge point label in the comprehensive knowledge point label set refers to the knowledge point label generated based on the characteristic information of the original data resource and the basic knowledge point label set, which can reflect the ability of the original data resource to exercise the user (student), a kind of original data resource. There can be multiple corresponding comprehensive knowledge point labels, for example: the original data resource contains audio, and the basic knowledge point label set contains the knowledge point label associated with the target phonetic symbol, then the comprehensive knowledge point corresponding to the original data resource can be obtained by analysis The label is the hearing label.
一般的,原始数据资源的综合知识点标签与原始数据资源自身的特征有关,综合知识点标签集合中的综合知识点标签与基础知识点标签集合中的基础知识点标签均与原始数据资源相关联,在基础知识点标签集合和综合知识点标签集合生成后,也即表明原始数据资源已标注上了相关的知识点标签(包括基础知识点标签和综合知识点标签)。Generally, the comprehensive knowledge point labels of the original data resources are related to the characteristics of the original data resources, and the comprehensive knowledge point labels in the comprehensive knowledge point label set and the basic knowledge point labels in the basic knowledge point label set are both associated with the original data resources. , after the basic knowledge point label set and the comprehensive knowledge point label set are generated, it means that the original data resource has been marked with the relevant knowledge point labels (including the basic knowledge point label and the comprehensive knowledge point label).
本申请实施例的方案在执行时,服务器对原始数据资源进行预处理获取文本数据,从预设知识图谱中查询原始数据资源对应的属性信息,得到与属性信息对应的多个目标知识点,对文本数据进行句子分割处理得到句子集合,并对句子集合进行分块处理得到词语块集合,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值,将重要程度权值大于第一预设权值的词语块作为参照内容词汇,将重要程度权值小于或等于第二预设权值的词语块作为参照高频词汇,计算参照内容词汇与目标内容词汇,以及参照高频词汇与目标高频词汇的相似度值,在相似度值大于相似度阈值时,获取目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签,将目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签加入到基础知识点标签集合中,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,将词性为动词词性的词语作为参照动词词汇,将词性为数词词性的词语作为参照数学词汇,计算参照动词词汇与目标动词词汇的相似度值,以及参照数学词汇与目标数学词汇的相似度值,在相似度值大于相似度阈值时,获取目标动词词汇关联的基础知识点标签,以及目标数学词汇关联的基础知识点标签,将目标动词词汇关联的基础知识点标签和目标数学词汇关联的基础知识点标签加入到基础知识点标签集合中,分析词语集合中的各个词语,并为各个词语标上音标得到音标集合,计算音标集合中的词语音标与目标音标的相似度值,在相似度值大于相似度阈值时,获取目标音标关联的基础知识点标签,将目标音标关联的基础知识点标签加入到基础知识点标签集合中,对句子集合中的句子进行依存句法分析得到依存句法树,计算词语集合中的词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值,在相似度值大于相似度阈值时,获取目标句式关联的基础知识点标签,将目标句式关联的基础知识点标签加入到基础知识点标签集合中,基于词语集合和词性标注集合计算句子集合包含的语法与目标语法的相似度值,在相似度值大于相似度阈值时,获取目标语法关联的基础知识点标签,将目标语法关联的基础知识点标签加入到基础知识点标签集合中,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。通过此种方式可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。When the solutions of the embodiments of the present application are executed, the server preprocesses the original data resources to obtain text data, queries the attribute information corresponding to the original data resources from the preset knowledge graph, and obtains multiple target knowledge points corresponding to the attribute information. The text data is processed by sentence segmentation to obtain a sentence set, and the sentence set is processed into blocks to obtain a word block set. Based on the keyword extraction TF-IDF algorithm, the importance degree weight of each word block in the word block set is calculated, and the importance degree weight is calculated. The word block whose value is greater than the first preset weight is used as the reference content vocabulary, the word block whose importance degree weight is less than or equal to the second preset weight is used as the reference high-frequency vocabulary, the reference content vocabulary and the target content vocabulary are calculated, and the reference content vocabulary is calculated. The similarity value between the high-frequency vocabulary and the target high-frequency vocabulary, when the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target content vocabulary and the basic knowledge point label associated with the target high-frequency vocabulary, and associate the target content vocabulary The basic knowledge point label and the basic knowledge point label associated with the target high-frequency vocabulary are added to the basic knowledge point label set. Based on the word set, each word in the word set is part-of-speech tagging to obtain the part-of-speech tagging set, and the part-of-speech is the word of the verb part-of-speech. As the reference verb vocabulary, take the words whose part of speech is the numeral part as the reference mathematical vocabulary, calculate the similarity value between the reference verb vocabulary and the target verb vocabulary, and the similarity value between the reference mathematical vocabulary and the target mathematical vocabulary, if the similarity value is greater than the similarity At the threshold, the basic knowledge point label associated with the target verb vocabulary and the basic knowledge point label associated with the target mathematical vocabulary are obtained, and the basic knowledge point label associated with the target verb vocabulary and the basic knowledge point label associated with the target mathematical vocabulary are added to the basic knowledge point. In the label set, analyze each word in the word set, and label each word with a phonetic symbol to obtain a phonetic symbol set, calculate the similarity value between the phonetic symbol of the word in the phonetic symbol set and the target phonetic symbol, and obtain the target when the similarity value is greater than the similarity threshold. Basic knowledge point tags associated with phonetic symbols, add the basic knowledge point tags associated with the target phonetic symbols to the basic knowledge point tag set, perform dependency syntax analysis on the sentences in the sentence collection to obtain the dependency syntax tree, and calculate the words and part-of-speech tags in the word collection. The similarity value corresponding to the word, part of speech and syntax tree in the set is respectively corresponding to the word, part of speech and syntax tree in the target sentence pattern. When the similarity value is greater than the similarity threshold, the basic knowledge point label associated with the target sentence pattern is obtained, and the target sentence pattern is associated with the similarity value. The basic knowledge point label associated with the sentence pattern is added to the basic knowledge point label set, and the similarity value between the grammar contained in the sentence set and the target grammar is calculated based on the word set and part-of-speech label set. When the similarity value is greater than the similarity threshold, the target is obtained. The basic knowledge point label associated with the grammar, the basic knowledge point label associated with the target grammar is added to the basic knowledge point label set, and the comprehensive knowledge point label set of the original data resource is generated according to the feature information of the original data resource and the basic knowledge point label set. In this way, knowledge point labels related to the original data resources can be quickly and accurately labeled, and the labeling efficiency and labeling accuracy can be improved.
正如前面描述,实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiment is mainly described by taking the online education industry as an example, but those skilled in the art will understand that the application of this method is not limited to the online education industry, such as retail, transportation, social networking, search, education, medical treatment, etc. The method described in this application can be applied to the processing of user labels in various industries.
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。The following are apparatus embodiments of the present application, which can be used to execute the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
请参见图5,其示出了本申请一个示例性实施例提供的数据资源的标注装置的结构示意图。以下简称装置5,装置5可以通过软件、硬件或者两者的结合实现成为终端的全部或一部分。装置5包括预处理模块501、计算模块502、第一处理模块503、第二处理模块504。Please refer to FIG. 5 , which shows a schematic structural diagram of an apparatus for labeling data resources provided by an exemplary embodiment of the present application. The
预处理模块501,用于对原始数据资源进行预处理获取文本数据;A
计算模块502,用于将所述文本数据分别和多个目标知识点进行相似度值计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;The
第一处理模块503,用于根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;The
第二处理模块504,用于根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。The
可选地,所述装置5还包括:Optionally, the
查询单元,用于从预设知识图谱中查询所述原始数据资源对应的属性信息,得到与所述属性信息对应的所述多个目标知识点a query unit, configured to query the attribute information corresponding to the original data resource from the preset knowledge graph, and obtain the multiple target knowledge points corresponding to the attribute information
可选地,所述装置5中的所述原始数据资源为教学资源,所述属性信息为课程信息。Optionally, the original data resources in the
可选地,所述第一处理模块503包括:Optionally, the
第一处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分块处理得到词语块集合,以及进行分词处理得到词语集合;a first processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, perform block processing on the sentence set respectively to obtain a word block set, and perform word segmentation processing to obtain a word set;
第一计算单元,用于分析所述词语块集合和所述词语集合得到参照词汇集合,并计算所述参照词汇集合中的各参照词汇与各自对应的目标知识点的相似度值;其中,所述参照词汇集合中包括参照内容词汇、参照高频词汇、参照动词词汇和参照数学词汇,所述参照内容词汇对应的目标知识点为目标内容词汇,所述参照高频词汇对应的目标知识点为目标高频词汇,所述参照动词词汇对应的目标知识点为目标动词词汇,所述参照数学词汇对应的目标知识点为目标数学词汇;The first calculation unit is configured to analyze the word block set and the word set to obtain a reference vocabulary set, and calculate the similarity value of each reference vocabulary in the reference vocabulary set and the corresponding target knowledge points; The reference vocabulary set includes reference content vocabulary, reference high-frequency vocabulary, reference verb vocabulary and reference mathematical vocabulary, the target knowledge point corresponding to the reference content vocabulary is the target content vocabulary, and the target knowledge point corresponding to the reference high-frequency vocabulary is Target high-frequency vocabulary, the target knowledge point corresponding to the reference verb vocabulary is the target verb vocabulary, and the target knowledge point corresponding to the reference mathematical vocabulary is the target mathematical vocabulary;
第一加入单元,用于在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中;或The first adding unit is used to add the basic knowledge point label corresponding to the corresponding target knowledge point to the basic knowledge point when the similarity value of the corresponding target knowledge point is greater than the similarity threshold value. in the knowledge point label set; or
第二处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分词处理得到词语集合;a second processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
第二计算单元,用于分析所述词语集合分别得到音标集合、词性标注集合和依存句法树,并计算所述音标集合中的词语音标、所述句子集合中的句式和所述句子集合中的语法与各自对应的目标知识点的相似度值;其中,所述词语音标对应的目标知识点为目标音标,所述句式对应的目标知识点为目标句式,所述语法对应的目标知识点为目标语法;The second computing unit is configured to analyze the word set to obtain a phonetic symbol set, a part-of-speech tagging set and a dependency syntax tree respectively, and calculate the word phonetic symbol in the phonetic symbol set, the sentence pattern in the sentence set, and the sentence set in the sentence set. The similarity value of the grammar and the corresponding target knowledge point; wherein, the target knowledge point corresponding to the word phonetic symbol is the target phonetic symbol, the target knowledge point corresponding to the sentence pattern is the target sentence pattern, and the target knowledge corresponding to the grammar dot is the target grammar;
第二加入单元,用于在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中。The second adding unit is configured to add the basic knowledge point label corresponding to the corresponding target knowledge point to the basic knowledge point when the similarity value of the corresponding target knowledge point is greater than the corresponding similarity threshold value. Knowledge point label collection.
可选地,所述第一处理模块503包括:Optionally, the
第三处理单元,用于对所述学习文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;a third processing unit, configured to perform sentence segmentation processing on the learning text data to obtain a sentence set, and perform block processing on the sentence set to obtain a word block set;
第三计算单元,用于基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;The third computing unit is used to calculate the importance degree weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;
第一选择单元,用于将所述重要程度权值大于第一预设权值的词语块作为所述参照内容词汇;a first selection unit, configured to use the word block whose importance weight is greater than the first preset weight as the reference content word;
第四计算单元,用于计算所述参照内容词汇与所述目标内容词汇的相似度值;a fourth calculation unit, configured to calculate the similarity value between the reference content vocabulary and the target content vocabulary;
第一获取单元,用于在相似度值大于相似度阈值时,获取所述目标内容词汇关联的基础知识点标签;a first obtaining unit, configured to obtain the basic knowledge point label associated with the target content vocabulary when the similarity value is greater than the similarity threshold;
第一添加单元,用于将所述目标内容词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The first adding unit is configured to add the basic knowledge point tag associated with the target content vocabulary to the basic knowledge point tag set.
可选地,所述第一处理模块503包括:Optionally, the
第四处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;a fourth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform block processing on the sentence set to obtain a word block set;
第四计算单元,用于基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;The fourth computing unit is used to calculate the importance degree weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;
第二选择单元,用于将所述重要程度权值小于或等于第二预设权值的词语块作为所述参照高频词汇;a second selection unit, configured to use the word block whose importance weight is less than or equal to the second preset weight as the reference high-frequency word;
第五计算单元,用于计算所述参照高频词汇与所述目标高频词汇的相似度值;a fifth calculation unit, used for calculating the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary;
第二获取单元,用于在相似度值大于相似度阈值时,获取所述目标高频词汇关联的基础知识点标签;a second obtaining unit, configured to obtain the basic knowledge point label associated with the target high-frequency vocabulary when the similarity value is greater than the similarity threshold;
第二添加单元,用于将所述目标高频词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The second adding unit is configured to add the basic knowledge point tags associated with the target high-frequency vocabulary to the basic knowledge point tag set.
可选地,所述第一处理模块503包括:Optionally, the
第五处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;a fifth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
第一标注单元,用于对所述词语集合中的各个词语进行词性标注得到词性标注集合;a first tagging unit, configured to perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
第三选择单元,用于将所述词性为动词词性的词语作为参照动词词汇;The third selection unit is used to use the word whose part of speech is the part of speech of the verb as the reference verb vocabulary;
第六计算单元,用于计算所述参照动词词汇与所述目标动词词汇的相似度值;a sixth calculating unit, configured to calculate the similarity value between the reference verb vocabulary and the target verb vocabulary;
第三获取单元,用于在相似度值大于相似度阈值时,获取所述目标动词词汇关联的基础知识点标签;a third obtaining unit, configured to obtain the basic knowledge point label associated with the target verb vocabulary when the similarity value is greater than the similarity threshold;
第三添加单元,用于将所述目标动词词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The third adding unit is configured to add the basic knowledge point tag associated with the target verb vocabulary to the basic knowledge point tag set.
可选地,所述第一处理模块503包括:Optionally, the
第六处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;The sixth processing unit is used to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
第二标注单元,用于对所述词语集合中的各个词语进行词性标注得到词性标注集合;a second tagging unit, configured to perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
第四选择单元,用于将所述词性为数词词性的词语作为参照数学词汇;The fourth selection unit is used to use the words whose part of speech is a numeral part as a reference mathematical vocabulary;
第七计算单元,用于计算所述参照数学词汇与所述目标数学词汇的相似度值;A seventh calculation unit, used for calculating the similarity value between the reference mathematical vocabulary and the target mathematical vocabulary;
第四获取单元,用于在相似度值大于相似度阈值时,获取所述目标数学词汇关联的基础知识点标签;a fourth obtaining unit, configured to obtain the basic knowledge point label associated with the target mathematical vocabulary when the similarity value is greater than the similarity threshold;
第四添加单元,用于将所述目标数学词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The fourth adding unit is configured to add the basic knowledge point tag associated with the target mathematical vocabulary to the basic knowledge point tag set.
可选地,所述第一处理模块503包括:Optionally, the
第七处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;The seventh processing unit is used to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
第三标注单元,用于分析所述词语集合中的各个词语,并为所述词语标上音标得到音标集合;The third labeling unit is used to analyze each word in the word set, and mark the word with phonetic symbols to obtain a phonetic symbol set;
第八计算单元,用于计算所述音标集合中的词语音标与所述目标音标的相似度值;The eighth computing unit, for calculating the similarity value of the word phonetic symbol in the phonetic symbol set and the target phonetic symbol;
第五选择单元,用于在相似度值大于相似度阈值时,获取所述目标音标关联的基础知识点标签;The fifth selection unit is used to obtain the basic knowledge point label associated with the target phonetic symbol when the similarity value is greater than the similarity threshold;
第五添加单元,用于将所述目标音标关联的基础知识点标签加入到所述基础知识点标签集合中。The fifth adding unit is used for adding the basic knowledge point tag associated with the target phonetic symbol to the basic knowledge point tag set.
可选地,所述第一处理模块503包括:Optionally, the
第八处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;The eighth processing unit is used to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
第四标注单元,用于基于所述词语集合对所述词语集合中的各个词语进行词性标注得到词性标注集合;a fourth tagging unit, configured to perform part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;
分析单元,用于对所述句子集合中的句子进行依存句法分析得到依存句法树;an analysis unit, configured to perform dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree;
第九计算单元,用于计算所述词语集合中的词语、所述词性标注集合中的词性和所述依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值;A ninth calculation unit, used to calculate the similarity values corresponding to the words in the word set, the parts of speech in the part-of-speech tagging set, and the dependent syntax tree respectively and the words, part of speech and syntax tree in the target sentence pattern;
第五获取单元,用于在相似度值大于相似度阈值时,获取所述目标句式关联的基础知识点标签;a fifth obtaining unit, configured to obtain the basic knowledge point label associated with the target sentence pattern when the similarity value is greater than the similarity threshold;
第六添加单元,用于将所述目标句式关联的基础知识点标签加入到所述基础知识点标签集合中。The sixth adding unit is configured to add the basic knowledge point tag associated with the target sentence pattern to the basic knowledge point tag set.
可选地,所述第一处理模块503包括:Optionally, the
第九处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;A ninth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
第五标注单元,用于基于所述词语集合对所述词语集合中的各个词语进行词性标注得到词性标注集合;a fifth tagging unit, configured to perform part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;
第十计算单元,用于基于所述词语集合和所述词性标注集合计算所述句子集合包含的语法与目标语法的相似度值;A tenth calculation unit, configured to calculate a similarity value between the grammar included in the sentence set and the target grammar based on the word set and the part-of-speech tag set;
第六获取单元,用于在相似度值大于相似度阈值时,获取所述目标语法关联的基础知识点标签;a sixth obtaining unit, configured to obtain the basic knowledge point label associated with the target grammar when the similarity value is greater than the similarity threshold;
第七添加单元,用于将所述目标语法关联的基础知识点标签加入到所述基础知识点标签集合中。A seventh adding unit, configured to add the basic knowledge point tag associated with the target grammar to the basic knowledge point tag set.
需要说明的是,上述实施例提供的装置5在执行数据资源的标注方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据资源的标注方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。It should be noted that, when the
正如前面描述,实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiment is mainly described by taking the online education industry as an example, but those skilled in the art will understand that the application of this method is not limited to the online education industry, such as retail, transportation, social networking, search, education, medical treatment, etc. The method described in this application can be applied to the processing of user labels in various industries.
图6为本申请实施例提供的一种数据资源的标注装置结构示意图,以下简称装置6,装置6可以集成于前述服务器或终端设备中,如图6所示,该装置包括:存储器602、处理器601、输入装置603、输出装置604和通信接口。FIG. 6 is a schematic structural diagram of an apparatus for labeling data resources provided by an embodiment of the present application, hereinafter referred to as apparatus 6, and apparatus 6 may be integrated into the aforementioned server or terminal equipment. As shown in FIG. 6, the apparatus includes: a memory 602, a processing 601, an
存储器602可以是独立的物理单元,与处理器601、输入装置603和输出装置604可以通过总线连接。存储器602、处理器601、输入装置603和输出装置604也可以集成在一起,通过硬件实现等。The memory 602 may be an independent physical unit, and may be connected to the processor 601, the
存储器602用于存储实现以上方法实施例,或者装置实施例各个模块的程序,处理器601调用该程序,执行以上方法实施例的操作。The memory 602 is used to store a program for implementing the above method embodiments or each module of the apparatus embodiment, and the processor 601 invokes the program to execute the operations of the above method embodiments.
输入装置602包括但不限于键盘、鼠标、触摸面板、摄像头和麦克风;输出装置包括但限于显示屏。The input device 602 includes but is not limited to a keyboard, a mouse, a touch panel, a camera and a microphone; the output device includes but is not limited to a display screen.
通信接口用于收发各种类型的消息,通信接口包括但不限于无线接口或有线接口。The communication interface is used to send and receive various types of messages, and the communication interface includes but is not limited to a wireless interface or a wired interface.
可选地,当上述实施例的分布式任务调度方法中的部分或全部通过软件实现时,装置也可以只包括处理器。用于存储程序的存储器位于装置之外,处理器通过电路/电线与存储器连接,用于读取并执行存储器中存储的程序。Optionally, when part or all of the distributed task scheduling method in the foregoing embodiment is implemented by software, the apparatus may also only include a processor. The memory for storing the program is located outside the device, and the processor is connected to the memory through a circuit/wire for reading and executing the program stored in the memory.
处理器可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合。The processor may be a central processing unit (CPU), a network processor (NP), or a combination of CPU and NP.
处理器还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmablelogic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complexprogrammable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gatearray,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor may further include a hardware chip. The above hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
存储器可以包括易失性存储器(volatile memory),例如存取存储器(random-access memory,RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器还可以包括上述种类的存储器的组合。The memory may include volatile memory (volatile memory), such as access memory (random-access memory, RAM); the memory may also include non-volatile memory (non-volatile memory), such as flash memory (flash memory), A hard disk drive (HDD) or a solid-state drive (SSD); the memory may also include a combination of the above types of memory.
其中,处理器601调用存储器602中的程序代码用于执行以下步骤:Wherein, the processor 601 invokes the program code in the memory 602 to perform the following steps:
对原始数据资源进行预处理获取文本数据;Preprocess raw data resources to obtain text data;
将所述文本数据分别和多个目标知识点进行相似度值计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;Perform similarity value calculation with the text data and a plurality of target knowledge points respectively to obtain a similarity value; wherein, each of the plurality of target knowledge points is associated with a basic knowledge point label;
根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;The basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value; wherein, the basic knowledge point label included in the basic knowledge point label set is: the target whose similarity value is greater than the similarity threshold value Basic knowledge point labels associated with knowledge points;
根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。A comprehensive knowledge point tag set of the original data resource is generated according to the feature information of the original data resource and the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
从预设知识图谱中查询所述原始数据资源对应的属性信息,得到与所述属性信息对应的所述多个目标知识点。The attribute information corresponding to the original data resource is queried from a preset knowledge graph, and the multiple target knowledge points corresponding to the attribute information are obtained.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分块处理得到词语块集合,以及进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, perform block processing on the sentence set respectively to obtain a word block set, and perform word segmentation processing to obtain a word set;
分析所述词语块集合和所述词语集合得到参照词汇集合,并计算所述参照词汇集合中的各参照词汇与各自对应的目标知识点的相似度值;其中,所述参照词汇集合中包括参照内容词汇、参照高频词汇、参照动词词汇和参照数学词汇,所述参照内容词汇对应的目标知识点为目标内容词汇,所述参照高频词汇对应的目标知识点为目标高频词汇,所述参照动词词汇对应的目标知识点为目标动词词汇,所述参照数学词汇对应的目标知识点为目标数学词汇;Analyzing the word block set and the word set to obtain a reference word set, and calculating the similarity value between each reference word in the reference word set and the corresponding target knowledge point; wherein, the reference word set includes reference Content vocabulary, reference high-frequency vocabulary, reference verb vocabulary and reference mathematical vocabulary, the target knowledge point corresponding to the reference content vocabulary is the target content vocabulary, the target knowledge point corresponding to the reference high-frequency vocabulary is the target high-frequency vocabulary, and the The target knowledge point corresponding to the reference verb vocabulary is the target verb vocabulary, and the target knowledge point corresponding to the reference mathematical vocabulary is the target mathematical vocabulary;
在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中;或When the similarity values of the respective corresponding target knowledge points are greater than the respective corresponding similarity thresholds, add the basic knowledge point labels corresponding to the respective corresponding target knowledge points to the basic knowledge point label set; or
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set respectively to obtain a word set;
分析所述词语集合分别得到音标集合、词性标注集合和依存句法树,并计算所述音标集合中的词语音标、所述句子集合中的句式和所述句子集合中的语法与各自对应的目标知识点的相似度值;其中,所述词语音标对应的目标知识点为目标音标,所述句式对应的目标知识点为目标句式,所述语法对应的目标知识点为目标语法;Analyzing the word set to obtain a phonetic symbol set, a part-of-speech tagging set and a dependency syntax tree respectively, and calculating the word phonetic symbol in the phonetic symbol set, the sentence pattern in the sentence set and the grammar in the sentence set and their corresponding targets The similarity value of knowledge points; wherein, the target knowledge point corresponding to the word phonetic symbol is the target phonetic symbol, the target knowledge point corresponding to the sentence pattern is the target sentence pattern, and the target knowledge point corresponding to the grammar is the target grammar;
在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中。When the similarity values of the respective corresponding target knowledge points are greater than the respective similarity thresholds, the basic knowledge point labels corresponding to the respective corresponding target knowledge points are added to the basic knowledge point label set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform block processing on the sentence set to obtain a word block set;
基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;Calculate the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;
将所述重要程度权值大于第一预设权值的词语块作为所述参照内容词汇;using the word block whose importance weight is greater than the first preset weight as the reference content word;
计算所述参照内容词汇与所述目标内容词汇的相似度值;calculating the similarity value between the reference content vocabulary and the target content vocabulary;
在相似度值大于相似度阈值时,获取所述目标内容词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target content vocabulary;
将所述目标内容词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target content vocabulary are added to the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform block processing on the sentence set to obtain a word block set;
基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;Calculate the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;
将所述重要程度权值小于或等于第二预设权值的词语块作为所述参照高频词汇;using the word block whose importance weight is less than or equal to the second preset weight as the reference high-frequency word;
计算所述参照高频词汇与所述目标高频词汇的相似度值;Calculate the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary;
在相似度值大于相似度阈值时,获取所述目标高频词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target high-frequency vocabulary;
将所述目标高频词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target high-frequency vocabulary are added to the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
对所述词语集合中的各个词语进行词性标注得到词性标注集合;Perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
将所述词性为动词词性的词语作为参照动词词汇;Use the words whose part of speech is the part of speech as the reference verb vocabulary;
计算所述参照动词词汇与所述目标动词词汇的相似度值;calculating the similarity value between the reference verb vocabulary and the target verb vocabulary;
在相似度值大于相似度阈值时,获取所述目标动词词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target verb vocabulary;
将所述目标动词词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target verb vocabulary are added to the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
对所述词语集合中的各个词语进行词性标注得到词性标注集合;Perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
将所述词性为数词词性的词语作为参照数学词汇;Use the words whose part of speech is a numeral part as a reference mathematical vocabulary;
计算所述参照数学词汇与所述目标数学词汇的相似度值;Calculate the similarity value between the reference mathematical vocabulary and the target mathematical vocabulary;
在相似度值大于相似度阈值时,获取所述目标数学词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target mathematical vocabulary;
将所述目标数学词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target mathematical vocabulary are added to the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
分析所述词语集合中的各个词语,并为所述词语标上音标得到音标集合;Analyzing each word in the word set, and marking the word with phonetic symbols to obtain a phonetic symbol set;
计算所述音标集合中的词语音标与所述目标音标的相似度值;Calculate the similarity value of the word phonetic symbol in the phonetic symbol set and the target phonetic symbol;
在相似度值大于相似度阈值时,获取所述目标音标关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target phonetic symbol;
将所述目标音标关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tag associated with the target phonetic symbol is added to the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
基于所述词语集合对所述词语集合中的各个词语进行词性标注得到词性标注集合;Perform part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;
对所述句子集合中的句子进行依存句法分析得到依存句法树;Performing dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree;
计算所述词语集合中的词语、所述词性标注集合中的词性和所述依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值;Calculate the similarity values corresponding to the words in the word set, the parts of speech in the part-of-speech tagging set, and the dependent syntax tree respectively corresponding to the words, part of speech and syntax tree in the target sentence pattern;
在相似度值大于相似度阈值时,获取所述目标句式关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target sentence pattern;
将所述目标句式关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target sentence pattern are added to the basic knowledge point tag set.
在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:
对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;
对所述词语集合中的各个词语进行词性标注得到词性标注集合;Perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
基于所述词语集合和所述词性标注集合计算所述句子集合包含的语法与目标语法的相似度值;Calculate, based on the word set and the part-of-speech tagging set, a similarity value between the grammar contained in the sentence set and the target grammar;
在相似度值大于相似度阈值时,获取所述目标语法关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target grammar;
将所述目标语法关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target grammar are added to the basic knowledge point tag set.
需要说明的是,上述实施例提供的装置6在执行数据资源的标注方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据资源的标注方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。It should be noted that, when the apparatus 6 provided in the above embodiment executes the data resource labeling method, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions may be allocated to different functional modules as required. To complete, that is, to divide the internal structure of the device into different functional modules to complete all or part of the functions described above. In addition, the method embodiments for labeling data resources provided by the above embodiments belong to the same concept, and details of the implementation process thereof are described in the method embodiments, which will not be repeated here.
正如前面描述,实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiment is mainly described by taking the online education industry as an example, but those skilled in the art will understand that the application of this method is not limited to the online education industry, such as retail, transportation, social networking, search, education, medical treatment, etc. The method described in this application can be applied to the processing of user labels in various industries.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.
本申请实施例还提供了一种计算机存储介质,所述计算机存储介质可以存储有多条指令,所述指令适于由处理器加载并执行如上述图2~图3所示实施例的方法步骤,具体执行过程可以参见图2~图3所示实施例的具体说明,在此不进行赘述。Embodiments of the present application further provide a computer storage medium, where the computer storage medium can store multiple instructions, and the instructions are suitable for being loaded by a processor and executing the method steps of the embodiments shown in FIG. 2 to FIG. 3 above. , and the specific execution process may refer to the specific description of the embodiments shown in FIG. 2 to FIG. 3 , which will not be repeated here.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
Claims (14)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010580828.4A CN111930792B (en) | 2020-06-23 | 2020-06-23 | Labeling method and device for data resources, storage medium and electronic equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010580828.4A CN111930792B (en) | 2020-06-23 | 2020-06-23 | Labeling method and device for data resources, storage medium and electronic equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111930792A true CN111930792A (en) | 2020-11-13 |
| CN111930792B CN111930792B (en) | 2024-04-12 |
Family
ID=73316724
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010580828.4A Active CN111930792B (en) | 2020-06-23 | 2020-06-23 | Labeling method and device for data resources, storage medium and electronic equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111930792B (en) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112836013A (en) * | 2021-01-29 | 2021-05-25 | 北京大米科技有限公司 | Data labeling method and device, readable storage medium and electronic equipment |
| CN113536777A (en) * | 2021-07-30 | 2021-10-22 | 深圳豹耳科技有限公司 | Extraction method, device and equipment of news keywords and storage medium |
| CN113569007A (en) * | 2021-06-18 | 2021-10-29 | 武汉理工数字传播工程有限公司 | Method, device and storage medium for processing knowledge service resources |
| CN114036907A (en) * | 2021-11-18 | 2022-02-11 | 国网江苏省电力有限公司电力科学研究院 | Text data amplification method based on domain features |
| CN114443903A (en) * | 2021-12-28 | 2022-05-06 | 新瑞鹏宠物医疗集团有限公司 | Method for extracting video label and related product |
| CN114492419A (en) * | 2022-04-01 | 2022-05-13 | 杭州费尔斯通科技有限公司 | Text labeling method, system and device based on newly added key words in labeling |
| CN114676774A (en) * | 2022-03-25 | 2022-06-28 | 北京百度网讯科技有限公司 | Data processing method, apparatus, equipment and storage medium |
| CN116029284A (en) * | 2023-03-27 | 2023-04-28 | 上海蜜度信息技术有限公司 | Chinese substring extraction method, system, storage medium and electronic equipment |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104090955A (en) * | 2014-07-07 | 2014-10-08 | 科大讯飞股份有限公司 | Automatic audio/video label labeling method and system |
| CN105956144A (en) * | 2016-05-13 | 2016-09-21 | 安徽教育网络出版有限公司 | Method for quantitatively calculating association degree among multi-tab learning resources |
| US20170103074A1 (en) * | 2015-10-09 | 2017-04-13 | Fujitsu Limited | Generating descriptive topic labels |
| CN110162591A (en) * | 2019-05-22 | 2019-08-23 | 南京邮电大学 | A kind of entity alignment schemes and system towards digital education resource |
-
2020
- 2020-06-23 CN CN202010580828.4A patent/CN111930792B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104090955A (en) * | 2014-07-07 | 2014-10-08 | 科大讯飞股份有限公司 | Automatic audio/video label labeling method and system |
| US20170103074A1 (en) * | 2015-10-09 | 2017-04-13 | Fujitsu Limited | Generating descriptive topic labels |
| CN105956144A (en) * | 2016-05-13 | 2016-09-21 | 安徽教育网络出版有限公司 | Method for quantitatively calculating association degree among multi-tab learning resources |
| CN110162591A (en) * | 2019-05-22 | 2019-08-23 | 南京邮电大学 | A kind of entity alignment schemes and system towards digital education resource |
Non-Patent Citations (1)
| Title |
|---|
| 郭崇慧;吕征达;: "一种基于集成学习的试题多知识点标注方法", 运筹与管理, no. 02, 25 February 2020 (2020-02-25) * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112836013A (en) * | 2021-01-29 | 2021-05-25 | 北京大米科技有限公司 | Data labeling method and device, readable storage medium and electronic equipment |
| CN112836013B (en) * | 2021-01-29 | 2024-08-02 | 北京大米科技有限公司 | Data labeling method and device, readable storage medium and electronic equipment |
| CN113569007A (en) * | 2021-06-18 | 2021-10-29 | 武汉理工数字传播工程有限公司 | Method, device and storage medium for processing knowledge service resources |
| CN113569007B (en) * | 2021-06-18 | 2024-06-21 | 武汉理工数字传播工程有限公司 | Method, device and storage medium for processing knowledge service resources |
| CN113536777A (en) * | 2021-07-30 | 2021-10-22 | 深圳豹耳科技有限公司 | Extraction method, device and equipment of news keywords and storage medium |
| CN114036907A (en) * | 2021-11-18 | 2022-02-11 | 国网江苏省电力有限公司电力科学研究院 | Text data amplification method based on domain features |
| CN114443903A (en) * | 2021-12-28 | 2022-05-06 | 新瑞鹏宠物医疗集团有限公司 | Method for extracting video label and related product |
| CN114676774A (en) * | 2022-03-25 | 2022-06-28 | 北京百度网讯科技有限公司 | Data processing method, apparatus, equipment and storage medium |
| CN114492419A (en) * | 2022-04-01 | 2022-05-13 | 杭州费尔斯通科技有限公司 | Text labeling method, system and device based on newly added key words in labeling |
| CN114492419B (en) * | 2022-04-01 | 2022-08-23 | 杭州费尔斯通科技有限公司 | Text labeling method, system and device based on newly added key words in labeling |
| CN116029284A (en) * | 2023-03-27 | 2023-04-28 | 上海蜜度信息技术有限公司 | Chinese substring extraction method, system, storage medium and electronic equipment |
| WO2024198343A1 (en) * | 2023-03-27 | 2024-10-03 | 上海蜜度科技股份有限公司 | Chinese substring extraction method and system, and storage medium and electronic device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111930792B (en) | 2024-04-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111930792B (en) | Labeling method and device for data resources, storage medium and electronic equipment | |
| CN112560912B (en) | Classification model training methods, devices, electronic equipment and storage media | |
| CN107066449B (en) | Information pushing method and device | |
| CN109657054B (en) | Abstract generation method, device, server and storage medium | |
| US9792278B2 (en) | Method for identifying verifiable statements in text | |
| CN112395385B (en) | Text generation method, device, computer equipment and medium based on artificial intelligence | |
| WO2021121198A1 (en) | Semantic similarity-based entity relation extraction method and apparatus, device and medium | |
| CN114547303B (en) | Text multi-feature classification method and device based on Bert-LSTM | |
| CN112686022A (en) | Method and device for detecting illegal corpus, computer equipment and storage medium | |
| CN108170749A (en) | Dialogue method, device and computer-readable medium based on artificial intelligence | |
| CN112926308B (en) | Methods, devices, equipment, storage media and program products for matching text | |
| WO2021218027A1 (en) | Method and apparatus for extracting terminology in intelligent interview, device, and medium | |
| CN115359799B (en) | Speech recognition method, training method, device, electronic device and storage medium | |
| US20210004602A1 (en) | Method and apparatus for determining (raw) video materials for news | |
| CN108121699A (en) | For the method and apparatus of output information | |
| CN111813993A (en) | Video content expanding method and device, terminal equipment and storage medium | |
| CN114817478A (en) | Text-based question and answer method and device, computer equipment and storage medium | |
| CN114491034A (en) | Text classification method and intelligent device | |
| CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
| CN115712705A (en) | Information matching method and device | |
| CN111555960A (en) | Method for generating information | |
| CN116361638A (en) | Question answer search method, device and storage medium | |
| CN113722496B (en) | Triple extraction method and device, readable storage medium and electronic equipment | |
| CN113360602B (en) | Method, apparatus, device and storage medium for outputting information | |
| CN114880520B (en) | Video title generation method, device, electronic device and medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20250711 Address after: No. 902, 9th Floor, Unit 2, Building 1, No. 333 Jiqing 3rd Road, Chengdu High tech Zone, Chengdu Free Trade Zone, Sichuan Province 610000 Patentee after: Chengdu Yudi Technology Co.,Ltd. Country or region after: China Address before: Building 6, Huitong Times Square, 1 yaojiayuan South Road, Chaoyang District, Beijing 100025 Patentee before: BEIJING DA MI TECHNOLOGY Co.,Ltd. Country or region before: China |
|
| TR01 | Transfer of patent right |