CN102486767B - Method and device for labeling content - Google Patents
Method and device for labeling content Download PDFInfo
- Publication number
- CN102486767B CN102486767B CN201010578057.1A CN201010578057A CN102486767B CN 102486767 B CN102486767 B CN 102486767B CN 201010578057 A CN201010578057 A CN 201010578057A CN 102486767 B CN102486767 B CN 102486767B
- Authority
- CN
- China
- Prior art keywords
- rule
- content
- data
- matching
- metadata
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000002372 labelling Methods 0.000 title claims abstract description 24
- 239000012634 fragment Substances 0.000 claims abstract description 34
- 238000013507 mapping Methods 0.000 claims abstract description 27
- 238000012545 processing Methods 0.000 claims description 9
- 230000014509 gene expression Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 13
- 238000010422 painting Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 229910052686 Californium Inorganic materials 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本发明提供了一种内容标注方法,包括:获取内容文档的内容片段;创建规则模板,所述规则模板包含从Rf到Rt的一组线性有序的规则[Rf,Rt];在[Sf,St]的内容数据上匹配规则[Rf,Rt],识别获得匹配数据项,对匹配到的各个数据项标注所匹配规则中的元数据标记,以得到映射关系列表M,所述关系列表M为结构化的内容数据,其中,Sf为内容片段的开始,St为内容片段的结束,Rf为规则模板的首个规则,Rt为规则模板的末个规则。本发明还提供了一种内容标注装置。本发明提高了内容标注的效率。
The present invention provides a method for content labeling, comprising: obtaining content fragments of content documents; creating a rule template, the rule template includes a set of linear orderly rules [R f , R t ] from R f to R t ; Match the rule [R f , R t ] on the content data of [S f , S t ], identify and obtain the matched data item, mark each matched data item with the metadata tag in the matching rule, and obtain a list of mapping relationships M, the relationship list M is structured content data, wherein S f is the beginning of the content segment, S t is the end of the content segment, R f is the first rule of the rule template, and R t is the last rule of the rule template rule. The invention also provides a content tagging device. The invention improves the efficiency of content labeling.
Description
技术领域technical field
本发明涉及数字排版领域,具体而言,涉及内容标注方法和装置。The present invention relates to the field of digital typesetting, in particular, to a content labeling method and device.
背景技术Background technique
计算机软件应用程序可以帮助用户创建各种内容文档,近些年,采用结构化数据格式,包括标记语言(如:XML等)或其他标准委员会所要求的标记标准等,来对这些内容文档或内容片段进行标注,描述内容的应用结构。基于该应用结构,对内容进行进一步管理、加工、重用等,成为广大用户的迫切需要。Computer software applications can help users create various content documents. In recent years, structured data formats, including markup languages (such as: XML, etc.) Fragments are annotated to describe the application structure of the content. Based on this application structure, further management, processing, and reuse of content have become an urgent need for users.
有些业务领域的内容文档呈现大量规律性的内容片段,例如论文集、试题集、字(词)典等。图1示出了字典的一个字条(或称为词条)。字典中会包含大量类似的字条,这些字条的规律性体现在,每个字条均包括字头(或称为字目、词头)、音标、释义等。Content documents in some business fields present a large number of regular content fragments, such as collections of papers, collections of test questions, dictionaries, and the like. Fig. 1 shows an entry (or term) of a dictionary. Can comprise a large amount of similar notes in the dictionary, and the regularity of these notes is embodied in, and each note all comprises prefix (or be called word item, prefix), phonetic symbol, paraphrase etc.
为了将图1的字典转换为结构化数据,需要将每个字条的字头、音标、释义等标注为元数据,即,为电子书籍的规律性的内容片段附加上元数据信息。现有技术采用手工方式进行内容标注,所以操作非常繁琐。In order to convert the dictionary in Figure 1 into structured data, it is necessary to mark the prefix, phonetic symbols, and definitions of each note as metadata, that is, add metadata information to the regular content fragments of the e-book. In the prior art, content annotation is performed manually, so the operation is very cumbersome.
发明内容Contents of the invention
本发明旨在提供一种内容文档的内容标注方法和装置,以解决手工进行内容标注比较繁琐的问题。The present invention aims to provide a method and device for content labeling of content documents, so as to solve the problem of cumbersome manual content labeling.
在本发明的实施例中,提供了一种内容标注方法,包括:获取内容文档的内容片段;创建规则模板,所述规则模板包含从Rf到Rt的一组线性有序的规则[Rf,Rt];在[Sf,St]的内容数据上匹配规则[Rf,Rt],识别获得匹配数据项,对匹配到的各个数据项标注所匹配规则中的元数据标记,以得到映射关系列表M,所述关系列表M为结构化的内容数据,其中,Sf为内容片段的开始,St为内容片段的结束,Rf为规则模板的首个规则,Rt为规则模板的末个规则;其中,所述规则包括:条件匹配规则、重复匹配规则和模板引用规则;所述规则包含以下属性:元数据标记、最小出现次数和最大出现次数。In an embodiment of the present invention, a content labeling method is provided, including: obtaining a content fragment of a content document; creating a rule template, which includes a set of linearly ordered rules from R f to R t [R f , R t ]; match the rule [R f , R t ] on the content data of [S f , S t ], identify and obtain the matching data item, and mark the metadata mark in the matching rule for each matched data item , to obtain the mapping relationship list M, the relationship list M is structured content data, wherein, S f is the beginning of the content segment, S t is the end of the content segment, R f is the first rule of the rule template, R t is the last rule of the rule template; wherein, the rule includes: a conditional matching rule, a repeated matching rule, and a template reference rule; the rule includes the following attributes: metadata tag, minimum number of occurrences, and maximum number of occurrences.
在本发明的实施例中,提供了一种内容文档的内容标注装置,包括:获取模块,用于获取内容文档的内容片段;创建模块,用于创建规则模板,所述规则模板包含从Rf到Rt的一组线性有序的规则[Rf,Rt];匹配模块,用于在[Sf,St]的内容数据上匹配规则[Rf,Rt],识别获得匹配数据项,对匹配到的各个数据项标注所匹配规则中的元数据标记,以得到映射关系列表M,所述关系列表M为结构化的内容数据,其中,Sf为内容片段的开始,St为内容片段的结束,Rf为规则模板的首个规则,Rt为规则模板的末个规则;其中,所述规则包括:条件匹配规则、重复匹配规则和模板引用规则;所述规则包含以下属性:元数据标记、最小出现次数和最大出现次数。In an embodiment of the present invention, a content labeling device for a content document is provided, including: an acquisition module for acquiring content fragments of the content document; a creation module for creating a rule template, the rule template including from R f A set of linear and ordered rules [R f , R t ] to R t ; matching module, used to match the rules [R f , R t ] on the content data of [S f , S t ], identify and obtain matching data Items, mark the metadata tags in the matching rules for each data item that is matched, so as to obtain the mapping relationship list M, the relationship list M is structured content data, wherein, S f is the beginning of the content segment, S t is the end of the content fragment, R f is the first rule of the rule template, and R t is the last rule of the rule template; wherein, the rules include: conditional matching rules, repeated matching rules and template reference rules; the rules include the following Attributes: Metadata Tag, Minimum Occurrences, and Maximum Occurrences.
本发明实施例的内容标注方法和装置因为采用规则自动匹配内容片段,所以克服了手工内容标注操作繁琐的问题,提高了内容标注的效率。The content tagging method and device of the embodiments of the present invention overcome the problem of cumbersome manual content tagging operations and improve the efficiency of content tagging because rules are used to automatically match content segments.
附图说明Description of drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:
图1示出了字典的一个字条;Figure 1 shows an entry of a dictionary;
图2示出了根据本发明一个实施例的内容标注方法的流程图;Fig. 2 shows a flow chart of a content tagging method according to an embodiment of the present invention;
图3示出了根据本发明一个优选实施例的在[Sf,St]的内容数据上匹配规则[Rf,Rt]的流程图;Fig. 3 shows a flow chart of matching rules [R f , R t ] on content data of [S f , S t ] according to a preferred embodiment of the present invention;
图4示出了根据本发明一个优选实施例的字条规则模板的示意图;FIG. 4 shows a schematic diagram of a word rule template according to a preferred embodiment of the present invention;
图5示出了图4的字条规则模板对图1的字条进行规则匹配获得的字条元数据信息的示意图;Fig. 5 shows a schematic diagram of the entry metadata information obtained by performing rule matching on the entry in Fig. 1 by the entry rule template in Fig. 4;
图6示出了字典的另一个字条;Fig. 6 shows another entry of the dictionary;
图7示出了图4的字条规则模板对图6的字条进行规则匹配获得的字条元数据信息的示意图;Fig. 7 shows a schematic diagram of the entry metadata information obtained by performing rule matching on the entry in Fig. 6 by the entry rule template in Fig. 4;
图8示出了根据本发明一个优选实施例的内容标注后的字条的示意图;Fig. 8 shows a schematic diagram of a note after content marking according to a preferred embodiment of the present invention;
图9示出了根据本发明一个优选实施例的字条规则模板的示意图;Fig. 9 shows a schematic diagram of a word rule template according to a preferred embodiment of the present invention;
图10示出了图9的字条规则模板对图8的字条进行规则匹配获得的字条元数据信息的示意图;Fig. 10 shows a schematic diagram of the entry metadata information obtained by performing rule matching on the entry in Fig. 8 by the entry rule template in Fig. 9;
图11示出了根据本发明一个实施例的内容标注装置的示意图。Fig. 11 shows a schematic diagram of a content tagging device according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将参考附图并结合实施例,来详细说明本发明。The present invention will be described in detail below with reference to the accompanying drawings and in combination with embodiments.
图2示出了根据本发明一个实施例的内容标注方法的流程图,包括:Fig. 2 shows a flow chart of a content labeling method according to an embodiment of the present invention, including:
步骤S10,获取内容片段;Step S10, acquiring content fragments;
步骤S20,在[Sf,St]的内容数据上匹配规则[Rf,Rt],对匹配到的各个数据项标注所匹配规则中的元数据标记,以得到映射关系列表M,其中,Sf为内容片段的开始,St为内容片段的结束,Rf为规则模板的首个规则,Rt为规则模板的末个规则,规则模板包括从Rf到Rt的一组线性有序的规则。Step S20, match the rule [R f , R t ] on the content data of [S f , S t ], mark each matched data item with the metadata mark in the matched rule, to obtain the mapping relationship list M, where , S f is the beginning of the content fragment, S t is the end of the content fragment, R f is the first rule of the rule template, R t is the last rule of the rule template, and the rule template includes a set of linear orderly rules.
现有技术采用手工方式进行内容标注,所以操作非常繁琐作,而本实施例中,预先构建了规则,采用规则来匹配内容片段,从而自动地匹配得到各个数据项,并将规则中预先创建的元数据标记自动地匹配给各个数据项,通过规则的创建,从而使得这些操作都可以通过计算机来实现,提高了内容标注的效率。The existing technology uses manual content tagging, so the operation is very cumbersome. In this embodiment, the rules are pre-built, and the rules are used to match the content fragments, so as to automatically match each data item, and the pre-created rules in the rules Metadata tags are automatically matched to each data item, and through the creation of rules, these operations can be realized by computers, which improves the efficiency of content tagging.
另外,在本实施例中,规则[Rf,Rt]是一组线性排序的规则,这种规则模板结构简单,用户可以很容易地针对各种业务类型的内容文档创建这种规则模板,而且计算机执行这种线性排序的规则进行逐一地匹配,算法实现简单,效率较高。In addition, in this embodiment, the rule [R f , R t ] is a set of linearly ordered rules. This rule template has a simple structure, and users can easily create this rule template for content documents of various business types. Moreover, the computer executes the linear sorting rules to match one by one, the algorithm is simple to implement and the efficiency is high.
图3示出了根据本发明一个优选实施例的在[Sf,St]的内容数据上匹配规则[Rf,Rt]的流程图,包括:Fig. 3 shows a flow chart of matching rules [R f , R t ] on content data of [S f , S t ] according to a preferred embodiment of the present invention, including:
1、设置当前规则Rc为Rf;1. Set the current rule R c to R f ;
2、以Sf为开始点执行Rc匹配,以获得Rc匹配的数据项、成功标志、结束位置Sr,对Rc匹配的数据项标注Rc中的元数据标记,得到映射关系列表Mr;2. Perform R c matching with S f as the starting point to obtain the data item matched by R c , the success flag, and the end position S r , mark the metadata tag in R c for the data item matched by R c , and obtain the mapping relationship list M r ;
3、判断成功标志是否为有效;3. Determine whether the success flag is valid;
4、如果是,则将Mr加入到M中,否则结束处理;4. If yes, add M r to M, otherwise end the process;
5,判断且Rc是否是Rt,如果是,则结束处理;5. Determine whether R c is R t , if so, end the process;
6、否则判断Sr是否是St,如果是,则结束处理;6. Otherwise, judge whether S r is S t , and if so, end the processing;
7、否则设置Sf为Sr,设置Rf为Rc的下一个规则,然后回到步骤1。7. Otherwise, set S f to S r , set R f to the next rule of R c , and return to step 1.
利用[Rf,Rt]是一组线性排序的规则,本优选实施例设计了这种循环遍历的流程,可以自动化地将[Rf,Rt]的所有规则顺序地对内容片段[Sf,St]的内容数据完成匹配。该过程简单,很容易通过计算机实现。Utilizing that [R f , R t ] is a set of linearly sorted rules, this preferred embodiment designs such a cyclical traversal process, which can automatically sequence all the rules of [R f , R t ] to the content segment [S f , S t ] content data to complete the matching. The process is simple and easily implemented by computer.
优选地,Rc包括数据匹配条件,执行Rc匹配包括:使用数据匹配条件在[Sf,St]的内容数据上匹配到各个数据项,并相应地设置成功标志。该优选实施例提供了一种条件匹配规则,可以通过条件判断的方式,识别内容片段中的数据项。Preferably, R c includes a data matching condition, and performing R c matching includes: using the data matching condition to match each data item on the content data of [S f , S t ], and setting a success flag accordingly. This preferred embodiment provides a conditional matching rule, which can identify data items in the content segment by way of conditional judgment.
优选地,Rc还包括终止位置标志,终止位置标志为无效,用于指示数据匹配条件为区间条件;终止位置标志为有效,用于指示数据匹配条件为位置条件,区间条件用于指示设置在连续区间上数据的格式规则,其中,对应的数据项为从上一个数据项的结束位置开始的、满足格式规则的连续范围内的数据;位置条件用于指示设置在结束位置处数据的格式规则,其中,对应的数据项为以上一个数据项的结束位置为开始点、以满足格式规则的位置为结束点之间的数据,其中,格式规则用于指示数据表现出的规律性特征。该优选实施例对于条件匹配规则提供了区间条件和位置条件,在用户能够确定业务内容在一个连续区间上的特征时,可以采用区间条件来实现匹配,在用户能够确定业务内容在某个位置上的特征时,就可以采用位置条件来实现匹配。该优选实施例可以满足各种不同类型的业务内容的内容标注需求。Preferably, R c also includes a termination position flag, the termination position flag is invalid, used to indicate that the data matching condition is an interval condition; the termination position flag is valid, used to indicate that the data matching condition is a position condition, and the interval condition is used to indicate that the data matching condition is a position condition, and the interval condition is used to indicate that the data matching condition is a position condition. Format rules for data on a continuous interval, where the corresponding data item is data in a continuous range that meets the format rules starting from the end position of the previous data item; the position condition is used to indicate the format rule for the data set at the end position , where the corresponding data item is the data between the end position of the previous data item as the starting point and the position satisfying the format rule as the end point, where the format rule is used to indicate the regularity characteristics exhibited by the data. This preferred embodiment provides interval conditions and location conditions for the condition matching rules. When the user can determine the characteristics of the business content on a continuous interval, the interval condition can be used to achieve matching. When the user can determine that the business content is in a certain position When the characteristics of , you can use the location condition to achieve matching. This preferred embodiment can meet the content labeling requirements of various types of business content.
优选地,格式规则包括以下至少一种:内容格式规则、显现格式规则、标记格式规则和Any规则,内容格式规则用于指示数据在文档内容上表现出的规律性特征;显现格式规则用于指示数据在版面呈现上表现出的规律性特征;标记格式规则用于指示数据在应用逻辑上表现出的规律性特征;Any规则用于指示任何数据都满足匹配条件。本优选实施例在上述优选实施例的基础上,进一步指示了多种格式规则,从而可以更好地满足各种不同类型的业务内容的内容标注需求。Preferably, the format rules include at least one of the following: content format rules, display format rules, markup format rules and Any rules, the content format rules are used to indicate the regular characteristics of data displayed on the document content; the display format rules are used to indicate The regularity of the data in the layout presentation; the mark format rules are used to indicate the regularity of the data in the application logic; the Any rule is used to indicate that any data meets the matching conditions. On the basis of the above-mentioned preferred embodiments, this preferred embodiment further indicates a variety of format rules, so as to better meet the content labeling requirements of various types of business content.
优选地,Rc包括重复规则数,重复规则数用于指示重复应用[Rf,Rt]中重复规则数个规则。本优选实施例提供了一种重复匹配规则。例如对于字条,因为一个字条中通常只包含一个字头,所以对于字头的识别,显然不需要采用重复匹配规则进行匹配。另外,一个字条中可能包含多个义项,所以采用本优选实施例的重复匹配规则进行识别就更加合适。Preferably, R c includes a repetition rule number, which is used to indicate the repeated application of the number of repetition rules in [R f , R t ]. This preferred embodiment provides a repeated matching rule. For example, for a note, since a note usually contains only one prefix, it is obviously not necessary to use repeated matching rules for matching for the identification of the prefix. In addition, a word entry may contain multiple meanings, so it is more appropriate to use the repeated matching rule of this preferred embodiment for identification.
优选地,规则包括:最小出现次数,其值为N,用于指示匹配到数据项最少为N次,N为非负整数;最大出现次数,其值为P,用于指示匹配到数据项最多为P次,P为正整数,且当N为0时,P>N;当N为正整数时,P≥N。Preferably, the rules include: the minimum number of occurrences, whose value is N, is used to indicate that the data items are matched at least N times, and N is a non-negative integer; the maximum number of occurrences, whose value is P, is used to indicate that the most data items are matched P times, P is a positive integer, and when N is 0, P>N; when N is a positive integer, P≥N.
优选地,本内容标注方法还包括:遍历M中每个映射关系,分别记录各个元数据标记和对应的数据项,以构建元数据项;将元数据项构建元数据项表;将元数据项表附加到内容片段。Preferably, the content labeling method further includes: traversing each mapping relationship in M, respectively recording each metadata tag and corresponding data item to construct a metadata item; constructing a metadata item table from the metadata item; Tables are appended to content fragments.
优选地,本内容标注方法还包括:遍历M中每个映射关系,分别记录各个元数据标记和对应的数据项,以构建元数据项;根据匹配数据项所对应的连续区间或结束位置,将元数据项附加到内容片段。Preferably, the content labeling method further includes: traversing each mapping relationship in M, respectively recording each metadata tag and corresponding data item to construct a metadata item; according to the continuous interval or end position corresponding to the matching data item, the Metadata items are attached to content fragments.
上述两个优选实施例给出了将建立的映射关系列表进行保存的两种简单易行的方案。The above two preferred embodiments provide two simple and feasible schemes for saving the established mapping relationship list.
优选地,元数据标记符合XML,其中,元数据标记为空标记时,用于指示附加元数据项时忽略元数据标记为空标记的数据项。XML是业界目前比较通用的计算机语言,采用XML规定元数据标记,可以提高本方法的通用性。另外,通过提供空标记,从而可以处理内容片段中不能识别的数据内容,提高了对内容文档的兼容性。Preferably, the metadata tag conforms to XML, wherein, when the metadata tag is an empty tag, it is used to indicate that a data item whose metadata tag is an empty tag is ignored when appending the metadata item. XML is a relatively common computer language in the industry at present, and the use of XML to specify metadata tags can improve the generality of this method. In addition, by providing empty tags, unrecognizable data content in content fragments can be processed, thereby improving compatibility with content documents.
优选地,本内容标注方法还包括:分析内容文档中各个内容片段的表现规律;根据表现规律创建规则模板,规则模板包括规则[Rf,Rt]。通过预先创建规则模板,在识别表现形式相近的多个电子文档时,可以公用一个规则模板,避免了每次需要重新建立规则[Rf,Rt],从而提高了内容标注工作的重用性。Preferably, the content tagging method further includes: analyzing the expression law of each content segment in the content document; creating a rule template according to the expression law, and the rule template includes a rule [R f , R t ]. By pre-creating rule templates, when identifying multiple electronic documents with similar expressions, a rule template can be shared, which avoids the need to re-establish the rules [R f , R t ] each time, thereby improving the reusability of content labeling.
优选地,Rc包括引用模板名称,用于指示引用具有该模板名称的规则模板。在本优选实施例中建立了模板引用规则,可以减少开发工作量。Preferably, Rc includes a reference template name, which is used to indicate reference to a rule template with the template name. In this preferred embodiment, template reference rules are established, which can reduce the development workload.
规则模板包含一组线性有序的规则,可以为其指定名称,1)规则模板可以被存储,使用到其他相似内容片段上,这与样式比较类似;2)其他规则模板也可以通过该名称引用已经定义的规则模板。A rule template contains a set of linear and orderly rules, which can be assigned a name. 1) The rule template can be stored and used on other similar content fragments, which is similar to the style comparison; 2) Other rule templates can also be referenced by this name Defined rule templates.
图4示出了根据本发明一个优选实施例的字条规则模板的示意图,该规则模板“字条”包含6个线性有序的匹配规则。图5示出了图4的字条规则模板对图1的字条进行规则匹配获得的字条元数据信息的示意图。Fig. 4 shows a schematic diagram of a rule template of a note according to a preferred embodiment of the present invention. The rule template "note" contains 6 matching rules in a linear order. FIG. 5 shows a schematic diagram of entry metadata information obtained by performing rule matching on the entry in FIG. 1 by the entry rule template in FIG. 4 .
在本优选实施例中,规则分为三类:条件匹配规则、重复匹配规则和模板引用规则。无论任何一种规则都包含下列属性:In this preferred embodiment, the rules are divided into three categories: conditional matching rules, repeated matching rules and template reference rules. Either rule contains the following properties:
其中,任何一种规则都可以指定最小出现次数和最大出现次数,最小出现次数<=匹配项的个数<=最大出现次数可以视为匹配成功。Wherein, any rule can specify the minimum number of occurrences and the maximum number of occurrences, and the minimum number of occurrences <= the number of matching items <= the maximum number of occurrences can be regarded as a successful match.
例如:多项选择题的答案可能在有些习题集中显示为如下的格式文本:Example: Answers to multiple-choice questions may appear in some problem sets as formatted text like this:
答案:ACAnswer: AC
这是可以通过连续条件规则({大写字母,答案选项,1..*}),识别出每个选择答案(“答案选项”=“A”,“答案选项”=“B”)。It is possible to identify each alternative answer (“answer option” = “A”, “answer option” = “B”) through the continuous conditional rule ({capital letter, answer option, 1..*}).
条件规则可以进一步细分为二类:连续条件规则和结束条件规则。Conditional rules can be further subdivided into two categories: continuous conditional rules and end conditional rules.
连续条件规则都包含下列属性:Successive conditional rules all contain the following properties:
结束条件规则都包含下列属性:End condition rules all contain the following properties:
其中,包含终止位置标志不是区分连续条件规则和结束条件规则的标志,它表示匹配项的范围是否包含终止位置的数据。Among them, the mark containing the end position is not a mark to distinguish the continuous condition rule from the end condition rule, it indicates whether the scope of the matching item contains the data of the end position.
以示例中的笔画规则为例:Take the stroke rules in the example as an example:
当“包含终止位置”为TRUE时({文本:“画,”,TRUE,笔画,1}),识别出匹配项为“4画,”,结束位置为“,”之后,包含“画,”;When "include end position" is TRUE ({text: "painting,", TRUE, stroke, 1}), the matching item is recognized as "4 painting," and after the end position is ",", contains "painting," ;
当“包含终止位置”为FALSE时({文本:“画,”,FALSE,笔画,1}),识别出匹配项为“4”,结束位置为“4”之后,不包含“画,”。When the "include end position" is FALSE ({text: "painting,", FALSE, stroke, 1}), it is recognized that the matching item is "4", and after the end position is "4", "painting," is not included.
重复规则都包含下列属性:Recurrence rules all contain the following properties:
模板引用规则都包含下列属性:Template reference rules all contain the following properties:
其中,模板引用规则指定应用由引用模板名称所标识的规则区间[Rf-模板,Rt-模板],是一种嵌套应用的方法。Wherein, the template reference rule specifies to apply the rule interval [R f-template , R t-template ] identified by the reference template name, which is a method of nested application.
例如:以示例中的规则模板“字条”为例,将最后的释义规则改成模板引用规则({模板引用:“释义”,释义,1}),在应用释义规则时,会自动找到规则模板“释义”对应的规则区间(如图9所示)进行匹配。For example: take the rule template "note" in the example as an example, change the last paraphrase rule to a template reference rule ({template reference: "paraphrase", paraphrase, 1}), when applying the paraphrase rule, the rule template will be found automatically The rule interval corresponding to "interpretation" (as shown in Figure 9) is matched.
元数据效果如下:The effect of metadata is as follows:
<元数据><metadata>
<字目>开</字目><word>open</word>
<繁体字形>開</繁体字形><traditional font>open</traditional font>
<拼音>kāi</拼音><Pinyin>kāi</Pinyin>
<笔画>4</笔画><stroke>4</stroke>
<部首>一</部首><radical>one</radical>
<释义><interpretation>
<义项>打开:~门|~幕|公~|网~一面。</义项> <meaning item> to open: ~door|~curtain|public~|network~side. </meaning item>
<义项>打通;开辟:~路|~矿|~掘|~拓。</义项><meaning item> get through; open up: ~ road | ~ mine | ~ dig | ~ extension. </meaning item>
</释义></interpretation>
</元数据></metadata>
具体的匹配过程如下:The specific matching process is as follows:
(1)观察和分析内容片段的数据,发现其表现规律,创建如图4所示的规则模板“字条”;(1) observe and analyze the data of the content segment, find its expression law, and create the rule template "note" as shown in Figure 4;
(2)选择字条“开”的内容片段;(2) select the content fragment of the note "open";
(2)该规则模板包含一组有序的匹配规则;(2) The rule template contains a set of ordered matching rules;
(3)分析内容片段,根据规则模板“字条”,识别获得匹配数据项,并映射到关联的元数据标记。(3) Analyze content fragments, identify and obtain matching data items according to the rule template "note", and map to associated metadata tags.
(4)根据识别出的匹配数据项和关联的元数据标记建立如图5所示的元数据信息。这些元数据信息可以整体附加到字条“开”的内容片段上。(4) Establish metadata information as shown in FIG. 5 according to the identified matching data items and associated metadata tags. These metadata information can be attached as a whole to the content fragment of the note "on".
上述步骤(3),可以进一步细化,规则模板识别匹配包括以下步骤:The above step (3) can be further refined, and the rule template identification and matching includes the following steps:
(3.1)设置起始位置Sf为内容片段的开始(段首的“开”字),设置结束位置St为内容片段的结束(段末的句号);(3.1) starting position S f is set to the beginning of the content fragment (" opening " word at the beginning of the paragraph), and the end position S t is set to be the end of the content fragment (period at the end of the paragraph);
(3.2)设置起始规则Rf为规则模板的首个规则(规则1“字目”),设置结束规则Rt为规则模板的末个规则(规则6“释义”)。(3.2) Set the start rule R f as the first rule of the rule template (rule 1 "words"), and set the end rule R t as the last rule of the rule template (rule 6 "interpretation").
(3.3)在区间[Sf,St]的内容数据上,执行区间[Rf,Rt]的规则匹配,获得映射关系列表M。(3.3) On the content data of the interval [S f , S t ], execute the rule matching of the interval [R f , R t ] to obtain the mapping relationship list M.
上述步骤(3.3),可以进一步细化,包括以下步骤:The above step (3.3) can be further refined, including the following steps:
(3.3.1)设置当前规则Rc为起始规则Rf(规则1“字目”);(3.3.1) Set the current rule R c as the initial rule R f (rule 1 "character");
(3.3.2)以起始位置Sf(段首的“开”字)为开始点,在区间[Sf,St]的内容数据上,执行规则Rc匹配;获得规则Rc匹配的成功标志(有效)、匹配的结束位置Sr(“开”字后面的“(”)以及匹配出的映射关系列表Mr(“字目”=“开”);(3.3.2) With the starting position S f (the "open" word at the beginning of the paragraph) as the starting point, on the content data of the interval [S f , S t ], execute the matching of the rule R c ; obtain the matching of the rule R c Success flag (effective), matching end position S r ("(" behind the "open" word) and matching mapping relationship list M r ("word"="open");
(3.3.3)如果步骤(3.3.2)获得的规则Rc匹配的成功标志有效,则将匹配出的映射关系列表Mr,记录到映射关系列表M中;(3.3.3) If the successful flag matched by the rule R c obtained in step (3.3.2) is valid, record the matching mapping relationship list M r in the mapping relationship list M;
(3.3.4)判断是否需要继续匹配(需要),如果是,则进入步骤(3.3.5);否则,终止处理;(3.3.4) judge whether need to continue to match (require), if yes, then enter step (3.3.5); Otherwise, terminate processing;
(3.3.5)设置起始位置Sf为步骤(3.3.2)获得的规则Rc匹配的结束位置Sr(“开”字后面的“(”);将设置起始规则Rf为当前规则Rc的下一个规则(规则2“繁体字形”);转到步骤(3.3.1),在区间[Sf,St]的内容数据上,执行区间[Rf,Rt]的规则匹配。(3.3.5) Set the start position S f as the end position S r ("(" behind the "open" word) matched by the rule R c obtained in step (3.3.2); the start rule R f will be set as the current The next rule of the rule R c (rule 2 "traditional characters"); go to step (3.3.1), on the content data of the interval [S f , S t ], execute the rule of the interval [R f , R t ] match.
上述步骤(3.3.2)可以进一步细化,单个规则的匹配包括以下步骤:The above steps (3.3.2) can be further refined, and the matching of a single rule includes the following steps:
(3.3.2.1)如果当前规则Rc为条件匹配规则,则进入步骤(3.3.2.2);如果当前规则Rc为重复匹配规则,转到步骤(3.3.2.4);否则当前规则Rc为模板引用匹配规则,转到步骤(3.3.2.5);(3.3.2.1) If the current rule R c is a conditional matching rule, then enter step (3.3.2.2); if the current rule R c is a repeated matching rule, go to step (3.3.2.4); otherwise, the current rule Rc is a template reference Matching rules, go to step (3.3.2.5);
(3.3.2.2)根据当前条件规则Rc中的条件、包括终止位置标志以及出现次数,获得满足规则的匹配数据项列表以及成功标志;(3.3.2.2) According to the conditions in the current conditional rule Rc , including the termination position mark and the number of occurrences, obtain the matching data item list and the success mark that satisfy the rule;
(3.3.2.3)为步骤(3.3.2.2)中获得的匹配数据项列表中每个匹配数据项,建立与当前条件规则Rc中的元数据标记之间的映射关系,加入映射关系列表Mr中;终止处理。(3.3.2.3) For each matching data item in the matching data item list obtained in step (3.3.2.2), establish a mapping relationship with the metadata tag in the current conditional rule R c , and add the mapping relationship list M r In; terminate processing.
(3.3.2.4)根据当前重复规则Rc中的重复规则数以及出现次数,在区间[Sf,St]的内容数据上,重复执行区间[Rc-重复次数,Rc-1]的规则匹配,获得映射关系列表Mr;终止处理。(3.3.2.4) According to the number of repetition rules and the number of occurrences in the current repetition rule R c , on the content data of the interval [S f , S t ], repeatedly execute the interval [R c-number of repetitions , R c-1 ] The rules are matched, and the mapping relationship list M r is obtained; the processing is terminated.
(3.3.2.5)根据当前模板引用规则Rc中的引用模板名称以及出现次数,在区间[Sf,St]的内容数据上,执行由引用模板名称标识的区间[Rf-模板,Rt-模板]的规则匹配,获得映射关系列表Mr;终止处理。以规则1“字目”为例,该规则为连续条件规则(即采用区间条件)。凡是从“开”开始的字号为三号的内容均满足该条件,所以其匹配数据项为段首的“开”。(3.3.2.5) According to the reference template name and the number of occurrences in the current template reference rule R c , on the content data of the interval [S f , S t ], execute the interval identified by the reference template name [R f-template , R t-template ] to match the rules to obtain the mapping relationship list M r ; terminate the processing. Take rule 1 "words" as an example, this rule is a continuous conditional rule (that is, an interval condition is used). All content starting from "Kai" with a font size of No. 3 satisfies this condition, so its matching data item is "Kai" at the beginning of the paragraph.
以规则4“笔画”为例,该规则为结束条件规则(即采用位置条件)。凡是从拼音字母后面开始的以“画,”字串结束的内容均满足该条件,所以其匹配数据项为“4画,”。Taking the rule 4 "stroke" as an example, this rule is an end condition rule (that is, a position condition is adopted). All the content starting from the back of the pinyin letter and ending with the string of "painting," all meet this condition, so the matching data item is "4hua,".
上述优选实施例建立的规则模板,还适用于对符合该规律的其他内容片段。例如字条“锎”,其内容片段和元数据信息,如图6和图7所示,图6示出了字典的另一个字条,图7示出了图4的字条规则模板对图6的字条进行规则匹配获得的字条元数据信息的示意图。The rule template established in the above preferred embodiment is also applicable to other content segments conforming to the rule. For example, the word entry "californium", its content fragments and metadata information, as shown in Figure 6 and Figure 7, Figure 6 shows another word entry of the dictionary, Figure 7 shows the effect of the word entry rule template in Figure 4 on the word entry in Figure 6 Schematic diagram of entry metadata information obtained by rule matching.
优选地,本内容标注方法还包括:将所匹配的数据项作为一个内容片段,继续执行在[Sf,St]的内容数据上匹配规则[Rf,Rt]的步骤。该优选实施例提供了一种嵌套机制,可以处理更加复杂的内容结构,从而可以满足各种业务类型的内容文档的标注需求。Preferably, the content tagging method further includes: taking the matched data item as a content segment, and continuing to perform the step of matching the rule [R f , R t ] on the content data of [S f , S t ]. This preferred embodiment provides a nesting mechanism that can handle more complex content structures, thereby meeting the labeling requirements of content documents of various business types.
上述步骤(4)建立的元数据信息,还可以以嵌入的方式附加到内容片段的数据区间上,如图8所示,其匹配出的元数据标记都标注到字条“开”内容片段中。图9示出了根据本发明一个优选实施例的字条规则模板的示意图,图10示出了图9的字条规则模板对图8的字条进行规则匹配获得的字条元数据信息的示意图。The metadata information established in the above step (4) can also be attached to the data interval of the content segment in an embedded manner, as shown in Figure 8, the matched metadata tags are all marked in the content segment of the note "Open". FIG. 9 shows a schematic diagram of a word rule template according to a preferred embodiment of the present invention, and FIG. 10 shows a schematic diagram of the word item metadata information obtained by performing rule matching on the word item in FIG. 8 by the word rule template in FIG. 9 .
可以看出,图8的字条“开”的内容被嵌入的元数据标记进一步分成更细小的内容片段。针对这些下级的内容片段,可以继续采用本发明方法进行标注。以“开的释义”内容片段为例,其规则模板和元数据信息如图9和图10所示。It can be seen that the content of the note "Open" in FIG. 8 is further divided into smaller content fragments by the embedded metadata tags. For these lower-level content fragments, the method of the present invention can be continued to be used for labeling. Taking the content fragment of "Open Interpretation" as an example, its rule template and metadata information are shown in Figure 9 and Figure 10 .
用户可以层层细化、逐次递进地应用本内容标注方法,最大限度、最小粒度地标注内容文档或片段,以达到满意的结构化效果。Users can apply this content labeling method layer by layer and step by step, and label content documents or fragments at the maximum and minimum granularity, so as to achieve a satisfactory structural effect.
图11示出了根据本发明一个实施例的内容标注装置的示意图,包括:Figure 11 shows a schematic diagram of a content tagging device according to an embodiment of the present invention, including:
获取模块10,用于获取内容片段;An acquisition module 10, configured to acquire content fragments;
匹配模块20,用于在[Sf,St]的内容数据上匹配规则[Rf,Rt],对匹配到的各个数据项标注所匹配规则中的元数据标记,以得到映射关系列表M,其中,Sf为内容片段的开始,St为内容片段的结束,Rf为规则模板的首个规则,Rt为规则模板的末个规则,规则模板包括从Rf到Rt的一组线性有序的规则。The matching module 20 is used to match the rule [R f , R t ] on the content data of [S f , S t ], mark each matched data item with the metadata mark in the matched rule, so as to obtain a list of mapping relationships M, where S f is the beginning of the content fragment, S t is the end of the content fragment, R f is the first rule of the rule template, R t is the last rule of the rule template, and the rule template includes from R f to R t A set of linearly ordered rules.
本内容标注装置提高了内容标注的效率。The content labeling device improves the efficiency of content labeling.
优选地,匹配模块20包括:Preferably, the matching module 20 includes:
当前设置模块,用于设置当前规则Rc为Rf;The current setting module is used to set the current rule R c to R f ;
当前匹配模块,用于以Sf为开始点执行Rc匹配,以获得Rc匹配的数据项、成功标志、结束位置Sr,对Rc匹配的数据项标注Rc中的元数据标记,得到映射关系列表Mr;The current matching module is used to perform Rc matching with Sf as the starting point, so as to obtain the data item matched by Rc , the success flag, and the end position Sr , and mark the metadata tag in Rc for the data item matched by Rc , Get the mapping relationship list M r ;
加入模块,用于如果成功标志为有效,则将Mr加入到M中;Adding a module for adding M r to M if the success flag is valid;
判断模块,用于判断成功标志是否为有效,且Rc是否不是Rt,且Sr是否不是St;Judgment module, used to judge whether the success flag is effective, and whether R c is not R t , and whether S r is not S t ;
循环模块,用于如果以上判断均为是,则设置Sf为Sr,设置Rf为Rc的下一个规则,然后继续执行上述步骤;否则终止处理。The loop module is used to set S f to S r and R f to the next rule of R c if the above judgments are all yes, and then continue to execute the above steps; otherwise, terminate the process.
该内容标注装置结构简单,很容易通过计算机实现。The content labeling device has a simple structure and can be easily realized by a computer.
优选地,本内容标注装置还包括:元数据项模块,用于遍历M中每个映射关系,分别记录各个元数据标记和对应的数据项,以构建元数据项;元数据项表模块,用于将元数据项构建元数据项表;附加模块,用于将元数据项表附加到内容片段。本优选实施例给出了将建立的映射关系列表进行保存的两种简单易行的方案。Preferably, the content labeling device further includes: a metadata item module, configured to traverse each mapping relationship in M, and record each metadata tag and corresponding data item to construct a metadata item; a metadata item table module, used The metadata item is used to construct a metadata item table; an additional module is used for attaching the metadata item table to the content fragment. This preferred embodiment provides two simple and feasible solutions for saving the established mapping relationship list.
优选地,本内容标注装置还包括:分析模块,用于分析内容文档中各个内容片段的表现规律;创建模块,用于根据表现规律创建规则模板,规则模板包括规则[Rf,Rt]。本优选实施例提高了内容标注工作的重用性。Preferably, the content tagging device further includes: an analysis module, for analyzing the expression law of each content segment in the content document; a creation module, for creating a rule template according to the expression law, and the rule template includes a rule [R f , R t ]. This preferred embodiment improves the reusability of content labeling work.
本发明的各个实施例可以结合批处理技术或宏命令操作,从而可以快速地对批量的特定规律性的内容片段进行识别、匹配并附加元数据信息。Various embodiments of the present invention can be combined with batch processing technology or macro command operations, so that batches of content segments with specific regularity can be quickly identified, matched, and metadata information added.
从以上的描述中可以看出,通过本发明的各个实施例,用户可以方便地、灵活地、高效地、准确地为规律性的内容片段附加上元数据信息。本发明上述的实施例结合元数据标记体系,可适用于各种应用领域,如:篇章、论文、试题、字(词)典等,满足用户不同的业务需求。It can be seen from the above description that through various embodiments of the present invention, users can conveniently, flexibly, efficiently and accurately add metadata information to regular content segments. The above-mentioned embodiments of the present invention, combined with the metadata marking system, can be applied to various application fields, such as chapters, papers, test questions, word (dictionary) dictionaries, etc., to meet different business needs of users.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Optionally, they can be implemented with program codes executable by computing devices, so that they can be stored in storage devices and executed by computing devices, or they can be made into individual integrated circuit modules, or their Multiple modules or steps are implemented as a single integrated circuit module. As such, the present invention is not limited to any specific combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (17)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010578057.1A CN102486767B (en) | 2010-12-02 | 2010-12-02 | Method and device for labeling content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010578057.1A CN102486767B (en) | 2010-12-02 | 2010-12-02 | Method and device for labeling content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102486767A CN102486767A (en) | 2012-06-06 |
CN102486767B true CN102486767B (en) | 2015-03-25 |
Family
ID=46152261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010578057.1A Active CN102486767B (en) | 2010-12-02 | 2010-12-02 | Method and device for labeling content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102486767B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140037535A (en) * | 2012-09-19 | 2014-03-27 | 삼성전자주식회사 | Method and apparatus for creating e-book including user effects |
WO2015006275A1 (en) * | 2013-07-09 | 2015-01-15 | 3M Innovative Properties Company | Systems and methods for note content extraction and management using segmented notes |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101123532A (en) * | 2006-08-07 | 2008-02-13 | 华为技术有限公司 | A system and method for generating communication user description information |
CN101158953A (en) * | 2007-10-08 | 2008-04-09 | 上海聆众商务咨询有限公司 | Network document information processing method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101571859B (en) * | 2008-04-28 | 2013-01-02 | 国际商业机器公司 | Method and apparatus for labelling document |
-
2010
- 2010-12-02 CN CN201010578057.1A patent/CN102486767B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101123532A (en) * | 2006-08-07 | 2008-02-13 | 华为技术有限公司 | A system and method for generating communication user description information |
CN101158953A (en) * | 2007-10-08 | 2008-04-09 | 上海聆众商务咨询有限公司 | Network document information processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN102486767A (en) | 2012-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108717406B (en) | Text emotion analysis method and device and storage medium | |
WO2019227584A1 (en) | Method for parsing and processing resume data information, device, apparatus, and storage medium | |
CN108415887B (en) | A method of converting PDF file to OFD file | |
CN109933796B (en) | Method and device for extracting key information from announcement text | |
CN112541359B (en) | Document content identification method, device, electronic equipment and medium | |
CN113051356A (en) | Open relationship extraction method and device, electronic equipment and storage medium | |
CN108664474B (en) | Resume analysis method based on deep learning | |
CN107145584B (en) | A resume parsing method based on n-gram model | |
CN110598203A (en) | A method and device for extracting entity information of military scenario documents combined with dictionaries | |
US11599727B2 (en) | Intelligent text cleaning method and apparatus, and computer-readable storage medium | |
CN110770735A (en) | Transcoding of documents with embedded mathematical expressions | |
CN102541948A (en) | Method and device for extracting document structure | |
CN112347765B (en) | Entity tagging method, module and device based on dictionary matching | |
CN111597302B (en) | Text event acquisition method and device, electronic equipment and storage medium | |
CN113987125A (en) | Text structured information extraction method based on neural network and related equipment thereof | |
CN105808523A (en) | Method and apparatus for identifying document | |
CN104965821A (en) | Data annotation method and apparatus | |
CN114495143A (en) | A text object recognition method, device, electronic device and storage medium | |
CN105426355A (en) | Syllabic size based method and apparatus for identifying Tibetan syntax chunk | |
CN115934955A (en) | Electric power standard knowledge graph construction method, knowledge question answering system and device | |
CN102982028A (en) | Method and device for extracting document structure | |
CN109902299B (en) | A text processing method and device | |
CN112990091A (en) | Research and report analysis method, device, equipment and storage medium based on target detection | |
CN114398492B (en) | Knowledge graph construction method, terminal and medium in digital field | |
CN102486767B (en) | Method and device for labeling content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220621 Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031 Patentee after: New founder holdings development Co.,Ltd. Patentee after: Beijing Beida Founder Electronics Co., Ltd. Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 5 floor Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee before: Beijing Beida Founder Electronics Co., Ltd. |
|
TR01 | Transfer of patent right |