CN102486767B

CN102486767B - Method and device for labeling content

Info

Publication number: CN102486767B
Application number: CN201010578057.1A
Authority: CN
Inventors: 杨燕菲
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: BEIJING BEIDA FOUNDER ELECTRONICS Co Ltd; New Founder Holdings Development Co ltd
Priority date: 2010-12-02
Filing date: 2010-12-02
Publication date: 2015-03-25
Anticipated expiration: 2030-12-02
Also published as: CN102486767A

Abstract

The present invention provides a method for content labeling, comprising: obtaining content fragments of content documents; creating a rule template, the rule template includes a set of linear orderly rules [R _f , R _t ] from R _f to R _t ; Match the rule [R _f , R _t ] on the content data of [S _f , S _t ], identify and obtain the matched data item, mark each matched data item with the metadata tag in the matching rule, and obtain a list of mapping relationships M, the relationship list M is structured content data, wherein S _f is the beginning of the content segment, S _t is the end of the content segment, R _f is the first rule of the rule template, and R _t is the last rule of the rule template rule. The invention also provides a content tagging device. The invention improves the efficiency of content labeling.

Description

Content labeling method and device for content document

技术领域technical field

本发明涉及数字排版领域，具体而言，涉及内容标注方法和装置。The present invention relates to the field of digital typesetting, in particular, to a content labeling method and device.

背景技术Background technique

计算机软件应用程序可以帮助用户创建各种内容文档，近些年，采用结构化数据格式，包括标记语言(如：XML等)或其他标准委员会所要求的标记标准等，来对这些内容文档或内容片段进行标注，描述内容的应用结构。基于该应用结构，对内容进行进一步管理、加工、重用等，成为广大用户的迫切需要。Computer software applications can help users create various content documents. In recent years, structured data formats, including markup languages (such as: XML, etc.) Fragments are annotated to describe the application structure of the content. Based on this application structure, further management, processing, and reuse of content have become an urgent need for users.

有些业务领域的内容文档呈现大量规律性的内容片段，例如论文集、试题集、字(词)典等。图1示出了字典的一个字条(或称为词条)。字典中会包含大量类似的字条，这些字条的规律性体现在，每个字条均包括字头(或称为字目、词头)、音标、释义等。Content documents in some business fields present a large number of regular content fragments, such as collections of papers, collections of test questions, dictionaries, and the like. Fig. 1 shows an entry (or term) of a dictionary. Can comprise a large amount of similar notes in the dictionary, and the regularity of these notes is embodied in, and each note all comprises prefix (or be called word item, prefix), phonetic symbol, paraphrase etc.

为了将图1的字典转换为结构化数据，需要将每个字条的字头、音标、释义等标注为元数据，即，为电子书籍的规律性的内容片段附加上元数据信息。现有技术采用手工方式进行内容标注，所以操作非常繁琐。In order to convert the dictionary in Figure 1 into structured data, it is necessary to mark the prefix, phonetic symbols, and definitions of each note as metadata, that is, add metadata information to the regular content fragments of the e-book. In the prior art, content annotation is performed manually, so the operation is very cumbersome.

发明内容Contents of the invention

本发明旨在提供一种内容文档的内容标注方法和装置，以解决手工进行内容标注比较繁琐的问题。The present invention aims to provide a method and device for content labeling of content documents, so as to solve the problem of cumbersome manual content labeling.

在本发明的实施例中，提供了一种内容标注方法，包括：获取内容文档的内容片段；创建规则模板，所述规则模板包含从R_f到R_t的一组线性有序的规则[R_f，R_t]；在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]，识别获得匹配数据项，对匹配到的各个数据项标注所匹配规则中的元数据标记，以得到映射关系列表M，所述关系列表M为结构化的内容数据，其中，S_f为内容片段的开始，S_t为内容片段的结束，R_f为规则模板的首个规则，R_t为规则模板的末个规则；其中，所述规则包括：条件匹配规则、重复匹配规则和模板引用规则；所述规则包含以下属性：元数据标记、最小出现次数和最大出现次数。In an embodiment of the present invention, a content labeling method is provided, including: obtaining a content fragment of a content document; creating a rule template, which includes a set of linearly ordered rules from R _f to R _t [R _f , R _t ]; match the rule [R _f , R _t ] on the content data of [S _f , S _t ], identify and obtain the matching data item, and mark the metadata mark in the matching rule for each matched data item , to obtain the mapping relationship list M, the relationship list M is structured content data, wherein, S _f is the beginning of the content segment, S _t is the end of the content segment, R _f is the first rule of the rule template, R _t is the last rule of the rule template; wherein, the rule includes: a conditional matching rule, a repeated matching rule, and a template reference rule; the rule includes the following attributes: metadata tag, minimum number of occurrences, and maximum number of occurrences.

在本发明的实施例中，提供了一种内容文档的内容标注装置，包括：获取模块，用于获取内容文档的内容片段；创建模块，用于创建规则模板，所述规则模板包含从R_f到R_t的一组线性有序的规则[R_f，R_t]；匹配模块，用于在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]，识别获得匹配数据项，对匹配到的各个数据项标注所匹配规则中的元数据标记，以得到映射关系列表M，所述关系列表M为结构化的内容数据，其中，S_f为内容片段的开始，S_t为内容片段的结束，R_f为规则模板的首个规则，R_t为规则模板的末个规则；其中，所述规则包括：条件匹配规则、重复匹配规则和模板引用规则；所述规则包含以下属性：元数据标记、最小出现次数和最大出现次数。In an embodiment of the present invention, a content labeling device for a content document is provided, including: an acquisition module for acquiring content fragments of the content document; a creation module for creating a rule template, the rule template including from R _f A set of linear and ordered rules [R _f , R _t ] to R _t ; matching module, used to match the rules [R _f , R _t ] on the content data of [S _f , S _t ], identify and obtain matching data Items, mark the metadata tags in the matching rules for each data item that is matched, so as to obtain the mapping relationship list M, the relationship list M is structured content data, wherein, S _f is the beginning of the content segment, S _t is the end of the content fragment, R _f is the first rule of the rule template, and R _t is the last rule of the rule template; wherein, the rules include: conditional matching rules, repeated matching rules and template reference rules; the rules include the following Attributes: Metadata Tag, Minimum Occurrences, and Maximum Occurrences.

本发明实施例的内容标注方法和装置因为采用规则自动匹配内容片段，所以克服了手工内容标注操作繁琐的问题，提高了内容标注的效率。The content tagging method and device of the embodiments of the present invention overcome the problem of cumbersome manual content tagging operations and improve the efficiency of content tagging because rules are used to automatically match content segments.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1示出了字典的一个字条；Figure 1 shows an entry of a dictionary;

图2示出了根据本发明一个实施例的内容标注方法的流程图；Fig. 2 shows a flow chart of a content tagging method according to an embodiment of the present invention;

图3示出了根据本发明一个优选实施例的在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]的流程图；Fig. 3 shows a flow chart of matching rules [R _f , R _t ] on content data of [S _f , S _t ] according to a preferred embodiment of the present invention;

图4示出了根据本发明一个优选实施例的字条规则模板的示意图；FIG. 4 shows a schematic diagram of a word rule template according to a preferred embodiment of the present invention;

图5示出了图4的字条规则模板对图1的字条进行规则匹配获得的字条元数据信息的示意图；Fig. 5 shows a schematic diagram of the entry metadata information obtained by performing rule matching on the entry in Fig. 1 by the entry rule template in Fig. 4;

图6示出了字典的另一个字条；Fig. 6 shows another entry of the dictionary;

图7示出了图4的字条规则模板对图6的字条进行规则匹配获得的字条元数据信息的示意图；Fig. 7 shows a schematic diagram of the entry metadata information obtained by performing rule matching on the entry in Fig. 6 by the entry rule template in Fig. 4;

图8示出了根据本发明一个优选实施例的内容标注后的字条的示意图；Fig. 8 shows a schematic diagram of a note after content marking according to a preferred embodiment of the present invention;

图9示出了根据本发明一个优选实施例的字条规则模板的示意图；Fig. 9 shows a schematic diagram of a word rule template according to a preferred embodiment of the present invention;

图10示出了图9的字条规则模板对图8的字条进行规则匹配获得的字条元数据信息的示意图；Fig. 10 shows a schematic diagram of the entry metadata information obtained by performing rule matching on the entry in Fig. 8 by the entry rule template in Fig. 9;

图11示出了根据本发明一个实施例的内容标注装置的示意图。Fig. 11 shows a schematic diagram of a content tagging device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参考附图并结合实施例，来详细说明本发明。The present invention will be described in detail below with reference to the accompanying drawings and in combination with embodiments.

图2示出了根据本发明一个实施例的内容标注方法的流程图，包括：Fig. 2 shows a flow chart of a content labeling method according to an embodiment of the present invention, including:

步骤S10，获取内容片段；Step S10, acquiring content fragments;

步骤S20，在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]，对匹配到的各个数据项标注所匹配规则中的元数据标记，以得到映射关系列表M，其中，S_f为内容片段的开始，S_t为内容片段的结束，R_f为规则模板的首个规则，R_t为规则模板的末个规则，规则模板包括从R_f到R_t的一组线性有序的规则。Step S20, match the rule [R _f , R _t ] on the content data of [S _f , S _t ], mark each matched data item with the metadata mark in the matched rule, to obtain the mapping relationship list M, where _, S _f is the beginning of the content fragment, S _t is _the end of the content fragment, R _f is the first rule of the rule template, R _t is the last rule of the rule template, and the rule template includes a set of linear orderly rules.

现有技术采用手工方式进行内容标注，所以操作非常繁琐作，而本实施例中，预先构建了规则，采用规则来匹配内容片段，从而自动地匹配得到各个数据项，并将规则中预先创建的元数据标记自动地匹配给各个数据项，通过规则的创建，从而使得这些操作都可以通过计算机来实现，提高了内容标注的效率。The existing technology uses manual content tagging, so the operation is very cumbersome. In this embodiment, the rules are pre-built, and the rules are used to match the content fragments, so as to automatically match each data item, and the pre-created rules in the rules Metadata tags are automatically matched to each data item, and through the creation of rules, these operations can be realized by computers, which improves the efficiency of content tagging.

另外，在本实施例中，规则[R_f，R_t]是一组线性排序的规则，这种规则模板结构简单，用户可以很容易地针对各种业务类型的内容文档创建这种规则模板，而且计算机执行这种线性排序的规则进行逐一地匹配，算法实现简单，效率较高。In addition, in this embodiment, the rule [R _f , R _t ] is a set of linearly ordered rules. This rule template has a simple structure, and users can easily create this rule template for content documents of various business types. Moreover, the computer executes the linear sorting rules to match one by one, the algorithm is simple to implement and the efficiency is high.

图3示出了根据本发明一个优选实施例的在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]的流程图，包括：Fig. 3 shows a flow chart of matching rules [R _{f , R t ] on content data of [S f} _, _S _t ] according to a preferred embodiment of the present invention, including:

1、设置当前规则R_c为R_f；1. Set the current rule R _c to R _f ;

2、以S_f为开始点执行R_c匹配，以获得R_c匹配的数据项、成功标志、结束位置S_r，对R_c匹配的数据项标注R_c中的元数据标记，得到映射关系列表M_r；2. Perform R _c matching with S _f as the starting point to obtain the data item matched by R _c , the success flag, and the end position S _r , mark the metadata tag in R _c for the data item matched by R _c , and obtain the mapping relationship list M _r ;

3、判断成功标志是否为有效；3. Determine whether the success flag is valid;

4、如果是，则将M_r加入到M中，否则结束处理；4. If yes, add M _r to M, otherwise end the process;

5，判断且R_c是否是R_t，如果是，则结束处理；5. Determine whether R _c is R _t , if so, end the process;

6、否则判断S_r是否是S_t，如果是，则结束处理；6. Otherwise, judge whether S _r is S _t , and if so, end the processing;

7、否则设置S_f为S_r，设置R_f为R_c的下一个规则，然后回到步骤1。7. Otherwise, set S _f to S _r , set R _f to the next rule of R _c , and return to step 1.

利用[R_f，R_t]是一组线性排序的规则，本优选实施例设计了这种循环遍历的流程，可以自动化地将[R_f，R_t]的所有规则顺序地对内容片段[S_f，S_t]的内容数据完成匹配。该过程简单，很容易通过计算机实现。Utilizing that [R _f , R _t ] is a set of linearly sorted rules, this preferred embodiment designs such a cyclical traversal process, which can automatically sequence all the rules of [R _f , R _t ] to the content segment [S _f , S _t ] content data to complete the matching. The process is simple and easily implemented by computer.

优选地，R_c包括数据匹配条件，执行R_c匹配包括：使用数据匹配条件在[S_f，S_t]的内容数据上匹配到各个数据项，并相应地设置成功标志。该优选实施例提供了一种条件匹配规则，可以通过条件判断的方式，识别内容片段中的数据项。Preferably, R _c includes a data matching condition, and performing R _c matching includes: using the data matching condition to match each data item on the content data of [S _f , S _t ], and setting a success flag accordingly. This preferred embodiment provides a conditional matching rule, which can identify data items in the content segment by way of conditional judgment.

优选地，R_c还包括终止位置标志，终止位置标志为无效，用于指示数据匹配条件为区间条件；终止位置标志为有效，用于指示数据匹配条件为位置条件，区间条件用于指示设置在连续区间上数据的格式规则，其中，对应的数据项为从上一个数据项的结束位置开始的、满足格式规则的连续范围内的数据；位置条件用于指示设置在结束位置处数据的格式规则，其中，对应的数据项为以上一个数据项的结束位置为开始点、以满足格式规则的位置为结束点之间的数据，其中，格式规则用于指示数据表现出的规律性特征。该优选实施例对于条件匹配规则提供了区间条件和位置条件，在用户能够确定业务内容在一个连续区间上的特征时，可以采用区间条件来实现匹配，在用户能够确定业务内容在某个位置上的特征时，就可以采用位置条件来实现匹配。该优选实施例可以满足各种不同类型的业务内容的内容标注需求。Preferably, R _c also includes a termination position flag, the termination position flag is invalid, used to indicate that the data matching condition is an interval condition; the termination position flag is valid, used to indicate that the data matching condition is a position condition, and the interval condition is used to indicate that the data matching condition is a position condition, and the interval condition is used to indicate that the data matching condition is a position condition. Format rules for data on a continuous interval, where the corresponding data item is data in a continuous range that meets the format rules starting from the end position of the previous data item; the position condition is used to indicate the format rule for the data set at the end position , where the corresponding data item is the data between the end position of the previous data item as the starting point and the position satisfying the format rule as the end point, where the format rule is used to indicate the regularity characteristics exhibited by the data. This preferred embodiment provides interval conditions and location conditions for the condition matching rules. When the user can determine the characteristics of the business content on a continuous interval, the interval condition can be used to achieve matching. When the user can determine that the business content is in a certain position When the characteristics of , you can use the location condition to achieve matching. This preferred embodiment can meet the content labeling requirements of various types of business content.

优选地，格式规则包括以下至少一种：内容格式规则、显现格式规则、标记格式规则和Any规则，内容格式规则用于指示数据在文档内容上表现出的规律性特征；显现格式规则用于指示数据在版面呈现上表现出的规律性特征；标记格式规则用于指示数据在应用逻辑上表现出的规律性特征；Any规则用于指示任何数据都满足匹配条件。本优选实施例在上述优选实施例的基础上，进一步指示了多种格式规则，从而可以更好地满足各种不同类型的业务内容的内容标注需求。Preferably, the format rules include at least one of the following: content format rules, display format rules, markup format rules and Any rules, the content format rules are used to indicate the regular characteristics of data displayed on the document content; the display format rules are used to indicate The regularity of the data in the layout presentation; the mark format rules are used to indicate the regularity of the data in the application logic; the Any rule is used to indicate that any data meets the matching conditions. On the basis of the above-mentioned preferred embodiments, this preferred embodiment further indicates a variety of format rules, so as to better meet the content labeling requirements of various types of business content.

优选地，R_c包括重复规则数，重复规则数用于指示重复应用[R_f，R_t]中重复规则数个规则。本优选实施例提供了一种重复匹配规则。例如对于字条，因为一个字条中通常只包含一个字头，所以对于字头的识别，显然不需要采用重复匹配规则进行匹配。另外，一个字条中可能包含多个义项，所以采用本优选实施例的重复匹配规则进行识别就更加合适。Preferably, R _c includes a repetition rule number, which is used to indicate the repeated application of the number of repetition rules in [R _f , R _t ]. This preferred embodiment provides a repeated matching rule. For example, for a note, since a note usually contains only one prefix, it is obviously not necessary to use repeated matching rules for matching for the identification of the prefix. In addition, a word entry may contain multiple meanings, so it is more appropriate to use the repeated matching rule of this preferred embodiment for identification.

优选地，规则包括：最小出现次数，其值为N，用于指示匹配到数据项最少为N次，N为非负整数；最大出现次数，其值为P，用于指示匹配到数据项最多为P次，P为正整数，且当N为0时，P＞N；当N为正整数时，P≥N。Preferably, the rules include: the minimum number of occurrences, whose value is N, is used to indicate that the data items are matched at least N times, and N is a non-negative integer; the maximum number of occurrences, whose value is P, is used to indicate that the most data items are matched P times, P is a positive integer, and when N is 0, P>N; when N is a positive integer, P≥N.

优选地，本内容标注方法还包括：遍历M中每个映射关系，分别记录各个元数据标记和对应的数据项，以构建元数据项；将元数据项构建元数据项表；将元数据项表附加到内容片段。Preferably, the content labeling method further includes: traversing each mapping relationship in M, respectively recording each metadata tag and corresponding data item to construct a metadata item; constructing a metadata item table from the metadata item; Tables are appended to content fragments.

优选地，本内容标注方法还包括：遍历M中每个映射关系，分别记录各个元数据标记和对应的数据项，以构建元数据项；根据匹配数据项所对应的连续区间或结束位置，将元数据项附加到内容片段。Preferably, the content labeling method further includes: traversing each mapping relationship in M, respectively recording each metadata tag and corresponding data item to construct a metadata item; according to the continuous interval or end position corresponding to the matching data item, the Metadata items are attached to content fragments.

上述两个优选实施例给出了将建立的映射关系列表进行保存的两种简单易行的方案。The above two preferred embodiments provide two simple and feasible schemes for saving the established mapping relationship list.

优选地，元数据标记符合XML，其中，元数据标记为空标记时，用于指示附加元数据项时忽略元数据标记为空标记的数据项。XML是业界目前比较通用的计算机语言，采用XML规定元数据标记，可以提高本方法的通用性。另外，通过提供空标记，从而可以处理内容片段中不能识别的数据内容，提高了对内容文档的兼容性。Preferably, the metadata tag conforms to XML, wherein, when the metadata tag is an empty tag, it is used to indicate that a data item whose metadata tag is an empty tag is ignored when appending the metadata item. XML is a relatively common computer language in the industry at present, and the use of XML to specify metadata tags can improve the generality of this method. In addition, by providing empty tags, unrecognizable data content in content fragments can be processed, thereby improving compatibility with content documents.

优选地，本内容标注方法还包括：分析内容文档中各个内容片段的表现规律；根据表现规律创建规则模板，规则模板包括规则[R_f，R_t]。通过预先创建规则模板，在识别表现形式相近的多个电子文档时，可以公用一个规则模板，避免了每次需要重新建立规则[R_f，R_t]，从而提高了内容标注工作的重用性。Preferably, the content tagging method further includes: analyzing the expression law of each content segment in the content document; creating a rule template according to the expression law, and the rule template includes a rule [R _f , R _t ]. By pre-creating rule templates, when identifying multiple electronic documents with similar expressions, a rule template can be shared, which avoids the need to re-establish the rules [R _f , R _t ] each time, thereby improving the reusability of content labeling.

优选地，Rc包括引用模板名称，用于指示引用具有该模板名称的规则模板。在本优选实施例中建立了模板引用规则，可以减少开发工作量。Preferably, Rc includes a reference template name, which is used to indicate reference to a rule template with the template name. In this preferred embodiment, template reference rules are established, which can reduce the development workload.

规则模板包含一组线性有序的规则，可以为其指定名称，1)规则模板可以被存储，使用到其他相似内容片段上，这与样式比较类似；2)其他规则模板也可以通过该名称引用已经定义的规则模板。A rule template contains a set of linear and orderly rules, which can be assigned a name. 1) The rule template can be stored and used on other similar content fragments, which is similar to the style comparison; 2) Other rule templates can also be referenced by this name Defined rule templates.

图4示出了根据本发明一个优选实施例的字条规则模板的示意图，该规则模板“字条”包含6个线性有序的匹配规则。图5示出了图4的字条规则模板对图1的字条进行规则匹配获得的字条元数据信息的示意图。Fig. 4 shows a schematic diagram of a rule template of a note according to a preferred embodiment of the present invention. The rule template "note" contains 6 matching rules in a linear order. FIG. 5 shows a schematic diagram of entry metadata information obtained by performing rule matching on the entry in FIG. 1 by the entry rule template in FIG. 4 .

在本优选实施例中，规则分为三类：条件匹配规则、重复匹配规则和模板引用规则。无论任何一种规则都包含下列属性：In this preferred embodiment, the rules are divided into three categories: conditional matching rules, repeated matching rules and template reference rules. Either rule contains the following properties:

其中，任何一种规则都可以指定最小出现次数和最大出现次数，最小出现次数<＝匹配项的个数<＝最大出现次数可以视为匹配成功。Wherein, any rule can specify the minimum number of occurrences and the maximum number of occurrences, and the minimum number of occurrences <= the number of matching items <= the maximum number of occurrences can be regarded as a successful match.

例如：多项选择题的答案可能在有些习题集中显示为如下的格式文本：Example: Answers to multiple-choice questions may appear in some problem sets as formatted text like this:

答案：ACAnswer: AC

这是可以通过连续条件规则({大写字母，答案选项，1..*})，识别出每个选择答案(“答案选项”＝“A”，“答案选项”＝“B”)。It is possible to identify each alternative answer (“answer option” = “A”, “answer option” = “B”) through the continuous conditional rule ({capital letter, answer option, 1..*}).

条件规则可以进一步细分为二类：连续条件规则和结束条件规则。Conditional rules can be further subdivided into two categories: continuous conditional rules and end conditional rules.

连续条件规则都包含下列属性：Successive conditional rules all contain the following properties:

属性Attributes 说明illustrate 格式规则(条件)format rules (conditions) 指定匹配项对应的格式规则。(连续区间上)Specifies the format rule for the match. (continuous range)

结束条件规则都包含下列属性：End condition rules all contain the following properties:

其中，包含终止位置标志不是区分连续条件规则和结束条件规则的标志，它表示匹配项的范围是否包含终止位置的数据。Among them, the mark containing the end position is not a mark to distinguish the continuous condition rule from the end condition rule, it indicates whether the scope of the matching item contains the data of the end position.

以示例中的笔画规则为例：Take the stroke rules in the example as an example:

当“包含终止位置”为TRUE时({文本：“画，”，TRUE，笔画，1})，识别出匹配项为“4画，”，结束位置为“，”之后，包含“画，”；When "include end position" is TRUE ({text: "painting,", TRUE, stroke, 1}), the matching item is recognized as "4 painting," and after the end position is ",", contains "painting," ;

当“包含终止位置”为FALSE时({文本：“画，”，FALSE，笔画，1})，识别出匹配项为“4”，结束位置为“4”之后，不包含“画，”。When the "include end position" is FALSE ({text: "painting,", FALSE, stroke, 1}), it is recognized that the matching item is "4", and after the end position is "4", "painting," is not included.

重复规则都包含下列属性：Recurrence rules all contain the following properties:

属性Attributes 说明illustrate 重复规则数Number of repeating rules 指定上几个匹配规则被重复应用。Specifies that several matching rules are applied repeatedly.

模板引用规则都包含下列属性：Template reference rules all contain the following properties:

属性Attributes 说明illustrate 引用模板名称reference template name 指定应用已定义的规则模板。Specifies to apply a defined rule template.

其中，模板引用规则指定应用由引用模板名称所标识的规则区间[R_f-模板,R_t-模板]，是一种嵌套应用的方法。Wherein, the template reference rule specifies to apply the rule interval [R _f-template , R _t-template ] identified by the reference template name, which is a method of nested application.

例如：以示例中的规则模板“字条”为例，将最后的释义规则改成模板引用规则({模板引用：“释义”，释义，1})，在应用释义规则时，会自动找到规则模板“释义”对应的规则区间(如图9所示)进行匹配。For example: take the rule template "note" in the example as an example, change the last paraphrase rule to a template reference rule ({template reference: "paraphrase", paraphrase, 1}), when applying the paraphrase rule, the rule template will be found automatically The rule interval corresponding to "interpretation" (as shown in Figure 9) is matched.

元数据效果如下：The effect of metadata is as follows:

<元数据><metadata>

<字目>开</字目><word>open</word>

<繁体字形>開</繁体字形><traditional font>open</traditional font>

<拼音>kāi</拼音><Pinyin>kāi</Pinyin>

<笔画>4</笔画><stroke>4</stroke>

<部首>一</部首><radical>one</radical>

<释义><interpretation>

<义项>打通；开辟：～路|～矿|～掘|～拓。</义项><meaning item> get through; open up: ~ road | ~ mine | ~ dig | ~ extension. </meaning item>

</释义></interpretation>

</元数据></metadata>

具体的匹配过程如下：The specific matching process is as follows:

(1)观察和分析内容片段的数据，发现其表现规律，创建如图4所示的规则模板“字条”；(1) observe and analyze the data of the content segment, find its expression law, and create the rule template "note" as shown in Figure 4;

(2)选择字条“开”的内容片段；(2) select the content fragment of the note "open";

(2)该规则模板包含一组有序的匹配规则；(2) The rule template contains a set of ordered matching rules;

(3)分析内容片段，根据规则模板“字条”，识别获得匹配数据项，并映射到关联的元数据标记。(3) Analyze content fragments, identify and obtain matching data items according to the rule template "note", and map to associated metadata tags.

(4)根据识别出的匹配数据项和关联的元数据标记建立如图5所示的元数据信息。这些元数据信息可以整体附加到字条“开”的内容片段上。(4) Establish metadata information as shown in FIG. 5 according to the identified matching data items and associated metadata tags. These metadata information can be attached as a whole to the content fragment of the note "on".

上述步骤(3)，可以进一步细化，规则模板识别匹配包括以下步骤：The above step (3) can be further refined, and the rule template identification and matching includes the following steps:

(3.1)设置起始位置S_f为内容片段的开始(段首的“开”字)，设置结束位置S_t为内容片段的结束(段末的句号)；(3.1) starting position S _f is set to the beginning of the content fragment (" opening " word at the beginning of the paragraph), and the end position S _t is set to be the end of the content fragment (period at the end of the paragraph);

(3.2)设置起始规则R_f为规则模板的首个规则(规则1“字目”)，设置结束规则R_t为规则模板的末个规则(规则6“释义”)。(3.2) Set the start rule R _f as the first rule of the rule template (rule 1 "words"), and set the end rule R _t as the last rule of the rule template (rule 6 "interpretation").

(3.3)在区间[S_f，S_t]的内容数据上，执行区间[R_f，R_t]的规则匹配，获得映射关系列表M。(3.3) On the content data of the interval [S _f , S _t ], execute the rule matching of the interval [R _f , R _t ] to obtain the mapping relationship list M.

上述步骤(3.3)，可以进一步细化，包括以下步骤：The above step (3.3) can be further refined, including the following steps:

(3.3.1)设置当前规则R_c为起始规则R_f(规则1“字目”)；(3.3.1) Set the current rule R _c as the initial rule R _f (rule 1 "character");

(3.3.2)以起始位置S_f(段首的“开”字)为开始点，在区间[S_f，S_t]的内容数据上，执行规则R_c匹配；获得规则R_c匹配的成功标志(有效)、匹配的结束位置S_r(“开”字后面的“(”)以及匹配出的映射关系列表M_r(“字目”＝“开”)；(3.3.2) With the starting position S _f (the "open" word at the beginning of the paragraph) as the starting point, on the content data of the interval [S _f , S _t ], execute the matching of the rule R _c ; obtain the matching of the rule R _c Success flag (effective), matching end position S _r ("(" behind the "open" word) and matching mapping relationship list M _r ("word"="open");

(3.3.3)如果步骤(3.3.2)获得的规则R_c匹配的成功标志有效，则将匹配出的映射关系列表M_r，记录到映射关系列表M中；(3.3.3) If the successful flag matched by the rule R _c obtained in step (3.3.2) is valid, record the matching mapping relationship list M _r in the mapping relationship list M;

(3.3.4)判断是否需要继续匹配(需要)，如果是，则进入步骤(3.3.5)；否则，终止处理；(3.3.4) judge whether need to continue to match (require), if yes, then enter step (3.3.5); Otherwise, terminate processing;

(3.3.5)设置起始位置S_f为步骤(3.3.2)获得的规则R_c匹配的结束位置S_r(“开”字后面的“(”)；将设置起始规则R_f为当前规则R_c的下一个规则(规则2“繁体字形”)；转到步骤(3.3.1)，在区间[S_f，S_t]的内容数据上，执行区间[R_f，R_t]的规则匹配。(3.3.5) Set the start position S _f as the end position S _r ("(" behind the "open" word) matched by the rule R _c obtained in step (3.3.2); the start rule R _f will be set as the current The next rule of the rule R _c (rule 2 "traditional characters"); go to step (3.3.1), on the content data of the interval [S _f , S _t ], execute the rule of the interval [R _f , R _t ] match.

上述步骤(3.3.2)可以进一步细化，单个规则的匹配包括以下步骤：The above steps (3.3.2) can be further refined, and the matching of a single rule includes the following steps:

(3.3.2.1)如果当前规则R_c为条件匹配规则，则进入步骤(3.3.2.2)；如果当前规则R_c为重复匹配规则，转到步骤(3.3.2.4)；否则当前规则Rc为模板引用匹配规则，转到步骤(3.3.2.5)；(3.3.2.1) If the current rule R _c is a conditional matching rule, then enter step (3.3.2.2); if the current rule R _c is a repeated matching rule, go to step (3.3.2.4); otherwise, the current rule Rc is a template reference Matching rules, go to step (3.3.2.5);

(3.3.2.2)根据当前条件规则R_c中的条件、包括终止位置标志以及出现次数，获得满足规则的匹配数据项列表以及成功标志；(3.3.2.2) According to the conditions in the current conditional rule _Rc , including the termination position mark and the number of occurrences, obtain the matching data item list and the success mark that satisfy the rule;

(3.3.2.3)为步骤(3.3.2.2)中获得的匹配数据项列表中每个匹配数据项，建立与当前条件规则R_c中的元数据标记之间的映射关系，加入映射关系列表M_r中；终止处理。(3.3.2.3) For each matching data item in the matching data item list obtained in step (3.3.2.2), establish a mapping relationship with the metadata tag in the current conditional rule R _c , and add the mapping relationship list M _r In; terminate processing.

(3.3.2.4)根据当前重复规则R_c中的重复规则数以及出现次数，在区间[S_f，S_t]的内容数据上，重复执行区间[R_{c-重复次数}，R_c-1]的规则匹配，获得映射关系列表M_r；终止处理。(3.3.2.4) According to the number of repetition rules and the number of occurrences in the current repetition rule R _c , on the content data of the interval [S _f , S _t ], repeatedly execute the interval [R _{c-number of repetitions} , R _c-1 ] The rules are matched, and the mapping relationship list M _r is obtained; the processing is terminated.

(3.3.2.5)根据当前模板引用规则R_c中的引用模板名称以及出现次数，在区间[S_f，S_t]的内容数据上，执行由引用模板名称标识的区间[R_f-模板，R_t-模板]的规则匹配，获得映射关系列表M_r；终止处理。以规则1“字目”为例，该规则为连续条件规则(即采用区间条件)。凡是从“开”开始的字号为三号的内容均满足该条件，所以其匹配数据项为段首的“开”。(3.3.2.5) According to the reference template name and the number of occurrences in the current template reference rule R _c , on the content data of the interval [S _f , S _t ], execute the interval identified by the reference template name [R _f-template , R _t-template ] to match the rules to obtain the mapping relationship list M _r ; terminate the processing. Take rule 1 "words" as an example, this rule is a continuous conditional rule (that is, an interval condition is used). All content starting from "Kai" with a font size of No. 3 satisfies this condition, so its matching data item is "Kai" at the beginning of the paragraph.

以规则4“笔画”为例，该规则为结束条件规则(即采用位置条件)。凡是从拼音字母后面开始的以“画，”字串结束的内容均满足该条件，所以其匹配数据项为“4画，”。Taking the rule 4 "stroke" as an example, this rule is an end condition rule (that is, a position condition is adopted). All the content starting from the back of the pinyin letter and ending with the string of "painting," all meet this condition, so the matching data item is "4hua,".

上述优选实施例建立的规则模板，还适用于对符合该规律的其他内容片段。例如字条“锎”，其内容片段和元数据信息，如图6和图7所示，图6示出了字典的另一个字条，图7示出了图4的字条规则模板对图6的字条进行规则匹配获得的字条元数据信息的示意图。The rule template established in the above preferred embodiment is also applicable to other content segments conforming to the rule. For example, the word entry "californium", its content fragments and metadata information, as shown in Figure 6 and Figure 7, Figure 6 shows another word entry of the dictionary, Figure 7 shows the effect of the word entry rule template in Figure 4 on the word entry in Figure 6 Schematic diagram of entry metadata information obtained by rule matching.

优选地，本内容标注方法还包括：将所匹配的数据项作为一个内容片段，继续执行在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]的步骤。该优选实施例提供了一种嵌套机制，可以处理更加复杂的内容结构，从而可以满足各种业务类型的内容文档的标注需求。Preferably, the content tagging method further includes: taking the matched data item as a content segment, and continuing to perform the step of matching the rule [R _f , R _t ] on the content data of [S _f , S _t ]. This preferred embodiment provides a nesting mechanism that can handle more complex content structures, thereby meeting the labeling requirements of content documents of various business types.

上述步骤(4)建立的元数据信息，还可以以嵌入的方式附加到内容片段的数据区间上，如图8所示，其匹配出的元数据标记都标注到字条“开”内容片段中。图9示出了根据本发明一个优选实施例的字条规则模板的示意图，图10示出了图9的字条规则模板对图8的字条进行规则匹配获得的字条元数据信息的示意图。The metadata information established in the above step (4) can also be attached to the data interval of the content segment in an embedded manner, as shown in Figure 8, the matched metadata tags are all marked in the content segment of the note "Open". FIG. 9 shows a schematic diagram of a word rule template according to a preferred embodiment of the present invention, and FIG. 10 shows a schematic diagram of the word item metadata information obtained by performing rule matching on the word item in FIG. 8 by the word rule template in FIG. 9 .

可以看出，图8的字条“开”的内容被嵌入的元数据标记进一步分成更细小的内容片段。针对这些下级的内容片段，可以继续采用本发明方法进行标注。以“开的释义”内容片段为例，其规则模板和元数据信息如图9和图10所示。It can be seen that the content of the note "Open" in FIG. 8 is further divided into smaller content fragments by the embedded metadata tags. For these lower-level content fragments, the method of the present invention can be continued to be used for labeling. Taking the content fragment of "Open Interpretation" as an example, its rule template and metadata information are shown in Figure 9 and Figure 10 .

用户可以层层细化、逐次递进地应用本内容标注方法，最大限度、最小粒度地标注内容文档或片段，以达到满意的结构化效果。Users can apply this content labeling method layer by layer and step by step, and label content documents or fragments at the maximum and minimum granularity, so as to achieve a satisfactory structural effect.

图11示出了根据本发明一个实施例的内容标注装置的示意图，包括：Figure 11 shows a schematic diagram of a content tagging device according to an embodiment of the present invention, including:

获取模块10，用于获取内容片段；An acquisition module 10, configured to acquire content fragments;

匹配模块20，用于在[S_f，S_t]的内容数据上匹配规则[R_f，R_t]，对匹配到的各个数据项标注所匹配规则中的元数据标记，以得到映射关系列表M，其中，S_f为内容片段的开始，S_t为内容片段的结束，R_f为规则模板的首个规则，R_t为规则模板的末个规则，规则模板包括从R_f到R_t的一组线性有序的规则。The matching module 20 is used to match the rule [R _f , R _t ] on the content data of [S _f , S _t ], mark each matched data item with the metadata mark in the matched rule, so as to obtain a list of mapping relationships M, where S _f is the beginning of the content fragment, S _t is the end of the content fragment, R _f is the first rule of the rule template, R _t is the last rule of the rule template, and the rule template includes from R _f to R _t A set of linearly ordered rules.

本内容标注装置提高了内容标注的效率。The content labeling device improves the efficiency of content labeling.

优选地，匹配模块20包括：Preferably, the matching module 20 includes:

当前设置模块，用于设置当前规则R_c为R_f；The current setting module is used to set the current rule R _c to R _f ;

当前匹配模块，用于以S_f为开始点执行R_c匹配，以获得R_c匹配的数据项、成功标志、结束位置S_r，对R_c匹配的数据项标注R_c中的元数据标记，得到映射关系列表M_r；The current matching module is used to perform _Rc matching with _Sf as the starting point, so as to obtain the data item matched by _Rc , the success flag, and the end position _Sr , and mark the metadata tag in _Rc for the data item matched by _Rc , Get the mapping relationship list M _r ;

加入模块，用于如果成功标志为有效，则将M_r加入到M中；Adding a module for adding M _r to M if the success flag is valid;

判断模块，用于判断成功标志是否为有效，且R_c是否不是R_t，且S_r是否不是S_t；Judgment module, used to judge whether the success flag is effective, and whether R _c is not R _t , and whether S _r is not S _t ;

循环模块，用于如果以上判断均为是，则设置S_f为S_r，设置R_f为R_c的下一个规则，然后继续执行上述步骤；否则终止处理。The loop module is used to set S _f to S _r and R _f to the next rule of R _c if the above judgments are all yes, and then continue to execute the above steps; otherwise, terminate the process.

该内容标注装置结构简单，很容易通过计算机实现。The content labeling device has a simple structure and can be easily realized by a computer.

优选地，本内容标注装置还包括：元数据项模块，用于遍历M中每个映射关系，分别记录各个元数据标记和对应的数据项，以构建元数据项；元数据项表模块，用于将元数据项构建元数据项表；附加模块，用于将元数据项表附加到内容片段。本优选实施例给出了将建立的映射关系列表进行保存的两种简单易行的方案。Preferably, the content labeling device further includes: a metadata item module, configured to traverse each mapping relationship in M, and record each metadata tag and corresponding data item to construct a metadata item; a metadata item table module, used The metadata item is used to construct a metadata item table; an additional module is used for attaching the metadata item table to the content fragment. This preferred embodiment provides two simple and feasible solutions for saving the established mapping relationship list.

优选地，本内容标注装置还包括：分析模块，用于分析内容文档中各个内容片段的表现规律；创建模块，用于根据表现规律创建规则模板，规则模板包括规则[R_f，R_t]。本优选实施例提高了内容标注工作的重用性。Preferably, the content tagging device further includes: an analysis module, for analyzing the expression law of each content segment in the content document; a creation module, for creating a rule template according to the expression law, and the rule template includes a rule [R _f , R _t ]. This preferred embodiment improves the reusability of content labeling work.

本发明的各个实施例可以结合批处理技术或宏命令操作，从而可以快速地对批量的特定规律性的内容片段进行识别、匹配并附加元数据信息。Various embodiments of the present invention can be combined with batch processing technology or macro command operations, so that batches of content segments with specific regularity can be quickly identified, matched, and metadata information added.

从以上的描述中可以看出，通过本发明的各个实施例，用户可以方便地、灵活地、高效地、准确地为规律性的内容片段附加上元数据信息。本发明上述的实施例结合元数据标记体系，可适用于各种应用领域，如：篇章、论文、试题、字(词)典等，满足用户不同的业务需求。It can be seen from the above description that through various embodiments of the present invention, users can conveniently, flexibly, efficiently and accurately add metadata information to regular content segments. The above-mentioned embodiments of the present invention, combined with the metadata marking system, can be applied to various application fields, such as chapters, papers, test questions, word (dictionary) dictionaries, etc., to meet different business needs of users.

显然，本领域的技术人员应该明白，上述的本发明的各模块或各步骤可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，可选地，它们可以用计算装置可执行的程序代码来实现，从而可以将它们存储在存储装置中由计算装置来执行，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Optionally, they can be implemented with program codes executable by computing devices, so that they can be stored in storage devices and executed by computing devices, or they can be made into individual integrated circuit modules, or their Multiple modules or steps are implemented as a single integrated circuit module. As such, the present invention is not limited to any specific combination of hardware and software.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A content labeling method for a content document, characterized in that, comprising:

Get the content fragment of the content document;

Create a rule template, which contains a set of linearly ordered rules from R _f to R _t [R _f , R _t ];

Match the rule [R _f , R _t ] on the content data of [S _f , S _t ], identify and obtain the matched data item, mark each matched data item with the metadata tag in the matching rule, and obtain a list of mapping relationships M, the relationship list M is structured content data, wherein S _f is the beginning of the content segment, S _t is the end of the content segment, R _f is the first rule of the rule template, and R _t is the the last rule of the above rule template;

Wherein, the rules include: condition matching rules, repeated matching rules and template reference rules;

The rule contains the following attributes: Metadata Tag, Minimum Occurrences, and Maximum Occurrences.

2. The method according to claim 1, wherein the matching rule [R _f , R _t ] on the content data of [S _f , S _t ] comprises:

Set the current rule R _c to R _f ;

Perform R _c matching with S _f as the starting point to obtain the data items matched by R _c , the success flag, and the end position S _r , mark the metadata tags in R _c for the data items matched by R _c , and obtain the mapping relationship list M _r ;

If the success flag is valid, add M _r to M;

judging whether the success flag is valid, and whether R _c is not R _t , and whether S _r is not S _t ;

If the above judgments are all yes, then set S _f to S _r , set R _f to the next rule of R _c , and then continue to execute from the step of setting the current rule R _c to R _f to the step of judging whether the success flag is valid all steps in between; otherwise terminate processing.

3. The method according to claim 2, wherein R _c includes data matching conditions, and performing R _c matching includes: using the data matching conditions to match each data on the content data of [S _f , S _t ] item, and set the success flag accordingly.

4. method according to claim 3, it is characterized in that, _Rc also comprises termination position mark, and described termination position mark is invalid, is used for indicating that described data matching condition is an interval condition; Described termination position mark is effective , used to indicate that the data matching condition is a location condition,

The interval condition is used to indicate a format rule for data set on a continuous interval, wherein the corresponding data item is data within a continuous range that satisfies the format rule starting from the end position of the previous data item;

The position condition is used to indicate the format rule of the data set at the end position, wherein the corresponding data item is the data between the end position of the previous data item as the start point and the position satisfying the format rule as the end point ,

Wherein, the format rule is used to indicate the regularity characteristic exhibited by the data.

5. The method according to claim 4, wherein the format rules include at least one of the following: content format rules, display format rules, markup format rules and Any rules,

The content format rule is used to indicate the regularity characteristics of the data in the document content;

The display format rule is used to indicate the regularity of the data in the layout presentation;

The markup format rule is used to indicate the regularity of the data in application logic;

The Any rule is used to indicate that any data satisfies the matching condition.

6 . The method according to claim 2 , wherein R _c includes a repetition rule number, and the repetition rule number is used to indicate repeated application of the number of repetition rules in [R _f , R _t ].

7. The method according to claim 1, wherein the rules comprise:

The minimum number of occurrences, whose value is N, is used to indicate that the matched data item is at least N times, and N is a non-negative integer;

The maximum number of occurrences, the value of which is P, is used to indicate that the matched data item is at most P times, P is a positive integer, and when N is 0, P>N; when N is a positive integer, P≥N.

8. The method of claim 1, further comprising:

Traversing each mapping relationship in M, recording each of the metadata tags and corresponding data items to construct metadata items;

Constructing the metadata item into a metadata item table;

The metadata item table is appended to the content fragment.

9. The method according to claim 4, further comprising:

The metadata item is appended to the content segment according to matching the continuous interval or the end position corresponding to the data item.

10. The method according to claim 8 or 9, wherein the metadata tag conforms to XML, wherein when the metadata tag is an empty tag, it is used to indicate that the metadata item is ignored when the metadata item is attached. A data item whose metadata tag is an empty tag.

11. The method of claim 1, further comprising:

Analyze the performance rules of each content fragment in the content document;

A rule template is created according to the expression rule, and the rule template includes a rule [R _f , R _t ].

12 . The method according to claim 11 , wherein R _c includes a reference template name, which is used to indicate reference to a rule template with the template name. 13 .

13. The method according to claim 1, further comprising: taking the matched data item as a content segment, and continuing to execute the matching rule [R _f , R on the content data of [S _f , S _t ] _t ] steps.

14. A content labeling device for a content document, characterized in that it comprises:

An acquisition module, used to acquire the content fragment of the content document;

Create a module for creating a rule template, the rule template includes a set of linearly ordered rules [R _f , R _t ] from R _f to R _t ;

The matching module is used to match the rule [R _f , R _t ] on the content data of [S _f , S _t ], identify and obtain the matched data item, and mark the metadata mark in the matching rule for each matched data item, To obtain a mapping relationship list M, the relationship list M is structured content data, wherein S _f is the beginning of the content segment, S _t is the end of the content segment, and R _f is the first rule of the rule template , R _t is the last rule of the rule template;

15. The device according to claim 14, wherein the matching module comprises:

The current setting module is used to set the current rule R _c to R _f ;

The current matching module is used to perform _Rc matching with _Sf as the starting point, so as to obtain the data item matched by _Rc , the success flag, and the end position _Sr , and mark the metadata tag in _Rc for the data item matched by _Rc , Get the mapping relationship list M _r ;

Adding a module for adding M _r to M if the success flag is valid;

A judging module, configured to judge whether the success flag is valid, whether R _c is not R _t , and whether S _r is not S _t ;

The loop module is used to set _Sf as _Sr if the above judgments are all yes, set _Rf as the next rule of _Rc , and then continue to execute the current setting module, current matching module, adding module and judging module in sequence The step involved; otherwise terminate processing.

16. The device of claim 14, further comprising:

The metadata item module is used to traverse each mapping relationship in M, and respectively record each of the metadata tags and corresponding data items to construct metadata items;

A metadata item table module, configured to construct a metadata item table from the metadata item;

An appending module, configured to append the table of metadata items to the content fragment.

17. The apparatus of claim 14, further comprising:

An analysis module, configured to analyze the performance rules of each content segment in the content document;

The creation module creates a rule template according to the expression rule, and the rule template includes a rule [R _f , R _t ].