CN101968819B

CN101968819B - Audio and video intelligent cataloging information acquisition method facing wide area network

Info

Publication number: CN101968819B
Application number: CN2010105371067A
Authority: CN
Inventors: 隋爱娜; 王永滨; 伏文龙
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2010-11-05
Filing date: 2010-11-05
Publication date: 2012-05-30
Anticipated expiration: 2030-11-05
Also published as: CN101968819A

Abstract

The invention relates to a method for obtaining audio and video intelligent cataloging information for a wide area network, and belongs to the field of computer application. The invention is characterized in that a weighting algorithm based on the position factor of keyword feature items is proposed, and different weighting factors are assigned to feature items at different positions in the document, so as to more accurately calculate the subject similarity of web page content; the three factors of web page content similarity, URL directory hierarchy information of hyperlinks, and anchor text information of hyperlinks are comprehensively utilized to optimize the selection of links with higher subject similarity. For the searched subject pages, cataloging information is automatically extracted using an information extraction method based on ontology and HTML. The extracted cataloging information is standardized using an improved semantic similarity calculation method. The invention can intelligently and automatically provide cataloging item information to catalogers, reduce manual labor, improve cataloging efficiency, meet the different needs of professional and non-professional catalogers, and adapt to wide area network environments.

Description

Method for Acquiring Audio and Video Intelligent Cataloging Information for Wide Area Network

技术领域 technical field

本发明属于计算机应用技术领域，涉及广域网环境中数字音视频资料的编目，为广域网环境中专业和非专业的内容制作者及编目者提供高效、自动、智能的编目方法，能解决目前编目系统面向局域网、专业性强、工作量大、重复劳动多、自动化程度低等问题，能显著提高编目效率，降低人工劳动量。The invention belongs to the field of computer application technology, relates to the cataloging of digital audio and video materials in the wide area network environment, provides an efficient, automatic and intelligent cataloging method for professional and non-professional content creators and catalogers in the wide area network environment, and can solve the problems faced by the current cataloging system. Problems such as local area network, strong professionalism, heavy workload, repetitive labor, and low degree of automation can significantly improve cataloging efficiency and reduce manual labor.

背景技术 Background technique

数字音视频资料的编目质量与速度直接影响资源的管理水平、成本效益及资源再利用。The cataloging quality and speed of digital audiovisual materials directly affect resource management, cost-effectiveness, and resource reuse.

国内从事的规模化视频编目工作基本始于2002年。由于媒体资产管理平台的建立需要相当的技术和物质条件，所以在国内仅有几家单位建立了大规模媒体资产管理系统和编目大规模生产，其中规模最大、年产量最高的是中央电视台音像资料馆，现在它所使用的编目软件主要由索贝和中科大洋两家公司制作。The large-scale video cataloging work in China basically started in 2002. Since the establishment of a media asset management platform requires considerable technical and material conditions, there are only a few units in China that have established a large-scale media asset management system and cataloged large-scale production. Among them, the largest and highest annual output is CCTV audio-visual materials Museum, the cataloging software it uses is mainly produced by two companies, Sobei and Zhongke Dayang.

在已有文献中，对图书馆、新闻等媒体形式的编目以及编目自动化技术进行了一些研究，例如，中国科学院自动化研究所开发的“新闻视频编目方法及系统”，基于新闻节目中字幕条、主持人、音频静音点信息对新闻视频进行了自动编目，该方法仅针对新闻视频本身的一些局部内容进行分析、分离和匹配确定编目信息。其他还有针对自动抽帧、镜头分割等方面的研究，例如，北京新岸线网络技术有限公司开发的“网络媒体智能编目系统”，其中的“自动编目系统”，可以对媒体文件进行视频结构化分析、人脸分析、字幕分析、台标分析、声音分析，从而形成相应的编目信息。这些方法针对视频内容本身，算法比较复杂，适应性很弱，实现精度较低，受视频内容自身质量的影响较大。In the existing literature, some studies have been done on the cataloging of libraries, news and other media forms and cataloging automation technology. For example, the "News Video Cataloging Method and System" developed by the Institute of Automation, Chinese Academy of The host and audio mute point information automatically cataloged the news video, and this method only analyzes, separates and matches some local content of the news video itself to determine the catalog information. There are also other researches on automatic frame extraction and shot segmentation. For example, the "Internet Media Intelligent Cataloging System" developed by Beijing Nufront Network Technology Co., Ltd., the "Automatic Cataloging System" can perform video structure on media files. Humanization analysis, face analysis, subtitle analysis, station logo analysis, sound analysis, so as to form corresponding catalog information. These methods aim at the video content itself, the algorithm is relatively complex, the adaptability is very weak, the realization accuracy is low, and the quality of the video content itself is greatly affected.

目前数字音视频资料的编目还存在以下主要问题：一是编目项繁多，需要编目人员根据专业知识逐项手工录入，工作量大，且容易出错；二是智能化和自动化程度低，不能自动获取编目信息，效率低下；三是编目环境一般为局域网，限制了编目信息自动获取的来源和渠道；四是对编目人员的专业化程度要求高，而随着音视频采集设备的逐渐普及，内容制作开始走向个性化、公众化，编目工作也逐渐趋向由非专业人员完成。At present, there are still the following main problems in the cataloging of digital audio and video materials: First, there are many cataloging items, which require catalogers to enter manually one by one according to their professional knowledge, which is a heavy workload and prone to errors; second, the degree of intelligence and automation is low, and they cannot be automatically obtained. Cataloging information is inefficient; third, the cataloging environment is generally a local area network, which limits the sources and channels for automatic acquisition of cataloging information; fourth, the professional level of cataloging personnel is high, and with the gradual popularization of audio and video collection equipment, content production It began to be personalized and publicized, and the cataloging work gradually tended to be completed by non-professionals.

发明内容 Contents of the invention

为了克服现有编目系统存在的上述问题，本发明提出了一种面向广域网的音视频智能编目信息提取方法，能够智能、自动地为编目者提供著录项信息，减轻人工劳动量，提高编目效率，而且能够适应专业和非专业编目者的不同需求，适应广域网环境。In order to overcome the above-mentioned problems existing in the existing cataloging system, the present invention proposes a wide area network-oriented audio and video intelligent cataloging information extraction method, which can intelligently and automatically provide cataloging item information for catalogers, reduce manual labor, and improve cataloging efficiency. Moreover, it can adapt to the different needs of professional and non-professional catalogers, and adapt to the wide area network environment.

本发明的特征在于：提出了基于关键词特征项位置因素的权重算法，对文档中不同位置的特征项赋予不同的加权因子，进而更准确地计算网页内容的主题相似度；综合利用网页内容相似度、超链接的URL目录层次信息、超链接的锚文本信息三方面因素，优化选择主题相似度更高的链接。对搜索到的主题页面，采用基于本体和HTML的信息提取方法自动提取出编目信息。采用改进的语义相似度计算方法，对提取到的编目信息进行规范化。The present invention is characterized in that: a weighting algorithm based on the location factors of keyword feature items is proposed, and different weighting factors are given to feature items in different positions in the document, so as to more accurately calculate the subject similarity of webpage content; comprehensively utilize webpage content similarity degree, hyperlink URL directory level information, and hyperlink anchor text information to optimize and select links with higher topic similarity. For the searched subject pages, the catalog information is automatically extracted using ontology and HTML-based information extraction methods. The extracted catalog information is normalized by using an improved semantic similarity calculation method.

本发明的总体流程如图1所示。本发明为用户提供了友好的编目界面，打开界面之后，选择播放待编目的音视频文件，之后开始在编目输入框中进行著录。在著录过程中，首先录入正题名和关键词作为输入值，然后在计算机上依次按下列步骤实现编目信息的自动提取：The overall process of the present invention is shown in Figure 1. The invention provides the user with a friendly cataloging interface. After the interface is opened, the audio and video files to be cataloged are selected to be played, and then recorded in the cataloging input box. In the process of description, first enter the title proper and keywords as input values, and then follow the steps below on the computer to automatically extract the catalog information:

1.主题爬虫搜索网页1. Theme crawler searches the web

本发明针对目前传统搜索引擎难以满足对特定编目信息检索的要求，采用垂直搜索引擎搜索与特定主题相关的网络资源。将编目者录入的正题名和关键词作为爬虫的主题集合。Aiming at the fact that the current traditional search engines are difficult to meet the requirements for specific catalog information retrieval, the invention uses a vertical search engine to search for network resources related to specific topics. The title proper and keywords entered by the cataloger are used as the subject collection of the crawler.

本发明设计的主题爬虫搜索网页的过程如下：The process of the theme crawler search webpage designed by the present invention is as follows:

(1)页面文档预处理(1) Page document preprocessing

获取并解析初始种子URL对应的网页，将其中的标题文本和正文文本进行分词，形成关键词特征项集合，并将其与主题集合进行匹配，得到与主题向量维数相等的特征项向量。Obtain and parse the webpage corresponding to the initial seed URL, segment the title text and body text to form a set of keyword feature items, and match it with the topic set to obtain a feature item vector with the same dimension as the topic vector.

(2)关键词特征项权重计算(2) Weight calculation of keyword feature items

本发明改进了传统向量空间模型中的TF权重算法。传统的TF权重算法只关注一个网页中关键词特征项出现的频率，但在浏览网页时，标题文字和正文的重要性明显是不一样的，TF权重算法忽略了关键词特征项在网页中的位置因素，导致关键词向量与主题向量的相似度存在误差。本发明提出“基于关键词特征项位置因素的权重算法”，计算方法如下：The invention improves the TF weight algorithm in the traditional vector space model. The traditional TF weight algorithm only pays attention to the frequency of keyword feature items in a web page, but when browsing a web page, the importance of title text and text is obviously different, and the TF weight algorithm ignores the frequency of keyword feature items in a web page. The location factor leads to errors in the similarity between keyword vectors and topic vectors. The present invention proposes " based on the weight algorithm of keyword feature item position factor ", calculation method is as follows:

a)定义特征项出现的不同位置，并对不同位置的特征项赋予不同的位置权重因子。a) Define different positions where feature items appear, and assign different position weight factors to feature items at different positions.

将关键词特征项出现的位置定义为3类：主题(Title)标签，标题(H1-H6)标签，正文其他位置。这3类位置对于特征项的重要性是依次递减的。The position where the keyword feature item appears is defined as three categories: the title (Title) tag, the title (H1-H6) tag, and other positions in the text. The importance of these three types of positions for feature items is in descending order.

然后引入位置权重因子PG表示特征项在不同位置的重要性，PG越大，表示该位置的特征项越重要。定义PG_i(i＝1，2，3)为特征项在不同位置所对应的权重因子，i代表上述3类位置。由于3类位置对于特征项的重要性是依次递减的，因此规定：PG_i＞＝PG_i+1(1＜＝i＜＝2)。Then the position weight factor PG is introduced to indicate the importance of feature items at different positions. The larger PG is, the more important the feature items at this position are. Define PG _i (i=1, 2, 3) as the weight factors corresponding to different positions of feature items, and i represents the above three types of positions. Since the importance of the three types of positions to the feature items is in descending order, it is stipulated that: PG _i >=PG _i+1 (1<=i<=2).

定义TF’_i(i＝1，2，3)为特征项t在不同位置出现的频率。Define TF' _i (i=1, 2, 3) as the frequency of feature item t appearing in different positions.

b)计算基于位置因素的特征项权重。b) Calculate the weight of feature items based on location factors.

根据特征项的位置，计算页面文档D中某关键词特征项t的权重DWeight(t)，公式为：According to the position of the feature item, calculate the weight DWeight(t) of a certain keyword feature item t in the page document D, the formula is:

$DWeight (t) = Σ_{i = 1}^{3} ({TF}^{'}_{i} \times {PG}_{i})$ (公式1) $DWeight (t) = Σ_{i = 1}^{3} ({TF}^{'}_{i} \times {PG}_{i})$ (Formula 1)

(3)页面内容主题相似度计算(3) Calculation of page content topic similarity

根据各关键词特征项的基于位置因素的权重，计算所搜索页面文档D中的关键词特征项与主题Topic的相似度Sim(D)如下：According to the position-based weight of each keyword feature item, the similarity Sim(D) between the keyword feature item in the searched page document D and the topic Topic is calculated as follows:

$Sim (D) = \frac{Σ_{j = 1}^{z} DWeight (j) \times TopicWeight (j)}{\sqrt{Σ_{j = 1}^{z} {DWeight (j)}^{2}} \times \sqrt{Σ_{j = 1}^{z} {TopicWeight (j)}^{2}}}$ (公式2) $Sim (D.) = \frac{Σ_{j = 1}^{z} DWeight (j) \times TopicWeight (j)}{\sqrt{Σ_{j = 1}^{z} {DWeight (j)}^{2}} \times \sqrt{Σ_{j = 1}^{z} {TopicWeight (j)}^{2}}}$ (Formula 2)

公式2中，所搜索页面文档D中的关键词特征项与主题Topic的维数相等，用z表示；DWeight(j)表示文档D中第j个关键词特征项的权重，TopicWeight(j)表示主题Topic中的第j个主题项的权重，1≤j≤z。In formula 2, the keyword feature item in the document D of the searched page is equal to the dimension of the topic Topic, denoted by z; DWeight(j) indicates the weight of the jth keyword feature item in document D, and TopicWeight(j) indicates The weight of the jth topic item in the topic Topic, 1≤j≤z.

根据实验分析，相似度阈值TH的取值范围为0.4～0.6能更准确地得到主题相关网页。如果网页文档与主题的相似度值Sim(D)≥TH，则判定该网页文档与主题相似，否则该文档与主题不相似。According to the experimental analysis, the value range of the similarity threshold TH is 0.4-0.6, which can obtain the topic-related web pages more accurately. If the similarity value Sim(D)≥TH between the webpage document and the topic, it is determined that the webpage document is similar to the topic; otherwise, the document is not similar to the topic.

(4)链接相似度计算(4) Link similarity calculation

该步骤决定主题爬虫的搜索方向。通过本发明设计的主题搜索策略，对URL链接进行相似度判定，并对URL优先级进行排序，给爬虫提供最优的URL链接，提高爬行效率。This step determines the search direction of the topic crawler. Through the topic search strategy designed by the invention, the URL links are judged on similarity, and URL priorities are sorted, so as to provide crawlers with optimal URL links and improve crawling efficiency.

本发明综合利用以下三方面信息来计算候选URL的主题相似性：本网页内容相似度、本网页内超链接的URL目录层次信息、本网页中超链接的锚文本信息。具体计算步骤为：The present invention comprehensively utilizes the following three aspects of information to calculate the theme similarity of candidate URLs: the content similarity of this webpage, the URL directory level information of hyperlinks in this webpage, and the anchor text information of hyperlinks in this webpage. The specific calculation steps are:

a)计算当前页面内容的主题相似度，即利用步骤(3)的计算结果；a) Calculate the subject similarity of the current page content, that is, use the calculation result of step (3);

b)对于当前页面的每个链接，通过URL目录信息判断该链接和当前页面是否目录位置相邻。如果是，根据当前页面内容的主题相似度来预估目标页面的相似度，如果否，则通过对锚文本的分析来评价链接的相似度。计算公式如下：b) For each link of the current page, judge whether the link and the current page are adjacent to each other based on URL directory information. If yes, predict the similarity of the target page according to the topic similarity of the current page content, if not, evaluate the similarity of the link by analyzing the anchor text. Calculated as follows:

(公式3) (Formula 3)

其中：in:

D——当前网页；D - the current web page;

L——D网页中的超链接锚文本；L—the hyperlink anchor text in the D web page;

C——L指向的目标网页；C - the target page pointed to by L;

Sim(D)——网页D的主题相似度；Sim(D)—the topic similarity of web page D;

Sim(L)——链接L的锚文本的主题相似度；Sim(L) - topic similarity of the anchor text of the link L;

Sim(C)——对L指向目标网页C主题相似度的预估值；Sim(C)——the estimated value of the topic similarity of L pointing to the target webpage C;

x——影响系数，x＝0～1，x用于调节D和L两个因素的主题相似度的权重分配，x越大，则公式更倾向于锚文本，x越小则更倾向于父网页的主题相似度。根据实验，x可设定为0.7～0.8。x——Influence coefficient, x=0~1, x is used to adjust the weight distribution of the topic similarity of the two factors of D and L, the larger the x, the more inclined the formula is to the anchor text, and the smaller the x is, the more inclined the parent Thematic similarity of web pages. According to experiments, x can be set to 0.7-0.8.

如果链接与主题的相似度值Sim(C)≥TH，则判定该链接与主题相似，否则与主题不相似。If the similarity value Sim(C)≥TH between the link and the topic, it is determined that the link is similar to the topic, otherwise it is not similar to the topic.

(5)将与主题相似的网页下载到数据库并建立索引，用于下一步的编目信息提取。(5) Download the webpages similar to the subject to the database and build an index for the next step of cataloging information extraction.

2.基于本体和HTML的编目信息提取2. Cataloging Information Extraction Based on Ontology and HTML

针对步骤1中搜索到的相似度较高的主题页面，采用基于本体和HTML的编目信息提取方法，提取编目信息。For the topic pages with high similarity searched in step 1, the catalog information extraction method based on ontology and HTML is used to extract the catalog information.

通常的Web信息提取中，以HTML形式存在的网页结构易变化、网页内容缺乏语义描述。本发明针对该问题，将基于HTML结构的信息提取技术和基于本体的信息提取技术的特点相结合，在定位提取信息所在信息块的时候采用基于HTML结构的提取原理，在具体提取信息的时候采用基于本体的提取原理，解决提取项描述的语义问题。In the usual web information extraction, the structure of the webpage existing in the form of HTML is easy to change, and the content of the webpage lacks semantic description. The present invention aims at this problem, combines the characteristics of the information extraction technology based on the HTML structure and the information extraction technology based on the ontology, adopts the extraction principle based on the HTML structure when locating the information block where the extracted information is located, and uses the Based on ontology extraction principle, it solves the semantic problem of extraction item description.

具体执行过程如图2所示，步骤如下：The specific execution process is shown in Figure 2, and the steps are as follows:

(1)构建本体：构建多媒体内容提取本体，本体中的概念定义为多媒体文件内容描述信息的编目著录项，属性定义为各概念之间的关系，概念的标签属性定义各提取数据源中提取著录项对应的多义词。(1) Construct ontology: construct multimedia content extraction ontology, the concept in the ontology is defined as the catalog entry of multimedia file content description information, the attribute is defined as the relationship between concepts, and the label attribute of concept defines the extraction and description in each extraction data source Polysemous words corresponding to the item.

(2)解析网页：清洗HTML页面，改正页面错误信息，去掉冗余信息，转换成XHTML文档，然后将该文档解析成DOM树结构。(2) Parsing the web page: cleaning the HTML page, correcting the error information on the page, removing redundant information, converting it into an XHTML document, and then parsing the document into a DOM tree structure.

(3)生成提取规则：通常情况下各个Web提取源中的提取信息都集中在一个连续的信息块中。系统根据基于树路径和文本内容结合的定位方式来定位信息，生成XPath路径，生成提取规则。(3) Generate extraction rules: Usually, the extracted information from each Web extraction source is concentrated in a continuous information block. The system locates information based on the combination of tree path and text content, generates XPath paths, and generates extraction rules.

(4)读取提取规则：读取步骤(3)生成的提取规则。(4) Read extraction rules: read the extraction rules generated in step (3).

(5)读取本体：读取多媒体内容提取本体，并对本体中的类、属性和实例进行操作。(5) Read the ontology: read the multimedia content to extract the ontology, and operate on the classes, attributes and instances in the ontology.

(6)执行提取算法：将步骤(4)和(5)的提取规则和本体作为输入，执行提取算法。具体步骤为：将HTML解析形成的DOM树中待提取信息块中的具体信息分割成key-value；读取多媒体文件提取本体中的概念以及概念的标签属性值；如果DOM树中的key与本体中的概念的标签属性值对应，则将本体的概念和对应的value值保存到XML文件中，即，将网页信息块中包含的所有数据提取出来。提取出来的具体信息可以作为本体中概念的具体实例添加到提取本体中，扩展本体模型。(6) Execute the extraction algorithm: take the extraction rules and ontology in steps (4) and (5) as input, and execute the extraction algorithm. The specific steps are: divide the specific information in the information block to be extracted in the DOM tree formed by HTML parsing into key-value; read the concept in the multimedia file and extract the label attribute value of the concept; if the key in the DOM tree is consistent with the ontology The tag attribute value of the concept in the ontology is corresponding, then the concept of the ontology and the corresponding value value are saved in the XML file, that is, all the data contained in the web page information block are extracted. The extracted specific information can be added to the extracted ontology as a concrete instance of the concept in the ontology to extend the ontology model.

(7)信息融合存储：根据每个网页数据源提取出相应的提取结果，多个数据源对应各提取结果文件，每个结果文件所包含的信息有相同的也有不同的。系统通过对比分析各个提取结果文件中的信息，进行信息融合，最终生成一个提取结果文件。(7) Information fusion storage: extract corresponding extraction results according to each web page data source, multiple data sources correspond to each extraction result file, and the information contained in each result file may be the same or different. The system compares and analyzes the information in each extraction result file, performs information fusion, and finally generates an extraction result file.

3.基于自然语言的编目信息规范化3. Normalization of cataloging information based on natural language

首先，本发明初始化一个著录项规范词库；然后，针对步骤2中提取到的编目信息，执行基于知网的语义相似度计算算法，生成规范化的编目著录项。Firstly, the present invention initializes a standardized thesaurus of bibliographic items; then, according to the catalog information extracted in step 2, executes the semantic similarity calculation algorithm based on HowNet to generate standardized catalog bibliographic items.

系统读入步骤2中提取到的编目项信息，以及知网词库和义原树文件，找到匹配的两个词语，之后计算两个词语的语义相似度。一个汉语词语由一个或多个义项(概念)组成，两个词语的相似度是各概念的语义相似度的最大值，把两个汉语词语之间的相似度问题归结到两个概念之间的相似度问题，而所有的概念都最终用义原来表示，因此，将按照下列步骤，从义原相似度计算开始，逐步计算两个汉语词语的语义相似度。The system reads in the catalog item information extracted in step 2, as well as HowNet thesaurus and sememe tree files, finds two matching words, and then calculates the semantic similarity of the two words. A Chinese word is composed of one or more meanings (concepts), and the similarity between two words is the maximum value of the semantic similarity of each concept. The similarity between two Chinese words is attributed to the relationship between two concepts. However, all concepts are finally represented by sememe, so the following steps will be followed, starting from the sememe similarity calculation, and gradually calculating the semantic similarity of two Chinese words.

1)计算义原的语义相似度1) Calculate the semantic similarity of the sememe

在义原构成的树状层次体系中，假设两个义原X和Y在其中的路径距离为dis(一个正整数)，这两个义原之间的语义相似度Sim(X，Y)按公式4计算：In the tree-like hierarchical system composed of sememes, assuming that the path distance between two sememes X and Y is dis (a positive integer), the semantic similarity Sim(X, Y) between these two sememes is expressed by Equation 4 calculates:

$Sim (X, Y) = \frac{α}{dis + α}$ (公式4) $Sim (x, Y) = \frac{α}{dis + α}$ (Formula 4)

其中，α是表示相似度为0.5时的路径长度参数，根据实验分析，α取为1.6。Among them, α is the path length parameter when the similarity is 0.5, and according to the experimental analysis, α is taken as 1.6.

分别按照公式4计算出第一独立义原、其他独立义原、关系义原和符号义原这四类义原的语义相似度。According to formula 4, the semantic similarity of the first independent sememe, other independent sememe, relational sememe and symbolic sememe are calculated.

2)计算两个概念的语义相似度2) Calculate the semantic similarity of two concepts

两个概念的相似度由上述四类义原的相似度加权平均得到。因此，概念S₁和概念S₂之间的语义相似度Sim(S₁，S₂)按照公式5计算：The similarity of two concepts is obtained by the weighted average of the similarity of the above four types of sememes. Therefore, the semantic similarity Sim(S ₁ , S ₂ ) between concept S ₁ and concept S ₂ is calculated according to formula 5:

$Sim (S_{1}, S_{2}) = Σ_{k = 1}^{4} β_{k} Π_{q = 1}^{k} {Sim}_{q} (X, Y)$ (公式5) $Sim (S_{1}, S_{2}) = Σ_{k = 1}^{4} β_{k} Π_{q = 1}^{k} {Sim}_{q} (x, Y)$ (Formula 5)

其中，X和Y表示两个义原，Sim_q(X，Y)表示第q类义原的语义相似度，1≤q≤4；β_k(1≤k≤4)是四类义原的权重，代表了四类义原对概念语义相似度的影响程度，且有β₁+β₂+β₃+β₄＝1，β₁≥β₂≥β₃≥β₄。Among them, X and Y represent two sememes, Sim _q (X, Y) represents the semantic similarity of the qth type of sememe, 1≤q≤4; β _k (1≤k≤4) is the four types of sememes The weight represents the degree of influence of the four types of sememes on the conceptual semantic similarity, and β ₁ + β ₂ + β ₃ + β ₄ = 1, β ₁ ≥ _{β 2} ≥ _{β 3} ≥ _{β 4} .

一般情况下，第一类义原反映了概念的最主要特征，因此，权重较大，通常该类义原的相似度也较大，但在实际中也存在第一类义原的相似度比第三类或第四类义原的相似度小的情况，这将导致概念的整体相似度比实际要偏大。公式5中，对于权重较小但有可能相似度较大的义原，用多个义原相似度乘积的方式，使得整体相似度降低，即，主要义原的相似度值对于次要义原的相似度值起到制约作用，如果主要义原相似度比较低，那么次要义原的相似度对于整体相似度所起到的作用也要降低。In general, the first type of sememe reflects the most important feature of the concept, therefore, the weight is larger, and the similarity of this type of sememe is usually larger, but in practice, the similarity ratio of the first type of sememe also exists. If the similarity of the third or fourth type of sememes is small, the overall similarity of concepts will be larger than the actual one. In Equation 5, for sememes with small weight but possibly high similarity, the overall similarity is reduced by multiplying the similarity of multiple sememes, that is, the similarity value of the main sememe is relatively large compared with that of the secondary sememe. The similarity value plays a restrictive role. If the similarity of the main sememe is relatively low, the effect of the similarity of the secondary sememe on the overall similarity will also be reduced.

3)计算两个汉语词语的语义相似度3) Calculate the semantic similarity of two Chinese words

对于两个汉语词语W₁和W₂，如果W₁有n个概念：S₁₁，S₁₂，……，S_1n，W₂有m个概念：S₂₁，S₂₂，……，S_2m，采用步骤1)和2)计算出两个词语的每对概念的语义相似度，然后取结果的最大值，即，词语W₁和词语W₂的语义相似度Sim(W₁，W₂)是各个概念的相似度之最大值，用公式6计算：For two Chinese words W ₁ and W ₂ , if W ₁ has n concepts: S ₁₁ , S ₁₂ , ..., S _1n , W ₂ has m concepts: S ₂₁ , S ₂₂ , ..., S _2m , Use steps 1) and 2) to calculate the semantic similarity of each pair of concepts of two words, and then take the maximum value of the result, that is, the semantic similarity Sim(W ₁ , W ₂ ) of word W ₁ and word W ₂ is The maximum value of the similarity of each concept is calculated by formula 6:

$Sim (W_{1}, W_{2}) = \underset{v = 1 . . . n; w = 1 . . . m}{Max} Sim (S_{1 v}, S_{2 w})$ (公式6) $Sim (W_{1}, W_{2}) = \underset{v = 1 . . . no; w = 1 . . . m}{Max} Sim (S_{1 v}, S_{2 w})$ (Formula 6)

其中，S_1v表示词语W₁中第v个概念，1≤v≤n，n是词语W₁中概念的个数；S_2w表示词语W₂中第w个概念，1≤w≤m，m是词语W₂中概念的个数；Sim(S_1v，S_2w)表示概念S_1v和S_2w的语义相似度。取各对概念的相似度的最大值，便得到两个词语的语义相似度。Among them, S _1v represents the vth concept in word _W1 , 1≤v≤n, n is the number of concepts in word _W1 ; _S2w represents the wth concept in word _W2 , 1≤w≤m, m is the number of concepts in word W ₂ ; Sim(S _1v , S _2w ) represents the semantic similarity between concepts S _1v and S _2w . Take the maximum value of the similarity of each pair of concepts to get the semantic similarity of two words.

本发明能够智能、自动地为编目者提供著录项信息，减轻人工劳动量，提高编目效率，而且能够适应专业和非专业编目者的不同需求，适应广域网环境。The invention can intelligently and automatically provide bibliographic item information for catalogers, reduce manual labor, improve cataloging efficiency, and can adapt to different needs of professional and non-professional catalogers and wide area network environment.

附图说明 Description of drawings

图1：面向广域网的音视频智能编目信息获取总体方案流程Figure 1: The overall solution process for WAN-oriented audio and video intelligent cataloging information acquisition

图2：基于本体和HTML的编目信息提取的执行流程Figure 2: Execution flow of cataloging information extraction based on ontology and HTML

图3：主题爬虫搜索网页的实施方案Figure 3: Implementation of a topic crawler searching web pages

具体实施方式 Detailed ways

根据图1配置本发明的实施例。本实施例中计算机为“DELL E02S，CPU：Intel Xeon 5504 2Ghz，4G内存，操作系统是Server 2003 Standard Edition sp2”。An embodiment of the invention is configured according to FIG. 1 . In this embodiment, the computer is "DELL E02S, CPU: Intel Xeon 5504 2Ghz, 4G memory, and the operating system is Server 2003 Standard Edition sp2".

编目者首先打开编目系统界面，再选择播放待编目的视频文件；然后在界面右侧的著录区域进行著录，当输入“正题名”和“关键词”著录项后，系统自动连接主题爬虫搜索到相关的主题网页，这些网页再通过基于本体和HTML的编目信息提取功能，提取出相关的编目信息，这些可能不规范的编目信息再由基于自然语言的编目信息规范化功能进行规范化，最后系统自动提示到相应的著录文本框内供用户参考选择。具体步骤如下：The cataloger first opens the cataloging system interface, and then chooses to play the video file to be cataloged; then performs description in the description area on the right side of the interface. After entering the description items of "proper title" and "keyword", the system will automatically connect to the subject crawler to search Relevant subject webpages, these webpages will extract relevant catalog information through ontology and HTML-based catalog information extraction functions, and these possibly non-standard catalog information will be normalized by the natural language-based catalog information normalization function, and finally the system will automatically prompt to the corresponding description text box for users to refer to and select. Specific steps are as follows:

1.主题爬虫搜索网页1. Theme crawler searches the web

通过对主题爬虫关键算法的改进，本发明设计并实现了高效的主题爬虫信息采集功能。By improving the key algorithm of the subject crawler, the invention designs and realizes an efficient subject crawler information collection function.

将编目者录入的正题名和关键词作为爬虫的主题集合，执行以下步骤(如图3所示)：Use the title proper and keywords entered by the cataloger as the subject collection of the crawler, and perform the following steps (as shown in Figure 3):

(1)将初始种子URL放入待爬队列中，等待主题爬虫爬取。(1) Put the initial seed URL into the queue to be crawled, and wait for the topic crawler to crawl.

(2)获取并解析初始种子URL对应的网页，得到其上的<p>标签和<title>标签中的文本内容，并通过分析<a>标签、<area>标签、<base>标签获取该网页上的所有URL链接。(2) Obtain and analyze the webpage corresponding to the initial seed URL, obtain the text content in the <p> tag and <title> tag on it, and obtain the content by analyzing the <a> tag, <area> tag, and <base> tag All URL links on the web page.

(3)将获取到的网页上的标题文本和正文文本用Lucene中文分词器进行分词，去重分析后形成关键词特征项集合，并将其与主题集合进行匹配，通过词频计算得到与主题向量维数相等的特征项向量。(3) Segment the title text and body text on the obtained webpage with the Lucene Chinese tokenizer, form a set of keyword feature items after deduplication analysis, and match it with the subject set, and obtain the subject vector through word frequency calculation Equal-dimensional feature term vectors.

(4)按照本发明提出的“基于关键词特征项位置因素的权重算法”，计算基于位置因素的各特征项的权重，进而计算出页面文档与主题的相似度。如果相似度大于等于阈值(取为0.4)，便将其下载到数据库并建立索引，否则，将其丢弃。并将该URL链接放入完成队列。(4) According to the "weight algorithm based on the position factor of the keyword feature item" proposed by the present invention, calculate the weight of each feature item based on the position factor, and then calculate the similarity between the page document and the topic. If the similarity is greater than or equal to the threshold (taken as 0.4), it will be downloaded to the database and indexed, otherwise, it will be discarded. And put the URL link into the completion queue.

(5)解析上述网页包含的所有链接，并将其与完成队列中的URL进行重复性判断，重复则丢弃，不重复则根据本发明设计的链接主题相似度算法(公式3)，计算其主题相似性，按照主题相似性的高低排列访问优先顺序，放入待爬队列中。(5) analyze all links that above-mentioned webpage comprises, and it is carried out repeatability judgment with the URL in the completion queue, repeats then discards, does not repeat then according to the link theme similarity algorithm (formula 3) that the present invention designs, calculates its theme Similarity, according to the priority of the subject similarity, put it in the queue to be climbed.

(6)当待爬队列为空，并且全部线程处于闲置状态的时候，爬取工作结束。所爬取到的网页链接将用于下一步的编目信息提取。(6) When the queue to be crawled is empty and all threads are idle, the crawling work ends. The crawled webpage links will be used for the next step of cataloging information extraction.

为了评测主题爬行效果，设计了以“2008北京奥运”为主题的爬虫实例如下：In order to evaluate the theme crawling effect, a crawler example with the theme of "2008 Beijing Olympics" is designed as follows:

首先配置参数为：线程数＝3，爬行深度＝1，初始种子数＝3，然后按照本发明设计的上述方法进行主题爬行。First configuration parameters are: thread number=3, crawling depth=1, initial seed number=3, and then subject crawling is carried out according to the above-mentioned method designed in the present invention.

一般主题爬虫采用的是传统向量空间模型进行主题相似度判定和单一的基于内容评价的搜索策略，而本发明采用“基于关键词特征项位置因素的权重算法”来计算网页内容的主题相似度，并综合考虑网页内容相似度、网页内超链接的URL目录层次信息、网页中超链接的锚文本信息三个因素，来计算各链接的主题相似度，能够更准确地发现主题相似的页面。The general topic crawler uses the traditional vector space model to determine the topic similarity and a single search strategy based on content evaluation, but the present invention uses "weight algorithm based on keyword feature item position factors" to calculate the topic similarity of web page content, And comprehensively consider the three factors of webpage content similarity, URL directory level information of hyperlinks in webpages, and anchor text information of hyperlinks in webpages to calculate the theme similarity of each link, so that pages with similar themes can be found more accurately.

本实验将一般主题爬虫与本发明设计的主题爬虫进行对比，得到了表1中的数据。In this experiment, the general theme crawler is compared with the theme crawler designed by the present invention, and the data in Table 1 are obtained.

表1一般主题爬虫与本发明的主题爬虫对比分析Table 1 general theme crawler and theme crawler comparative analysis of the present invention

一般主题爬虫 General theme crawler 本发明设计的主题爬虫 The theme crawler designed by the present invention 分析的总链接数(个) The total number of links analyzed (pieces) 1280 1280 1419 1419 主题相关链接数(个) Number of topic-related links (pieces) 413 413 494 494 超时链接数(个) The number of timed out connections (pieces) 35 35 45 45 丢弃链接数(个) The number of discarded links (pieces) 832 832 880 880 共耗时(秒) Total time spent (seconds) 3855.843 3855.843 2191.344 2191.344 爬取速率(个/秒) Crawling rate (pieces/second) 0.332 0.332 0.648 0.648

通过实验数据分析可知，本发明所设计的主题爬虫具有以下优势：Through the analysis of experimental data, it can be seen that the theme crawler designed by the present invention has the following advantages:

(1)在总链接数方面，本发明设计的主题爬虫比一般主题爬虫分析的链接数有所增加，这说明本发明采用的搜索策略在保证尽最大可能分析与主题相关的链接的前提下，缓解了主题爬虫“近视”的问题，提高了爬虫的查全率；(1) Aspect the total number of links, the theme crawler designed by the present invention increases the number of links analyzed by the general theme crawler, which shows that the search strategy adopted by the present invention is under the premise of ensuring that the links relevant to the theme are analyzed as much as possible. Alleviated the "myopia" problem of theme crawlers and improved the recall rate of crawlers;

(2)在主题相关链接数的获取方面，本发明设计的主题爬虫比一般主题爬虫的获取量也有所增加，这说明本发明的主题相似度判定算法与基于传统向量空间模型的主题相似度判定算法相比，可以更准确地获取到与主题相关的Web文档，提高了爬虫的查准率，避免了对与主题相关文档的错判；(2) In terms of the acquisition of topic-related links, the topic crawler designed by the present invention also has an increased amount of acquisition than general topic crawlers, which shows that the topic similarity judgment algorithm of the present invention is different from the topic similarity judgment based on the traditional vector space model Compared with the algorithm, web documents related to the topic can be obtained more accurately, which improves the accuracy rate of crawlers and avoids misjudgment of documents related to the topic;

(3)在爬取时间和速率方面，本发明设计的主题爬虫比一般的主题爬虫在效率上有明显提高。(3) In terms of crawling time and speed, the theme crawler designed by the present invention has significantly improved efficiency than general theme crawlers.

针对步骤1中搜索到的相似度较高的主题页面，采用基于本体和HTML的编目信息提取方法，提取编目信息。通过基于HTML结构的信息提取方法对HTML页面结构进行分析定位，生成抽取规则，并通过基于本体的信息提取方法解决语义问题。具体处理步骤如下：For the topic pages with high similarity searched in step 1, the catalog information extraction method based on ontology and HTML is used to extract the catalog information. The HTML page structure is analyzed and located by the information extraction method based on the HTML structure, the extraction rules are generated, and the semantic problem is solved by the information extraction method based on the ontology. The specific processing steps are as follows:

(1)采用斯坦福大学开发的本体构建工具Protégé构建多媒体内容提取本体，本体中的概念定义为多媒体文件内容描述信息的编目著录项，属性定义为各概念之间的关系，概念的标签属性定义各提取数据源中提取著录项对应的多义词。所建本体存储为OWL格式的文件。(1) Using Protégé, an ontology construction tool developed by Stanford University, to construct a multimedia content extraction ontology, the concept in the ontology is defined as the catalog entry of multimedia file content description information, the attribute is defined as the relationship between concepts, and the label attribute of the concept defines each Extract the polysemous words corresponding to the extracted bibliographic items from the data source. The created ontology is stored as a file in OWL format.

(2)采用HTML Tidy工具包将HTML页面进行清洗，改正错误信息，去掉冗余信息，转换成XHTML文档，然后用XML解析器将该文档解析成DOM树结构。(2) Use the HTML Tidy toolkit to clean the HTML page, correct the error information, remove the redundant information, convert it into an XHTML document, and then use the XML parser to parse the document into a DOM tree structure.

(3)根据基于树路径和文本内容结合的定位方式来定位信息，生成XPath路径，生成提取规则。(3) Locate information based on the combination of tree path and text content, generate XPath path, and generate extraction rules.

(4)读取步骤(3)生成的提取规则。(4) Read the extraction rules generated in step (3).

(5)利用HP公司开发的基于Java的开源语义网工具包Jena，读取多媒体内容提取本体，并对本体中的类、属性和实例进行操作。(5) Using Jena, an open source semantic web toolkit based on Java developed by HP, to read the multimedia content extraction ontology and operate on the classes, attributes and instances in the ontology.

(6)将步骤(4)和(5)的提取规则和本体作为输入，执行提取算法。具体步骤为：将HTML解析形成的DOM树中待提取信息块中的具体信息分割成key-value；读取多媒体文件提取本体中的概念以及概念的标签属性值；如果DOM树中的key与本体中的概念的标签属性值对应，则将本体的概念和对应的value值保存到XML文件中。(6) Take the extraction rules and ontology of steps (4) and (5) as input, and execute the extraction algorithm. The specific steps are: divide the specific information in the information block to be extracted in the DOM tree formed by HTML parsing into key-value; read the concept in the multimedia file and extract the label attribute value of the concept; if the key in the DOM tree is consistent with the ontology If the label attribute value of the concept in the ontology corresponds, the concept of the ontology and the corresponding value value are saved in the XML file.

(7)针对多个网页数据源提取出的结果文件，通过对比分析其中各项编目项信息，进行信息融合，最终生成一个提取结果文件(XML文档)。(7) For the result files extracted from multiple webpage data sources, by comparing and analyzing the information of each catalog item in them, performing information fusion, and finally generating an extraction result file (XML document).

为了评测编目信息提取效果，针对十部影视剧视频，各采用步骤1搜索到的主题页面中的九个页面，用本发明的编目信息提取方法对其进行分析提取。这九个页面对应的网站源分别是腾讯网、搜狐网、新浪网、豆瓣网、时光网、土豆网、优酷网、酷6网、迅雷看看，每个网站源提供一部影视剧视频。In order to evaluate the effect of cataloging information extraction, for ten films and television drama videos, nine pages in the subject pages searched in step 1 were respectively used, and the cataloging information extraction method of the present invention was used to analyze and extract them. The website sources corresponding to these nine pages are Tencent.com, Sohu.com, Sina.com, Douban.com, Mtime.com, Tudou.com, Youku.com, Ku6.com, and Xunlei Kankan. Each website source provides a film and television drama video.

针对九个不同网站源中近百个提取实例的提取结果统计分析如下：The statistical analysis of the extraction results of nearly a hundred extraction instances from nine different website sources is as follows:

表2基于本体和HTML的编目信息提取结果分析Table 2 Analysis of cataloging information extraction results based on ontology and HTML

其中，in,

召回率R＝正确提取的信息/应提取的正确信息Recall rate R = correctly extracted information / correct information that should be extracted

准确率P＝正确提取的信息/提取的信息Accuracy rate P = correct extracted information / extracted information

F指数是召回率R和准确率P的加权几何平均值，为了综合评价提取的性能，

其中，δ是召回率和准确率的相对权重，δ＝1时，召回率与准确率同样重要，δ＞1时，准确率更重要，δ＜1时，召回率更重要。实验中，取δ＝1，使召回率和准确率同样重要。The F index is the weighted geometric mean of the recall rate R and the accuracy rate P. In order to comprehensively evaluate the extraction performance,

Among them, δ is the relative weight of the recall rate and the precision rate. When δ=1, the recall rate is as important as the precision rate. When δ>1, the precision rate is more important. When δ<1, the recall rate is more important. In the experiment, take δ=1, so that the recall rate and precision rate are equally important.

从上表数据可知，召回率几乎都在80％以上，大多接近90％，F指数也基本都在80％以上，这说明本发明提出的方法在处理简单提取页面信息和语义描述复杂的提取页面信息时可取得较好的提取效果，对页面结构变化的适应能力较强，稳定性较高。As can be seen from the data in the above table, the recall rate is almost all above 80%, mostly close to 90%, and the F index is also basically above 80%, which shows that the method proposed by the present invention is effective in processing simple extraction page information and complex semantic description extraction pages When extracting information, it can achieve a better extraction effect, has a strong adaptability to page structure changes, and has high stability.

首先将《知网》词库和义原之间的关系树存储为两个文本文件，优点是减少数据库存储时间，且有助于用户自主扩充和修改词库；然后系统自动读入两文本文件，将步骤2提取到的编目信息逐个与词库内容进行匹配，自动对字符串进行分析，分离出该词语的词性和义原串；再通过本发明设计的算法，首先计算义原的相似度(取α＝1.6)，再计算出两个词语所包含的所有概念之间的相似度(取β₁＝0.5，β₂＝0.2，β₃＝0.17，β₄＝0.13)，并取其最大值作为两个词语的语义相似度，将相似度超过阈值的规范化著录项显示到系统界面。根据大量实验统计分析，阈值设定为0.7能够得到更准确的结果。First, store the relationship tree between HowNet thesaurus and sememes as two text files, which has the advantage of reducing database storage time and helping users to independently expand and modify thesaurus; then the system automatically reads in the two text files , matching the catalog information extracted in step 2 with thesaurus content one by one, automatically analyzing the character string, separating the part of speech and sememe string of the word; then by the algorithm designed in the present invention, at first calculate the similarity of the sememe (take α=1.6), and then calculate the similarity between all the concepts contained in the two words (take β ₁ =0.5, β ₂ =0.2, β ₃ =0.17, β ₄ =0.13), and take the maximum The value is taken as the semantic similarity of two words, and the normalized bibliographic items whose similarity exceeds the threshold are displayed on the system interface. According to the statistical analysis of a large number of experiments, setting the threshold at 0.7 can get more accurate results.

基于步骤2中来自于九个不同网站源中近百个提取实例的提取结果，采用本发明的词语相似度算法，进行了编目信息规范化实验，实验结果表明，本发明根据相似度选出的词语与实际情况是相符合的。Based on the extraction results of nearly a hundred extraction examples from nine different website sources in step 2, the word similarity algorithm of the present invention is used to carry out a cataloging information standardization experiment. The experimental results show that the words selected by the present invention according to the similarity It is consistent with the actual situation.

以“受众”著录项为例，步骤2提取的该著录项内容为“儿童”，而步骤3建立的规范词库中，“受众”的著录项包括“未成年”、“成年”和“大众”。通过应用本发明设计的语义相似度算法，分别计算“儿童”与“未成年”、“成年”和“大众”三个规范化词条的语义相似度，得出“未成年”与“儿童”在语义上最为接近，因此，系统自动将计算结果“未成年”显示在“受众”著录项文本框中。Taking the item "audience" as an example, the content of the item extracted in step 2 is "children", while in the standardized thesaurus established in step 3, the items of "audience" include "minor", "adult" and "public ". By applying the semantic similarity algorithm designed by the present invention, the semantic similarity of "children" and "minor", "adult" and "popular" three normalized entries are calculated respectively, and "minor" and "children" are The semantics are the closest, so the system automatically displays the calculation result "underage" in the text box of the "audience" bibliographic item.

再以“语种”著录项为例，步骤2提取的该著录项内容为“汉语”，通过应用本发明设计的语义相似度算法，分别计算“汉语”与规范词库中各个“语种”词条的语义相似度，得出“汉语”与规范词库中的“中文”在语义上最为接近，因此，系统自动将计算结果“中文”显示在“语种”著录项文本框中。Taking the "language" item as an example again, the content of the item extracted in step 2 is "Chinese", and by applying the semantic similarity algorithm designed by the present invention, each "language" entry in "Chinese" and the standard lexicon is calculated respectively The semantic similarity of "Chinese" and "Chinese" in the standard thesaurus is the closest in semantics, so the system automatically displays the calculation result "Chinese" in the "Language" description item text box.

本发明采用基于关键词特征项位置因素的权重算法，对文档中不同位置的特征项赋予不同的加权因子，进而更准确地计算网页内容的主题相似度；综合利用网页内容相似度、超链接的URL目录层次信息、超链接的锚文本信息三方面因素，优化选择主题相似度更高的链接。对搜索到的主题页面，采用基于本体和HTML的信息提取方法自动提取出编目信息。采用改进的语义相似度计算方法，对提取到的编目信息进行规范化。本发明能够智能、自动地为编目者提供著录项信息，减轻人工劳动量，提高编目效率，能够适应专业和非专业编目者的不同需求，也能适应广域网环境。The present invention adopts a weighting algorithm based on the location factors of keyword feature items, and assigns different weighting factors to feature items in different positions in the document, thereby more accurately calculating the subject similarity of webpage content; comprehensively utilizing webpage content similarity, hyperlink Based on the three factors of URL directory hierarchy information and hyperlink anchor text information, links with higher topic similarity are optimized and selected. For the searched subject pages, the catalog information is automatically extracted using ontology and HTML-based information extraction methods. The extracted catalog information is normalized by using an improved semantic similarity calculation method. The invention can intelligently and automatically provide bibliographic item information for catalogers, reduce manual labor, improve cataloging efficiency, adapt to different needs of professional and non-professional catalogers, and also adapt to wide area network environment.

Claims

1. A wide area network-oriented audio and video intelligent cataloguing information acquisition method is characterized by comprising the following steps:

(1) topic crawler search web page

Calculating the topic similarity of the webpage content by adopting a weight algorithm based on the position factors of the keyword characteristic items; comprehensively utilizing three factors of webpage content similarity, URL directory hierarchical information of hyperlinks and anchor text information of the hyperlinks to calculate link topic similarity;

the weighting algorithm based on the position factors of the keyword feature items comprises the following specific steps:

1) defining different positions where the feature items appear, and giving different position weight factors to the feature items at the different positions;

defining the position of the occurrence of the keyword feature item as 3 types: subject label, title label, other positions of the text; the importance of the 3 types of positions to the feature items is sequentially reduced;

then, a position weight factor PG is introduced to express the importance of the characteristic items at different positions, and the PG is defined_i(i is 1, 2, 3) is a weighting factor corresponding to the feature term at different positions, i represents the above 3 types of positions, and defines: PG (Picture experts group)_i＞＝PG_i+1(1＜＝i＜＝2)；

TF'_i(i ═ 1, 2, 3) is the frequency with which the characteristic term t appears at different positions;

2) calculating the weight of the feature item based on the position factor;

according to the position of the feature item, calculating the weight DWeight (t) of the feature item t of a certain keyword in the page document D, wherein the formula is as follows:

(formula 1)

The steps of calculating the similarity of the linking topics are as follows:

1) calculating the topic similarity of the current page content:

(formula 2)

In formula 2, D is the searched page document, z represents the dimension of the Topic Topic, DWeight (j) represents the weight of the jth keyword feature item in D, Topicweight (j) represents the weight of the jth Topic item in the Topic Topic, and j is more than or equal to 1 and less than or equal to z;

2) for each link of the current page, judging whether the link is adjacent to the current page in directory position or not through URL directory information, if so, estimating the similarity of the target page according to the topic similarity of the content of the current page, and if not, evaluating the similarity of the link through the analysis of the anchor text; the calculation formula is as follows:

(formula 3)

Wherein:

d-current webpage;

hyperlink anchor text in L-D web page;

c-the target web page pointed to by L;

sim (D) -topic similarity of web page D;

sim (L) -topic similarity of anchor text linking L;

sim (C) -a predicted value for similarity of subject matter C of L-directed target web pages;

x is an influence coefficient, wherein x is 0-1, x is used for adjusting weight distribution of theme similarity of the D factor and the anchor L factor, the larger x is, the more the anchor text is inclined to a formula, and the smaller x is, the more the theme similarity of a parent webpage is inclined to;

(2) ontology and HTML based catalog information extraction

Analyzing and positioning an HTML page structure by adopting an information extraction method based on an HTML structure, generating an extraction rule, and solving a semantic problem by adopting an information extraction method based on a body; the specific treatment steps are as follows:

1) constructing an ontology: constructing a multimedia content extraction body, wherein concepts in the body are defined as cataloguing and bibliographic items of multimedia file content description information, attributes are defined as relations among the concepts, and label attributes of the concepts define polysemous words corresponding to the bibliographic items extracted from each extraction data source;

2) analyzing the webpage: cleaning HTML page, correcting error information of page, removing redundant information, converting into XHTML document, and then analyzing the document into DOM tree structure;

3) and (3) generating an extraction rule: generating an XPath path and an extraction rule according to positioning information based on the combination of the tree path and the text content;

4) reading an extraction rule: reading the extraction rule generated in the step 3);

5) reading the body: reading a multimedia content extraction body, and operating the class, the attribute and the example in the body;

6) executing an extraction algorithm: taking the extraction rules and the body of the steps 4) and 5) as input, and executing an extraction algorithm; the method comprises the following specific steps: dividing specific information in an information block to be extracted in a DOM tree formed by HTML (hypertext markup language) analysis into key-values; reading concepts in the multimedia file extraction ontology and label attribute values of the concepts; if the key in the DOM tree corresponds to the label attribute value of the concept in the ontology, storing the concept of the ontology and the corresponding value into an XML file, namely, extracting all data contained in the webpage information block; the extracted specific information can be added into the extracted ontology as a specific example of a concept in the ontology to expand the ontology model;

7) information fusion and storage: extracting corresponding extraction results according to each webpage data source, wherein the plurality of data sources correspond to each extraction result file, and the information contained in each result file is the same and different; the system performs information fusion by comparing and analyzing information in each extraction result file, and finally generates an extraction result file;

(3) cataloging information normalization based on natural language

Adopting a semantic similarity calculation method and providing an improved calculation method of the similarity of the concept semantic expression, and performing similarity calculation on the cataloguing information extracted in the step (2) and the information of the normative lexicon to further determine the contents of the normative cataloguing items; the specific process is as follows:

the system reads in the cataloguing information extracted in the step (2), a known network word stock and a semantic tree file, finds two matched words and calculates the semantic similarity of the two words; the method comprises the following steps that one Chinese word is composed of one or more concepts, the similarity of the two words is the maximum value of the semantic similarity of each concept, the similarity problem between the two Chinese words is solved to the similarity problem between the two concepts, all the concepts are finally expressed by a semantic primitive, and the semantic similarity of the two Chinese words is calculated step by step from the calculation of the similarity of the semantic primitives according to the following steps;

1) calculating semantic similarity of sememes

In the tree hierarchy formed by the sememes, assuming that the path distance between two sememes X and Y is dis, which is a positive integer, the semantic similarity Sim (X, Y) between the two sememes is calculated according to formula 4:

(formula 4)

Wherein α is a path length parameter indicating a similarity of 0.5;

calculating the semantic similarity of four types of sememes, namely a first independent sememe, other independent sememes, a relation sememe and a symbol sememe according to a formula 4;

2) calculating semantic similarity of two concepts

The similarity of the two concepts is obtained by weighted average of the similarity of the four types of the sememes; concept S₁And concept S₂Semantic similarity Sim (S) between₁，S₂) Calculated according to equation 5:

(formula 5)

Wherein X and Y represent two sememes, Sim_q(X, Y) represents the semantic similarity of the q-th class of sememes, and q is more than or equal to 1 and less than or equal to 4; beta is a_k(1 ≦ k ≦ 4) is the weight of the four classes of sememes, representing the degree of influence of the four classes of sememes on the semantic similarity of the concept, and has β₁+β₂+β₃+β₄＝1，β₁≥β₂≥β₃≥β₄；

In formula 5, for the semantic atoms with smaller weight but possibly larger similarity, the overall similarity is reduced by the product of the similarities of a plurality of the semantic atoms, that is, the similarity of the primary semantic atoms has a restriction effect on the similarity of the secondary semantic atoms, and if the similarity of the primary semantic atoms is lower, the similarity of the secondary semantic atoms also has a reduction effect on the overall similarity;

3) calculating semantic similarity of two Chinese words

For two Chinese words W₁And W₂If W is₁There are n concepts: s₁₁，S₁₂，……，S_1n，W₂There are m concepts: s₂₁，S₂₂，……，S_2mCalculating the semantic similarity of each pair of concepts of the two terms by adopting the steps 1) and 2), and then taking the maximum value of the result, namely the term W₁And the word W₂Semantic similarity Sim (W) of₁，W₂) Is the maximum value of the similarity of each concept, and is calculated by equation 6:

(formula 6)

Wherein S is_1vMeans word W₁The v-th concept in the specification, v is more than or equal to 1 and less than or equal to n, and n is the word W₁The number of concepts in; s_2wMeans word W₂W is more than or equal to 1 and less than or equal to m which is a word W₂The number of concepts in; sim (S)_1v，S_2w) Represents the concept S_1vAnd S_2wThe semantic similarity of the two words is obtained by taking the maximum value of the similarity of each pair of concepts.