[go: up one dir, main page]

CN105528411B - Device and method for full-text retrieval of ship equipment interactive electronic technical manual - Google Patents

Device and method for full-text retrieval of ship equipment interactive electronic technical manual Download PDF

Info

Publication number
CN105528411B
CN105528411B CN201510884252.XA CN201510884252A CN105528411B CN 105528411 B CN105528411 B CN 105528411B CN 201510884252 A CN201510884252 A CN 201510884252A CN 105528411 B CN105528411 B CN 105528411B
Authority
CN
China
Prior art keywords
module
retrieval
database
abbreviation
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510884252.XA
Other languages
Chinese (zh)
Other versions
CN105528411A (en
Inventor
马良荔
覃基伟
苏凯
许国鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval University of Engineering PLA
Original Assignee
Naval University of Engineering PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval University of Engineering PLA filed Critical Naval University of Engineering PLA
Priority to CN201510884252.XA priority Critical patent/CN105528411B/en
Publication of CN105528411A publication Critical patent/CN105528411A/en
Application granted granted Critical
Publication of CN105528411B publication Critical patent/CN105528411B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明所设计的一种船舶装备交互式电子技术手册全文检索装置,它包括公共源数据库、专业词汇提取模块、缩略语提取模块、第一分词模块、技术信息术语数据库、装备部件名称数据库、缩略语数据库、通用词汇数据库、检索记录数据库、用户检索命令通信模块、检索模块、第二分词模块、索引数据库和索引模块。本发明综合数据模块文档中元素标签特点和文档内容,利用专业词汇进行查询并加大专业词汇在文档及检索关键词中的权重,使得系统能够在一定语义层次进行查询,返回的检索结果更加贴近用户的检索意图,从而保证了该检索系统的高召回率和准确率。

A full-text search device for ship equipment interactive electronic technical manuals designed by the present invention includes a public source database, a professional vocabulary extraction module, an abbreviation extraction module, a first word segmentation module, a technical information term database, an equipment component name database, an abbreviation Abbreviation database, common vocabulary database, retrieval record database, user retrieval command communication module, retrieval module, second participle module, index database and index module. The present invention integrates the characteristics of element tags and document content in the data module document, uses professional vocabulary to query and increases the weight of professional vocabulary in documents and retrieval keywords, so that the system can query at a certain semantic level, and the returned retrieval results are closer The user's retrieval intention ensures the high recall and precision of the retrieval system.

Description

船舶装备交互式电子技术手册全文检索装置及方法Device and method for full-text retrieval of ship equipment interactive electronic technical manual

技术领域technical field

本发明涉及信息检索技术领域,具体地指一种船舶装备交互式电子技术手册全文检索装置及方法。The invention relates to the technical field of information retrieval, in particular to a full-text retrieval device and method for an interactive electronic technical manual of ship equipment.

技术背景technical background

目前船舶装备的技术资料大部分以纸质形式存在,导致技术资料的管理任务日益繁重,资料重复率和冗余度增大,且难以更新,数据互操作性、传递实时性和共享难度大。为了解决上述难题,通常编制交互式电子技术手册(IETM,Interactive ElectronicTechnical Manual)对技术资料进行管理,即按照标准的数字格式标准编制,采用文字、图形、表格、音频和视频等形式,通过人机交互方式提供该装备的基本原理、操作使用和维修保障等内容的技术出版物。由于IETM系统涉及的信息繁多,用户通常需使用信息检索功能实现对所需内容的快速查找,其中全文检索是最常用的方法之一。过去IETM的全文检索方法中,多数采用通用领域的检索方案,没有充分考虑专业领域技术资料的特点,导致检索结果不理想。At present, most of the technical data of ship equipment exist in paper form, which leads to increasingly heavy management tasks of technical data, increased data repetition rate and redundancy, and is difficult to update, and the interoperability, real-time transmission and sharing of data are difficult. In order to solve the above problems, an Interactive Electronic Technical Manual (IETM, Interactive Electronic Technical Manual) is usually compiled to manage technical data, that is, compiled according to standard digital format standards, using text, graphics, tables, audio and video, etc., through man-machine Technical publications that provide the basic principles, operation and maintenance support of the equipment in an interactive manner. Due to the large amount of information involved in the IETM system, users usually need to use the information retrieval function to quickly find the required content, and full-text retrieval is one of the most commonly used methods. In the past, most of IETM's full-text retrieval methods used general-purpose retrieval schemes, which did not fully consider the characteristics of technical materials in specialized fields, resulting in unsatisfactory retrieval results.

全文检索是指将文档的所有文本与检索关键词进行匹配的检索方法。由于在中文语境下,词语间没有空格作为分隔符,词语之间没有明显的区分标记,需要按照一定规范将中文字符串切分为一个个单独的词,才能达到计算机自动识别语句含义的效果,以完成文档中文本与检索关键词的匹配工作,因此,中文分词技术也成为了中文全文检索的核心技术。在目前常用的分词方法中,基于字符串的分词方法是应用最广泛的方法,该方法是将需要分词的字符串与一个词库按照一定的策略进行匹配得到分词结果的方法,而在专业领域中,如果词库中缺少专业词汇,基于字符串的分词方法无法取得理想的分词效果,词库中专业词汇的多少直接影响了分词的准确率。Full-text search refers to a search method that matches all text of a document with search keywords. Since in the Chinese context, there is no space between words as a separator, and there is no obvious distinguishing mark between words, it is necessary to divide the Chinese string into individual words according to certain specifications, in order to achieve the effect of automatic recognition of the meaning of the sentence by the computer , to complete the matching work between the text in the document and the retrieval keywords, therefore, Chinese word segmentation technology has also become the core technology of Chinese full-text retrieval. Among the currently commonly used word segmentation methods, the word segmentation method based on strings is the most widely used method. This method is to match the strings that need to be segmented with a thesaurus according to a certain strategy to obtain word segmentation results. In the professional field Among them, if there is a lack of professional vocabulary in the thesaurus, the word segmentation method based on strings cannot achieve the ideal word segmentation effect, and the number of professional vocabulary in the thesaurus directly affects the accuracy of word segmentation.

在船舶装备IETM领域中,主要存在两类专业词汇,一类是船舶装备部件名称,如“SMR-7200船用雷达”、“05106电流型螺旋桨风速仪”等。另一类是技术信息术语,如“战术技术指标”、“比幅测向原理”、“维修包络图”等。因此,这两类专业词汇的获取是IETM全文检索首先需解决的问题,只有同时利用专业词汇和通用词汇对数据模块(DM,Data Model)文档进行分词匹配,才能使得用户快速查找到所需的装备技术信息。In the field of marine equipment IETM, there are mainly two types of professional vocabulary, one is the names of marine equipment components, such as "SMR-7200 marine radar", "05106 current type propeller anemometer" and so on. The other category is technical information terms, such as "tactical technical indicators", "principle of direction finding", "maintenance envelope diagram" and so on. Therefore, the acquisition of these two types of professional vocabulary is the first problem to be solved by IETM full-text retrieval. Only by using both professional vocabulary and general vocabulary to perform word segmentation and matching on data module (DM, Data Model) documents can users quickly find the desired Equipment technical information.

船舶装备名称全称构造复杂,名称中往往包含数字、符号、字母等多种字符类型,用户通常会使用缩略语来替代全称,如装备名称“H1604A‘伊尔科斯尊严’号散货轮”,用户通常使用“H1604A散货轮”或者“伊尔科斯尊严”来代替,因此,词库中仅仅包含装备名称的全称还不够,缩略语的处理也是船舶装备IETM领域分词匹配无法避开的问题。对于装备名称,从原语到缩略语形式主要为缩合和截略两种,缩合是指将原语切分为若干部分,选取各部分中最能代表原义的字或词组合成为缩略语,如举例中的“H1604A散货轮”;截略是指获取原语中一段连续的子字符串作为缩略语,如上例中的“伊尔科斯尊严”。The structure of the full name of the ship's equipment is complicated, and the name often contains numbers, symbols, letters and other types of characters. Users usually use abbreviations to replace the full name. Usually, "H1604A bulk carrier" or "Ilkos Dignity" is used instead. Therefore, it is not enough to only include the full name of the equipment name in the thesaurus, and the treatment of abbreviations is also an unavoidable problem in word segmentation matching in the field of ship equipment IETM. For equipment names, there are mainly two forms from the original language to the abbreviation: condensation and truncation. Condensation refers to dividing the original language into several parts, and selecting the words or words that best represent the original meaning in each part to form an abbreviation. For example, "H1604A bulk carrier" in the example; truncation refers to obtaining a continuous substring in the original language as an abbreviation, such as "Ilkos Dignity" in the above example.

在解决专业词汇获取问题后,现有的分词方法没有针对专业词汇的特点进行匹配,分词效果存在一定的问题,因此,需要结合提取词汇的特点设计适用于该领域的特定分词方法,以便取得最佳的匹配效果。After solving the problem of professional vocabulary acquisition, the existing word segmentation methods do not match the characteristics of professional vocabulary, and there are certain problems in the word segmentation effect. good matching effect.

在检索到所需的信息后,如何对多种检索结果进行排序也是全文检索装置与方法需要解决的核心问题之一,由于数据模块文档的元素种类众多,重要度不一,不同文档的重要度也存在差异,不同的查询关键词的重要度也不相同,因此,需要综合考虑以上三方面的因素,设计合理的检索结果排序方法,得到令用户满意的检索结果。After the required information is retrieved, how to sort the various retrieval results is also one of the core problems that the full-text retrieval device and method need to solve. Due to the large number of elements in the data module document, the importance varies, and the importance of different documents There are also differences, and the importance of different query keywords is not the same. Therefore, it is necessary to comprehensively consider the above three factors, design a reasonable search result sorting method, and obtain search results that satisfy users.

由以上内容可以看出,专业词汇获取、缩略语获取、分词问题和检索结果排序是目前船舶装备IETM全文检索装置与方法需解决的四大问题。From the above content, it can be seen that acquisition of professional vocabulary, acquisition of abbreviations, word segmentation and sorting of retrieval results are the four major problems to be solved by the IETM full-text retrieval device and method for ship equipment.

发明内容Contents of the invention

本发明的目的就是要提供一种船舶装备交互式电子技术手册全文检索装置及方法,该装置和方法能方便用户快速准确地查找到所需的船舶装备技术信息。The purpose of the present invention is to provide a full-text search device and method for an interactive electronic technical manual of ship equipment, which can facilitate users to quickly and accurately find required ship equipment technical information.

为实现此目的,本发明所设计了船舶装备交互式电子技术手册全文检索装置,它包括数据库和功能模块,其中,所述数据库包括公共源数据库、技术信息术语数据库、装备部件名称数据库、缩略语数据库、通用词汇数据库、检索记录数据库和索引数据库,所述功能模块包括专业词汇提取模块、缩略语提取模块、第一分词模块、用户检索命令通信模块、检索模块、第二分词模块和索引模块,其中公共源数据库为专业词汇提取模块和缩略语提取模块提供词汇提取源并为第一分词模块提供分词处理的内容,专业词汇提取模块用于提取词汇并存入技术信息术语数据库和装备部件名称数据库,缩略语提取模块用于提取词汇存入缩略语数据库,第一分词模块用于将处理后的分词内容导入索引模块处理;To achieve this goal, the present invention designs a device for full-text retrieval of interactive electronic technical manuals for marine equipment, which includes databases and functional modules, wherein the databases include public source databases, technical information terminology databases, equipment parts name databases, abbreviations Database, general vocabulary database, search record database and index database, the functional modules include professional vocabulary extraction module, abbreviation extraction module, first word segmentation module, user search command communication module, retrieval module, second word segmentation module and index module, Among them, the public source database provides vocabulary extraction sources for the professional vocabulary extraction module and abbreviation extraction module and provides word segmentation processing content for the first word segmentation module, and the professional vocabulary extraction module is used to extract vocabulary and store it in the technical information terminology database and equipment part name database , the abbreviation extraction module is used to extract vocabulary and store it in the abbreviation database, and the first word segmentation module is used to import the processed word segmentation content into the index module for processing;

索引模块用于建立索引并存入索引数据库,索引数据库用于接收第二分词模块分词处理的检索内容进行匹配查找,并将匹配得到的结果集返回至检索模块进行排序,检索模块用于将用户的检索内容发送至第二分词模块进行分词处理,检索模块还用于接收用户检索命令通信模块的检索命令和返回排序后的结果集发送至用户检索命令通信模块,用户检索命令通信模块用于将用户的检索命令发送至检索记录数据库,检索记录数据库用于为缩略语提取模块提供词汇提取源;The index module is used to build an index and store it in the index database. The index database is used to receive the retrieval content processed by the word segmentation of the second word segmentation module for matching and searching, and return the matched result set to the retrieval module for sorting. The retrieval module is used to sort the user The retrieval content is sent to the second word segmentation module for word segmentation processing. The retrieval module is also used to receive the retrieval command from the user retrieval command communication module and return the sorted result set to the user retrieval command communication module. The user retrieval command communication module is used to The user's search command is sent to the search record database, and the search record database is used to provide a vocabulary extraction source for the abbreviation extraction module;

所述技术信息术语数据库、装备部件名称数据库、缩略语数据库和通用词汇数据库分别为第一分词模块和第二分词模块提供分词时的匹配词集。The technical information term database, equipment part name database, abbreviation database and general vocabulary database respectively provide matching word sets for the first word segmentation module and the second word segmentation module.

一种利用上述船舶装备交互式电子技术手册全文检索装置进行检索的方法,它包括如下步骤:A method for searching by using the above-mentioned interactive electronic technical manual full-text retrieval device for ship equipment, which includes the following steps:

步骤1:在公共源数据库中导入根据选定的交互式电子技术手册文档编写标准(即S1000D标准)编辑的数据模块文档,专业词汇提取模块根据所述选定的交互式电子技术手册文档编写标准的要求提取公共源数据库内数据模块文档中的技术信息术语和装备部件名称两类专业词汇,并建立与相应数据模块文档中数据模块编码信息间的映射关系,并将上述两类专业词汇和映射关系存入对应的技术信息术语数据库和装备部件名称数据库中;Step 1: Import the data module document edited according to the selected interactive electronic technical manual document writing standard (ie S1000D standard) into the public source database, and the professional vocabulary extraction module is based on the selected interactive electronic technical manual document writing standard According to the requirements of the public source database, two types of technical information terminology and equipment part names are extracted, and the mapping relationship with the data module coding information in the corresponding data module file is established, and the above two types of professional vocabulary and the mapping The relationship is stored in the corresponding technical information term database and equipment part name database;

步骤2:缩略语提取模块从公共源数据库的装备部件名称中提取对应缩略语的特征量,该特征量为装备部件名称中的数字编号或者俗称部分;Step 2: The abbreviation extraction module extracts the feature quantity corresponding to the abbreviation from the equipment part name of the public source database, and the feature quantity is the number or common name part in the equipment part name;

步骤3:缩略语提取模块将上述特征量与公共源数据库内数据模块文档和检索记录数据库内的用户检索记录进行匹配查找,确定特征量中的各个元素在数据模块文档和用户检索记录中的具体位置;Step 3: The abbreviation extraction module matches and searches the above-mentioned feature quantity with the user search records in the data module document and search record database in the public source database, and determines the specific content of each element in the feature quantity in the data module document and user search record. Location;

步骤4:缩略语提取模块确定特征量所在缩略语的首尾字符串,并识别特征量对应缩略语的边界片段,使得识别的缩略语为完整缩略语,将该完整缩略语定为候选缩略语;Step 4: The abbreviation extraction module determines the first and last character strings of the abbreviation where the feature quantity is located, and identifies the boundary segment of the abbreviation corresponding to the feature quantity, so that the recognized abbreviation is a complete abbreviation, and the complete abbreviation is defined as a candidate abbreviation;

步骤5:缩略语提取模块通过以下公式1计算上述候选缩略语的权值:Step 5: The abbreviation extraction module calculates the weight of the above candidate abbreviations through the following formula 1:

式中nmic为候选缩略语在特定内容中出现的次数,所述特定内容包括与装备部件名称的装备类型标识码相同的数据模块文档内容及该数据模块文档内容检索记录中的检索关键词;nall为候选缩略语在所有数据模块文档中出现的次数与检索记录数据库中所有检索记录中出现次数的总和;Dall为所有数据模块文档总数和所有检索记录总数之和;Dmic为包含候选缩略语的数据模块文档总数和包含候选缩略语的检索记录总数之和;Wa为候选缩略语的权值,用于衡量候选缩略语衡量主题的能力,Wa的阈值为给定值,当候选缩略语的权值大于等于Wa的阈值时,候选缩略语可视为正式缩略语,将候选缩略语存入缩略语数据库,候选缩略语的权值小于Wa的阈值时,对候选缩略语不进行处理;In the formula, n mic is the number of times that the candidate abbreviation appears in the specific content, and the specific content includes the same data module document content as the equipment type identification code of the equipment part name and the retrieval keyword in the data module document content retrieval record; n all is the total number of occurrences of candidate abbreviations in all data module documents and all search records in the search record database; D all is the sum of the total number of all data module documents and all search records; D mic is the sum of candidate abbreviations The sum of the total number of data module documents of the abbreviation and the total number of retrieval records containing the candidate abbreviation; W a is the weight of the candidate abbreviation, which is used to measure the ability of the candidate abbreviation to measure the subject. The threshold of W a is a given value, when When the weight of a candidate abbreviation is greater than or equal to the threshold of W a , the candidate abbreviation can be regarded as a formal abbreviation, and the candidate abbreviation is stored in the abbreviation database. Abbreviations are not processed;

步骤6:在第一分词模块和第二分词模块分别对数据模块文档和检索模块提供的用户检索关键词进行分词处理;分词处理的具体过程为:Step 6: In the first word segmentation module and the second word segmentation module, respectively perform word segmentation processing on the user search keywords provided by the data module document and the retrieval module; the specific process of word segmentation processing is as follows:

设待切分的字符串为S1=w1w2w3…wi…wn,其中,待切分的字符串S1为用户检索关键词的字符串或数据模块文档中的每一句内容,wi为S1中的单个字符,n为该字符串的长度,n≥1,i为1到n之间的字符编号;Assume that the character string to be segmented is S 1 = w 1 w 2 w 3 ... wi ... w n , wherein the character string to be segmented S 1 is the string of keywords retrieved by the user or each sentence in the data module document Content, w i is a single character in S 1 , n is the length of the string, n≥1, i is the character number between 1 and n;

使用缩略语数据库对待切分的字符串S1进行扫描,当缩略语命中时,将待切分的字符串S1中命中的字符子串还原为对应的原语,直到待切分的字符串S1扫描完毕为止,此时形成字符串S2=u1u2…ui…um,其中ui为S2中的单个字符,m为该字符串的长度;Use the abbreviation database to scan the character string S 1 to be segmented, and when the abbreviation hits, restore the hit character substring in the character string S 1 to be segmented to the corresponding primitive until the character string to be segmented Until the scanning of S 1 is completed, a character string S 2 = u 1 u 2 ... u i ... u m is formed at this time, where u i is a single character in S 2 , and m is the length of the character string;

在第一分词模块和第二分词模块内用字符串S2建立一个节点数为m+1的有向无环图G,有向无环图G节点的编号依次为v0、v1、v2…vm,m为该字符串的长度,在相邻两个顶点vk,vk+1间建立有向边<vk,vk+1>,该有向边<vk,vk+1>对应的词汇为uk+1,(k=0,1,2...m-1,m为该字符串的长度),若任意两个有向无环图G节点间存在直接相连的有向边,则认为这两个节点间的距离为1,若字符串S2的字符子串h1=upup+1…uq,(1≤p<q)为在缩略语还原后的原语,则以节点vp-1,vq为起始节点和终结节点建立有向边<vp-1,vq>,该有向边<vp-1,vq>对应的词汇为字符串S2的字符子串h1In the first word segmentation module and the second word segmentation module, use the string S 2 to establish a directed acyclic graph G with the number of nodes m+1, and the numbers of the nodes of the directed acyclic graph G are v 0 , v 1 , v 2 …v m , m is the length of the string, establish a directed edge <v k ,v k+1 > between two adjacent vertices v k , v k+ 1, the directed edge <v k ,v k+1 >The corresponding vocabulary is u k+1 , (k=0,1,2...m-1, m is the length of the string), if any two directed acyclic graph G nodes exist Directly connected directed edges, it is considered that the distance between these two nodes is 1, if the character substring h 1 =u p u p+1 …u q of the string S 2 , (1≤p<q) is in The original language after the abbreviation is restored, the node v p-1 , v q is used as the start node and the end node to establish a directed edge <v p-1 ,v q >, the directed edge <v p-1 ,v q >corresponding vocabulary is character substring h 1 of character string S 2 ;

分别使用技术信息术语数据库和装备部件名称数据库对字符串S2进行匹配,若存在匹配的最大字长字符子串h2=uaua+1…ub,(1≤a<b),且最大字长字符子串h2的节点va-1和节点vb间不存在有向边<va-1,vb>,并有a≥p+1或者b≤q-1成立,则以节点va-1为起始节点,以节点vb为终结节点建立有向边<va-1,vb>,该边对应词汇为最大字长字符子串h2Use the technical information terminology database and the equipment part name database to match the character string S 2 respectively, if there is a matching maximum word length character substring h 2 =u a u a+1 ... u b , (1≤a<b), And there is no directed edge <v a-1 , v b > between node v a-1 and node v b of the maximum word length character substring h 2 , and a≥p+1 or b≤q-1 holds, Then take the node v a-1 as the starting node, and take the node v b as the ending node to establish a directed edge <v a-1 , v b >, and the corresponding vocabulary of this edge is the character substring h 2 of the maximum word length;

使用通用词汇数据库对字符串S2进行匹配,若存在匹配的字符串h3=ucuc+1…ud,(1≤c<d),且字符串h3的节点vc-1和vd间不存在有向边<vc-1,vd>,则以字符串h3的节点vc-1为起始节点,以字符串h3的节点vd为终结节点建立有向边<vc-1,vd>,该有向边<vc-1,vd>对应词汇为字符串h3;若字符串h3的节点vc-1和节点vd间存在有向边<vc-1,vd>,且有向边<vc-1,vd>的字符串类型为最大字长字符子串h2,则说明最大字长字符子串h2在通用词汇数据库中存在,因此将其类型最大字长字符子串h2改为字符子串h4Use the general vocabulary database to match the string S 2 , if there is a matching string h 3 =u c u c+1 … u d , (1≤c<d), and the node v c-1 of the string h 3 There is no directed edge <v c-1 , v d > between v d and v c-1, then the node v c-1 of the string h 3 is used as the starting node, and the node v d of the string h 3 is used as the terminal node to establish a The directional edge <v c-1 , v d >, the corresponding vocabulary of the directional edge <v c-1 , v d > is the string h 3 ; if the node v c-1 and the node v d of the string h 3 exist There is a directed edge <v c-1 , v d >, and the string type of the directed edge <v c-1 , v d > is the character substring h 2 of the maximum word length, then it indicates that the character substring h 2 of the maximum word length It exists in the general vocabulary database, so its type maximum word length character substring h 2 is changed to character substring h 4 ;

统计有向边生成完毕后有向无环图G中从节点v0到达vm路径长度由短至长的前N条路径,N选为3,其中最短的一条路径考虑所有有向边类型,第二短的路径和第三短的路径均忽略字符串类型为h1和h2的有向边,只对对应词汇字符串为h3和h4的有向边进行考虑,即在非最优路径中只考虑通用词库的匹配结果,剔除上述三条路径中存在的重复有向边,分别输出各条路径中剩余有向边对应的词汇,构成的结果集既为最终的分词结果;Count the first N paths from node v 0 to v m in the directed acyclic graph G after the directed edges are generated. The length of the path from node v 0 to v m is from short to long. Both the second shortest path and the third shortest path ignore the directed edges whose string types are h 1 and h 2 , and only consider the directed edges whose corresponding vocabulary strings are h 3 and h 4 , that is, in the non-most In the optimal path, only the matching results of the general lexicon are considered, the repeated directed edges existing in the above three paths are eliminated, and the words corresponding to the remaining directed edges in each path are respectively output, and the resulting result set is the final word segmentation result;

步骤7:在第一分词模块将上述得到的最终的分词结果分别存入索引数据库内索引文档的各域中,并设置各域的权重值,索引文档各域包括标题域、路径域、链接文本域、子标题域和正文域;Step 7: In the first word segmentation module, store the final word segmentation results obtained above into each field of the index document in the index database, and set the weight value of each field. Each field of the index document includes the title field, path field, link text field, subtitle field and body field;

步骤8:设置索引数据库内索引文档的权重,并将多个索引文档构成段并最终形成索引文件;索引文档权重设置分为标准编码系统编码权重设置和信息码权重设置,根据数据模块文档编码特点,对不同标准编码系统编码和信息码的权重进行设置,标准编码系统编码权重设置依据标准编码系统编码装备层次级别越低,对应的权重因子设置越高的规则,信息码权重设置依据子类别信息码设置比主类别更高的权重的规则,然后将标准编码系统编码权重和信息码权重相乘得到索引文档的权重;Step 8: Set the weight of the index document in the index database, and form multiple index documents into segments and finally form an index file; the index document weight setting is divided into standard coding system coding weight setting and information code weight setting, according to the data module document coding characteristics , to set the weights of different standard coding system codes and information codes. The standard coding system coding weight setting is based on the rule that the lower the standard coding system coding equipment level is, the higher the corresponding weight factor is set. The information code weight setting is based on subcategory information The code sets a higher weight than the main category, and then multiplies the weight of the standard coding system and the weight of the information code to obtain the weight of the index document;

步骤9:利用检索模块向用户提供全文检索服务,检索模块接收用户的检索请求并调用查询方式进行检索,该查询方式具体为:将用户检索的关键词调用步骤6进行分词处理后,与步骤7形成的索引库中文档各域的分词内容进行匹配,查找所有匹配的文档作为结果集。Step 9: Use the retrieval module to provide full-text retrieval service to the user. The retrieval module receives the user's retrieval request and invokes the query method to perform the retrieval. Match the word segmentation content in each field of the document in the formed index library, and find all matching documents as the result set.

本发明针对现有的全文检索装置与方法在船舶装备交互式电子技术手册专业领域使用时存在的缺少专业词汇及其缩略语、缺少适配的分词算法和检索结果排序没有优化的问题,通过分析选定的交互式电子技术手册文档编写标准(即S1000D标准)数据模块文档结构及特定元素标签特点,结合船舶装备技术资料出现的专业词汇类型与特点,完成专业词汇及其缩略语的提取,并根据多类词汇特点,针对性地设计分词算法,将数据模块文档内容分词后存入索引便于快速定位信息,并设置各类因素权重值用于解决检索结果排序问题,完成交互式电子技术手册全文检索装置与方法的构建。该全文检索装置与方法综合数据模块文档中元素标签特点和文档内容,利用专业词汇进行查询并加大专业词汇在文档及检索关键词中的权重,使得系统能够在一定语义层次进行查询,返回的检索结果更加贴近用户的检索意图,从而保证了该检索装置的高召回率和准确率。The present invention aims at the problems of lack of professional vocabulary and abbreviations, lack of adapted word segmentation algorithm and unoptimized sorting of search results when existing full-text retrieval devices and methods are used in the professional field of ship equipment interactive electronic technical manuals. The selected interactive electronic technical manual document writing standard (i.e. S1000D standard) data module document structure and specific element label characteristics, combined with the types and characteristics of professional vocabulary appearing in the technical data of ship equipment, complete the extraction of professional vocabulary and its abbreviations, and According to the characteristics of multiple types of vocabulary, the word segmentation algorithm is designed in a targeted manner, and the content of the data module document is segmented and stored in the index to facilitate the rapid positioning of information, and the weight values of various factors are set to solve the problem of sorting the retrieval results, and the full text of the interactive electronic technical manual is completed. Construction of retrieval device and method. The full-text retrieval device and method integrates the characteristics of element tags and document content in the data module document, uses professional vocabulary to query and increases the weight of professional vocabulary in the document and retrieval keywords, so that the system can query at a certain semantic level, and the returned The retrieval result is closer to the user's retrieval intention, thereby ensuring a high recall rate and accuracy rate of the retrieval device.

附图说明Description of drawings

图1为本发明中船舶装备交互式电子技术手册全文检索装置的结构示意图。Fig. 1 is a structural schematic diagram of a full-text retrieval device for an interactive electronic technical manual of ship equipment in the present invention.

其中,1—公共源数据库、2—专业词汇提取模块、3—缩略语提取模块、4—第一分词模块、5—技术信息术语数据库、6—装备部件名称数据库、7—缩略语数据库、8—通用词汇数据库、9—检索记录数据库、10—用户检索命令通信模块、11—检索模块、12—第二分词模块、13—索引数据库、14—索引模块。Among them, 1—public source database, 2—professional vocabulary extraction module, 3—acronym extraction module, 4—first word segmentation module, 5—technical information term database, 6—equipment part name database, 7—abbreviation database, 8 - general vocabulary database, 9 - search record database, 10 - user search command communication module, 11 - search module, 12 - second word segmentation module, 13 - index database, 14 - index module.

具体实施方式Detailed ways

以下结合附图和具体实施例对本发明作进一步的详细说明:Below in conjunction with accompanying drawing and specific embodiment the present invention is described in further detail:

如图1所示的船舶装备交互式电子技术手册全文检索装置,它包括数据库和功能模块,其中,所述数据库包括公共源数据库1、技术信息术语数据库5、装备部件名称数据库6、缩略语数据库7、通用词汇数据库8、检索记录数据库9和索引数据库13,所述功能模块包括专业词汇提取模块2、缩略语提取模块3、第一分词模块4、用户检索命令通信模块10、检索模块11、第二分词模块12和索引模块14,其中公共源数据库1为专业词汇提取模块2和缩略语提取模块3提供词汇提取源并为第一分词模块4提供分词处理的内容,专业词汇提取模块2用于提取词汇并存入技术信息术语数据库5和装备部件名称数据库6,缩略语提取模块3用于提取词汇存入缩略语数据库7,第一分词模块4用于将处理后的分词内容导入索引模块14处理;The ship equipment interactive electronic technical manual full-text retrieval device as shown in Figure 1, it includes database and function module, wherein, said database includes public source database 1, technical information terminology database 5, equipment part name database 6, abbreviation database 7, general vocabulary database 8, retrieval record database 9 and index database 13, described function module comprises professional vocabulary extraction module 2, abbreviation extraction module 3, the first participle module 4, user retrieval order communication module 10, retrieval module 11, Second word segmentation module 12 and index module 14, wherein public source database 1 provides vocabulary extraction source for professional vocabulary extraction module 2 and abbreviation extraction module 3 and provides the content of word segmentation processing for the first word segmentation module 4, professional vocabulary extraction module 2 uses To extract vocabulary and store it in the technical information terminology database 5 and the equipment part name database 6, the abbreviation extraction module 3 is used to extract vocabulary and store it in the abbreviation database 7, and the first word segmentation module 4 is used to import the processed word segmentation content into the index module 14 processing;

索引模块14用于建立索引并存入索引数据库13,索引数据库13用于接收第二分词模块12分词处理的检索内容进行匹配查找,并将匹配得到的结果集返回至检索模块11进行排序,检索模块11用于将用户的检索内容发送至第二分词模块12进行分词处理,检索模块11还用于接收用户检索命令通信模块10的检索命令和返回排序后的结果集发送至用户检索命令通信模块10以便查看,用户检索命令通信模块10用于将用户的检索命令发送至检索记录数据库9,检索记录数据库9用于为缩略语提取模块3提供词汇提取源;The indexing module 14 is used to build an index and store it in the indexing database 13. The indexing database 13 is used to receive the retrieval content processed by the second word segmentation module 12 for word segmentation to perform matching search, and return the result set obtained by matching to the retrieval module 11 for sorting, retrieval Module 11 is used to send the user's retrieval content to the second word segmentation module 12 for word segmentation processing, and retrieval module 11 is also used to receive the retrieval command of the user retrieval command communication module 10 and return the sorted result set to the user retrieval command communication module 10 so as to check, the user's retrieval command communication module 10 is used to send the user's retrieval command to the retrieval record database 9, and the retrieval record database 9 is used to provide the vocabulary extraction source for the abbreviation extraction module 3;

所述技术信息术语数据库5、装备部件名称数据库6、缩略语数据库7和通用词汇数据库8分别为第一分词模块4和第二分词模块12提供分词时的匹配词集。The technical information terminology database 5 , equipment part name database 6 , abbreviation database 7 and common vocabulary database 8 provide matching word sets for the first word segmentation module 4 and the second word segmentation module 12 respectively.

一种利用上述船舶装备交互式电子技术手册全文检索装置进行检索的方法,它包括如下步骤:A method for searching by using the above-mentioned interactive electronic technical manual full-text retrieval device for ship equipment, which includes the following steps:

步骤1:在公共源数据库1中导入根据选定的交互式电子技术手册文档(本实施例选择为S1000D的交互式电子技术手册文档)编写标准编辑的数据模块文档,专业词汇提取模块2根据所述选定的交互式电子技术手册文档编写标准的要求提取公共源数据库1内数据模块文档(DM,Data Model)中的技术信息术语和装备部件名称两类专业词汇,并建立与相应数据模块文档中数据模块编码信息间的映射关系,并将上述两类专业词汇和映射关系存入对应的技术信息术语数据库5和装备部件名称数据库6中;Step 1: Import in the public source database 1 the data module document compiled according to the selected interactive electronic technical manual document (this embodiment is selected as the interactive electronic technical manual document of S1000D), and the professional vocabulary extraction module 2 is based on the selected interactive electronic technical manual document. According to the requirements of the selected interactive electronic technical manual document writing standards, extract two types of technical information terms and equipment part names in the data module document (DM, Data Model) in the public source database 1, and establish a corresponding data module document The mapping relationship between the coding information of the data module in the data module, and store the above-mentioned two types of professional vocabulary and the mapping relationship in the corresponding technical information term database 5 and equipment component name database 6;

步骤2:缩略语提取模块3从公共源数据库1的装备部件名称(全称)中提取对应缩略语的特征量,该特征量为装备部件名称中的数字编号或者俗称部分(例如装备名称原语“H1604A‘伊尔科斯尊严’号散货轮”,其缩略语必然包括数字编号“1604”和俗称“伊尔科斯尊严”或二者之一,因此,可利用此类特征量定位缩略语可能存在的位置,再利用装备名称的原语的其他字符串与特征量前后的字符串匹配,识别缩略语的边界片段,使得识别的缩略语包含最长词,计算该缩略语权值并判定阈值,建装备名称原语与缩略语间的映射关系并存入缩略语词典,完成缩略语提取);Step 2: The abbreviation extraction module 3 extracts the feature quantity corresponding to the abbreviation from the equipment part name (full name) of the public source database 1, and the feature quantity is a digital number or a common name part in the equipment part name (for example, the equipment name primitive " H1604A "Ilkos Dignity" Bulk Carrier", its abbreviation must include the number "1604" and commonly known as "Ilkos Dignity" or one of the two, therefore, such characteristic quantities can be used to locate the possible existence of abbreviations position, and then use other strings of the original language of the equipment name to match the strings before and after the feature, identify the boundary segment of the abbreviation, make the identified abbreviation contain the longest word, calculate the weight of the abbreviation and determine the threshold, Build the mapping relationship between the equipment name primitive and the abbreviation and store it in the abbreviation dictionary to complete the abbreviation extraction);

上述缩略语提取模块3从公共源数据库1的装备部件名称(全称)中提取对应缩略语特征量的具体方法,包括如下步骤:提取船舶装备名称原语中的缩略语的特征量;由于每类船舶装备都有固定的命名规则,因此可利用该命名规则判定装备名称类型并根据规则对装备名称的组成成分进行切分,完成特征量的提取,设船舶装备名称原语为W0=w1w2…wn,wi为名称原语的第i个字符,首先使用JAPE(a Java Annotation Patterns Engine)等语法工具制定各类装备命名规则的正则表达式,使用这些正则表达式判定步骤1形成的装备部件名称词库中的每个W0所属的名称类型,并按照命中的规则对W0进行切分,得到缩略语特征量W1=wp…wq,1≤p<q≤n;The specific method for the above-mentioned abbreviation extraction module 3 to extract the corresponding abbreviation feature quantity from the equipment part name (full name) of the public source database 1 includes the following steps: extract the feature quantity of the abbreviation in the original language of the ship equipment name; Ship equipment has fixed naming rules, so the naming rules can be used to determine the type of equipment name and segment the components of the equipment name according to the rules to complete the extraction of feature quantities. The original language of ship equipment name is W 0 = w 1 w 2 …w n , w i is the i-th character of the name primitive, first use JAPE (a Java Annotation Patterns Engine) and other grammatical tools to formulate regular expressions for various equipment naming rules, and use these regular expressions to determine step 1 The name type of each W 0 in the formed equipment part name lexicon, and segment W 0 according to the matching rules to obtain the abbreviation feature quantity W 1 = w p ... w q , 1≤p<q≤ n;

步骤3:缩略语提取模块3将上述特征量与公共源数据库1内数据模块文档和检索记录数据库9内的用户检索记录进行匹配查找,确定特征量中的各个元素在数据模块文档和用户检索记录中的具体位置,步骤3的具体方法为设命中字符串为W2,则满足W2=W1,为防止不相关IETM系统的字符串成为缩略语候选,W2所处的数据模块文档DM或检索记录对应访问链接的装备类型标识码MIC须满足与W1对应的原语W0所映射装备类型标识码MIC相同;Step 3: the abbreviation extraction module 3 matches and searches the above-mentioned feature quantity with the data module document in the public source database 1 and the user search record in the search record database 9, and determines that each element in the feature quantity is in the data module document and the user search record The specific position in , the specific method of step 3 is to set the hit character string as W 2 , then satisfy W 2 =W 1 . Or the equipment type identification code MIC of the access link corresponding to the retrieval record must satisfy the same requirement as the equipment type identification code MIC mapped to the primitive W 0 corresponding to W 1 ;

步骤4:缩略语提取模块3确定特征量所在缩略语的首尾字符串,并识别特征量对应缩略语的边界片段,使得识别的缩略语为完整缩略语,将该完整缩略语定为候选缩略语(比如,“HMZ-360雷达识别目标”,这句话,“360”是特征量,“HMZ-360雷达”是缩略语的最长词,如果只识别到“HMZ-360”或者“360雷达”都是识别不完全);Step 4: The abbreviation extraction module 3 determines the first and last character strings of the abbreviation where the feature quantity is located, and identifies the boundary segment of the abbreviation corresponding to the feature quantity, so that the recognized abbreviation is a complete abbreviation, and the complete abbreviation is defined as a candidate abbreviation (For example, "HMZ-360 radar recognizes the target", in this sentence, "360" is the feature quantity, "HMZ-360 radar" is the longest word of the abbreviation, if only "HMZ-360" or "360 radar" is recognized " are all incomplete identification);

步骤5:缩略语提取模块3通过以下公式1计算上述候选缩略语的权值:Step 5: The abbreviation extraction module 3 calculates the weight of the above-mentioned candidate abbreviations through the following formula 1:

式中nmic为候选缩略语在特定内容中出现的次数,所述特定内容包括与装备部件名称的装备类型标识码(MIC,Model identification code)相同的数据模块文档内容及该数据模块文档内容检索记录中的检索关键词;nall为候选缩略语在所有数据模块文档中出现的次数与检索记录数据库9中所有检索记录中出现次数的总和(二者之商衡量候选缩略语词频,该值越高,说明候选缩略语在特定IETM系统中出现次数越多,);Dall为所有数据模块文档总数和所有检索记录总数之和;Dmic为包含候选缩略语的数据模块文档总数和包含候选缩略语的检索记录总数之和(该对数值用于衡量候选缩略语的普遍性,该值越高,说明候选缩略语集中在少数数据模块文档出现);Wa为候选缩略语的权值,用于衡量候选缩略语衡量主题的能力,Wa的阈值为给定值,该阈值设定为2,当候选缩略语的权值大于等于Wa的阈值时(说明它在与特定的装备的IETM系统主题关联度较高),候选缩略语可视为正式缩略语,将候选缩略语存入缩略语数据库7,候选缩略语的权值小于Wa的阈值时,对候选缩略语不进行处理;In the formula, n mic is the number of times that the candidate abbreviation appears in the specific content, and the specific content includes the same data module document content as the equipment type identification code (MIC, Model identification code) of the equipment part name and the retrieval of the data module document content Retrieval keywords in the record; n all is the sum of the number of occurrences of candidate abbreviations in all data module documents and all retrieval records in the retrieval record database 9 (the quotient of the two measures the frequency of candidate abbreviations, and the higher the value is High, indicating that the candidate abbreviation appears more times in a specific IETM system; D all is the sum of the total number of all data module documents and all retrieval records; D mic is the total number of data module documents containing candidate abbreviations and the total number of candidate abbreviations The sum of the total number of retrieved records of the abbreviations (the logarithm value is used to measure the universality of the candidate abbreviations, the higher the value, it means that the candidate abbreviations are concentrated in a few data module documents); W a is the weight of the candidate abbreviations, using To measure the ability of candidate abbreviations to measure topics, the threshold value of W a is a given value, and the threshold value is set to 2. When the weight of candidate abbreviations is greater than or equal to the threshold value of W a (indicating that it is related to the specific equipment IETM The system subject correlation degree is higher), the candidate abbreviation can be regarded as the official abbreviation, the candidate abbreviation is stored in the abbreviation database 7, when the weight value of the candidate abbreviation is less than the threshold of W a , the candidate abbreviation is not processed;

步骤6:在第一分词模块4和第二分词模块12分别对数据模块文档和检索模块11提供的用户检索关键词进行分词处理,在专业词汇提取模块2和缩略语提取模块3提取形成的多类词汇中,存在着由多个简单词汇组合而成的复合词汇,这些词汇通过词库切分后存在多条正确路径,如装备名称“雷达测试装置”可继续切分为“雷达/测试/装置”,对于这类复合词汇如果只采用单一的切分结果,将造成大量正确的匹配方式被舍弃,得到得分词结果无法满足用户检索的需求,本发明采用在原有N-最短路径分词方法的基础上,结合生成的多类专业词汇词库和已有的通用词汇词库中词汇特点,在进行分词时,共进行3次词库匹配过程,首先利用步骤2得到的缩略语词库进行匹配,扫描技术信息中存在的缩略语,并将其还原为对应的装备部件名称原语;其次通过步骤1得到的技术信息术语词库和装备部件名称词库匹配未命中的文本内容;然后通过通用词库对还原原语后的所有文本内容进行匹配;当匹配完毕后,输出符合要求的N条路径,多条路径构成的结果集为最终分词结果,分词处理的具体过程为:Step 6: The first word segmentation module 4 and the second word segmentation module 12 carry out word segmentation processing to the user's retrieval keywords provided by the data module document and the retrieval module 11 respectively, and extract the multiple words formed in the professional vocabulary extraction module 2 and the abbreviation extraction module 3 In the category of vocabulary, there are compound words composed of multiple simple words. These words have multiple correct paths after being segmented through the thesaurus. For example, the equipment name "radar test device" can be further divided into "radar/test/ device", if only a single segmentation result is used for this type of compound vocabulary, a large number of correct matching methods will be discarded, and the result of scoring words cannot meet the needs of user retrieval. The present invention adopts the original N-shortest path segmentation method. On this basis, combined with the characteristics of the generated multi-category professional vocabulary thesaurus and the existing general vocabulary thesaurus, when performing word segmentation, a total of three thesaurus matching processes are carried out. First, the abbreviation thesaurus obtained in step 2 is used for matching , scan the abbreviations in the technical information, and restore them to the corresponding equipment part name primitives; secondly, match the missing text content through the technical information term lexicon obtained in step 1 and the equipment part name lexicon; and then pass the general The thesaurus matches all the text content after restoring the original language; when the matching is completed, output N paths that meet the requirements, and the result set composed of multiple paths is the final word segmentation result. The specific process of word segmentation processing is as follows:

设待切分的字符串为S1=w1w2w3…wi…wn,其中,待切分的字符串S1为用户检索关键词的字符串或数据模块文档中的每一句内容,wi为S1中的单个字符,n为该字符串的长度,n≥1,i为1到n之间的字符编号;Assume that the character string to be segmented is S 1 = w 1 w 2 w 3 ... wi ... w n , wherein the character string to be segmented S 1 is the string of keywords retrieved by the user or each sentence in the data module document Content, w i is a single character in S 1 , n is the length of the string, n≥1, i is the character number between 1 and n;

使用缩略语数据库7对待切分的字符串S1进行扫描,当缩略语命中时,将待切分的字符串S1中命中的字符子串还原为对应的原语,直到待切分的字符串S1扫描完毕为止,此时形成字符串S2=u1u2…ui…um,其中ui为S2中的单个字符,m为该字符串的长度;Use the abbreviation database 7 to scan the character string S1 to be segmented, and when the abbreviation hits, restore the hit character substring in the character string S1 to be segmented to the corresponding primitive until the character to be segmented Until the string S 1 is scanned, a character string S 2 = u 1 u 2 ... u i ... u m is formed at this time, where u i is a single character in S 2 , and m is the length of the string;

在第一分词模块4和第二分词模块12内用字符串S2建立一个节点数为m+1的有向无环图G,有向无环图G节点的编号依次为v0、v1、v2…vm,m为该字符串的长度,在相邻两个顶点vk,vk+1间建立有向边<vk,vk+1>,该有向边<vk,vk+1>对应的词汇为uk+1,(k=0,1,2...m-1,m为该字符串的长度),若任意两个有向无环图G节点间存在直接相连的有向边,则认为这两个节点间的距离为1,若字符串S2的字符子串h1=upup+1…uq,(1≤p<q)为在缩略语还原后的原语,则以节点vp-1,vq为起始节点和终结节点建立有向边<vp-1,vq>,该有向边<vp-1,vq>对应的词汇为字符串S2的字符子串h1In the first word segmentation module 4 and the second word segmentation module 12, use the string S 2 to establish a directed acyclic graph G with the number of nodes m+1, and the numbers of the nodes of the directed acyclic graph G are v 0 , v 1 in turn , v 2 ...v m , m is the length of the string, and a directed edge <v k ,v k+1 > is established between two adjacent vertices v k and v k+ 1, the directed edge <v k , v k+1 > the corresponding vocabulary is u k+1 , (k=0,1,2...m-1, m is the length of the string), if any two directed acyclic graph G nodes If there is a directed edge directly connected between the two nodes, the distance between these two nodes is considered to be 1, if the character substring h 1 of the string S 2 =u p u p+1 …u q , (1≤p<q) is the original language after the abbreviation is restored, and the node v p-1 , v q is used as the starting node and the ending node to establish a directed edge <v p-1 , v q >, and the directed edge <v p-1 , v q > the corresponding vocabulary is character substring h 1 of character string S 2 ;

分别使用技术信息术语数据库5和装备部件名称数据库6对字符串S2进行匹配,若存在匹配的最大字长字符子串h2=uaua+1…ub,(1≤a<b),且最大字长字符子串h2的节点va-1和节点vb间不存在有向边<va-1,vb>,并有a≥p+1或者b≤q-1成立,则以节点va-1为起始节点,以节点vb为终结节点建立有向边<va-1,vb>,该边对应词汇为最大字长字符子串h2Use the technical information terminology database 5 and the equipment part name database 6 to match the character string S 2 , if there is a matching maximum word length character substring h 2 =u a u a+1 ... u b , (1≤a<b ), and there is no directed edge <v a-1 , v b > between node v a-1 and node v b of the maximum word length character substring h 2 , and there is a≥p+1 or b≤q-1 If it is established, the node v a-1 is used as the starting node, and the node v b is used as the ending node to establish a directed edge <v a-1 , v b >, and the corresponding vocabulary of this edge is the maximum word length character substring h 2 ;

使用通用词汇数据库8对字符串S2进行匹配,若存在匹配的字符串h3=ucuc+1…ud,(1≤c<d),且字符串h3的节点vc-1和vd间不存在有向边<vc-1,vd>,则以字符串h3的节点vc-1为起始节点,以字符串h3的节点vd为终结节点建立有向边<vc-1,vd>,该有向边<vc-1,vd>对应词汇为字符串h3;若字符串h3的节点vc-1和节点vd间存在有向边<vc-1,vd>,且有向边<vc-1,vd>的字符串类型为最大字长字符子串h2,则说明最大字长字符子串h2在通用词汇数据库8中存在,因此将其类型最大字长字符子串h2改为字符子串h4,便于后续的输出处理;Use the general vocabulary database 8 to match the string S 2 , if there is a matching string h 3 = u c u c + 1 ... u d , (1≤c<d), and the node v c- of the string h 3 If there is no directed edge <v c-1 ,v d > between 1 and v d , then the node v c-1 of the string h 3 is used as the starting node, and the node v d of the string h 3 is established as the ending node There is a directed edge <v c-1 , v d >, and the corresponding vocabulary of the directed edge <v c-1 , v d > is the string h 3 ; if the node v c-1 and the node v d of the string h 3 There is a directed edge <v c-1 , v d >, and the string type of the directed edge <v c-1 , v d > is the maximum word length character substring h 2 , then the maximum word length character substring h 2 exists in the general vocabulary database 8, so its type maximum word length character substring h 2 is changed to character substring h 4 to facilitate subsequent output processing;

统计有向边生成完毕后有向无环图G中从节点v0到达vm路径长度由短至长的前N条路径,N选为3,其中最短的一条路径考虑所有有向边类型,第二短的路径和第三短的路径均忽略字符串类型为h1和h2的有向边,只对对应词汇字符串为h3和h4的有向边进行考虑,即在非最优路径中只考虑通用词库的匹配结果(防止以上的N-最短路径分词方法3次切分还无法满足检索需求,避免N值过大才能达到较好的切分粒度的情况),剔除上述三条路径中存在的重复有向边,分别输出各条路径中剩余有向边对应的词汇,构成的结果集既为最终的分词结果;Count the first N paths from node v 0 to v m in the directed acyclic graph G after the directed edges are generated. The length of the path from node v 0 to v m is from short to long. Both the second shortest path and the third shortest path ignore the directed edges whose string types are h 1 and h 2 , and only consider the directed edges whose corresponding vocabulary strings are h 3 and h 4 , that is, in the non-most In the optimal path, only the matching results of the general lexicon are considered (to prevent the above N-shortest path word segmentation method from three segmentations that cannot meet the retrieval requirements, and to avoid the situation where the N value is too large to achieve a better segmentation granularity), and the above-mentioned For the repeated directed edges existing in the three paths, respectively output the vocabulary corresponding to the remaining directed edges in each path, and the resulting set is the final word segmentation result;

步骤7:在第一分词模块4将上述得到的最终的分词结果分别存入索引数据库13内索引文档的各域中,并设置各域的权重值,为最终检索结果的排序提供参数,多个文档构成段并最终形成索引文件,存入磁盘或内存中,索引文档各域包括标题域、路径域、链接文本域、子标题域和正文域;Step 7: In the first word segmentation module 4, the final word segmentation results obtained above are stored in each field of the index document in the index database 13 respectively, and the weight value of each field is set to provide parameters for the sorting of the final retrieval results, multiple The document forms a segment and finally forms an index file, which is stored in the disk or memory. The fields of the index document include the title field, the path field, the link text field, the subtitle field and the text field;

步骤8:设置索引数据库13内索引文档的权重,并将多个索引文档构成段并最终形成索引文件,并存入磁盘或内存中;索引文档权重设置分为标准编码系统(StandardNumbering Systems,SNS)编码权重设置和信息码权重设置,根据数据模块文档编码特点,对不同标准编码系统编码和信息码的权重进行设置,标准编码系统编码权重设置依据标准编码系统编码装备层次级别越低,对应的权重因子设置越高的规则,信息码权重设置依据子类别信息码设置比主类别更高的权重的规则,然后将标准编码系统编码权重和信息码权重相乘得到索引文档的权重;Step 8: Set the weight of the index document in the index database 13, and form a plurality of index documents into segments and finally form an index file, and store it in disk or memory; the index document weight setting is divided into standard numbering systems (StandardNumbering Systems, SNS) Coding weight setting and information code weight setting, according to the data module document coding characteristics, set the weight of different standard coding system coding and information code, the standard coding system coding weight setting is based on the standard coding system The lower the level of coding equipment, the corresponding weight The higher the factor is set, the information code weight is set according to the rule that the subcategory information code sets a higher weight than the main category, and then the weight of the index document is obtained by multiplying the standard coding system coding weight and the information code weight;

步骤9:利用检索模块11向用户提供全文检索服务,检索模块11接收用户的检索请求并调用查询方式进行检索,该查询方式具体为:将用户检索的关键词调用步骤6进行分词处理后,与步骤7形成的索引库中文档各域的分词内容进行匹配,查找所有匹配的文档作为结果集。Step 9: Utilize the retrieval module 11 to provide full-text retrieval service to the user. The retrieval module 11 receives the user's retrieval request and invokes the query method to perform the retrieval. The word segmentation content of each field of the document in the index database formed in step 7 is matched, and all matching documents are found as the result set.

上述技术方案的步骤7中,索引文档各域和对应的权重值设置依据如下:In step 7 of the above technical solution, the basis for setting each field of the index document and the corresponding weight value is as follows:

标题域存放数据模块名称<dmtitle>的分词结果,出现在标题域的词条反映整篇数据模块文档的主题,标题域的权重设置为10;The title field stores the word segmentation result of the data module name <dmtitle>, and the entries appearing in the title field reflect the theme of the entire data module document, and the weight of the title field is set to 10;

路径域用于标识文档访问路径,并存放数据模块编码信息来实现标识路径功能,路径域不参与分词和检索过程,路径域无需设置权重;The path field is used to identify the document access path, and stores the data module encoding information to realize the path identification function. The path field does not participate in the word segmentation and retrieval process, and the path field does not need to set the weight;

链接文本域用于存放数据模块编码链接还原文本内容的分词结果(和网页里面一样,数据模块内容中存在链接,链接以数据模块编码的形式出现,用户可点击链接访问其它数据模块,在步骤1里将数据模块编码与词汇之间形成映射,此处为利用这种映射关系将编码还原为词汇内容然后分词的结果),还用于实现对链接锚文本的检索,当检索关键词在链接文本域命中时,链接指向的数据模块文档可能为用户所查找的内容,链接文本域的权重设置为3;The link text field is used to store the participle results of the data module encoding link restoration text content (same as in the webpage, there are links in the data module content, and the links appear in the form of data module encoding, users can click on the link to access other data modules, in step 1 Here, the mapping between the data module code and the vocabulary is formed, here is the result of using this mapping relationship to restore the code to the vocabulary content and then segment the word), it is also used to realize the retrieval of the link anchor text, when the search keyword is in the link text When the field hits, the data module document pointed to by the link may be what the user is looking for, and the weight of the link text field is set to 3;

子标题域用于存放反映局部主题信息<title>(局部主题的标签,里面存放局部主题内容)的分词结果,子标题域的权重设置为5;The subtitle field is used to store word segmentation results reflecting the local topic information <title> (the label of the local topic, which stores the content of the local topic), and the weight of the subtitle field is set to 5;

正文域用于存放数据模块文档中其它技术信息分词(其它技术信息为除开子标题和链接信息的正文内容)结果,正文域的权重设置为1。The text field is used to store word segmentation results of other technical information in the data module document (other technical information is the content of the text excluding subtitles and link information), and the weight of the text field is set to 1.

上述技术方案的步骤1,具体包括如下步骤:Step 1 of the above-mentioned technical solution specifically includes the following steps:

步骤101:选取特定文本内容提取装备部件名称和技术信息术语两类专业词汇,其中特定元素包括技术名称<techname>和信息名称<infoname>,在数据模块名称中,技术名称<techname>用于描述装备部件名称,信息名称<infoname>用于描述技术信息术语,因此提取这两类元素的文本信息完成专业词汇的提取;Step 101: Select specific text content to extract two types of professional vocabulary, equipment part name and technical information term, wherein specific elements include technical name <techname> and information name <infoname>, in the data module name, technical name <techname> is used for description The name of the equipment part and the information name <infoname> are used to describe technical information terms, so the text information of these two types of elements is extracted to complete the extraction of professional vocabulary;

步骤102:建立专业词汇与相应数据模块编码(Data Model Code,DMC)信息间的映射关系,其中的映射关系是指标准编码系统(Standard Numbering Systems,SNS)与装备部件名称间、信息码<incode>与技术信息术语间的映射关系,链接访问信息是检索过程中一部分重要的资源,但是由于数据模块文档的链接引用不给出锚文本信息,而是通过引用数据模块编码来实现,因此需要将数据模块编码信息还原为文本才能进入检索范围,数据模块编码的子元素准编码系统SNS用于描述当前数据模块文档描述的组件在整个装备中的层级位置,因此可与技术名称<techname>描述的装备部件名称形成映射关系,从而利用装备部件名称完成对编码系统SNS的检索,建立数据模块编码DMC的子元素信息码<incode>与信息名称<infoname>之间的映射关系,利用技术信息术语完成对信息码的检索由于在不同的船舶装备交互式电子技术手册IETM系统中,相同的技术信息或者装备部件名称对应的编码可能不同,为了防止这种映射不一致的情况,在相应的信息码和编码系统SNS码添加相应装备类型标识码(Model identification code,MIC),MIC码起到定义装备名称和型号的作用,是权威机构制定的唯一确定装备的编码;Step 102: Establish a mapping relationship between professional vocabulary and corresponding data module code (Data Model Code, DMC) information, wherein the mapping relationship refers to the standard numbering system (Standard Numbering Systems, SNS) and equipment component names, information code <incode >The mapping relationship with technical information terms, link access information is an important resource in the retrieval process, but because the link reference of the data module document does not provide the anchor text information, but realizes it by referencing the data module code, so it needs to be The data module code information can only be entered into the search scope after it is restored to text. The sub-element quasi-code system SNS of the data module code is used to describe the hierarchical position of the component described in the current data module document in the entire equipment, so it can be described with the technical name <techname> The name of the equipment part forms a mapping relationship, so that the retrieval of the coding system SNS is completed by using the name of the equipment part, and the mapping relationship between the sub-element information code <incode> of the data module code DMC and the information name <infoname> is established, and the technical information term is used to complete Retrieval of information codes Because in different ship equipment interactive electronic technical manual IETM systems, the codes corresponding to the same technical information or equipment part names may be different, in order to prevent such mapping inconsistencies, the corresponding information codes and codes Add the corresponding equipment type identification code (Model identification code, MIC) to the SNS code of the system. The MIC code plays the role of defining the name and model of the equipment, and is the only code to determine the equipment formulated by the authoritative organization;

步骤103:将提取的词汇与对应编码信息分别存入装备部件名称词库和技术信息术语词库,其中装备部件名称词库用于存放装备名称或零部件名称及对应的编码系统SNS编码信息,技术信息术语词库用于存放技术信息术语及对应的信息码编码信息。Step 103: Store the extracted vocabulary and corresponding coding information into the equipment part name lexicon and the technical information term lexicon respectively, wherein the equipment part name lexicon is used to store equipment names or parts names and corresponding coding system SNS coding information, The technical information terminology database is used to store technical information terms and corresponding information code encoding information.

上述技术方案的步骤4中,由于船舶装备缩略语以缩合和截略两种形式出现,因此缩略语中出现的字符串必为原语(即为“缩略语”对应的全称)中字符,且满足缩略语中字符的排列顺序相对原语不变;读入W2左侧或者右侧的一位字符,设该候选字符为wc,判定wc在W0中是否存在且满足与W2的排列顺序在W0中不发生变化,如果满足条件,则判定wc为候选缩略语的边界字符,令W2等于wcW2或W2wc,若不满足条件,则wc不为缩略语中字符,当前方向字符判定终止,边界确定,重复以上过程,直到两个方向的字符边界判断全部终止,此时的W2为最终候选缩略语。In step 4 of the above technical solution, since the abbreviation of ship equipment appears in two forms of condensation and truncation, the character strings appearing in the abbreviation must be the characters in the original language (that is, the full name corresponding to the "abbreviation"), and It is satisfied that the sequence of characters in the abbreviation remains unchanged relative to the original language; read in a character on the left or right side of W 2 , set the candidate character as w c , and determine whether w c exists in W 0 and satisfies the requirement of W 2 The arrangement order of W 0 does not change. If the condition is satisfied, w c is determined as the boundary character of the candidate abbreviation, and W 2 is equal to w c W 2 or W 2 w c . If the condition is not satisfied, w c is not is the character in the abbreviation, the character determination in the current direction is terminated, the boundary is determined, and the above process is repeated until the character boundary determination in the two directions is all terminated. At this time, W 2 is the final candidate abbreviation.

上述技术方案的步骤7中,索引用于快速定位所需的文本信息,从而避免检索过程中大量的读写操作,索引使用特定的数据结构完成对词条的快速定位,本发明在通用的全文检索工具包Lucene的基础上,设计适用于IETM全文检索装置与方法的索引结构,Lucene中的索引结构从高到低共分为索引、段、文档、域和词条共五级层次,其中词条为索引的基本单位,存放每一个经过分词处理后的字符串;域用于包含单篇文档中分开索引的不同信息,如标题、正文、链接,域为用户可自行设计的结构,以便实现对不同类型文档的检索;文档为建立索引的基本单位,在本发明中,一个索引文档存放一个数据模块文档处理后的信息;段由多个文档组成,可视为一个小型索引,多个段最终构成索引。In step 7 of the above technical solution, the index is used to quickly locate the required text information, thereby avoiding a large number of read and write operations in the retrieval process, and the index uses a specific data structure to quickly locate the entry. Based on the retrieval toolkit Lucene, an index structure suitable for the IETM full-text retrieval device and method is designed. The index structure in Lucene is divided into five levels from high to low: index, segment, document, field and entry. The entry is the basic unit of the index, storing each word-segmented character string; the domain is used to contain different information in a single document, such as title, body, and link, and the domain is a structure that users can design by themselves, so as to realize Retrieval to different types of documents; document is the basic unit of indexing. In the present invention, an index document stores the information processed by a data module document; a segment is composed of multiple documents, which can be regarded as a small index, and multiple segments Finally constitute the index.

上述技术方案的步骤8中,标准编码系统(Standard Numbering Systems,SNS)编码权重根据标准编码系统代表的装备部件层级确定,SNS码的数字描述了当前数据模块中装备部件所处的装备层级,SNS码00-00-00、0a-00-00、0a-b0-00、0a-bd-00及0a-bd-fg,(a≠0,b≠0,d≠0,f≠0∪g≠0)分别描述了装备层次结构中处于装备级、系统级、子系统级、子子系统级和更底层装备划分级的装备部件,当检索关键词命中文档时,SNS码层次较高的数据模块文档可能只有局部内容与用户所需信息挂钩,反而SNS码层次较低的数据模块文档反映用户所需信息占文档内容的比例更高,因此,SNS码装备层次级别越低,相应的文档的权重因子设置越高,装备级、系统级、子系统级、子子系统级和更底层装备划分级的SNS码权重分别设置为1、2、3、4和5;In step 8 of the above-mentioned technical solution, the standard numbering system (Standard Numbering Systems, SNS) coding weight is determined according to the equipment part level represented by the standard numbering system, and the number of the SNS code describes the equipment level where the equipment part is in the current data module, and the SNS Code 00-00-00, 0a-00-00, 0a-b0-00, 0a-bd-00 and 0a-bd-fg, (a≠0, b≠0, d≠0, f≠0∪g≠ 0) respectively describe the equipment components at the equipment level, system level, subsystem level, sub-subsystem level and lower equipment division level in the equipment hierarchy. When the search keyword hits the document, the data module with a higher SNS code level Documents may have only part of the content linked to the information required by users. On the contrary, data module documents with lower levels of SNS codes reflect that the information required by users accounts for a higher proportion of document content. Therefore, the lower the level of SNS code equipment, the weight of the corresponding documents The higher the factor is set, the SNS code weights of equipment level, system level, subsystem level, sub-subsystem level and lower equipment division level are set to 1, 2, 3, 4 and 5 respectively;

信息码权重根据该信息码所描述的信息类别大小确定,信息码a00和abc,(b≠0,c≠0)分别描述了技术信息的大类别和子类别,当检索关键词命中文档时,粒度更小的信息码级别与用户所需的内容关联的可能性更高,因此,子类别信息码设置比大类别更高的权重,本发明设置大类别权重值为1,子类别权重值为2。The weight of the information code is determined according to the size of the information category described by the information code. The information codes a00 and abc, (b≠0, c≠0) respectively describe the major category and subcategory of technical information. When the retrieval keyword hits the document, the granularity The smaller information code level is more likely to be associated with the content required by the user. Therefore, the subcategory information code is set with a higher weight than the large category. The present invention sets the large category weight value to 1, and the subcategory weight value to 2 .

上述技术方案的步骤9中,结果集的排序依据向量空间模型(VSM,Vector SpaceModel)计算得到,具体公式如下:In step 9 of the above technical solution, the sorting of the result set is calculated according to the vector space model (VSM, Vector SpaceModel), and the specific formula is as follows:

设索引中文档为d,用户的检索关键词为q,q经过分词切分后结果为t1/t2/…/tn(Sdt是i从1到n的结果,里面包括tn),其中n为切分后的词条总数,ti为单个关键词词条,n≥1,i为1到n之间的字符编号,Sqd表示在索引文档d中匹配检索关键词q的得分,为结果排序因素,其值越高,在结果集中文档排序越靠前,coord(q,d)用于衡量索引文档d中不重复词条的数目,通过计算索引文档d中存在不重复词条数目Numdt与检索关键词q中不重复词条数目Numqt的商得到,querytnorm(q)为调节因子,对打分排序结果不影响,可设定该值用于整体调节得分的大小,Sdt表示在索引文档d中命中所有单个关键词词条ti的得分和,tf(ti,d)表示单个关键词词条ti在索引文档d出现的词频得分,idf(ti)表示单个关键词词条ti在多少文档出现过,该值越高,说明ti出现的文档越少,单个关键词词条ti与特定主题相关性越大,Boostti为单个关键词词条ti的权重,根据分词时单个关键词词条ti所匹配词库确定,norm(t,d)为索引文档d的权重及长度因素汇总值,其中Boostd为索引文档d权重,该值大小根据步骤7所述索引模块的索引文档各域权重设置来决定,Boostf为索引文档d中命中单个关键词词条ti的域的权重,该值大小根据步骤7所述索引模块的索引文档各域权重设置决定,Numterm是索引文档d中的切分词条总数,该值越大,norm(t,d)得分越低;Let the document in the index be d, the user's search keyword be q, and the result of q after word segmentation is t 1 /t 2 /.../t n (S dt is the result of i from 1 to n, including t n ) , where n is the total number of entries after segmentation, t i is a single keyword entry, n≥1, i is the character number between 1 and n, and S qd represents the matching retrieval keyword q in the index document d Score, which is the result sorting factor, the higher the value, the higher the ranking of documents in the result set, coord(q,d) is used to measure the number of unique entries in the index document d, by calculating the existence of unique entries in the index document d The quotient of the number of entries Num dt and the number of non-repeated entries Num qt in the search keyword q is obtained. querytnorm(q) is an adjustment factor, which does not affect the scoring and sorting results. This value can be set to adjust the size of the score as a whole. S dt represents the sum of the scores of all single keyword entries t i hit in the index document d, tf(t i ,d) represents the word frequency score of a single keyword entry t i appearing in the index document d, idf(t i ) Indicates how many documents a single keyword entry t i has appeared in. The higher the value, the fewer documents t i appears in, and the greater the relevance of a single keyword entry t i to a specific topic. Boost ti is a single keyword The weight of entry t i is determined according to the matching lexicon of single keyword entry t i during word segmentation, norm(t,d) is the weight and length factor summary value of index document d, where Boost d is the weight of index document d, the The value is determined according to the weight setting of each field of the index document of the index module described in step 7. Boost f is the weight of the field that hits a single keyword entry t i in the index document d, and the value is determined according to the index module described in step 7. The weight setting of each field of the index document is determined. Num term is the total number of segmentation terms in the index document d. The larger the value, the lower the norm(t,d) score;

所述的检索关键词条的权重依据分词时匹配词库类型来决定,设置依据如下:The weight of the search keyword entry is determined according to the matching thesaurus type during word segmentation, and the setting basis is as follows:

(1)从缩略语词库、技术信息术语词库和装备部件名称词库中命中的词条反映用户检索意图较大,权重值设置为5。(1) The hit entries from the abbreviation lexicon, technical information term lexicon, and equipment part name lexicon reflect the user's retrieval intention, and the weight value is set to 5.

(2)通用词库中匹配的词库反映用户检索意图较为片面,权重值设置为2。(2) The matching thesaurus in the general thesaurus reflects the user's one-sided retrieval intention, and the weight value is set to 2.

(3)分词过程中出现的单字划分粒度过细,检索时造成的噪声数据过多,权重值设置为1。(3) The granularity of word segmentation in the word segmentation process is too fine, resulting in too much noise data during retrieval, and the weight value is set to 1.

当排序完毕后,检索模块以一定形式输出结果集的排序结果,返回的结果页面每页十个检索结果,每个结果输出命中词条所在的信息片段并加红高亮命中的词条,并给出命中文档的标题和数据模块编码(Data Model Code,DMC)信息,用户可通过点击标题的超链接访问原数据模块文档。After the sorting is completed, the retrieval module outputs the sorting results of the result set in a certain form, and the returned result page has ten retrieval results per page, and each result outputs the information segment where the hit entry is located and highlights the hit entry in red, and Given the title and data module code (Data Model Code, DMC) information of the hit document, the user can click on the hyperlink of the title to access the original data module document.

本说明书未作详细描述的内容属于本领域专业技术人员公知的现有技术。The content not described in detail in this specification belongs to the prior art known to those skilled in the art.

Claims (7)

1.一种船舶装备交互式电子技术手册全文检索装置,其特征在于:它包括数据库和功能模块,其中,所述数据库包括公共源数据库(1)、技术信息术语数据库(5)、装备部件名称数据库(6)、缩略语数据库(7)、通用词汇数据库(8)、检索记录数据库(9)和索引数据库(13),所述功能模块包括专业词汇提取模块(2)、缩略语提取模块(3)、第一分词模块(4)、用户检索命令通信模块(10)、检索模块(11)、第二分词模块(12)和索引模块(14),其中公共源数据库(1)为专业词汇提取模块(2)和缩略语提取模块(3)提供词汇提取源并为第一分词模块(4)提供分词处理的内容,专业词汇提取模块(2)用于提取词汇并存入技术信息术语数据库(5)和装备部件名称数据库(6),缩略语提取模块(3)用于提取词汇存入缩略语数据库(7),第一分词模块(4)用于将处理后的分词内容导入索引模块(14)处理;1. A ship equipment interactive electronic technical manual full-text search device is characterized in that: it includes a database and a functional module, wherein the database includes a public source database (1), a technical information terminology database (5), an equipment component name Database (6), abbreviation database (7), common vocabulary database (8), retrieval record database (9) and index database (13), described function module comprises professional vocabulary extraction module (2), abbreviation extraction module ( 3), the first word segmentation module (4), the user retrieval command communication module (10), the retrieval module (11), the second word segmentation module (12) and the index module (14), wherein the public source database (1) is a professional vocabulary The extraction module (2) and the abbreviation extraction module (3) provide vocabulary extraction sources and provide word segmentation processing content for the first word segmentation module (4), and the professional vocabulary extraction module (2) is used to extract vocabulary and store it in the technical information terminology database (5) and equipment parts name database (6), the abbreviation extracting module (3) is used for extracting vocabulary and storing in the abbreviation database (7), and the first word segmentation module (4) is used for importing the word segmentation content after processing into the index module (14) Processing; 索引模块(14)用于建立索引并存入索引数据库(13),索引数据库(13)用于接收第二分词模块(12)分词处理的检索内容进行匹配查找,并将匹配得到的结果集返回至检索模块(11)进行排序,检索模块(11)用于将用户的检索内容发送至第二分词模块(12)进行分词处理,检索模块(11)还用于接收用户检索命令通信模块(10)的检索命令和返回排序后的结果集发送至用户检索命令通信模块(10),用户检索命令通信模块(10)用于将用户的检索命令发送至检索记录数据库(9),检索记录数据库(9)用于为缩略语提取模块(3)提供词汇提取源;Indexing module (14) is used for setting up index and is stored in indexing database (13), and indexing database (13) is used for receiving the retrieval content that the second participle module (12) participle handles to carry out matching search, and the result set that matching obtains is returned Retrieval module (11) sorts, and retrieval module (11) is used for sending the retrieval content of user to the second word segmentation module (12) and carries out word segmentation process, and retrieval module (11) is also used for receiving user's retrieval order communication module (10 ) of the retrieval command and the returned sorted result set are sent to the user retrieval command communication module (10), and the user retrieval command communication module (10) is used to send the user's retrieval command to the retrieval record database (9), and the retrieval record database ( 9) It is used to provide a vocabulary extraction source for the abbreviation extraction module (3); 所述技术信息术语数据库(5)、装备部件名称数据库(6)、缩略语数据库(7)和通用词汇数据库(8)分别为第一分词模块(4)和第二分词模块(12)提供分词时的匹配词集。The technical information term database (5), the equipment parts name database (6), the abbreviation database (7) and the common vocabulary database (8) respectively provide word segmentation for the first word segmentation module (4) and the second word segmentation module (12) matching word set. 2.一种利用权利要求1所述船舶装备交互式电子技术手册全文检索装置进行检索的方法,其特征在于,它包括如下步骤:2. a method utilizing the full-text retrieval device of the interactive electronic technical manual of the ship equipment described in claim 1 to retrieve, is characterized in that, it comprises the steps: 步骤1:在公共源数据库(1)中导入根据选定的交互式电子技术手册文档编写标准编辑的数据模块文档,专业词汇提取模块(2)根据所述选定的交互式电子技术手册文档编写标准的要求提取公共源数据库(1)内数据模块文档中的技术信息术语和装备部件名称两类专业词汇,并建立与相应数据模块文档中数据模块编码信息间的映射关系,并将上述两类专业词汇和映射关系存入对应的技术信息术语数据库(5)和装备部件名称数据库(6)中;Step 1: Import the data module document edited according to the selected interactive electronic technical manual document writing standard in the public source database (1), and the professional vocabulary extraction module (2) is written according to the selected interactive electronic technical manual document The standard requires extracting two types of technical information terms and equipment part names in the data module documents in the public source database (1), and establishing a mapping relationship with the data module coding information in the corresponding data module documents, and combining the above two categories The professional vocabulary and mapping relationship are stored in the corresponding technical information terminology database (5) and equipment part name database (6); 步骤2:缩略语提取模块(3)从公共源数据库(1)的装备部件名称中提取对应缩略语的特征量,该特征量为装备部件名称中的数字编号或者俗称部分;Step 2: the abbreviation extraction module (3) extracts the feature quantity corresponding to the abbreviation from the equipment part name of the public source database (1), and the feature quantity is a digital number or a common name part in the equipment part name; 步骤3:缩略语提取模块(3)将上述特征量与公共源数据库(1)内数据模块文档和检索记录数据库(9)内的用户检索记录进行匹配查找,确定特征量中的各个元素在数据模块文档和用户检索记录中的具体位置;Step 3: the abbreviation extraction module (3) matches the above-mentioned feature quantity with the user search record in the data module document in the public source database (1) and the search record database (9), and determines that each element in the feature quantity is included in the data Specific locations in the module documentation and user search records; 步骤4:缩略语提取模块(3)确定特征量所在缩略语的首尾字符串,并识别特征量对应缩略语的边界片段,使得识别的缩略语为完整缩略语,将该完整缩略语定为候选缩略语;Step 4: The abbreviation extraction module (3) determines the first and last character strings of the abbreviation where the feature quantity is located, and identifies the boundary segment of the abbreviation corresponding to the feature quantity, so that the recognized abbreviation is a complete abbreviation, and the complete abbreviation is determined as a candidate Abbreviations; 步骤5:缩略语提取模块(3)通过以下公式1计算上述候选缩略语的权值:Step 5: The abbreviation extraction module (3) calculates the weight of the above-mentioned candidate abbreviations by the following formula 1: 式中nmic为候选缩略语在特定内容中出现的次数,所述特定内容包括与装备部件名称的装备类型标识码相同的数据模块文档内容及该数据模块文档内容检索记录中的检索关键词;nall为候选缩略语在所有数据模块文档中出现的次数与检索记录数据库(9)中所有检索记录中出现次数的总和;Dall为所有数据模块文档总数和所有检索记录总数之和;Dmic为包含候选缩略语的数据模块文档总数和包含候选缩略语的检索记录总数之和;Wa为候选缩略语的权值,用于衡量候选缩略语衡量主题的能力,Wa的阈值为给定值,当候选缩略语的权值大于等于Wa的阈值时,候选缩略语可视为正式缩略语,将候选缩略语存入缩略语数据库(7),候选缩略语的权值小于Wa的阈值时,对候选缩略语不进行处理;In the formula, n mic is the number of times that the candidate abbreviation appears in the specific content, and the specific content includes the same data module document content as the equipment type identification code of the equipment part name and the retrieval keyword in the data module document content retrieval record; n all is the total number of occurrences of candidate abbreviations in all data module documents and all retrieval records in the retrieval record database (9); D all is the sum of the total number of all data module documents and all retrieval records; D mic is the sum of the total number of data module documents containing candidate abbreviations and the total number of retrieval records containing candidate abbreviations; W a is the weight of candidate abbreviations, which is used to measure the ability of candidate abbreviations to measure the subject, and the threshold of W a is given value, when the weight of the candidate abbreviation is greater than or equal to the threshold of W a , the candidate abbreviation can be regarded as a formal abbreviation, and the candidate abbreviation is stored in the abbreviation database (7), and the weight of the candidate abbreviation is less than W a When the threshold is reached, the candidate abbreviations are not processed; 步骤6:在第一分词模块(4)和第二分词模块(12)分别对数据模块文档和检索模块(11)提供的用户检索关键词进行分词处理;分词处理的具体过程为:Step 6: Carry out word segmentation processing to the user's retrieval keywords that data module document and retrieval module (11) provide respectively in the first word segmentation module (4) and the second word segmentation module (12); The specific process of word segmentation processing is: 设待切分的字符串为S1=w1w2w3…wi…wn,其中,待切分的字符串S1为用户检索关键词的字符串或数据模块文档中的每一句内容,wi为S1中的单个字符,n为该字符串的长度,n≥1,i为1到n之间的字符编号;Assume that the character string to be segmented is S 1 = w 1 w 2 w 3 ... wi ... w n , wherein the character string to be segmented S 1 is the string of keywords retrieved by the user or each sentence in the data module document Content, w i is a single character in S 1 , n is the length of the string, n≥1, i is the character number between 1 and n; 使用缩略语数据库(7)对待切分的字符串S1进行扫描,当缩略语命中时,将待切分的字符串S1中命中的字符子串还原为对应的原语,直到待切分的字符串S1扫描完毕为止,此时形成字符串S2=u1u2…ui…um,其中ui为S2中的单个字符,m为该字符串的长度;Use the abbreviation database (7) to scan the character string S 1 to be segmented, and when the abbreviation hits, restore the character substring hit in the character string S 1 to be segmented to the corresponding primitive until the segment to be segmented Until the character string S 1 of is scanned, a character string S 2 = u 1 u 2 ... u i ... u m is formed at this time, where u i is a single character in S 2 , and m is the length of the character string; 在第一分词模块(4)和第二分词模块(12)内用字符串S2建立一个节点数为m+1的有向无环图G,有向无环图G节点的编号依次为v0、v1、v2…vm,m为该字符串的长度,在相邻两个顶点vk,vk+1间建立有向边<vk,vk+1>,该有向边<vk,vk+1>对应的词汇为uk+1,k=0,1,2...m-1,m为该字符串的长度,若任意两个有向无环图G节点间存在直接相连的有向边,则认为这两个节点间的距离为1,若字符串S2的字符子串h1=upup+1…uq为在缩略语还原后的原语,其中1≤p<q,则以节点vp-1,vq为起始节点和终结节点建立有向边<vp-1,vq>,该有向边<vp-1,vq>对应的词汇为字符串S2的字符子串h1In the first word segmentation module (4) and the second word segmentation module (12), set up a directed acyclic graph G whose number of nodes is m+ 1 with character string S2, and the numbering of the directed acyclic graph G node is v successively 0 , v 1 , v 2 ... v m , m is the length of the string, and a directed edge <v k , v k+1 > is established between two adjacent vertices v k , v k+ 1, the directed The vocabulary corresponding to edge <v k , v k+1 > is u k+1 , k=0,1,2...m-1, m is the length of the string, if any two directed acyclic graphs If there is a directed edge directly connected between G nodes, the distance between these two nodes is considered to be 1. If the character substring h 1 =u p u p+1 ...u q of the string S 2 is after the reduction of the abbreviations , where 1≤p<q, then a directed edge <v p-1 ,v q > is established with nodes v p-1 and v q as the start node and end node, and the directed edge <v p- 1 , v q > the corresponding vocabulary is character substring h 1 of character string S 2 ; 分别使用技术信息术语数据库(5)和装备部件名称数据库(6)对字符串S2进行匹配,若存在匹配的最大字长字符子串h2=uaua+1…ub,1≤a<b,且最大字长字符子串h2的节点va-1和节点vb间不存在有向边<va-1,vb>,并有a≥p+1或者b≤q-1成立,则以节点va-1为起始节点,以节点vb为终结节点建立有向边<va-1,vb>,该边对应词汇为最大字长字符子串h2Use the technical information terminology database (5) and the equipment part name database (6) to match the character string S 2 , if there is a matching maximum word length character substring h 2 =u a u a+1 ... u b , 1≤ a<b, and there is no directed edge <v a-1 , v b > between node v a-1 and node v b of the maximum word length character substring h 2 , and a≥p+1 or b≤q If -1 is established, a directed edge <v a-1 , v b > is established with node v a-1 as the starting node and node v b as the ending node, and the corresponding vocabulary of this edge is the character substring h 2 of maximum word length ; 使用通用词汇数据库(8)对字符串S2进行匹配,若存在匹配的字符串h3=ucuc+1…ud,1≤c<d,且字符串h3的节点vc-1和vd间不存在有向边<vc-1,vd>,则以字符串h3的节点vc-1为起始节点,以字符串h3的节点vd为终结节点建立有向边<vc-1,vd>,该有向边<vc-1,vd>对应词汇为字符串h3;若字符串h3的节点vc-1和节点vd间存在有向边<vc-1,vd>,且有向边<vc-1,vd>的字符串类型为最大字长字符子串h2,则说明最大字长字符子串h2在通用词汇数据库(8)中存在,因此将其类型最大字长字符子串h2改为字符子串h4Use the general vocabulary database (8) to match the string S 2 , if there is a matching string h 3 =u c u c+1 ... u d , 1≤c<d, and the node v c- of the string h 3 If there is no directed edge <v c-1 ,v d > between 1 and v d , then the node v c-1 of the string h 3 is used as the starting node, and the node v d of the string h 3 is established as the ending node There is a directed edge <v c-1 , v d >, and the corresponding vocabulary of the directed edge <v c-1 , v d > is the string h 3 ; if the node v c-1 and the node v d of the string h 3 There is a directed edge <v c-1 , v d >, and the string type of the directed edge <v c-1 , v d > is the maximum word length character substring h 2 , then the maximum word length character substring h 2 exists in the general vocabulary database (8), so its type maximum word length character substring h 2 is changed into character substring h 4 ; 统计有向边生成完毕后有向无环图G中从节点v0到达vm路径长度由短至长的前N条路径,N选为3,其中最短的一条路径考虑所有有向边类型,第二短的路径和第三短的路径均忽略字符串类型为h1和h2的有向边,只对对应词汇字符串为h3和h4的有向边进行考虑,即在非最优路径中只考虑通用词库的匹配结果,剔除上述三条路径中存在的重复有向边,分别输出各条路径中剩余有向边对应的词汇,构成的结果集既为最终的分词结果;Count the first N paths from node v 0 to v m in the directed acyclic graph G after the directed edges are generated. The length of the path from node v 0 to v m is from short to long. Both the second shortest path and the third shortest path ignore the directed edges whose string types are h 1 and h 2 , and only consider the directed edges whose corresponding vocabulary strings are h 3 and h 4 , that is, in the non-most In the optimal path, only the matching results of the general lexicon are considered, the repeated directed edges existing in the above three paths are eliminated, and the words corresponding to the remaining directed edges in each path are respectively output, and the resulting result set is the final word segmentation result; 步骤7:在第一分词模块(4)将上述得到的最终的分词结果分别存入索引数据库(13)内索引文档的各域中,并设置各域的权重值,索引文档各域包括标题域、路径域、链接文本域、子标题域和正文域;Step 7: In the first word segmentation module (4), the final word segmentation results obtained above are respectively stored in each field of the index document in the index database (13), and the weight value of each field is set, and each field of the index document includes the title field , path field, link text field, subtitle field, and body text field; 步骤8:设置索引数据库(13)内索引文档的权重,并将多个索引文档构成段并最终形成索引文件;索引文档权重设置分为标准编码系统编码权重设置和信息码权重设置,根据数据模块文档编码特点,对不同标准编码系统编码和信息码的权重进行设置,标准编码系统编码权重设置依据标准编码系统编码装备层次级别越低,对应的权重因子设置越高的规则,信息码权重设置依据子类别信息码设置比主类别更高的权重的规则,然后将标准编码系统编码权重和信息码权重相乘得到索引文档的权重;Step 8: set the weight of the index document in the index database (13), and form a plurality of index documents into segments and finally form an index file; the index document weight setting is divided into standard coding system coding weight setting and information code weight setting, according to the data module Document coding features, set the weights of different standard coding system codes and information codes, standard code system code weight settings are based on the rule that the lower the level of standard code system coding equipment, the higher the corresponding weight factor setting, information code weight settings are based on The subcategory information code sets a rule with a higher weight than the main category, and then multiplies the standard coding system coding weight and the information code weight to obtain the weight of the index document; 步骤9:利用检索模块(11)向用户提供全文检索服务,检索模块(11)接收用户的检索请求并调用查询方式进行检索,该查询方式具体为:将用户检索的关键词调用步骤6进行分词处理后,与步骤7形成的索引库中文档各域的分词内容进行匹配,查找所有匹配的文档作为结果集。Step 9: Utilize the retrieval module (11) to provide full-text retrieval service to the user. The retrieval module (11) receives the user's retrieval request and invokes the query method for retrieval. The query method is specifically: call the keywords retrieved by the user in step 6 to perform word segmentation After processing, match the word segmentation content of each domain in the document in the index library formed in step 7, and find all matching documents as the result set. 3.根据权利要求2所述的检索方法,其特征在于:所述步骤7中,标题域存放数据模块文档名称的分词结果,出现在标题域的词条反映整篇数据模块文档的主题,标题域的权重设置为10。3. The retrieval method according to claim 2, characterized in that: in said step 7, the title field stores the word segmentation result of the data module document name, and the entry that appears in the title field reflects the theme of the entire data module document, and the title The weight of the domain is set to 10. 4.根据权利要求2所述的检索方法,其特征在于:所述步骤7中,路径域用于标识文档访问路径,并存放数据模块文档编码信息来实现标识路径功能,路径域不参与分词和检索过程,路径域无需设置权重。4. The retrieval method according to claim 2, characterized in that: in said step 7, the path field is used to identify the document access path, and stores the data module document encoding information to realize the function of identifying the path, and the path field does not participate in word segmentation and In the retrieval process, the path domain does not need to set weight. 5.根据权利要求2所述的检索方法,其特征在于:所述步骤7中,链接文本域用于存放数据模块文档编码链接还原文本内容的分词结果,还用于实现对链接锚文本的检索,当检索关键词在链接文本域命中时,链接指向的数据模块文档可能为用户所查找的内容,链接文本域的权重设置为3。5. The retrieval method according to claim 2, characterized in that: in said step 7, the link text domain is used to store the word segmentation results of the data module document code link restoration text content, and is also used to realize the retrieval of the link anchor text , when the search keyword hits in the link text field, the data module document pointed to by the link may be what the user is looking for, and the weight of the link text field is set to 3. 6.根据权利要求2所述的检索方法,其特征在于:所述步骤7中,子标题域用于存放反映局部主题信息的标签的分词结果,子标题域的权重设置为5。6. The retrieval method according to claim 2, characterized in that: in the step 7, the subtitle field is used to store word segmentation results of tags reflecting local topic information, and the weight of the subtitle field is set to 5. 7.根据权利要求2所述的检索方法,其特征在于:所述步骤7中,正文域用于存放数据模块文档中其它技术信息分词结果,正文域的权重设置为1。7. The retrieval method according to claim 2, characterized in that: in the step 7, the text field is used to store word segmentation results of other technical information in the data module document, and the weight of the text field is set to 1.
CN201510884252.XA 2015-12-03 2015-12-03 Device and method for full-text retrieval of ship equipment interactive electronic technical manual Active CN105528411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510884252.XA CN105528411B (en) 2015-12-03 2015-12-03 Device and method for full-text retrieval of ship equipment interactive electronic technical manual

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510884252.XA CN105528411B (en) 2015-12-03 2015-12-03 Device and method for full-text retrieval of ship equipment interactive electronic technical manual

Publications (2)

Publication Number Publication Date
CN105528411A CN105528411A (en) 2016-04-27
CN105528411B true CN105528411B (en) 2019-08-20

Family

ID=55770634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510884252.XA Active CN105528411B (en) 2015-12-03 2015-12-03 Device and method for full-text retrieval of ship equipment interactive electronic technical manual

Country Status (1)

Country Link
CN (1) CN105528411B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562716A (en) * 2017-07-18 2018-01-09 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN107844472B (en) * 2017-07-18 2021-08-24 创新先进技术有限公司 Word vector processing method, device and electronic device
CN110851692B (en) * 2018-07-27 2024-09-06 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN112084290B (en) * 2019-06-13 2024-04-05 北京沃东天骏信息技术有限公司 Data retrieval method, device, equipment and storage medium
CN110990663B (en) * 2019-12-04 2023-03-24 中船黄埔文冲船舶有限公司 Ship process knowledge management method, device and system
CN111339244A (en) * 2020-02-29 2020-06-26 山东浪潮通软信息科技有限公司 Tax policy and regulation inquiry method, computer equipment and storage medium
CN111930879B (en) * 2020-07-10 2024-12-27 银盛支付服务股份有限公司 A full-text search engine method and system based on management system
CN115329086B (en) * 2022-08-29 2024-04-16 中铁四局集团电气化工程有限公司 Rail transit document retrieval system and retrieval method based on classification coding
CN115688690B (en) * 2022-11-16 2023-10-03 金航数码科技有限责任公司 Dynamic conversion method for converting Word document content into XML fragment conforming to S1000D standard
CN116204557A (en) * 2023-01-10 2023-06-02 西安法士特汽车传动有限公司 Signal instruction retrieval matching method and system for electric commercial vehicle
CN116226335B (en) * 2023-03-15 2025-06-27 阿维塔科技(重庆)有限公司 Keyword query method and device and electronic equipment
CN116227488B (en) * 2023-05-09 2023-07-04 北京拓普丰联信息科技股份有限公司 Text word segmentation method and device, electronic equipment and storage medium
CN119003750B (en) * 2024-07-29 2025-04-22 中国船舶集团有限公司第七一九研究所 Full-text retrieval method and system for interactive electronic technical manual of ship equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023989A (en) * 2009-09-23 2011-04-20 阿里巴巴集团控股有限公司 Information retrieval method and system thereof
CN102810096A (en) * 2011-06-02 2012-12-05 阿里巴巴集团控股有限公司 Retrieval method and device based on separate character indexing system
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023989A (en) * 2009-09-23 2011-04-20 阿里巴巴集团控股有限公司 Information retrieval method and system thereof
CN102810096A (en) * 2011-06-02 2012-12-05 阿里巴巴集团控股有限公司 Retrieval method and device based on separate character indexing system
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method

Also Published As

Publication number Publication date
CN105528411A (en) 2016-04-27

Similar Documents

Publication Publication Date Title
CN105528411B (en) Device and method for full-text retrieval of ship equipment interactive electronic technical manual
US8321201B1 (en) Identifying a synonym with N-gram agreement for a query phrase
CN106649597B (en) Method for auto constructing is indexed after a kind of books book based on book content
US7493251B2 (en) Using source-channel models for word segmentation
US8661012B1 (en) Ensuring that a synonym for a query phrase does not drop information present in the query phrase
US8171029B2 (en) Automatic generation of ontologies using word affinities
US20120166414A1 (en) Systems and methods for relevance scoring
US8392441B1 (en) Synonym generation using online decompounding and transitivity
CN100483417C (en) Method for catching limit word information, optimizing output and input method system
CN110807326B (en) Short Text Keyword Extraction Method Combining GPU-DMM and Text Features
US8122022B1 (en) Abbreviation detection for common synonym generation
CN112395395A (en) Text keyword extraction method, device, equipment and storage medium
CN104156452A (en) Method and device for generating webpage text summarization
CN106407182A (en) A method for automatic abstracting for electronic official documents of enterprises
CN113553410B (en) Long document processing method, processing device, electronic device and storage medium
CN104199965A (en) Semantic information retrieval method
CN106649666A (en) Left-right recursion-based new word discovery method
CN110874408B (en) Model training method, text recognition device and computing equipment
CN114997288B (en) A design resource association method
WO2009017464A1 (en) Relation extraction system
CN111104437A (en) Method and system for unified retrieval of test data based on object model
CN109165331A (en) A kind of index establishing method and its querying method and device of English place name
CN101933017B (en) File retrieval device, file retrieval system and file retrieval method
CN109885641B (en) Method and system for searching Chinese full text in database
CN118862843A (en) A method and system for checking duplicates and automatically annotating scientific and technological project documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant