CN103886034B - A kind of method and apparatus of inquiry input information that establishing index and matching user - Google Patents
A kind of method and apparatus of inquiry input information that establishing index and matching user Download PDFInfo
- Publication number
- CN103886034B CN103886034B CN201410079818.7A CN201410079818A CN103886034B CN 103886034 B CN103886034 B CN 103886034B CN 201410079818 A CN201410079818 A CN 201410079818A CN 103886034 B CN103886034 B CN 103886034B
- Authority
- CN
- China
- Prior art keywords
- word
- matching
- words
- information
- input information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及计算机技术领域,尤其涉及一种用于建立索引及匹配 用户的查询输入信息的技术。The present invention relates to the field of computer technology, and in particular, to a technology for establishing an index and matching query input information of a user.
背景技术Background technique
人们在使用搜索引擎的过程中,往往不知道输入什么样的关键词 来表达自己的想法,其可能输入一堆描述性的词句,例如:1)早上 起来呕吐、平时心慌气短、四肢无力,是什么病症状?2)表达对爱 人的怀恋的歌曲简介?3)包含“说什么忘却富贵”歌曲4)吃着火锅唱着歌是在哪部电影中,谁说的?4)形容勤奋学习的诗句5)做 人难,做女人难是谁说的,完整的说法是什么?还有一些用户可能输 入一些句式复杂的表达内容,例如对于一些人物类别,用户可能问“安 徽出来的皇帝和国家主席有哪些?”、“本届政府山西的政治局常委介绍”等等。在这种情况下搜索引擎很难搜索到合适结果。In the process of using search engines, people often do not know what keywords to enter to express their thoughts. They may enter a bunch of descriptive words, such as: 1) Vomiting in the morning, usually flustered and short of breath, weak in limbs, yes What disease symptoms? 2) Introduction to the song expressing nostalgia for the lover? 3) Contains the song "What to say to forget about wealth" 4) In which movie was eating hot pot and singing the song, and who said it? 4) Describe the verse of diligent study 5) It is difficult to be a man, and it is difficult to be a woman, who said it, and what is the complete statement? Some users may also enter some complex expressions. For example, for some character categories, users may ask "What are the emperors and state presidents from Anhui?", "Introduction to the Standing Committee of the Political Bureau of the current government in Shanxi" and so on. In this case, it is difficult for search engines to find suitable results.
从原因上分析,这是由于现在通用的搜索引擎主要是对标题 (title)建立索引。虽然这些搜索引擎通常也对内容建立索引,但是 由于调权等因素,导致一些优质知识描述部分很难很好的展示。例如, 对于一些资源类如歌曲、电影等信息,现有的搜索引擎通常只是对歌 曲名和电影名建立索引,这种情况下,当用户记不住歌名或者电影名, 而是仅仅记住其中歌词、台词简介或者小部分描述时,现有的搜索引 擎就无法进行有效的搜索查询。这些情况同样发生在小说、诗词、对 联、祝福语、人物、电视剧、小说、句子、成语、疾病等类别的资源 上。From the analysis of the reason, this is because the current general search engine mainly builds an index on the title. Although these search engines usually index the content, due to factors such as weight adjustment, it is difficult to display some high-quality knowledge descriptions well. For example, for some resource categories such as songs, movies, etc., existing search engines usually only index the names of songs and movies. Existing search engines cannot make effective search queries when there are lyrics, introductions to lines, or small descriptions. These situations also occur in resources such as novels, poems, couplets, blessings, characters, TV series, novels, sentences, idioms, diseases, etc.
百科类资源知识通常是对以词为中心建立索引,这样就会导致在 通用的搜索排序算法中,很难将非出现在标题中的关键词排序在前 面。然而事实上,由于百科类资源知识的知识权威性,如果将这些数 据排在前面,能够很好的满足用户的需求。例如,对于百科中的疾病, 如果对症状进行打标签并建索引,则根据用户描述出来的症状就能够 很好地将对应的资源知识提供给用户。Encyclopedia resource knowledge is usually indexed with words as the center, which makes it difficult to rank keywords that do not appear in the title in the general search ranking algorithm. However, in fact, due to the knowledge authority of encyclopedia resource knowledge, if these data are ranked in the front, it can well meet the needs of users. For example, for diseases in the encyclopedia, if the symptoms are tagged and indexed, the corresponding resource knowledge can be well provided to the user according to the symptoms described by the user.
因此,如何有效利用现有资源知识,为之建立索引并匹配获得与 用户的查询输入信息对应的目标文本信息,成为本领域技术人员亟需 解决的问题之一。Therefore, how to effectively utilize the existing resource knowledge, establish an index for it, and obtain the target text information corresponding to the user's query input information by matching, has become one of the problems that those skilled in the art need to solve urgently.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种用于建立索引及匹配用户的查询输入信 息的方法与装置。An object of the present invention is to provide a method and apparatus for indexing and matching user's query input information.
根据本发明的一个方面,提供了一种用于基于文本信息建立索引 的方法,其中,该方法包括以下步骤:According to one aspect of the present invention, there is provided a method for indexing based on textual information, wherein the method comprises the following steps:
A根据文本信息,从中确定结构化信息;A determines structured information from textual information;
B自所述结构化信息中提取主题词;B extracts subject headings from the structured information;
C根据所述主题词所对应的主题,自所述文本信息中确定与所述 主题相对应的标签词;C, according to the theme corresponding to the theme word, from the text information, determine the label word corresponding to the theme;
D为所述主题词与所述标签词建立索引。D builds an index for the subject term and the tag term.
根据本发明的另一方面,还提供了一种根据前述所建立的索引匹 配用户的查询输入信息的方法,其中,该方法包括以下步骤:According to another aspect of the present invention, there is also provided a method for matching user's query input information according to the aforementioned established index, wherein the method comprises the following steps:
a获取用户输入的查询输入信息;a Obtain the query input information entered by the user;
b对所述查询输入信息进行主题与标签分析,以获得所述查询输 入信息所对应的主题词与标签词;b, subject and tag analysis are carried out to the query input information to obtain the subject words and tag words corresponding to the query input information;
c根据所述主题词与标签词,在前述所建立的索引中进行匹配查 询,以获得与所述查询输入信息相匹配的候选文本信息;c, according to the subject word and the tag word, carry out a matching query in the above-mentioned established index to obtain candidate text information that matches the query input information;
d根据所述候选文本信息与所述查询输入信息的语义匹配度,确 定与所述查询输入信息相匹配的目标文本信息。d. According to the semantic matching degree between the candidate text information and the query input information, determine the target text information that matches the query input information.
根据本发明的又一方面,还提供了一种用于基于文本信息建立 索引的索引建立设备,其中,该设备包括:According to another aspect of the present invention, also provides a kind of index establishment equipment for indexing based on text information, wherein, this equipment comprises:
信息确定装置,用于根据文本信息,从中确定结构化信息;an information determination device, used for determining structured information therefrom according to text information;
主题提取装置,用于自所述结构化信息中提取主题词;a topic extraction device for extracting topic words from the structured information;
标签确定装置,用于根据所述主题词所对应的主题,自所述文本 信息中确定与所述主题相对应的标签词;A tag determining device is configured to determine a tag word corresponding to the topic from the text information according to the topic corresponding to the topic word;
索引建立装置,用于为所述主题词与所述标签词建立索引。An index building device is used to build an index for the subject word and the tag word.
根据本发明的再一方面,还提供了一种根据前述所建立的索引匹 配用户的查询输入信息的匹配设备,其中,该设备包括:According to yet another aspect of the present invention, there is also provided a matching device for matching user's query input information according to the aforementioned established index, wherein the device includes:
查询获取装置,用于获取用户输入的查询输入信息;a query acquisition device, configured to acquire query input information input by a user;
信息分析装置,用于对所述查询输入信息进行主题与标签分析, 以获得所述查询输入信息所对应的主题词与标签词;an information analysis device, configured to perform subject and tag analysis on the query input information to obtain subject words and tag words corresponding to the query input information;
匹配查询装置,用于根据所述主题词与标签词,在如前述所建立 的索引中进行匹配查询,以获得与所述查询输入信息相匹配的候选文 本信息;A matching query device, configured to perform a matching query in the index established as described above according to the subject words and tag words, so as to obtain candidate text information that matches the query input information;
文本确定装置,用于根据所述候选文本信息与所述查询输入信息 的语义匹配度,确定与所述查询输入信息相匹配的目标文本信息。A text determining device is configured to determine target text information matching the query input information according to the semantic matching degree between the candidate text information and the query input information.
根据本发明的再一方面,还提供了一种用于建立索引及匹配用 户的查询输入信息的系统,包括如前所述的索引建立设备,及如前所 述的匹配设备。According to yet another aspect of the present invention, there is also provided a system for building an index and matching query input information of a user, including the aforementioned index building device and the aforementioned matching device.
与现有技术相比,本发明根据文本信息,从中确定结构化信息; 自所述结构化信息中提取主题词;根据所述主题词所对应的主题,自 所述文本信息中确定与所述主题相对应的标签词;为所述主题词与所 述标签词建立索引。进一步地,本发明获取用户输入的查询输入信 息;对所述查询输入信息进行主题与标签分析,以获得所述查询输入 信息所对应的主题词与标签词;根据所述主题词与标签词,在前述所 建立的索引中进行匹配查询,以获得与所述查询输入信息相匹配的候 选文本信息;根据所述候选文本信息与所述查询输入信息的语义匹配 度,确定与所述查询输入信息相匹配的目标文本信息。Compared with the prior art, the present invention determines structured information from text information; extracts subject words from the structured information; Label words corresponding to the topic; build an index for the topic words and the label words. Further, the present invention obtains the query input information input by the user; performs subject and label analysis on the query input information to obtain the subject words and label words corresponding to the query input information; according to the subject words and the label words, A matching query is performed in the aforementioned established index to obtain candidate text information that matches the query input information; matching target text information.
本发明基于百科类资源知识,或其他通过网络挖掘的资源知识,对其进行主题、标题的提取,形成对资源知识内容的有效描述,更好Based on encyclopedic resource knowledge or other resource knowledge mined through the network, the invention extracts the subject and title of the resource knowledge to form an effective description of the content of the resource knowledge, better
地展现这类优质资源知识,使得对这类资源知识的语义搜索更加高 效,满足用户无法准确使用关键词表达的复杂描述搜索需求,提升了 用户的使用体验。 This kind of high-quality resource knowledge can be displayed in an efficient manner, which makes the semantic search of this kind of resource knowledge more efficient, meets the complex description search needs that users cannot accurately express with keywords, and improves the user experience.
附图说明Description of drawings
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述, 本发明的其它特征、目的和优点将会变得更明显:Other features, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:
图1示出根据本发明一个方面的用于基于文本信息建立索引的设 备示意图;1 shows a schematic diagram of a device for indexing based on textual information according to an aspect of the present invention;
图2示出根据本发明一个优选实施例的用于基于文本信息建立索 引的设备示意图;2 shows a schematic diagram of a device for establishing an index based on text information according to a preferred embodiment of the present invention;
图3示出根据本发明另一个方面的用于匹配用户的查询输入信息 的设备示意图;3 shows a schematic diagram of a device for matching query input information of a user according to another aspect of the present invention;
图4示出根据本发明又一个方面的用于基于文本信息建立索引的 方法流程图;Figure 4 shows a flowchart of a method for indexing based on textual information according to yet another aspect of the present invention;
图5示出根据本发明再一个方面的用于基于文本信息建立索引的 方法流程图。Figure 5 shows a flowchart of a method for indexing based on textual information according to yet another aspect of the present invention.
附图中相同或相似的附图标记代表相同或相似的部件。The same or similar reference numbers in the drawings represent the same or similar parts.
具体实施方式Detailed ways
下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below with reference to the accompanying drawings.
图1示出根据本发明一个方面的用于基于文本信息建立索引的设 备示意图。索引建立设备1包括信息确定装置101、主题提取装置 102、标签确定装置103和索引建立装置104。Figure 1 shows a schematic diagram of an apparatus for indexing based on textual information according to an aspect of the present invention. The indexing apparatus 1 includes information determining means 101, topic extracting means 102, tag determining means 103, and indexing means 104.
其中,信息确定装置101根据文本信息,从中确定结构化信息。 具体地,该信息确定装置101例如通过与数据源的交互,如百科数据 等,获取了文本信息,进而,通过对该文本信息进行结构化,如分析 该文本信息中所包含的目录信息、子目录信息等,从中确定结构化信 息。Wherein, the information determination device 101 determines the structured information therefrom according to the text information. Specifically, the information determining device 101 obtains text information, for example, through interaction with data sources, such as encyclopedia data, etc., and further, by structuring the text information, such as analyzing the directory information, subsections contained in the text information Catalog information, etc., from which structured information is determined.
例如,信息确定装置101通过与百度百科、互动百科等百科数据 的交互,获取这些百科类的资源知识,以作为文本信息,进而,该信 息确定装置101对该文本信息进行结构化,例如,分析各个资源知识 对应的目录以及子目录,如对于“疾病”的资源知识,分析出其症状 对应的目录或子目录,治疗方法对应的目录或子目录等。For example, the information determination device 101 obtains the resource knowledge of these encyclopedias as text information through interaction with encyclopedia data such as Baidu Encyclopedia and Interactive Encyclopedia, and further, the information determination device 101 structures the text information, for example, analyzes Catalogs and sub-categories corresponding to each resource knowledge, for example, for the resource knowledge of "disease", the catalog or sub-category corresponding to its symptoms, the catalog or sub-category corresponding to the treatment method, etc. are analyzed.
又如,信息确定装置101通过数据挖掘的方式,从互联网中挖掘 出资源知识,以作为文本信息,进而,对该文本信息进行结构化以确 定结构化信息。例如,该信息确定装置101通过对垂直类资源网站的 挖掘,从中获取疾病以及疾病的症状描述、治疗方法、专长的医院等 信息。每个资源以疾病作为ID进行组织。如,首先根据类别给出一 些候选的种子词,例如疾病,给出冠心病、心肌炎、胃炎等,根据搜 索结果获取共同排名靠前的网站url,对其网站的结构进行分析,从 中提取出冠心病、冠心病的症状、冠心病的治疗方法、冠心病的专长 医院的信息,并将上述信息归并到冠心病这类“疾病”中,以组织的 方式将该冠心病形成名片,进行存储。则该“冠心病”即可作为最终 的文本信息,而其对应的“冠心病的症状”、“冠心病的治疗方法”、“冠 心病的专长医院的信息”等信息,则可作为该文本信息对应的结构化 信息。For another example, the information determination device 101 mines resource knowledge from the Internet by means of data mining as text information, and further structures the text information to determine structured information. For example, the information determining device 101 obtains information such as diseases and disease symptom descriptions, treatment methods, specialized hospitals and other information by mining vertical resource websites. Each resource is organized by disease as an ID. For example, firstly, some candidate seed words are given according to the category, such as disease, coronary heart disease, myocarditis, gastritis, etc., according to the search results, the common top website url is obtained, the structure of the website is analyzed, and the crown is extracted from it. Heart disease, symptoms of coronary heart disease, treatment methods for coronary heart disease, and information on hospitals specializing in coronary heart disease, and incorporate the above information into "diseases" such as coronary heart disease, and organize the coronary heart disease into a business card for storage. Then the "coronary heart disease" can be used as the final text information, and the corresponding information such as "symptoms of coronary heart disease", "treatment methods of coronary heart disease", and "information of hospitals specializing in coronary heart disease" can be used as the text. Information corresponds to structured information.
本领域技术人员应能理解上述确定结构化信息的方式仅为举 例,其他现有的或今后可能出现的确定结构化信息的方式如可适用 于本发明,也应包含在本发明保护范围以内,并在此以引用方式包 含于此。Those skilled in the art should understand that the above method for determining structured information is only an example, and other existing or possible future methods for determining structured information, if applicable to the present invention, should also be included within the protection scope of the present invention. and is hereby incorporated by reference.
主题提取装置102自所述结构化信息中提取主题词。具体地,该 主题提取装置102根据信息确定装置101所确定的结构化信息,例如 通过主题分类器,或其他预定的提取主题词的方式,自该结构化信息 中提取主题词。The subject extraction means 102 extracts subject words from the structured information. Specifically, the subject extracting means 102 extracts subject words from the structured information according to the structured information determined by the information determining means 101, such as through subject classifiers, or other predetermined means of extracting subject words.
在此,提取主题词的目的在于从文本信息中提取出表示该文本信 息的主题,从而为建立语义索引以及后续的语义匹配计算服务。Here, the purpose of extracting the subject heading is to extract the subject representing the text information from the text information, so as to serve the establishment of a semantic index and the subsequent semantic matching calculation.
优选地,该索引建立设备1还包括主题训练装置(未示出),该 主题训练装置根据预定主题体系,获取与所述预定主题体系相对应的 训练语料;根据所述训练语料,训练主题分类器;其中,所述主题提 取装置102根据所述主题分类器,自所述结构化信息中提取所述主题 词。Preferably, the index building device 1 further includes a theme training device (not shown), the theme training device obtains training corpus corresponding to the predetermined theme system according to a predetermined theme system; according to the training corpus, the training theme is classified wherein, the topic extraction device 102 extracts the topic words from the structured information according to the topic classifier.
具体地,主题训练装置确定预定主题体系,例如,该主题训练装 置根据大量网络搜索用户输入的查询序列的统计结果,确定网络搜索 用户常用的搜索需求,并结合目前常用的分类体系,例如百科、知道 等现有体系,确定具有一定需求的主题分类体系,并将其作为预定主 题体系。进而,该主题训练装置根据该预定主题体系,获取与该预定 主题体系相对应的训练语料,例如,假设在文章中有对应的位置标识 “医疗健康?内科”,则该数据被认为是疾病类别的训练语料。随后, 该主题训练装置根据该训练语料,训练主题分类器,例如,通过训练 语料,训练一个svm分类模型,以作为主题分类器。Specifically, the theme training device determines a predetermined theme system. For example, the theme training device determines the search requirements commonly used by network search users according to the statistical results of query sequences input by a large number of network search users, and combines the currently commonly used classification systems, such as encyclopedia, Knowing and other existing systems, determine the subject classification system with certain needs, and use it as a predetermined subject system. Further, the subject training device obtains training corpus corresponding to the predetermined subject system according to the predetermined subject system. For example, if there is a corresponding position identifier "Medical Health? Internal Medicine" in the article, the data is considered to be a disease category training data. Subsequently, the subject training device trains a subject classifier according to the training corpus, for example, through the training corpus, trains an svm classification model to serve as a subject classifier.
接着,主题提取装置102根据该主题训练装置所训练的主题分类 器,自结构化信息中提取主题词。例如,该主题提取装置102将“冠 心病”词及其症状、治疗方法等结构化信息输入该主题分类器,从而 获得该主题为“疾病”。又如,对于新来的百科名片,主题提取装置 102将其输入该主题分类器,如svm分类器,从而获得该百科名片的 类别所对应的主题。Next, the subject extraction means 102 extracts subject words from the structured information according to the subject classifier trained by the subject training means. For example, the topic extraction device 102 inputs the word "coronary heart disease" and its symptoms, treatment methods and other structured information into the topic classifier, so as to obtain the topic as "disease". For another example, for a new encyclopedia business card, the topic extraction device 102 inputs it into the topic classifier, such as the svm classifier, so as to obtain the topic corresponding to the category of the encyclopedic business card.
较佳地,该主题提取装置102还可对该提取的主题进行同义表达 扩展,例如,将主题“疾病”进行同义表达扩展,增加一个同义主题 “病”。Preferably, the topic extraction device 102 can also perform synonym expansion on the extracted topic, for example, perform synonym expansion on the topic "disease" to add a synonymous topic "disease".
本领域技术人员应能理解上述提取主题词的方式仅为举例,其 他现有的或今后可能出现的提取主题词的方式如可适用于本发明, 也应包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above methods for extracting subject words are only examples, and other existing or future methods for extracting subject words, if applicable to the present invention, should also be included within the protection scope of the present invention, and in This is incorporated herein by reference.
标签确定装置103根据所述主题词所对应的主题,自所述文本信 息中确定与所述主题相对应的标签词。具体地,该标签确定装置103 根据主题提取装置102所提取的主题词,及该主题词所对应的主题, 自该文本信息中确定与该主题相对应的标签词,例如,对于疾病为主 题的文本信息,标签确定装置103确定与该主题相对应的如下标签词: 心慌气短、胸闷、腹泻、呕吐、四肢无力等。The tag determining means 103 determines the tag word corresponding to the topic from the text information according to the topic corresponding to the topic word. Specifically, the label determination device 103 determines the label word corresponding to the theme from the text information according to the theme word extracted by the theme extraction device 102 and the theme corresponding to the theme word, for example, for disease-themed Text information, the tag determining device 103 determines the following tag words corresponding to the topic: palpitation, shortness of breath, chest tightness, diarrhea, vomiting, limb weakness, and the like.
优选地,所述标签确定装置103包括候选确定单元(未示出)、 中心词确定单元(未示出)和标签确定单元(未示出)。具体地,该 候选确定单元根据所述主题词所对应的主题,自所述文本信息中确定 与所述主题相对应的至少一个候选标签词,例如,该候选确定单元对 所有以词汇为组织的页面数据进行一元、二元、三元词统计,提取出 现在大于一定数量页面数据的词,作为候选标签词。Preferably, the label determination device 103 includes a candidate determination unit (not shown), a central word determination unit (not shown) and a label determination unit (not shown). Specifically, the candidate determination unit determines at least one candidate tag word corresponding to the topic from the text information according to the topic corresponding to the topic word. The page data is counted for unary, binary, and ternary words, and words that appear in more than a certain number of page data are extracted as candidate tag words.
随后,中心词确定单元根据所述至少一个候选标签词,确定对应 的中心词。接着,标签确定单元根据所述至少一个候选标签词与所述 中心词的距离,确定与所述主题相对应的标签词。Then, the central word determination unit determines the corresponding central word according to the at least one candidate tag word. Next, the label determination unit determines a label word corresponding to the topic according to the distance between the at least one candidate label word and the central word.
例如,中心词确定单元根据前面统计的标签数据,将所有候选标 签词进行合并,对这些候选标签词进行线下统计,统计过程如下:通 过在大规模文本中,如采用全网数据,统计数据中在文档的共线频率。 对于任意两个候选标签词,根据下式,计算它们之间的相似度:For example, the central word determination unit merges all the candidate label words according to the previously counted label data, and conducts offline statistics on these candidate label words. The collinear frequency in the document. For any two candidate label words, calculate the similarity between them according to the following formula:
在此,PMI(w′,w1)表示w'w1之间的互信息分值,定义为 P(w)表示被统计词w的概率。Here, PMI(w', w 1 ) represents the mutual information score between w'w 1 , which is defined as P(w) represents the probability of the word w being counted.
随后,中心词确定单元根据主题,确定需要对文本信息的哪些域 进行分析,如,疾病的症状类别、诗词的本身以及解释部分、人物的 描述部分等。进而,从中抽取所有在候选标签词中出现的词、以及对 应的同义词,然后将这些词组成一个中心,作为该至少一个候选标签 词对应的中心词。Then, the central word determination unit determines which fields of the text information need to be analyzed according to the theme, such as the symptom category of the disease, the poem itself and the explanation part, the description part of the character and so on. Further, extract all the words appearing in the candidate tag words and the corresponding synonyms, and then form a center of these words as the center word corresponding to the at least one candidate tag word.
接着,标签确定单元计算该至少一个候选标签词中每一个与该中 心词的距离,例如,假设此处以T表示中心词,则候选标签词与该中 心词的距离可通过下式计算获得:Next, the label determination unit calculates the distance between each of the at least one candidate label word and the central word. For example, if T is used to represent the central word, the distance between the candidate label word and the central word can be obtained by calculating the following formula:
在此,Num(T)表示中心词中所包含的词的数目。Here, Num(T) represents the number of words included in the central word.
随后,该标签确定单元根据该至少一个候选标签词与该中心词的 距离,确定与该主题相对应的标签词,例如,将与该中心词的距离小 于预定阈值的候选标签词作为与该主题相对应的标签词。Subsequently, the label determination unit determines a label word corresponding to the topic according to the distance between the at least one candidate label word and the central word, for example, a candidate label word whose distance from the central word is less than a predetermined threshold is regarded as a label word corresponding to the topic corresponding label words.
较佳地,如图3所示,标签确定单元以该候选标签排名与中心词 的距离做一个时间序列,如果排名变化的斜率大于预定斜率阈值,则 后续的节点被截除,如图3中的排名第5点到第6点。Preferably, as shown in Figure 3, the label determination unit makes a time series with the distance between the candidate label ranking and the central word. If the slope of the ranking change is greater than the predetermined slope threshold, the subsequent nodes are truncated, as shown in Figure 3. Ranked 5th to 6th.
在此,该斜率阈值例如通过统计得分的总体分布而经验设定。Here, the slope threshold value is set empirically, for example, by statistic on the overall distribution of the scores.
本领域技术人员应能理解上述确定标签词的方式仅为举例,其 他现有的或今后可能出现的确定标签词的方式如可适用于本发明, 也应包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned ways of determining tag words are only examples, and other existing or future ways of determining tag words, if applicable to the present invention, should also be included within the protection scope of the present invention, and in This is incorporated herein by reference.
更优选地,所述中心词确定单元根据预定过滤规则,对所述至少 一个候选标签词进行过滤处理,以获得至少一个经过滤处理后的候选 标签词;根据所述至少一个经过滤处理后的候选标签词,确定所述中 心词;其中,所述预定过滤规则基于以下至少任一项来确定:More preferably, the central word determination unit performs filtering processing on the at least one candidate tag word according to a predetermined filtering rule to obtain at least one filtered candidate tag word; according to the at least one filtered candidate tag word; candidate tag words, determine the central word; wherein, the predetermined filtering rule is determined based on at least any one of the following:
-所述至少一个候选标签词的词性;- the part of speech of the at least one candidate tag word;
-所述至少一个候选标签词的用词规则;- a wording rule for the at least one candidate tag word;
-所述至少一个候选标签词与所述主题的共现比。- the co-occurrence ratio of the at least one candidate tag word with the topic.
具体地,在对候选标签词进行统计的过程中,可能引入噪声,因 此,需要对候选标签词进行过滤处理,中心词确定单元根据预定过滤 规则,对所述至少一个候选标签词进行过滤处理,以获得至少一个经 过滤处理后的候选标签词。Specifically, noise may be introduced in the process of counting the candidate tag words. Therefore, it is necessary to filter the candidate tag words. The central word determination unit filters the at least one candidate tag word according to a predetermined filtering rule. to obtain at least one filtered candidate tag word.
例如,该中心词确定单元根据该至少一个候选标签词的词性,对 该至少一个候选标签词进行过滤处理,如,对该至少一个候选标签词 进行首词和尾词过滤。For example, the central word determining unit performs filtering processing on the at least one candidate label word according to the part of speech of the at least one candidate label word, for example, performs first word and last word filtering on the at least one candidate label word.
又如,该中心词确定单元根据该至少一个候选标签词的用词规 则,对该至少一个候选标签词进行过滤处理,如,该候选标签词的首 字不可能是“把”、“办”、“被”、“比”等字,尾字不可能是“当”、“到”、 “得”等字。For another example, the central word determining unit filters the at least one candidate label word according to the wording rule of the at least one candidate label word. For example, the first word of the candidate label word cannot be "ba" or "do". , "by", "bi" and other words, the last word cannot be "dang", "to", "de" and other words.
再如,该中心词确定单元根据该至少一个候选标签词与所述主题 的共现比,对该至少一个候选标签词进行过滤处理,如,该中心词确 定单元在搜索统计日志中、以及全网标题中,统计该至少一个候选标 签词与主题的共现比,只有与该主题共现过的才得以保留,或者,保 留与该主题的共现比大于预定阈值的候选标签词。For another example, the central word determination unit performs filtering processing on the at least one candidate tagged word according to the co-occurrence ratio of the at least one candidate tagged word and the subject. In the web title, the co-occurrence ratio of the at least one candidate tag word and the topic is counted, and only those that have co-occurred with the topic are retained, or the candidate tag words whose co-occurrence ratio with the topic is greater than a predetermined threshold are retained.
较佳地,该中心词确定单元根据结合上述任意两个预定过滤规则 或综合考虑全部三个预定过滤规则,对该至少一个候选标签词进行过 滤处理。Preferably, the central word determination unit performs filtering processing on the at least one candidate tag word according to a combination of any two predetermined filtering rules mentioned above, or a comprehensive consideration of all three predetermined filtering rules.
随后,中心词确定单元根据所述至少一个经过滤处理后的候选标 签词,确定所述中心词。Subsequently, the central word determination unit determines the central word according to the at least one filtered candidate tag word.
本领域技术人员应能理解上述预定过滤规则仅为举例,其他现 有的或今后可能出现的预定过滤规则如可适用于本发明,也应包含 在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned predetermined filtering rules are only examples. If other existing or future predetermined filtering rules are applicable to the present invention, they should also be included within the protection scope of the present invention, and are incorporated herein by reference. Included here.
索引建立装置104为所述主题词与所述标签词建立索引。具体地, 索引建立装置104根据主题词提取装置102所提取的主题词,及该标 签确定装置102所确定的标签词,为该主题词和标签词建立索引。The index building means 104 builds an index for the subject word and the tag word. Specifically, the index establishing means 104 builds an index for the subject word and the label word according to the subject word extracted by the subject word extraction means 102 and the label word determined by the label determination means 102.
例如,假设冠心病对应的文档为ID1,对应在该文档中重要度为 WC1(x),如x可以等于“疾病”、“心慌气短”等,心肌炎对应的文档 为ID2,胃炎对应的文档为ID3,中风对应的文档为ID4。索引建立 装置104按下述方式对主题词和标签词建立统一的倒排索引:For example, suppose the document corresponding to coronary heart disease is ID1, and the corresponding importance in this document is WC1(x). For example, x can be equal to "disease", "palpitation and shortness of breath", etc. The document corresponding to myocarditis is ID2, and the document corresponding to gastritis is ID3, the document corresponding to the stroke is ID4. The index establishment means 104 establishes a unified inverted index for subject terms and tag terms in the following manner:
疾病-ID1(WC1(x)),ID2(WC2(x)),ID3(WC3(x)),ID4 (WC4(x))Disease - ID1 (WC1(x)), ID2 (WC2(x)), ID3 (WC3(x)), ID4 (WC4(x))
心慌气短-ID1(WC1(x)),ID2(WC2(x)),ID4(WC4(x))Frustrated Shortness of Breath - ID1 (WC1(x)), ID2 (WC2(x)), ID4 (WC4(x))
心悸气短-ID1(WC1(x)),ID2(WC2(x)),ID4(WC4(x))Palpitations Shortness of Breath - ID1 (WC1(x)), ID2 (WC2(x)), ID4 (WC4(x))
呕吐-ID3(WC3(x)),ID4(WC4(x))Vomit - ID3 (WC3(x)), ID4 (WC4(x))
吐-ID3(WC3(x)),ID4(WC4(x))Spit - ID3 (WC3(x)), ID4 (WC4(x))
优选地,索引建立设备1还包括归一化装置(未示出),该归一 化装置若所述标签词包括多个语义一致的标签词,确定所述多个语义 一致的标签词的归一化结果;其中,所述索引建立装置104为所述主 题词、所述标签词及所述归一化结果建立索引。Preferably, the index building device 1 further includes a normalization device (not shown), the normalization device determines the normalization of the plurality of semantically consistent label words if the label words include a plurality of semantically consistent label words. The normalization result; wherein, the index building device 104 builds an index for the subject word, the tag word and the normalization result.
具体地,主题词“疾病”对应的标签词中可能包括多个语义一致 的标签词,如“吐”和“恶心呕吐”即语义一致,则归一化装置确定 该两个标签词的归一化结果为“呕吐”;随后,索引建立装置104为 该主题词“疾病”、标签词“吐”、“恶心呕吐”和归一化结果“呕吐”建立索引。Specifically, the label words corresponding to the subject word "disease" may include a plurality of label words with consistent semantics, such as "spit" and "nausea and vomiting", which have the same semantics, and the normalization device determines the normalization of the two label words The normalized result is "vomiting"; then, the indexing means 104 indexes the subject word "disease", the tag words "vomiting", "nausea and vomiting" and the normalized result "vomiting".
本领域技术人员应能理解上述建立索引的方式仅为举例,其他 现有的或今后可能出现的建立索引的方式如可适用于本发明,也应 包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned method of establishing an index is only an example, and other existing or possible future methods of establishing an index, if applicable to the present invention, should also be included within the protection scope of the present invention, and are hereby referred to as References are included here.
通常,建立索引都是针对关键词建立索引,在此,索引建立设备 1还对主题词、标签词及其归一化结果建立索引,从而实现用户的查 询输入信息与资源知识更好的匹配。Usually, indexing is to build indexes for keywords. Here, the indexing device 1 also builds indexes for subject words, tag words and their normalized results, so as to achieve better matching between the user's query input information and resource knowledge.
优选地,索引建立设备1的各个装置之间是持续不断工作的。具 体地,信息确定装置101根据文本信息,从中确定结构化信息;主题 提取装置102自所述结构化信息中提取主题词;标签确定装置103根 据所述主题词所对应的主题,自所述文本信息中确定与所述主题相对 应的标签词;索引建立装置104为所述主题词与所述标签词建立索 引。在此,本领域技术人员应理解“持续”是指索引建立设备1的各 装置分别按照设定的或实时调整的工作模式要求进行结构化信息的 确定、主题词的提取、标签词的确定及索引的建立,直至该索引建立 设备1在较长时间内停止确定结构化信息。Preferably, the various devices of the index building apparatus 1 work continuously. Specifically, the information determining means 101 determines structured information from the text information; the topic extracting means 102 extracts subject words from the structured information; the label determining means 103 extracts the subject words from the text according to the topics corresponding to the subject words The tag word corresponding to the topic is determined in the information; the index building device 104 creates an index for the topic word and the tag word. Here, those skilled in the art should understand that “continuous” means that each device of the index building apparatus 1 performs the determination of structured information, the extraction of subject words, the determination of label words and The index is established until the index establishment device 1 stops determining the structured information for a long period of time.
在此,索引建立设备1根据文本信息,从中确定结构化信息;自 所述结构化信息中提取主题词;根据所述主题词所对应的主题,自所 述文本信息中确定与所述主题相对应的标签词;为所述主题词与所述 标签词建立索引。索引建立设备基于百科类资源知识,或其他通过网 络挖掘的资源知识,对其进行主题、标题的提取,形成对资源知识内 容的有效描述,更好地展现这类优质资源知识,使得后续对这类资源 知识的语义搜索更加高效,满足用户无法准确使用关键词表达的复杂 描述搜索需求,提升了用户的使用体验。Here, the index building device 1 determines structured information therefrom according to the text information; extracts subject words from the structured information; Corresponding label words; build an index for the subject word and the label word. Based on encyclopedia resource knowledge or other resource knowledge mined through the network, the index building device extracts the subject and title of the resource knowledge to form an effective description of the content of the resource knowledge and better display this kind of high-quality resource knowledge. The semantic search of resource-like knowledge is more efficient, meeting the search needs of complex descriptions that users cannot accurately express with keywords, and improving the user experience.
图2示出根据本发明另一个方面的用于匹配用户的查询输入信息的 设备示意图。匹配设备2包括查询获取装置201、信息分析装置202、 匹配查询装置203和文本确定装置204。Figure 2 shows a schematic diagram of a device for matching query input information of a user according to another aspect of the present invention. The matching device 2 includes a query acquisition device 201 , an information analysis device 202 , a matching query device 203 and a text determination device 204 .
其中,查询获取装置201获取用户输入的查询输入信息。具体地, 用户通过与用户设备的交互,输入了查询输入信息,查询获取装置201 通过调用该用户设备所提供的应用程序接口(API)、通过调用诸如 JSP、ASP或PHP等动态页面技术,或者,通过其他约定的通信方式, 获取该用户输入的查询输入信息。Wherein, the query obtaining means 201 obtains the query input information input by the user. Specifically, the user inputs query input information through interaction with the user equipment, and the query acquisition means 201 invokes an application programming interface (API) provided by the user equipment, invokes a dynamic page technology such as JSP, ASP, or PHP, or , and obtain the query input information input by the user through other agreed communication methods.
在此,该查询输入信息包括但不限于用户通过文字输入、语音输入、 图像输入等不同输入方式所提交的查询输入信息。Here, the query input information includes, but is not limited to, the query input information submitted by the user through different input methods such as text input, voice input, and image input.
本领域技术人员应能理解上述获取查询输入信息的方式仅为举 例,其他现有的或今后可能出现的获取查询输入信息的方式如可适 用于本发明,也应包含在本发明保护范围以内,并在此以引用方式 包含于此。Those skilled in the art should understand that the above-mentioned ways of obtaining query input information are only examples, and other existing or possible future ways of obtaining query input information, if applicable to the present invention, should also be included within the protection scope of the present invention. and is hereby incorporated by reference.
信息分析装置202对所述查询输入信息进行主题与标签分析,以 获得所述查询输入信息所对应的主题词与标签词。具体地,信息分析 装置202对该查询获取装置201所获取的查询输入信息进行主题与标 签分析,例如,通过将该查询输入信息输入前述训练所获得的主题分 类器,获得该查询输入信息所对应的主题词;该信息分析装置202对 该用户输入的查询输入信息进行标签分析,获得对应的标签词。在此, 该信息分析装置202对该查询输入信息的标签分析的方式与前述标签 确定装置103确定文本信息的标签词的方式相同或相类似,故此处不 再赘述,并通过引用的方式包含于此。The information analysis device 202 performs subject and tag analysis on the query input information to obtain subject words and tag words corresponding to the query input information. Specifically, the information analysis device 202 performs subject and label analysis on the query input information acquired by the query acquisition device 201. For example, by inputting the query input information into the subject classifier obtained by the aforementioned training, the corresponding query input information is obtained. The information analysis device 202 performs tag analysis on the query input information input by the user to obtain the corresponding tag words. Here, the method of the information analysis device 202 to analyze the tags of the query input information is the same as or similar to the method of the aforementioned tag determination device 103 to determine the tag words of the text information, so it will not be repeated here, and it is included in the reference method. this.
匹配查询装置203根据所述主题词与标签词,在前述索引建立装 置104建立的索引中进行匹配查询,以获得与所述查询输入信息相匹 配的候选文本信息。具体地,匹配查询装置203根据该查询获取装置 201所获取的用户输入的查询输入信息,在前述索引建立装置104建 立的索引中进行匹配查询,例如通过全部匹配或部分匹配的方式,获 得命中该查询输入信息所对应的主题词的文本信息,或者命中该查询 输入信息所对应的标签词的文本信息,以作为与该查询输入信息相匹 配的候选文本信息。The matching query means 203 performs a matching query in the index established by the aforementioned index establishment means 104 according to the subject words and tag words, so as to obtain candidate text information that matches the query input information. Specifically, the matching query device 203 performs a matching query in the index established by the aforementioned index establishment device 104 according to the query input information input by the user obtained by the query obtaining device 201, for example, by means of all matching or partial matching, to obtain the matching query. The text information of the subject word corresponding to the query input information, or the text information of the tag word corresponding to the query input information is hit, as the candidate text information matching the query input information.
例如,假设用户输入查询输入信息为“心慌气短”,查询获取装 置201获取该用户输入的查询输入信息“心慌气短”;信息分析装置 202对该查询输入信息进行标签分析,获得的标签词为“心慌气短”, 前述索引建立装置104对该标签词“心慌气短”建立的索引如下:For example, assuming that the query input information input by the user is "palpitation and shortness of breath", the query acquisition device 201 obtains the query input information "palpitation and shortness of breath" input by the user; the information analysis device 202 performs tag analysis on the query input information, and the obtained tag word is " palpitation and shortness of breath”, the index established by the aforementioned index building device 104 for the tag word “palpitation and shortness of breath” is as follows:
心慌气短-ID1(WC1(x)),ID2(WC2(x)),ID4(WC4(x))Frustrated Shortness of Breath - ID1 (WC1(x)), ID2 (WC2(x)), ID4 (WC4(x))
其中,ID1、ID2、ID4分别表示包含有标签词“心慌气短”的文 本信息的ID号码,WC1(x)、WC2(x)、WC4(x)则分别表示标签词“心 慌气短”分别在这几个文本信息中的重要度。Among them, ID1, ID2, and ID4 respectively represent the ID numbers of the text information containing the tag word "palpitation and shortness of breath", and WC1(x), WC2(x), and WC4(x) respectively represent the label word "palpitation and shortness of breath". importance in several textual information.
则匹配查询装置203根据该用户的查询输入信息所对应的标签词 “心慌气短”,在索引建立装置104所建立的索引中进行匹配查询, 如根据上述索引,得到该查询输入信息“心慌气短”所对应的候选文 本信息——文本信息ID1、ID2和ID4。Then, the matching query device 203 performs a matching query in the index established by the index building device 104 according to the tag word "palpitation and shortness of breath" corresponding to the query input information of the user, and obtains the query input information "palpitation and shortness of breath" according to the above index. The corresponding candidate text information - text information ID1, ID2 and ID4.
本领域技术人员应能理解上述匹配查询的方式仅为举例,其他 现有的或今后可能出现的匹配查询的方式如可适用于本发明,也应 包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above matching query methods are only examples, and other existing or possible future matching query methods, if applicable to the present invention, should also be included within the protection scope of the present invention, and are hereby referred to as References are included here.
文本确定装置204根据所述候选文本信息与所述查询输入信息的 语义匹配度,确定与所述查询输入信息相匹配的目标文本信息。The text determining means 204 determines the target text information matching the query input information according to the semantic matching degree between the candidate text information and the query input information.
具体地,候选文本信息与查询输入信息之间具有一定的语义匹配 度,该语义匹配度可以通过计算获得,或进一步通过计算该候选文本 信息对应的索引词集与该查询输入信息所对应的匹配词集间的匹配 度获得。该文本确定装置204根据该候选文本信息与用户的查询输入 信息的语义匹配度,确定与该查询输入信息相匹配的目标文本信息, 如将语义匹配度最高的候选文本信息作为与该查询输入信息相匹配 的目标文本信息,或者,将语义匹配度大于预定匹配度阈值的候选文 本信息作为与该查询输入信息相匹配的目标文本信息。Specifically, there is a certain degree of semantic matching between the candidate text information and the query input information, and the semantic matching degree can be obtained by calculation, or further by calculating the matching between the index word set corresponding to the candidate text information and the query input information The matching degree between word sets is obtained. The text determination device 204 determines the target text information matching the query input information according to the semantic matching degree between the candidate text information and the user's query input information, for example, the candidate text information with the highest semantic matching degree is used as the query input information The matching target text information, or the candidate text information whose semantic matching degree is greater than a predetermined matching degree threshold is used as the target text information matching the query input information.
在此,该预定匹配度阈值为用于判断候选文本信息是否与查询输 入信息相匹配的语义匹配度,其值可以是预置的固定的,也可根据实 际情况进行调整。Here, the predetermined matching degree threshold is the semantic matching degree used to judge whether the candidate text information matches the query input information, and its value can be preset and fixed, or can be adjusted according to the actual situation.
优选地,该文本确定装置还包括匹配计算单元(未示出)和文本 确定单元(未示出)。该匹配计算单元计算所述候选文本信息与所述 查询输入信息的语义匹配度;文本确定单元根据所述语义匹配度,结 合预定匹配度阈值,确定与所述查询输入信息相匹配的目标文本信 息。Preferably, the text determination device further includes a matching calculation unit (not shown) and a text determination unit (not shown). The matching calculation unit calculates the semantic matching degree between the candidate text information and the query input information; the text determining unit determines the target text information matching the query input information according to the semantic matching degree and a predetermined matching threshold value .
例如,该匹配计算单元根据现有的匹配度计算方法,计算该候选 文本信息与用户的查询输入信息的语义匹配度;当该语义匹配度大于 该预定匹配度阈值,则该文本确定单元将该候选文本信息作为与该查 询输入信息相匹配的目标文本信息。For example, the matching calculation unit calculates the semantic matching degree between the candidate text information and the user's query input information according to the existing matching degree calculation method; when the semantic matching degree is greater than the predetermined matching degree threshold, the text determining unit The candidate text information serves as the target text information that matches the query input information.
较佳地,文本确定装置还可根据候选文本信息所对应的索引词集 与查询输入信息所对应的匹配词集,来确定与该查询输入信息所对应 的目标文本信息。具体地,候选文本信息具有对应的索引词集,如假 设上例中候选文本信息ID1对应的主题为“冠心病”,其对应的索引 词包括“疾病”、“心慌气短”等,则这些索引词所组成的索引词集即 为该候选文本信息ID1所对应的索引词集。用户的查询输入信息也有 对应的匹配词集,例如,通过对该查询输入信息进行分词处理后获得 匹配词,再将该匹配词所组成的集合作为该查询输入信息对应的匹配 词集,如假设用户输入的查询输入信息为“心慌气短呕吐”,匹配设 备1对该查询输入信息进行分词处理后,获得匹配词“心慌气短”和 “呕吐”,则该两个匹配词所组成的集合即为该查询输入信息对应的 匹配词集。文本确定装置204根据该索引词集与该匹配词集,确定与 该用户的查询输入信息相匹配的目标文本信息,例如,将命中该匹配 词集中最多匹配词的索引词集所对应的文本信息,作为与该查询输入 信息相匹配的目标文本信息;或者,将命中匹配词的数量大于预定数 量阈值的索引词集所对应的文本信息,作为与该查询输入信息相匹配 的目标文本信息。Preferably, the text determination device can also determine the target text information corresponding to the query input information according to the index word set corresponding to the candidate text information and the matching word set corresponding to the query input information. Specifically, the candidate text information has a corresponding set of index words. For example, if the subject corresponding to the candidate text information ID1 in the above example is "coronary heart disease", and the corresponding index words include "disease", "palpitation and shortness of breath", then these indexes The index word set formed by the words is the index word set corresponding to the candidate text information ID1. The user's query input information also has a corresponding matching word set. For example, the matching words are obtained by performing word segmentation on the query input information, and then the set composed of the matching words is used as the matching word set corresponding to the query input information. The query input information input by the user is "palpitation, shortness of breath and vomiting". After the matching device 1 performs word segmentation processing on the query input information, the matching words "palpitation, shortness of breath" and "vomiting" are obtained, and the set formed by the two matching words is The matching word set corresponding to the query input information. The text determination device 204 determines the target text information that matches the query input information of the user according to the index word set and the matching word set, for example, will hit the text information corresponding to the index word set with the most matching words in the matching word set , as the target text information matching the query input information; or, the text information corresponding to the index word set whose number of matching words is greater than a predetermined threshold is used as the target text information matching the query input information.
例如,对于上例中的候选文本信息ID1、ID2和ID4,ID1对应的 索引词集包括索引词“疾病”、“心慌气短”;ID2对应的索引词集包 括索引词“心慌气短”、“呕吐”、“疾病”;ID4对应的索引词集包括 索引词“心慌气短”。则对于用户输入的查询输入信息“心慌气短呕吐”,其匹配词为“心慌气短”、“呕吐”,ID2对应的索引词集命中该 查询输入信息对应的匹配词集中最多的匹配词,则将该候选文本信息 ID2作为与该查询输入信息最相匹配的目标文本信息,或者,假设预 定数量阈值为0,则上述候选文本信息ID1、ID2和ID4所对应的索引词集命中该匹配词集中的匹配词的数量均大于该预定数量阈值,则 上述候选文本信息ID1、ID2和ID4均作为与该查询输入信息相匹配 的目标文本信息。该匹配设备2提供给该用户时,可按照对应的索引 词在该候选文本信息中的重要度的高低进行排序。For example, for the candidate text information ID1, ID2 and ID4 in the above example, the index word set corresponding to ID1 includes the index words "disease" and "palpitation and shortness of breath"; the index word set corresponding to ID2 includes the index words "palpitation and shortness of breath", "vomiting" ", "disease"; the index word set corresponding to ID4 includes the index word "palpitation and shortness of breath". Then, for the query input information "palpitation, shortness of breath and vomiting" input by the user, the matching words are "palpitation and shortness of breath" and "vomiting", and the index word set corresponding to ID2 hits the most matching word in the matching word set corresponding to the query input information. The candidate text information ID2 is used as the target text information that most matches the query input information, or, assuming that the predetermined number threshold is 0, the index word set corresponding to the candidate text information ID1, ID2 and ID4 hits the matching word set. If the number of matching words is greater than the predetermined number threshold, the above-mentioned candidate text information ID1, ID2 and ID4 are all used as target text information matching the query input information. When the matching device 2 is provided to the user, it can be sorted according to the importance of the corresponding index words in the candidate text information.
本领域技术人员应能理解上述确定目标文本信息的方式仅为举 例,其他现有的或今后可能出现的确定目标文本信息的方式如可适 用于本发明,也应包含在本发明保护范围以内,并在此以引用方式 包含于此。Those skilled in the art should understand that the above methods for determining target text information are only examples, and other existing or future methods for determining target text information, if applicable to the present invention, should also be included within the protection scope of the present invention. and is hereby incorporated by reference.
优选地,匹配设备2的各个装置之间是持续不断工作的。具体 地,查询获取装置201获取用户输入的查询输入信息;信息分析装置 202对所述查询输入信息进行主题与标签分析,以获得所述查询输入 信息所对应的主题词与标签词;匹配查询装置203根据所述主题词与 标签词,在前述索引建立装置104建立的索引中进行匹配查询,以获 得与所述查询输入信息相匹配的候选文本信息;文本确定装置204根 据所述候选文本信息与所述查询输入信息的语义匹配度,确定与所述 查询输入信息相匹配的目标文本信息。在此,本领域技术人员应理 解“持续”是指匹配设备2的各装置分别按照设定的或实时调整的工 作模式要求进行查询输入信息的获取、主题与标签分析、候选文本信 息的匹配查询与目标文本信息的确定,直至该匹配设备2在较长时间 内停止获取用户输入的查询输入信息。Preferably, the various devices of the matching device 2 work continuously. Specifically, the query acquisition means 201 acquires the query input information input by the user; the information analysis means 202 performs subject and tag analysis on the query input information to obtain the subject words and tag words corresponding to the query input information; the matching query means 203 According to the subject words and tag words, a matching query is performed in the index established by the aforementioned index establishment device 104 to obtain candidate text information that matches the query input information; the text determination device 204 is based on the candidate text information and The semantic matching degree of the query input information determines the target text information that matches the query input information. Here, those skilled in the art should understand that “continuous” means that each device of the matching device 2 performs the acquisition of query input information, the analysis of topics and tags, and the matching query of candidate text information according to the set or real-time adjustment requirements of the working mode, respectively. The target text information is determined until the matching device 2 stops acquiring the query input information input by the user for a long period of time.
在此,索引建立设备1与匹配设备2的各装置之间相互配合,以 实现基于用户输入的查询输入信息,匹配获得与之对应的目标文本信 息;基于百科类资源知识,或其他通过网络挖掘的资源知识,对其进行 主题、标题的提取,形成对资源知识内容的有效描述,更好地展现这 类优质资源知识,使得对这类资源知识的语义搜索更加高效,满足用 户无法准确使用关键词表达的复杂描述搜索需求,提升了用户的使用 体验。Here, the devices of the index building device 1 and the matching device 2 cooperate with each other to realize the query input information based on the user input, and obtain the corresponding target text information by matching; It can extract the subject and title of the resource knowledge, form an effective description of the resource knowledge content, better display this kind of high-quality resource knowledge, make the semantic search for this kind of resource knowledge more efficient, and satisfy users who cannot accurately use the key The complex description of the word expression to describe the search requirements improves the user experience.
优选地,所述主题词与标签词还可看作两个不同的域,分别对应 主题域和标签域,所述匹配查询装置203根据该主题词与标签词,分 别在主题域和标签域所对应的前述索引中进行匹配查询,以获得与所 述查询输入信息相匹配的候选文本信息。Preferably, the subject word and the tag word can also be regarded as two different domains, corresponding to the subject domain and the tag domain respectively. A matching query is performed in the corresponding aforementioned index to obtain candidate text information matching the query input information.
具体地,匹配查询装置203根据信息分析装置202对用户输入的 查询输入信息的分析所获得的主题词和标签词,采用分域匹配的方 式,分别在该主题域和标签域所对应的索引中进行匹配查询,以获得 候选文本信息。Specifically, the matching query means 203 adopts the method of domain matching according to the subject words and tag words obtained by the information analysis means 202 through the analysis of the query input information input by the user, respectively in the indexes corresponding to the subject domain and the tag domain. A match query is performed to obtain candidate text information.
在此,该主题域和标签域可以通过对该查询输入信息进行分析获 得,例如,对用户输入的查询输入信息,利用前述的主题分类器对用 户输入的查询输入信息进行分析,获得主题类别。Here, the subject domain and the tag domain can be obtained by analyzing the query input information, for example, using the aforementioned subject classifier to analyze the query input information input by the user to obtain the subject category.
在此,主题域和标签域所对应的索引即前述索引建立装置104所 建立的索引,根据之前建立的标签,对用户输入的查询输入信息进行 标签词提取,如针对包含在该查询输入信息中且在标签集合里面的, 则将其提取出来。然后,利用标签词和主题类别到对应的主题与标签 统一索引中进行拉倒排文档的候选,将包含该主题类别或者标签的文 档作为与该查询输入信息相对应的候选文本信息,参与后续计算。Here, the index corresponding to the subject domain and the label domain is the index established by the aforementioned index establishment device 104, and according to the previously established label, the label word extraction is performed on the query input information input by the user, such as for the query input information contained in the query input information. And in the label set, it will be extracted. Then, use the tag word and topic category to pull down the candidates of the document in the corresponding topic and tag unified index, and take the document containing the topic category or tag as the candidate text information corresponding to the query input information, and participate in the subsequent calculation.
较佳地,该匹配查询装置203还可考虑该主题域和标签域所对应 的权重,在对应的索引中进行匹配查询,综合考虑该主题域和标签域 对应的权重,最终获得候选文本信息。Preferably, the matching query device 203 can also consider the corresponding weights of the subject domain and the tag domain, perform a matching query in the corresponding index, comprehensively consider the corresponding weights of the subject domain and the tag domain, and finally obtain candidate text information.
优选地,所述文本确定装置204根据所述匹配词集所包括的匹配 词,在所述候选文本信息所对应的索引词集中确定目标索引词集,其 中,所述目标索引词集命中所述匹配词集中最多的匹配词;若所述目 标索引词集与所述匹配词集的相似度大于预定阈值,将所述目标索引 词集所对应的文本信息作为与所述查询输入信息相匹配的目标文本 信息。Preferably, the text determination device 204 determines a target index word set in the index word set corresponding to the candidate text information according to the matching words included in the matching word set, wherein the target index word set hits the The most matching words in the matching word set; if the similarity between the target index word set and the matching word set is greater than a predetermined threshold, the text information corresponding to the target index word set is used as the matching word with the query input information. target text information.
具体地,文本确定装置204根据候选文本信息所对应的索引词集 命中匹配词集中匹配词的数量,将命中匹配词数量最多的索引词集作 为目标索引词集;随后,该文本确定装置204计算该目标索引词集与 匹配词集的相似度,例如,分别计算目标索引词集与匹配词集中,命 中的索引词与对应的匹配词之间的相似度,再通过简单相加或加权平 均等方式,计算该目标索引词集与匹配词集的相似度,当该相似度大 于预定阈值时,该文本确定装置将该目标索引词集所对应的文本信息 作为与该查询输入信息相匹配的目标文本信息。Specifically, the text determining device 204 uses the number of matching words in the matching word set according to the index word set corresponding to the candidate text information, and takes the index word set with the largest number of matching words as the target index word set; then, the text determining device 204 calculates The similarity between the target index word set and the matching word set, for example, calculate the similarity between the target index word set and the matching word set, and the similarity between the hit index word and the corresponding matching word, and then simply add or weight the average, etc. method to calculate the similarity between the target index word set and the matching word set, when the similarity is greater than a predetermined threshold, the text determination device takes the text information corresponding to the target index word set as the target matching the query input information text information.
在此,该预定阈值为根据目标索引词集与匹配词集的相似度,判 断是否将目标索引词集对应的文本信息作为目标文本信息的相似度 阈值,其值可以是固定的,也可根据实际情况做调整。Here, the predetermined threshold is the similarity threshold for judging whether to use the text information corresponding to the target index word set as the target text information according to the similarity between the target index word set and the matching word set. Adjust the actual situation.
优选地,匹配设备2还包括词集确定装置(未示出)。其中,词 集确定装置对所述查询输入信息进行分词处理,获得经所述分词处理 后的分词;将所述分词与所述信息分析装置202所获得的主题词与标 签词进行合并处理,以获得与所述查询输入信息对应的匹配词集,其 中,所述匹配词集中所包括的词作为匹配词。随后,所述匹配计算单 元根据所述匹配词集与所述候选文本信息所对应的索引词集,计算所 述候选文本信息与所述查询输入信息的语义匹配度。Preferably, the matching device 2 further comprises a word set determination device (not shown). The word set determination device performs word segmentation processing on the query input information to obtain word segmentation processed by the word segmentation; A matching word set corresponding to the query input information is obtained, wherein the words included in the matching word set are used as matching words. Subsequently, the matching calculation unit calculates the semantic matching degree between the candidate text information and the query input information according to the matching word set and the index word set corresponding to the candidate text information.
具体地,词集确定装置对该查询获取装置201所获取的查询输入 信息进行分词处理,以获得经分词处理后的分词,较佳地,该词集确 定装置还可对该分词处理后获得分词进行去除停用词等过滤处理,进 而获得最终的分词;随后,该词集确定装置根据所获得的分词,将其 与前述信息分析装置202所获得的主题词与标签词进行合并处理、去 冗余处理等,以最终获得与该查询输入信息相对应的匹配词集,并将 该匹配词集中所包括的词作为与该查询输入信息对应的匹配词。Specifically, the word set determination device performs word segmentation processing on the query input information obtained by the query acquisition device 201 to obtain word segmentation processed by word segmentation. Preferably, the word set determination device can also process the word segmentation to obtain word segmentation Perform filtering processing such as removing stop words, and then obtain the final word segmentation; then, according to the obtained word segmentation, the word set determination device combines it with the subject words and tag words obtained by the aforementioned information analysis device 202. After processing, etc., to finally obtain a matching word set corresponding to the query input information, and use the words included in the matching word set as matching words corresponding to the query input information.
随后,匹配计算单元根据所述匹配词集与所述候选文本信息所对 应的索引词集,计算所述候选文本信息与所述查询输入信息的语义匹 配度。Subsequently, the matching calculation unit calculates the semantic matching degree between the candidate text information and the query input information according to the matching word set and the index word set corresponding to the candidate text information.
更优选地,该匹配设备2还包括后续处理装置(未示出)。该后 续处理装置对所述匹配词进行后续处理,以更新所述匹配词集;其中, 所述后续处理包括以下至少任一项:More preferably, the matching device 2 further includes a subsequent processing device (not shown). The subsequent processing device performs subsequent processing on the matching words to update the matching word set; wherein, the subsequent processing includes at least any one of the following:
-确定所述匹配词中所包括的相互同义的匹配词,将所述相互同 义的匹配词合并为所述匹配词集的子集。- determining mutually synonymous matched words included in the matched words, and merging the mutually synonymous matched words into a subset of the set of matched words.
-对所述匹配词进行同义扩展,将同义扩展后得到的同义词与所 述匹配词确定为所述匹配词集的子集。- performing synonym expansion on the matching words, and determining the synonyms obtained after the synonym expansion and the matching words as subsets of the matching word set.
具体地,后续处理装置对词集确定装置所确定的匹配词集中的匹 配词进行后续处理,以更新该匹配词集。例如,后续处理装置确定所 述匹配词中所包括的相互同义的匹配词,将所述相互同义的匹配词合 并为所述匹配词集的子集。由于匹配词中可能包括相互同义的匹配 词,如“呕吐”和“吐”,该后续处理装置将这些相互同义的匹配词 合并为该匹配词集的子集。Specifically, the subsequent processing means performs subsequent processing on the matching words in the matching word set determined by the word set determining means, so as to update the matching word set. For example, the subsequent processing device determines mutually synonymous matching words included in the matching words, and combines the mutually synonymous matching words into a subset of the matching word set. Since the matching words may include mutually synonymous matching words, such as "vomit" and "spit", the post-processing device combines these mutually synonymous matching words into a subset of the matching word set.
例如,假设用户输入的查询输入信息为Q,词集确定装置对该查 询输入信息进行分词处理,去除停用词等过滤处理之后,在标签域内 的匹配词集表示为Q={a,b,c,d,e},其中,a,b,c,d,e分别为该匹配 词集中所包括的匹配词;假设其中的匹配词a和b是相互同义的匹配 词,则后续处理装置将该匹配词a和b合并为该匹配词集的子集,则 该匹配词集更新表示为Q={{a,b},c,d,e}。随后,后续装置如匹配查 询装置203进行后续的匹配查询操作。For example, assuming that the query input information input by the user is Q, the word set determination device performs word segmentation processing on the query input information, and after filtering processing such as removing stop words, the matching word set in the label field is expressed as Q={a,b, c, d, e}, where a, b, c, d, and e are the matching words included in the matching word set; assuming that the matching words a and b are mutually synonymous matching words, the subsequent processing device Combining the matching words a and b into a subset of the matching word set, the matching word set update is expressed as Q={{a,b},c,d,e}. Subsequently, subsequent devices, such as the matching query device 203, perform subsequent matching query operations.
又如,后续处理装置还对所述匹配词进行同义扩展,将同义扩展 后得到的同义词与所述匹配词确定为所述匹配词集的子集。具体地, 后续处理装置还可对该查询输入信息对应的匹配词集中的匹配词进 行同义扩展,如将“心悸气短”同义扩展为“心慌气短”,随后,该 后续处理装置将该同义扩展后得到的同义词与该匹配词确定为该匹 配词集的子集。For another example, the subsequent processing device further performs synonym expansion on the matching word, and determines the synonym obtained after the synonym expansion and the matching word as a subset of the matching word set. Specifically, the subsequent processing device may also synonymously expand the matching words in the matching word set corresponding to the query input information, for example, synonymously expand "palpitation and shortness of breath" to "palpitation and shortness of breath", and then, the subsequent processing device will synonymously expand the same word. The synonym obtained after the sense expansion and the matching word are determined as a subset of the matching word set.
接上例,对于经同义合并后的匹配词集Q={{a,b},c,d,e},该后 续处理装置还可对该匹配词集进行同义扩展,扩展获得其中的匹配词 abcde的同义词,并将该同义扩展后得到的同义词与该匹配词确 定为该匹配词集的子集,例如,该匹配词集Q经多次同义扩展后,得 到如下表达式:Continuing from the above example, for the matching word set Q={{a,b},c,d,e} after synonymous merging, the subsequent processing device can also perform synonym expansion on the matching word set, and expand to obtain the The synonym of the matching word abcde is determined, and the synonym obtained after the synonym expansion and the matching word are determined as subsets of the matching word set. For example, after the matching word set Q is synonymously expanded for many times, the following expression is obtained:
随后,匹配查询装置203根据该匹配词集,在索引建立装置104 所建立的索引中进行匹配查询,例如,经过倒排索引,获得包含 的候选文本信息。Then, the matching query device 203 performs a matching query in the index established by the index establishment device 104 according to the matching word set, for example, through an inverted index, obtains a candidate text information.
假设将命中匹配词集中最多的匹配词的索引词集表示为C,则C 为:Assuming that the index word set that hits the most matching words in the matching word set is denoted by C, then C is:
其中,C表示同义命中的最大w1i对应的位置语义 映射的词集合 where C represents the maximum number of synonymous hits The word set of the position semantic mapping corresponding to w 1i
则匹配计算单元根据所述匹配词集与所述候选文本信息所对应 的索引词集,计算所述候选文本信息与所述查询输入信息的语义匹配 度。Then, the matching calculation unit calculates the semantic matching degree between the candidate text information and the query input information according to the matching word set and the index word set corresponding to the candidate text information.
Q和C之间的语义匹配度可通过下式计算:The semantic matching degree between Q and C can be calculated by the following formula:
其中,表示词的权重,这里用(log(TF)+1)*log(N/DF) 表示;Match(TQ,TC)表示索引词集、匹配词集与主题是否匹配。in, express word The weight of , here is represented by (log(TF)+1)*log(N/DF); Match(T Q ,T C ) represents whether the index word set, matching word set and topic match.
在此,Match(TQ,TC)对应的值可定义,如假设该索引词集、匹配词 集与主题匹配,则Match(TQ,TC)的值为1,否则为0.5。Here, the value corresponding to Match(T Q , T C ) can be defined. If it is assumed that the index word set and the matching word set match the topic, the value of Match (T Q , T C ) is 1, otherwise it is 0.5.
随后,假设该计算得出的语义匹配度值大于预定阈值,则文本确 定单元将该索引词集所对应的文本信息作为与该查询输入信息相匹 配的目标文本信息。Then, assuming that the calculated semantic matching degree value is greater than a predetermined threshold, the text determining unit takes the text information corresponding to the index word set as the target text information matching the query input information.
图4示出根据本发明又一个方面的用于基于文本信息建立索引的 方法流程图。Figure 4 shows a flowchart of a method for indexing based on textual information according to yet another aspect of the present invention.
在步骤S401中,索引建立设备1根据文本信息,从中确定结构 化信息。具体地,在步骤S401中,索引建立设备1例如通过与数据 源的交互,如百科数据等,获取了文本信息,进而,通过对该文本信 息进行结构化,如分析该文本信息中所包含的目录信息、子目录信息 等,从中确定结构化信息。In step S401, the indexing apparatus 1 determines structured information therefrom according to the textual information. Specifically, in step S401, the index building device 1 obtains text information, for example, through interaction with a data source, such as encyclopedia data, etc., and further, by structuring the text information, such as analyzing the text information contained in the Directory information, sub-directory information, etc., from which structured information is determined.
例如,在步骤S401中,索引建立设备1通过与百度百科、互动 百科等百科数据的交互,获取这些百科类的资源知识,以作为文本信 息,进而,在步骤S401中,索引建立设备1对该文本信息进行结构 化,例如,分析各个资源知识对应的目录以及子目录,如对于“疾病”的资源知识,分析出其症状对应的目录或子目录,治疗方法对应的目 录或子目录等。For example, in step S401, the index building device 1 obtains the resource knowledge of these encyclopedias as text information by interacting with the encyclopedia data such as Baidu Baike, Interactive Baike, etc., and then, in step S401, the index building device 1 for the The text information is structured, for example, by analyzing the catalogues and subcategories corresponding to each resource knowledge, for example, for the resource knowledge of "disease", analyzing the catalogues or subcategories corresponding to the symptoms, and the catalogues or subcategories corresponding to the treatment methods, etc.
又如,在步骤S401中,索引建立设备1通过数据挖掘的方式, 从互联网中挖掘出资源知识,以作为文本信息,进而,对该文本信息 进行结构化以确定结构化信息。例如,在步骤S401中,索引建立设 备1通过对垂直类资源网站的挖掘,从中获取疾病以及疾病的症状描 述、治疗方法、专长的医院等信息。每个资源以疾病作为ID进行组 织。如,首先根据类别给出一些候选的种子词,例如疾病,给出冠心 病、心肌炎、胃炎等,根据搜索结果获取共同排名靠前的网站url, 对其网站的结构进行分析,从中提取出冠心病、冠心病的症状、冠心病的治疗方法、冠心病的专长医院的信息,并将上述信息归并到冠心 病这类“疾病”中,以组织的方式将该冠心病形成名片,进行存储。 则该“冠心病”即可作为最终的文本信息,而其对应的“冠心病的症 状”、“冠心病的治疗方法”、“冠心病的专长医院的信息”等信息,则可作为该文本信息对应的结构化信息。For another example, in step S401, the index building device 1 mines resource knowledge from the Internet by means of data mining as text information, and further structures the text information to determine structured information. For example, in step S401, the index building device 1 obtains information such as diseases and disease symptom descriptions, treatment methods, specialized hospitals and other information by mining vertical resource websites. Each resource is organized by disease as an ID. For example, firstly, some candidate seed words are given according to the category, such as diseases, coronary heart disease, myocarditis, gastritis, etc. are given, and according to the search results, the url of the website with the top ranking is obtained, the structure of the website is analyzed, and the crown is extracted from it. Heart disease, symptoms of coronary heart disease, treatment methods for coronary heart disease, and information on hospitals specializing in coronary heart disease, and incorporate the above information into "diseases" such as coronary heart disease, and organize the coronary heart disease into a business card for storage. Then the "coronary heart disease" can be used as the final text information, and the corresponding information such as "symptoms of coronary heart disease", "treatment methods of coronary heart disease", and "information of hospitals specializing in coronary heart disease" can be used as the text. Information corresponds to structured information.
本领域技术人员应能理解上述确定结构化信息的方式仅为举 例,其他现有的或今后可能出现的确定结构化信息的方式如可适用 于本发明,也应包含在本发明保护范围以内,并在此以引用方式包 含于此。Those skilled in the art should understand that the above method for determining structured information is only an example, and other existing or possible future methods for determining structured information, if applicable to the present invention, should also be included within the protection scope of the present invention. and is hereby incorporated by reference.
在步骤S402中,索引建立设备1自所述结构化信息中提取主题 词。具体地,在步骤S402中,索引建立设备1根据在步骤S401中所 确定的结构化信息,例如通过主题分类器,或其他预定的提取主题词 的方式,自该结构化信息中提取主题词。In step S402, the indexing apparatus 1 extracts the subject words from the structured information. Specifically, in step S402, the index building device 1 extracts subject words from the structured information according to the structured information determined in step S401, for example, through a subject classifier, or other predetermined methods of extracting subject words.
在此,提取主题词的目的在于从文本信息中提取出表示该文本信 息的主题,从而为建立语义索引以及后续的语义匹配计算服务。Here, the purpose of extracting the subject heading is to extract the subject representing the text information from the text information, so as to serve the establishment of a semantic index and the subsequent semantic matching calculation.
优选地,该方法还包括步骤S405(未示出),在步骤S405中, 索引建立设备1根据预定主题体系,获取与所述预定主题体系相对应 的训练语料;根据所述训练语料,训练主题分类器;其中,在步骤 S402中,索引建立设备1根据所述主题分类器,自所述结构化信息 中提取所述主题词。Preferably, the method further includes step S405 (not shown). In step S405, the index building device 1 obtains training corpus corresponding to the predetermined theme system according to the predetermined theme system; according to the training corpus, the training theme classifier; wherein, in step S402, the index building device 1 extracts the subject word from the structured information according to the subject classifier.
具体地,在步骤S405中,索引建立设备1确定预定主题体系, 例如,在步骤S405中,索引建立设备1根据大量网络搜索用户输入 的查询序列的统计结果,确定网络搜索用户常用的搜索需求,并结合 目前常用的分类体系,例如百科、知道等现有体系,确定具有一定需求的主题分类体系,并将其作为预定主题体系。进而,在步骤S405 中,索引建立设备1根据该预定主题体系,获取与该预定主题体系相 对应的训练语料,例如,假设在文章中有对应的位置标识“医疗健康 ?内科”,则该数据被认为是疾病类别的训练语料。随后,在步骤S405中,索引建立设备1根据该训练语料,训练主题分类器,例如,通过 训练语料,训练一个svm分类模型,以作为主题分类器。Specifically, in step S405, the index building device 1 determines a predetermined subject system. For example, in step S405, the index building device 1 determines the search requirements commonly used by network search users according to the statistical results of query sequences input by a large number of network search users, And combined with the currently commonly used classification systems, such as encyclopedia, knowing and other existing systems, determine the subject classification system with certain needs, and use it as the predetermined subject system. Further, in step S405, the index building device 1 obtains the training corpus corresponding to the predetermined subject system according to the predetermined subject system. Considered the training corpus for disease categories. Subsequently, in step S405, the index building device 1 trains a subject classifier according to the training corpus, for example, through the training corpus, trains an svm classification model as a subject classifier.
接着,在步骤S402中,索引建立设备1根据在步骤S405中所训 练的主题分类器,自结构化信息中提取主题词。例如,在步骤S402 中,索引建立设备1将“冠心病”词及其症状、治疗方法等结构化信 息输入该主题分类器,从而获得该主题为“疾病”。又如,对于新来 的百科名片,在步骤S402中,索引建立设备1将其输入该主题分类 器,如svm分类器,从而获得该百科名片的类别所对应的主题。Next, in step S402, the index building apparatus 1 extracts subject words from the structured information according to the subject classifier trained in step S405. For example, in step S402, the index building device 1 inputs the word "coronary heart disease" and structured information such as symptoms and treatment methods into the topic classifier, thereby obtaining the topic as "disease". For another example, for a new encyclopedia business card, in step S402, the index building device 1 inputs it into the subject classifier, such as the svm classifier, so as to obtain the subject corresponding to the category of the encyclopedic business card.
较佳地,在步骤S402中,索引建立设备1还可对该提取的主题 进行同义表达扩展,例如,将主题“疾病”进行同义表达扩展,增加 一个同义主题“病”。Preferably, in step S402, the index building device 1 can also perform synonym expansion on the extracted topic, for example, perform synonym expansion on the topic "disease" to add a synonymous topic "disease".
本领域技术人员应能理解上述提取主题词的方式仅为举例,其 他现有的或今后可能出现的提取主题词的方式如可适用于本发明, 也应包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above methods for extracting subject words are only examples, and other existing or future methods for extracting subject words, if applicable to the present invention, should also be included within the protection scope of the present invention, and in This is incorporated herein by reference.
在步骤S403中,索引建立设备1根据所述主题词所对应的主题, 自所述文本信息中确定与所述主题相对应的标签词。具体地,在步骤 S403中,索引建立设备1根据在步骤S402中所提取的主题词,及该 主题词所对应的主题,自该文本信息中确定与该主题相对应的标签 词,例如,对于疾病为主题的文本信息,在步骤S403中,索引建立 设备1确定与该主题相对应的如下标签词:心慌气短、胸闷、腹泻、 呕吐、四肢无力等。In step S403, the index building device 1 determines a tag word corresponding to the topic from the text information according to the topic corresponding to the topic word. Specifically, in step S403, the index building device 1 determines the tag word corresponding to the topic from the text information according to the topic word extracted in step S402 and the topic corresponding to the topic word, for example, for For text information on the subject of disease, in step S403, the indexing device 1 determines the following tag words corresponding to the subject: palpitation, shortness of breath, chest tightness, diarrhea, vomiting, limb weakness, and the like.
优选地,步骤S403还包括子步骤S403a(未示出)、子步骤S403b (未示出)和子步骤S403c(未示出)。具体地,在子步骤S403a中, 索引建立设备1根据所述主题词所对应的主题,自所述文本信息中确 定与所述主题相对应的至少一个候选标签词,例如,在子步骤S403a 中,索引建立设备1对所有以词汇为组织的页面数据进行一元、二元、 三元词统计,提取出现在大于一定数量页面数据的词,作为候选标签 词。Preferably, step S403 further includes sub-step S403a (not shown), sub-step S403b (not shown) and sub-step S403c (not shown). Specifically, in sub-step S403a, the index building device 1 determines at least one candidate tag word corresponding to the topic from the text information according to the topic corresponding to the topic word, for example, in sub-step S403a , the index building device 1 performs unigram, binary and ternary word statistics on all page data organized by vocabulary, and extracts words that appear in page data larger than a certain number as candidate tag words.
随后,在子步骤S403b中,索引建立设备1根据所述至少一个候 选标签词,确定对应的中心词。接着,在子步骤S403c中,索引建立 设备1根据所述至少一个候选标签词与所述中心词的距离,确定与所 述主题相对应的标签词。Subsequently, in sub-step S403b, the index building device 1 determines the corresponding central word according to the at least one candidate tag word. Next, in sub-step S403c, the index building device 1 determines a tag word corresponding to the topic according to the distance between the at least one candidate tag word and the center word.
例如,在子步骤S403b中,索引建立设备1根据前面统计的标签 数据,将所有候选标签词进行合并,对这些候选标签词进行线下统计, 统计过程如下:通过在大规模文本中,如采用全网数据,统计数据中 在文档的共线频率。对于任意两个候选标签词,根据下式,计算它们 之间的相似度:For example, in sub-step S403b, the index building device 1 combines all the candidate tag words according to the previously counted tag data, and performs offline statistics on these candidate tag words. The statistical process is as follows: Network-wide data, the collinear frequency of documents in the statistics. For any two candidate label words, calculate the similarity between them according to the following formula:
在此,PMI(w′,w1)表示w'w1之间的互信息分值,定义为 P(w)表示被统计词w的概率。Here, PMI(w', w 1 ) represents the mutual information score between w'w 1 , which is defined as P(w) represents the probability of the word w being counted.
随后,在子步骤S403b中,索引建立设备1根据主题,确定需要 对文本信息的哪些域进行分析,如,疾病的症状类别、诗词的本身以 及解释部分、人物的描述部分等。进而,从中抽取所有在候选标签词 中出现的词、以及对应的同义词,然后将这些词组成一个中心,作为 该至少一个候选标签词对应的中心词。Subsequently, in sub-step S403b, the indexing device 1 determines which fields of the text information need to be analyzed according to the subject, such as the symptom category of the disease, the poem itself and the explanation part, the description part of the character, etc. Further, extract all the words appearing in the candidate label words and the corresponding synonyms, and then form a center of these words as the center word corresponding to the at least one candidate label word.
接着,在子步骤S403c中,索引建立设备1计算该至少一个候选 标签词中每一个与该中心词的距离,例如,假设此处以T表示中心词, 则候选标签词与该中心词的距离可通过下式计算获得:Next, in sub-step S403c, the index building device 1 calculates the distance between each of the at least one candidate tag word and the center word. For example, if T is used to represent the center word, the distance between the candidate tag word and the center word can be Obtained by the following formula:
在此,Num(T)表示中心词中所包含的词的数目。Here, Num(T) represents the number of words included in the central word.
随后,在子步骤S403c中,索引建立设备1根据该至少一个候选 标签词与该中心词的距离,确定与该主题相对应的标签词,例如,将 与该中心词的距离小于预定阈值的候选标签词作为与该主题相对应 的标签词。Subsequently, in sub-step S403c, the index building device 1 determines a tag word corresponding to the topic according to the distance between the at least one candidate tag word and the center word, for example, selects a candidate whose distance from the center word is less than a predetermined threshold Label words as the label words corresponding to the topic.
较佳地,如图3所示,在子步骤S403c中,索引建立设备1以该 候选标签排名与中心词的距离做一个时间序列,如果排名变化的斜率 大于预定斜率阈值,则后续的节点被截除,如图3中的排名第5点到 第6点。Preferably, as shown in FIG. 3, in sub-step S403c, the index building device 1 makes a time series with the distance between the candidate label ranking and the central word. If the slope of the ranking change is greater than the predetermined slope threshold, the subsequent nodes are Cut off, ranking points 5 to 6 in Figure 3.
在此,该斜率阈值例如通过统计得分的总体分布而经验设定。Here, the slope threshold value is set empirically, for example, by statistic on the overall distribution of the scores.
本领域技术人员应能理解上述确定标签词的方式仅为举例,其 他现有的或今后可能出现的确定标签词的方式如可适用于本发明, 也应包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned ways of determining tag words are only examples, and other existing or future ways of determining tag words, if applicable to the present invention, should also be included within the protection scope of the present invention, and in This is incorporated herein by reference.
更优选地,在子步骤S403b中,索引建立设备1根据预定过滤规 则,对所述至少一个候选标签词进行过滤处理,以获得至少一个经过 滤处理后的候选标签词;根据所述至少一个经过滤处理后的候选标签 词,确定所述中心词;其中,所述预定过滤规则基于以下至少任一项 来确定:More preferably, in sub-step S403b, the index building device 1 performs filtering processing on the at least one candidate tag word according to a predetermined filtering rule to obtain at least one filtered candidate tag word; Filter the processed candidate tag words to determine the central word; wherein, the predetermined filtering rule is determined based on at least any one of the following:
-所述至少一个候选标签词的词性;- the part of speech of the at least one candidate tag word;
-所述至少一个候选标签词的用词规则;- a wording rule for the at least one candidate tag word;
-所述至少一个候选标签词与所述主题的共现比。- the co-occurrence ratio of the at least one candidate tag word with the topic.
具体地,在对候选标签词进行统计的过程中,可能引入噪声,因 此,需要对候选标签词进行过滤处理,在子步骤S403b中,索引建立 设备1根据预定过滤规则,对所述至少一个候选标签词进行过滤处理, 以获得至少一个经过滤处理后的候选标签词。Specifically, noise may be introduced in the process of counting the candidate tag words. Therefore, it is necessary to filter the candidate tag words. In sub-step S403b, the index building device 1, according to a predetermined filtering rule, filters the at least one candidate tag word. The tag words are filtered to obtain at least one filtered candidate tag word.
例如,在子步骤S403b中,索引建立设备1根据该至少一个候选 标签词的词性,对该至少一个候选标签词进行过滤处理,如,对该至 少一个候选标签词进行首词和尾词过滤。For example, in sub-step S403b, the index building device 1 performs filtering processing on the at least one candidate tag word according to the part of speech of the at least one candidate tag word, for example, performs first and last word filtering on the at least one candidate tag word.
又如,在子步骤S403b中,索引建立设备1根据该至少一个候选 标签词的用词规则,对该至少一个候选标签词进行过滤处理,如,该 候选标签词的首字不可能是“把”、“办”、“被”、“比”等字,尾字不 可能是“当”、“到”、“得”等字。For another example, in sub-step S403b, the index building device 1 performs filtering processing on the at least one candidate tag word according to the wording rule of the at least one candidate tag word, for example, the first word of the candidate tag word cannot be " ", "do", "by", "bi" and other words, the last word can not be "dang", "to", "de" and other words.
再如,在子步骤S403b中,索引建立设备1根据该至少一个候选 标签词与所述主题的共现比,对该至少一个候选标签词进行过滤处 理,如,在子步骤S403b中,索引建立设备1在搜索统计日志中、以 及全网标题中,统计该至少一个候选标签词与主题的共现比,只有与 该主题共现过的才得以保留,或者,保留与该主题的共现比大于预定 阈值的候选标签词。For another example, in sub-step S403b, the index building device 1 performs filtering processing on the at least one candidate label word according to the co-occurrence ratio between the at least one candidate label word and the topic. For example, in sub-step S403b, index building The device 1 counts the co-occurrence ratio between the at least one candidate tag word and the topic in the search statistics log and in the title of the entire network, and only the co-occurrence ratio with the topic is retained, or the co-occurrence ratio with the topic is retained. Candidate tag words greater than a predetermined threshold.
较佳地,在子步骤S403b中,索引建立设备1根据结合上述任意 两个预定过滤规则或综合考虑全部三个预定过滤规则,对该至少一个 候选标签词进行过滤处理。Preferably, in sub-step S403b, the index building device 1 performs filtering processing on the at least one candidate tag word according to a combination of any two predetermined filtering rules mentioned above or a comprehensive consideration of all three predetermined filtering rules.
随后,在子步骤S403b中,索引建立设备1根据所述至少一个经 过滤处理后的候选标签词,确定所述中心词。Subsequently, in sub-step S403b, the index building device 1 determines the central word according to the at least one filtered candidate tag word.
本领域技术人员应能理解上述预定过滤规则仅为举例,其他现 有的或今后可能出现的预定过滤规则如可适用于本发明,也应包含 在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned predetermined filtering rules are only examples. If other existing or future predetermined filtering rules are applicable to the present invention, they should also be included within the protection scope of the present invention, and are incorporated herein by reference. Included here.
在步骤S404中,索引建立设备1为所述主题词与所述标签词建 立索引。具体地,在步骤S404中,索引建立设备1根据在步骤S402 中所提取的主题词,及在步骤S402中所确定的标签词,为该主题词 和标签词建立索引。In step S404, the index building device 1 builds an index for the subject word and the tag word. Specifically, in step S404, the index building device 1 builds an index for the subject word and the label word according to the subject word extracted in step S402 and the label word determined in step S402.
例如,假设冠心病对应的文档为ID1,对应在该文档中重要度为 WC1(x),如x可以等于“疾病”、“心慌气短”等,心肌炎对应的文档 为ID2,胃炎对应的文档为ID3,中风对应的文档为ID4。在步骤S404 中,索引建立设备1按下述方式对主题词和标签词建立统一的倒排索 引:For example, suppose the document corresponding to coronary heart disease is ID1, and the corresponding importance in this document is WC1(x). For example, x can be equal to "disease", "palpitation and shortness of breath", etc. The document corresponding to myocarditis is ID2, and the document corresponding to gastritis is ID3, the document corresponding to the stroke is ID4. In step S404, the index establishment device 1 establishes a unified inverted index for the subject word and the tag word in the following manner:
疾病-ID1(WC1(x)),ID2(WC2(x)),ID3(WC3(x)),ID4 (WC4(x))Disease - ID1 (WC1(x)), ID2 (WC2(x)), ID3 (WC3(x)), ID4 (WC4(x))
心慌气短-ID1(WC1(x)),ID2(WC2(x)),ID4(WC4(x))Frustrated Shortness of Breath - ID1 (WC1(x)), ID2 (WC2(x)), ID4 (WC4(x))
心悸气短-ID1(WC1(x)),ID2(WC2(x)),ID4(WC4(x))Palpitations Shortness of Breath - ID1 (WC1(x)), ID2 (WC2(x)), ID4 (WC4(x))
呕吐-ID3(WC3(x)),ID4(WC4(x))Vomit - ID3 (WC3(x)), ID4 (WC4(x))
吐-ID3(WC3(x)),ID4(WC4(x))Spit - ID3 (WC3(x)), ID4 (WC4(x))
优选地,该方法还包括步骤S406(未示出),在步骤S406中, 若所述标签词包括多个语义一致的标签词,索引建立设备1确定所述 多个语义一致的标签词的归一化结果;其中,在步骤S404中,索引 建立设备1为所述主题词、所述标签词及所述归一化结果建立索引。Preferably, the method further includes step S406 (not shown). In step S406, if the label word includes a plurality of label words with consistent semantics, the index building device 1 determines the classification of the plurality of label words with consistent semantics. The normalization result; wherein, in step S404, the index building device 1 builds an index for the subject word, the label word and the normalization result.
具体地,主题词“疾病”对应的标签词中可能包括多个语义一致 的标签词,如“吐”和“恶心呕吐”即语义一致,则在步骤S406中, 索引建立设备1确定该两个标签词的归一化结果为“呕吐”;随后, 在步骤S404中,索引建立设备1为该主题词“疾病”、标签词“吐”、 “恶心呕吐”和归一化结果“呕吐”建立索引。Specifically, the tag word corresponding to the subject word "disease" may include a plurality of tag words with consistent semantics, such as "spit" and "nausea and vomiting", that is, the semantics are the same, then in step S406, the index building device 1 determines the two The normalized result of the tag word is "vomit"; then, in step S404, the index building device 1 creates the subject word "disease", the tag word "vomit", "nausea and vomiting" and the normalized result "vomit" index.
本领域技术人员应能理解上述建立索引的方式仅为举例,其他 现有的或今后可能出现的建立索引的方式如可适用于本发明,也应 包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned method of establishing an index is only an example, and other existing or possible future methods of establishing an index, if applicable to the present invention, should also be included within the protection scope of the present invention, and are hereby referred to as References are included here.
通常,建立索引都是针对关键词建立索引,在此,索引建立设备 1还对主题词、标签词及其归一化结果建立索引,从而实现用户的查 询输入信息与资源知识更好的匹配。Usually, indexing is to build indexes for keywords. Here, the indexing device 1 also builds indexes for subject words, tag words and their normalized results, so as to achieve better matching between the user's query input information and resource knowledge.
优选地,索引建立设备1的各个步骤之间是持续不断工作的。具 体地,在步骤S401中,索引建立设备1根据文本信息,从中确定结 构化信息;在步骤S402中,索引建立设备1自所述结构化信息中提 取主题词;在步骤S403中,索引建立设备1根据所述主题词所对应 的主题,自所述文本信息中确定与所述主题相对应的标签词;在步骤 S404中,索引建立设备1为所述主题词与所述标签词建立索引。在 此,本领域技术人员应理解“持续”是指索引建立设备1的各步骤分 别按照设定的或实时调整的工作模式要求进行结构化信息的确定、 主题词的提取、标签词的确定及索引的建立,直至该索引建立设备1 在较长时间内停止确定结构化信息。Preferably, the various steps of the index building apparatus 1 are continuously working. Specifically, in step S401, the index building device 1 determines structured information from the text information; in step S402, the index building device 1 extracts subject words from the structured information; in step S403, the index building device 1. According to the topic corresponding to the topic word, determine the tag word corresponding to the topic from the text information; in step S404, the index building device 1 creates an index for the topic word and the tag word. Here, those skilled in the art should understand that “continuous” means that each step of the index building device 1 performs the determination of structured information, the extraction of subject words, the determination of label words and The index is established until the index establishment device 1 stops determining the structured information for a long period of time.
在此,索引建立设备1根据文本信息,从中确定结构化信息;自 所述结构化信息中提取主题词;根据所述主题词所对应的主题,自所 述文本信息中确定与所述主题相对应的标签词;为所述主题词与所述 标签词建立索引。索引建立设备基于百科类资源知识,或其他通过网 络挖掘的资源知识,对其进行主题、标题的提取,形成对资源知识内 容的有效描述,更好地展现这类优质资源知识,使得后续对这类资源 知识的语义搜索更加高效,满足用户无法准确使用关键词表达的复杂 描述搜索需求,提升了用户的使用体验。Here, the index building device 1 determines structured information therefrom according to the text information; extracts subject words from the structured information; Corresponding label words; build an index for the subject word and the label word. Based on encyclopedia resource knowledge or other resource knowledge mined through the network, the index building device extracts the subject and title of the resource knowledge to form an effective description of the content of the resource knowledge and better display this kind of high-quality resource knowledge. The semantic search of resource-like knowledge is more efficient, meeting the search needs of complex descriptions that users cannot accurately express with keywords, and improving the user experience.
图5示出根据本发明再一个方面的用于基于文本信息建立索引的 方法流程图。Figure 5 shows a flowchart of a method for indexing based on textual information according to yet another aspect of the present invention.
在步骤S501中,匹配设备2获取用户输入的查询输入信息。具 体地,用户通过与用户设备的交互,输入了查询输入信息,在步骤 S501中,匹配设备2通过调用该用户设备所提供的应用程序接口 (API)、通过调用诸如JSP、ASP或PHP等动态页面技术,或者,通 过其他约定的通信方式,获取该用户输入的查询输入信息。In step S501, the matching device 2 obtains the query input information input by the user. Specifically, the user inputs the query input information through the interaction with the user equipment. In step S501, the matching device 2 calls the application programming interface (API) provided by the user equipment, and dynamically calls such as JSP, ASP, or PHP. The page technology, or through other agreed communication methods, obtains the query input information input by the user.
在此,该查询输入信息包括但不限于用户通过文字输入、语音输入、 图像输入等不同输入方式所提交的查询输入信息。Here, the query input information includes, but is not limited to, the query input information submitted by the user through different input methods such as text input, voice input, and image input.
本领域技术人员应能理解上述获取查询输入信息的方式仅为举 例,其他现有的或今后可能出现的获取查询输入信息的方式如可适 用于本发明,也应包含在本发明保护范围以内,并在此以引用方式 包含于此。Those skilled in the art should understand that the above-mentioned ways of obtaining query input information are only examples, and other existing or possible future ways of obtaining query input information, if applicable to the present invention, should also be included within the protection scope of the present invention. and is hereby incorporated by reference.
在步骤S502中,匹配设备2对所述查询输入信息进行主题与标 签分析,以获得所述查询输入信息所对应的主题词与标签词。具体地, 在步骤S502中,匹配设备2对在步骤S501中所获取的查询输入信息 进行主题与标签分析,例如,通过将该查询输入信息输入前述训练所 获得的主题分类器,获得该查询输入信息所对应的主题词;在步骤 S502中,匹配设备2对该用户输入的查询输入信息进行标签分析, 获得对应的标签词。在此,在步骤S502中,匹配设备2对该查询输 入信息的标签分析的方式与前述索引建立设备1在步骤S403中确定 文本信息的标签词的方式相同或相类似,故此处不再赘述,并通过引 用的方式包含于此。In step S502, the matching device 2 performs subject and tag analysis on the query input information to obtain subject words and tag words corresponding to the query input information. Specifically, in step S502, the matching device 2 performs subject and label analysis on the query input information obtained in step S501, for example, by inputting the query input information into the subject classifier obtained by the aforementioned training, to obtain the query input The subject word corresponding to the information; in step S502, the matching device 2 performs tag analysis on the query input information input by the user to obtain the corresponding tag word. Here, in step S502, the method of the tag analysis of the query input information by the matching device 2 is the same as or similar to the manner in which the aforementioned index establishment device 1 determines the tag word of the text information in step S403, so it is not repeated here. and are incorporated herein by reference.
在步骤S503中,匹配设备2根据所述主题词与标签词,在前述 索引建立设备1建立的索引中进行匹配查询,以获得与所述查询输入 信息相匹配的候选文本信息。具体地,在步骤S503中,匹配设备2 根据在步骤S501中所获取的用户输入的查询输入信息,在前述索引 建立设备1建立的索引中进行匹配查询,例如通过全部匹配或部分匹 配的方式,获得命中该查询输入信息所对应的主题词的文本信息,或 者命中该查询输入信息所对应的标签词的文本信息,以作为与该查询 输入信息相匹配的候选文本信息。In step S503, the matching device 2 performs a matching query in the index established by the aforementioned index building device 1 according to the subject words and tag words, so as to obtain candidate text information that matches the query input information. Specifically, in step S503, the matching device 2 performs a matching query in the index established by the aforementioned index establishment device 1 according to the query input information input by the user obtained in step S501, for example, by means of full matching or partial matching, Obtain the text information that hits the subject word corresponding to the query input information, or the text information that hits the tag word corresponding to the query input information, as candidate text information matching the query input information.
例如,假设用户输入查询输入信息为“心慌气短”,在步骤S501 中,匹配设备2获取该用户输入的查询输入信息“心慌气短”;在步 骤S502中,匹配设备2对该查询输入信息进行标签分析,获得的标 签词为“心慌气短”,前述索引建立设备1对该标签词“心慌气短” 建立的索引如下:For example, assuming that the query input information input by the user is "palpitation and shortness of breath", in step S501, the matching device 2 obtains the query input information "palpitation and shortness of breath" input by the user; in step S502, the matching device 2 labels the query input information After analysis, the obtained tag word is "palpitation and shortness of breath", and the index established by the aforementioned index building device 1 for the label word "palpitation and shortness of breath" is as follows:
心慌气短-ID1(WC1(x)),ID2(WC2(x)),ID4(WC4(x))Frustrated Shortness of Breath - ID1 (WC1(x)), ID2 (WC2(x)), ID4 (WC4(x))
其中,ID1、ID2、ID4分别表示包含有标签词“心慌气短”的文 本信息的ID号码,WC1(x)、WC2(x)、WC4(x)则分别表示标签词“心 慌气短”分别在这几个文本信息中的重要度。Among them, ID1, ID2, and ID4 respectively represent the ID numbers of the text information containing the tag word "palpitation and shortness of breath", and WC1(x), WC2(x), and WC4(x) respectively represent the label word "palpitation and shortness of breath". importance in several textual information.
则在步骤S503中,匹配设备2根据该用户的查询输入信息所对 应的标签词“心慌气短”,在索引建立设备1所建立的索引中进行匹 配查询,如根据上述索引,得到该查询输入信息“心慌气短”所对应 的候选文本信息——文本信息ID1、ID2和ID4。Then in step S503, the matching device 2 performs a matching query in the index established by the index building device 1 according to the tag word "palpitation and shortness of breath" corresponding to the query input information of the user, and obtains the query input information according to the above index. Candidate text information corresponding to "palpitation and shortness of breath" - text information ID1, ID2 and ID4.
本领域技术人员应能理解上述匹配查询的方式仅为举例,其他 现有的或今后可能出现的匹配查询的方式如可适用于本发明,也应 包含在本发明保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above matching query methods are only examples, and other existing or possible future matching query methods, if applicable to the present invention, should also be included within the protection scope of the present invention, and are hereby referred to as References are included here.
在步骤S504中,匹配设备2根据所述候选文本信息与所述查询 输入信息的语义匹配度,确定与所述查询输入信息相匹配的目标文本 信息。In step S504, the matching device 2 determines the target text information matching the query input information according to the semantic matching degree between the candidate text information and the query input information.
具体地,候选文本信息与查询输入信息之间具有一定的语义匹配 度,该语义匹配度可以通过计算获得,或进一步通过计算该候选文本 信息对应的索引词集与该查询输入信息所对应的匹配词集间的匹配 度获得。在步骤S504中,匹配设备2根据该候选文本信息与用户的 查询输入信息的语义匹配度,确定与该查询输入信息相匹配的目标文 本信息,如将语义匹配度最高的候选文本信息作为与该查询输入信息 相匹配的目标文本信息,或者,将语义匹配度大于预定匹配度阈值的 候选文本信息作为与该查询输入信息相匹配的目标文本信息。Specifically, there is a certain degree of semantic matching between the candidate text information and the query input information, and the semantic matching degree can be obtained by calculation, or further by calculating the matching between the index word set corresponding to the candidate text information and the query input information The matching degree between word sets is obtained. In step S504, the matching device 2 determines the target text information matching the query input information according to the semantic matching degree between the candidate text information and the query input information of the user, for example, the candidate text information with the highest semantic matching degree is used as the candidate text information with the highest semantic matching degree. The target text information that matches the query input information, or the candidate text information whose semantic matching degree is greater than a predetermined matching degree threshold is used as the target text information matching the query input information.
在此,该预定匹配度阈值为用于判断候选文本信息是否与查询输 入信息相匹配的语义匹配度,其值可以是预置的固定的,也可根据实 际情况进行调整。Here, the predetermined matching degree threshold is the semantic matching degree used to judge whether the candidate text information matches the query input information, and its value can be preset and fixed, or can be adjusted according to the actual situation.
优选地,该步骤S504还包括子步骤S504a(未示出)和子步骤 S504b(未示出)。在子步骤S504a中,匹配设备2计算所述候选文本 信息与所述查询输入信息的语义匹配度;在子步骤S504a中,匹配设 备2根据所述语义匹配度,结合预定匹配度阈值,确定与所述查询输入信息相匹配的目标文本信息。Preferably, this step S504 further includes sub-step S504a (not shown) and sub-step S504b (not shown). In sub-step S504a, the matching device 2 calculates the semantic matching degree between the candidate text information and the query input information; in sub-step S504a, the matching device 2 determines the matching degree according to the semantic matching degree combined with a predetermined matching degree threshold. The query input information matches the target text information.
例如,在子步骤S504a中,匹配设备2根据现有的匹配度计算方 法,计算该候选文本信息与用户的查询输入信息的语义匹配度;当该 语义匹配度大于该预定匹配度阈值,则在子步骤S504b中,匹配设备 2将该候选文本信息作为与该查询输入信息相匹配的目标文本信息。For example, in sub-step S504a, the matching device 2 calculates the semantic matching degree between the candidate text information and the user's query input information according to the existing matching degree calculation method; when the semantic matching degree is greater than the predetermined matching degree threshold, then in In sub-step S504b, the matching device 2 takes the candidate text information as the target text information matching the query input information.
较佳地,在步骤S504中,匹配设备2还可根据候选文本信息所 对应的索引词集与查询输入信息所对应的匹配词集,来确定与该查询 输入信息所对应的目标文本信息。具体地,候选文本信息具有对应的 索引词集,如假设上例中候选文本信息ID1对应的主题为“冠心病”, 其对应的索引词包括“疾病”、“心慌气短”等,则这些索引词所组成 的索引词集即为该候选文本信息ID1所对应的索引词集。用户的查询 输入信息也有对应的匹配词集,例如,通过对该查询输入信息进行分 词处理后获得匹配词,再将该匹配词所组成的集合作为该查询输入信 息对应的匹配词集,如假设用户输入的查询输入信息为“心慌气短呕 吐”,匹配设备1对该查询输入信息进行分词处理后,获得匹配词“心 慌气短”和“呕吐”,则该两个匹配词所组成的集合即为该查询输入 信息对应的匹配词集。在步骤S504中,匹配设备2根据该索引词集 与该匹配词集,确定与该用户的查询输入信息相匹配的目标文本信 息,例如,将命中该匹配词集中最多匹配词的索引词集所对应的文本 信息,作为与该查询输入信息相匹配的目标文本信息;或者,将命中 匹配词的数量大于预定数量阈值的索引词集所对应的文本信息,作为 与该查询输入信息相匹配的目标文本信息。Preferably, in step S504, the matching device 2 can also determine the target text information corresponding to the query input information according to the index word set corresponding to the candidate text information and the matching word set corresponding to the query input information. Specifically, the candidate text information has a corresponding set of index words. For example, it is assumed that the subject corresponding to the candidate text information ID1 in the above example is "coronary heart disease", and the corresponding index words include "disease", "palpitation and shortness of breath", etc., then these indexes The index word set formed by the words is the index word set corresponding to the candidate text information ID1. The user's query input information also has a corresponding matching word set. For example, the matching words are obtained by performing word segmentation on the query input information, and then the set formed by the matching words is used as the matching word set corresponding to the query input information. The query input information input by the user is "palpitation, shortness of breath and vomiting". After the matching device 1 performs word segmentation processing on the query input information, the matching words "palpitation and shortness of breath" and "vomiting" are obtained, and the set formed by the two matching words is The matching word set corresponding to the query input information. In step S504, the matching device 2 determines the target text information matching the query input information of the user according to the index word set and the matching word set. The corresponding text information is used as the target text information that matches the query input information; or, the text information corresponding to the index word set whose number of hit matching words is greater than the predetermined number threshold is used as the target text information that matches the query input information. text information.
例如,对于上例中的候选文本信息ID1、ID2和ID4,ID1对应的 索引词集包括索引词“疾病”、“心慌气短”;ID2对应的索引词集包 括索引词“心慌气短”、“呕吐”、“疾病”;ID4对应的索引词集包括 索引词“心慌气短”。则对于用户输入的查询输入信息“心慌气短呕吐”,其匹配词为“心慌气短”、“呕吐”,ID2对应的索引词集命中该 查询输入信息对应的匹配词集中最多的匹配词,则将该候选文本信息 ID2作为与该查询输入信息最相匹配的目标文本信息,或者,假设预 定数量阈值为0,则上述候选文本信息ID1、ID2和ID4所对应的索引词集命中该匹配词集中的匹配词的数量均大于该预定数量阈值,则 上述候选文本信息ID1、ID2和ID4均作为与该查询输入信息相匹配 的目标文本信息。该匹配设备2提供给该用户时,可按照对应的索引 词在该候选文本信息中的重要度的高低进行排序。For example, for the candidate text information ID1, ID2 and ID4 in the above example, the index word set corresponding to ID1 includes the index words "disease" and "palpitation and shortness of breath"; the index word set corresponding to ID2 includes the index words "palpitation and shortness of breath", "vomiting" ", "disease"; the index word set corresponding to ID4 includes the index word "palpitation and shortness of breath". Then, for the query input information "palpitation, shortness of breath and vomiting" input by the user, the matching words are "palpitation and shortness of breath" and "vomiting", and the index word set corresponding to ID2 hits the most matching word in the matching word set corresponding to the query input information. The candidate text information ID2 is used as the target text information that most matches the query input information, or, assuming that the predetermined number threshold is 0, the index word set corresponding to the candidate text information ID1, ID2 and ID4 hits the matching word set. If the number of matching words is greater than the predetermined number threshold, the above-mentioned candidate text information ID1, ID2 and ID4 are all used as target text information matching the query input information. When the matching device 2 is provided to the user, it can be sorted according to the importance of the corresponding index words in the candidate text information.
本领域技术人员应能理解上述确定目标文本信息的方式仅为举 例,其他现有的或今后可能出现的确定目标文本信息的方式如可适 用于本发明,也应包含在本发明保护范围以内,并在此以引用方式 包含于此。Those skilled in the art should understand that the above methods for determining target text information are only examples, and other existing or future methods for determining target text information, if applicable to the present invention, should also be included within the protection scope of the present invention. and is hereby incorporated by reference.
优选地,匹配设备2的各个步骤之间是持续不断工作的。具体 地,在步骤S501中,匹配设备2获取用户输入的查询输入信息;在 步骤S502中,匹配设备2对所述查询输入信息进行主题与标签分析, 以获得所述查询输入信息所对应的主题词与标签词;在步骤S503中, 匹配设备2根据所述主题词与标签词,在前述索引建立设备1建立的 索引中进行匹配查询,以获得与所述查询输入信息相匹配的候选文本 信息;在步骤S504中,匹配设备2根据所述候选文本信息与所述查 询输入信息的语义匹配度,确定与所述查询输入信息相匹配的目标文 本信息。在此,本领域技术人员应理解“持续”是指匹配设备2的各 步骤分别按照设定的或实时调整的工作模式要求进行查询输入信息 的获取、主题与标签分析、候选文本信息的匹配查询与目标文本信息 的确定,直至该匹配设备2在较长时间内停止获取用户输入的查询输 入信息。Preferably, the various steps of the matching device 2 work continuously. Specifically, in step S501, the matching device 2 obtains the query input information input by the user; in step S502, the matching device 2 performs subject and tag analysis on the query input information to obtain the subject corresponding to the query input information word and label word; in step S503, the matching device 2 performs a matching query in the index established by the aforementioned index building device 1 according to the subject word and label word to obtain candidate text information that matches the query input information ; In step S504, the matching device 2 determines the target text information that matches the query input information according to the semantic matching degree between the candidate text information and the query input information. Here, those skilled in the art should understand that "continuous" means that each step of the matching device 2 performs the acquisition of query input information, the analysis of topics and tags, and the matching query of candidate text information according to the set or real-time adjustment of the working mode requirements, respectively. The target text information is determined until the matching device 2 stops acquiring the query input information input by the user for a long period of time.
在此,索引建立设备1与匹配设备2的各步骤之间相互配合,以 实现基于用户输入的查询输入信息,匹配获得与之对应的目标文本信 息;基于百科类资源知识,或其他通过网络挖掘的资源知识,对其进行 主题、标题的提取,形成对资源知识内容的有效描述,更好地展现这 类优质资源知识,使得对这类资源知识的语义搜索更加高效,满足用 户无法准确使用关键词表达的复杂描述搜索需求,提升了用户的使用 体验。Here, the steps of the index building device 1 and the matching device 2 cooperate with each other to realize the query input information based on the user input, and obtain the corresponding target text information by matching; It can extract the subject and title of the resource knowledge, form an effective description of the resource knowledge content, better display this kind of high-quality resource knowledge, make the semantic search for this kind of resource knowledge more efficient, and satisfy users who cannot accurately use the key The complex description of the word expression to describe the search requirements improves the user experience.
优选地,所述主题词与标签词还可看作两个不同的域,分别对应 主题域和标签域,在步骤S503中,匹配设备2根据该主题词与标签 词,分别在主题域和标签域所对应的前述索引中进行匹配查询,以获 得与所述查询输入信息相匹配的候选文本信息。Preferably, the subject word and the tag word can also be regarded as two different domains, corresponding to the subject domain and the tag domain respectively. In step S503, the matching device 2, according to the subject word and the tag word, respectively, in the subject domain and the tag word A matching query is performed in the aforementioned index corresponding to the domain to obtain candidate text information that matches the query input information.
具体地,在步骤S503中,匹配设备2根据在步骤S502中对用户 输入的查询输入信息的分析所获得的主题词和标签词,采用分域匹配 的方式,分别在该主题域和标签域所对应的索引中进行匹配查询,以 获得候选文本信息。Specifically, in step S503, the matching device 2 adopts the method of sub-domain matching according to the subject words and tag words obtained by analyzing the query input information input by the user in step S502, respectively in the subject field and the tag field. A matching query is performed in the corresponding index to obtain candidate text information.
在此,该主题域和标签域可以通过对该查询输入信息进行分析获 得,例如,对用户输入的查询输入信息,利用前述的主题分类器对用 户输入的查询输入信息进行分析,获得主题类别。Here, the subject domain and the tag domain can be obtained by analyzing the query input information, for example, using the aforementioned subject classifier to analyze the query input information input by the user to obtain the subject category.
在此,主题域和标签域所对应的索引即前述索引建立设备1所建 立的索引,根据之前建立的标签,对用户输入的查询输入信息进行标 签词提取,如针对包含在该查询输入信息中且在标签集合里面的,则 将其提取出来。然后,利用标签词和主题类别到对应的主题与标签统 一索引中进行拉倒排文档的候选,将包含该主题类别或者标签的文档 作为与该查询输入信息相对应的候选文本信息,参与后续计算。Here, the index corresponding to the subject domain and the label domain is the index established by the aforementioned index establishment device 1, and according to the previously established label, the label word extraction is performed on the query input information input by the user, such as for the query input information contained in the query input information. And if it is in the tag set, it will be extracted. Then, use the tag word and topic category to pull down the candidates of the document in the corresponding topic and tag unified index, and take the document containing the topic category or tag as the candidate text information corresponding to the query input information, and participate in the subsequent calculation.
较佳地,在步骤S503中,匹配设备2还可考虑该主题域和标签 域所对应的权重,在对应的索引中进行匹配查询,综合考虑该主题域 和标签域对应的权重,最终获得候选文本信息。Preferably, in step S503, the matching device 2 may also consider the weights corresponding to the subject domain and the tag domain, perform a matching query in the corresponding index, comprehensively consider the corresponding weights of the subject domain and the tag domain, and finally obtain a candidate. text information.
优选地,在步骤S504中,匹配设备2根据所述匹配词集所包括 的匹配词,在所述候选文本信息所对应的索引词集中确定目标索引词 集,其中,所述目标索引词集命中所述匹配词集中最多的匹配词;若 所述目标索引词集与所述匹配词集的相似度大于预定阈值,将所述目 标索引词集所对应的文本信息作为与所述查询输入信息相匹配的目 标文本信息。Preferably, in step S504, the matching device 2 determines a target index word set in the index word set corresponding to the candidate text information according to the matching words included in the matching word set, wherein the target index word set hits The most matching words in the matching word set; if the similarity between the target index word set and the matching word set is greater than a predetermined threshold, the text information corresponding to the target index word set is used as the query input information. Matching target text information.
具体地,在步骤S504中,匹配设备2根据候选文本信息所对应 的索引词集命中匹配词集中匹配词的数量,将命中匹配词数量最多的 索引词集作为目标索引词集;随后,在步骤S504中,匹配设备2计 算该目标索引词集与匹配词集的相似度,例如,分别计算目标索引词 集与匹配词集中,命中的索引词与对应的匹配词之间的相似度,再通 过简单相加或加权平均等方式,计算该目标索引词集与匹配词集的相 似度,当该相似度大于预定阈值时,在步骤S504中,匹配设备2将 该目标索引词集所对应的文本信息作为与该查询输入信息相匹配的 目标文本信息。Specifically, in step S504, the matching device 2 hits the number of matching words in the matching word set according to the index word set corresponding to the candidate text information, and takes the index word set with the largest number of hit matching words as the target index word set; then, in step S504 In S504, the matching device 2 calculates the similarity between the target index word set and the matching word set, for example, calculates the similarity between the target index word set and the matching word set, the hit index word and the corresponding matching word respectively, and then passes Simple addition or weighted average, etc., calculate the similarity between the target index word set and the matching word set, when the similarity is greater than a predetermined threshold, in step S504, the matching device 2 The text corresponding to the target index word set information as the target text information that matches the query input information.
在此,该预定阈值为根据目标索引词集与匹配词集的相似度,判 断是否将目标索引词集对应的文本信息作为目标文本信息的相似度 阈值,其值可以是固定的,也可根据实际情况做调整。Here, the predetermined threshold is the similarity threshold for judging whether to use the text information corresponding to the target index word set as the target text information according to the similarity between the target index word set and the matching word set. Adjust the actual situation.
优选地,该方法还包括步骤S505(未示出)。在步骤S505中, 匹配设备2对所述查询输入信息进行分词处理,获得经所述分词处理 后的分词;将所述分词与所述匹配设备2在步骤S502中所获得的主 题词与标签词进行合并处理,以获得与所述查询输入信息对应的匹配 词集,其中,所述匹配词集中所包括的词作为匹配词。随后,在子步 骤S504a中,匹配设备2根据所述匹配词集与所述候选文本信息所对 应的索引词集,计算所述候选文本信息与所述查询输入信息的语义匹 配度。Preferably, the method further includes step S505 (not shown). In step S505, the matching device 2 performs word segmentation processing on the query input information to obtain the word segmentation processed by the word segmentation; the word segmentation and the subject words and tag words obtained by the matching device 2 in step S502 A merging process is performed to obtain a matching word set corresponding to the query input information, wherein the words included in the matching word set are used as matching words. Subsequently, in sub-step S504a, the matching device 2 calculates the semantic matching degree between the candidate text information and the query input information according to the matching word set and the index word set corresponding to the candidate text information.
具体地,在步骤S505中,匹配设备2对在步骤S501中所获取的 查询输入信息进行分词处理,以获得经分词处理后的分词,较佳地, 在步骤S505中,匹配设备2还可对该分词处理后获得分词进行去除 停用词等过滤处理,进而获得最终的分词;随后,在步骤S505中,匹配设备2根据所获得的分词,将其与匹配设备2在步骤S502中所 获得的主题词与标签词进行合并处理、去冗余处理等,以最终获得与 该查询输入信息相对应的匹配词集,并将该匹配词集中所包括的词作 为与该查询输入信息对应的匹配词。Specifically, in step S505, the matching device 2 performs word segmentation processing on the query input information obtained in step S501 to obtain the word segmentation processed by the word segmentation. Preferably, in step S505, the matching device 2 can also perform word segmentation on After the word segmentation is processed, the word segmentation is obtained, and filtering processing such as removing stop words is performed to obtain the final word segmentation; then, in step S505, the matching device 2 compares the obtained word segmentation with the word segmentation obtained by the matching device 2 in step S502. Subject words and tag words are combined, de-redundant, etc., to finally obtain a matching word set corresponding to the query input information, and use the words included in the matching word set as the matching words corresponding to the query input information. .
随后,在子步骤S504a中,匹配设备2根据所述匹配词集与所述 候选文本信息所对应的索引词集,计算所述候选文本信息与所述查询 输入信息的语义匹配度。Subsequently, in sub-step S504a, the matching device 2 calculates the semantic matching degree between the candidate text information and the query input information according to the matching word set and the index word set corresponding to the candidate text information.
更优选地,该方法还包括步骤S506(未示出)。在步骤S506中, 匹配设备2对所述匹配词进行后续处理,以更新所述匹配词集;其中, 所述后续处理包括以下至少任一项:More preferably, the method further includes step S506 (not shown). In step S506, the matching device 2 performs subsequent processing on the matching words to update the matching word set; wherein, the subsequent processing includes at least any one of the following:
-确定所述匹配词中所包括的相互同义的匹配词,将所述相互同 义的匹配词合并为所述匹配词集的子集。- determining mutually synonymous matched words included in the matched words, and merging the mutually synonymous matched words into a subset of the set of matched words.
-对所述匹配词进行同义扩展,将同义扩展后得到的同义词与所 述匹配词确定为所述匹配词集的子集。- performing synonym expansion on the matching words, and determining the synonyms obtained after the synonym expansion and the matching words as subsets of the matching word set.
具体地,在步骤S506中,匹配设备2对在步骤S505中所确定的 匹配词集中的匹配词进行后续处理,以更新该匹配词集。例如,在步 骤S506中,匹配设备2确定所述匹配词中所包括的相互同义的匹配 词,将所述相互同义的匹配词合并为所述匹配词集的子集。由于匹配 词中可能包括相互同义的匹配词,如“呕吐”和“吐”,在步骤S506 中,匹配设备2将这些相互同义的匹配词合并为该匹配词集的子集。Specifically, in step S506, the matching device 2 performs subsequent processing on the matching words in the matching word set determined in step S505 to update the matching word set. For example, in step S506, the matching device 2 determines mutually synonymous matching words included in the matching words, and combines the mutually synonymous matching words into a subset of the matching word set. Since the matching words may include mutually synonymous matching words, such as "vomit" and "spit", in step S506, the matching device 2 combines these mutually synonymous matching words into a subset of the matching word set.
例如,假设用户输入的查询输入信息为Q,在步骤S505中,匹 配设备2对该查询输入信息进行分词处理,去除停用词等过滤处理之 后,在标签域内的匹配词集表示为Q={a,b,c,d,e},其中,a,b,c,d,e 分别为该匹配词集中所包括的匹配词;假设其中的匹配词a和b是相 互同义的匹配词,则在步骤S506中,匹配设备2将该匹配词a和b 合并为该匹配词集的子集,则该匹配词集更新表示为Q={{a,b},c,d, e}。随后,后续步骤如步骤S503进行后续的匹配查询操作。For example, assuming that the query input information input by the user is Q, in step S505, the matching device 2 performs word segmentation processing on the query input information, and after filtering processing such as removing stop words, the matching word set in the tag field is expressed as Q={ a,b,c,d,e}, where a,b,c,d,e are the matching words included in the matching word set; assuming that the matching words a and b are mutually synonymous matching words, Then in step S506, the matching device 2 combines the matching words a and b into a subset of the matching word set, and the matching word set is updated and expressed as Q={{a,b},c,d,e}. Subsequently, subsequent steps such as step S503 perform subsequent matching query operations.
又如,在步骤S506中,匹配设备2还对所述匹配词进行同义扩 展,将同义扩展后得到的同义词与所述匹配词确定为所述匹配词集的 子集。具体地,在步骤S506中,匹配设备2还可对该查询输入信息 对应的匹配词集中的匹配词进行同义扩展,如将“心悸气短”同义扩 展为“心慌气短”,随后,在步骤S506中,匹配设备2将该同义扩展 后得到的同义词与该匹配词确定为该匹配词集的子集。For another example, in step S506, the matching device 2 further performs synonym expansion on the matching word, and determines the synonym and the matching word obtained after the synonym expansion as a subset of the matching word set. Specifically, in step S506, the matching device 2 can also synonymously expand the matching words in the matching word set corresponding to the query input information, for example, synonymously expand "palpitation and shortness of breath" to "palpitation and shortness of breath", then, in step S506 In S506, the matching device 2 determines the synonym obtained after the synonym expansion and the matching word as a subset of the matching word set.
接上例,对于经同义合并后的匹配词集Q={{a,b},c,d,e},在步 骤S506中,匹配设备2还可对该匹配词集进行同义扩展,扩展获得 其中的匹配词a,b,c,d,e的同义词,并将该同义扩展后得到的同义词 与该匹配词确定为该匹配词集的子集,例如,该匹配词集Q经多次同 义扩展后,得到如下表达式:Continuing from the previous example, for the matched word set Q={{a,b},c,d,e} after synonymous merging, in step S506, the matching device 2 can also synonymously expand the matched word set, Expand to obtain the synonyms of the matching words a, b, c, d, and e, and determine the synonym and the matching word obtained after the synonym expansion as a subset of the matching word set, for example, the matching word set Q After multiple synonym expansions, the following expression is obtained:
随后,在步骤S503中,匹配设备2根据该匹配词集,在索引建 立设备1所建立的索引中进行匹配查询,例如,经过倒排索引,获得 包含的候选文本信息。Subsequently, in step S503, the matching device 2 performs a matching query in the index established by the index building device 1 according to the matching word set. candidate text information.
假设将命中匹配词集中最多的匹配词的索引词集表示为C,则C 为:Assuming that the index word set that hits the most matching words in the matching word set is denoted by C, then C is:
其中,C表示同义命中的最大w1i对应的位置语义 映射的词集合 where C represents the maximum number of synonymous hits The word set of the position semantic mapping corresponding to w 1i
则在子步骤S504a中,匹配设备2根据所述匹配词集与所述候选 文本信息所对应的索引词集,计算所述候选文本信息与所述查询输入 信息的语义匹配度。Then in sub-step S504a, the matching device 2 calculates the semantic matching degree between the candidate text information and the query input information according to the matching word set and the index word set corresponding to the candidate text information.
Q和C之间的语义匹配度可通过下式计算:The semantic matching degree between Q and C can be calculated by the following formula:
其中,表示词的权重,这里用(log(TF)+1)*log(N/DF) 表示;Match(TQ,TC)表示索引词集、匹配词集与主题是否匹配。in, express word The weight of , here is represented by (log(TF)+1)*log(N/DF); Match(T Q ,T C ) represents whether the index word set, matching word set and topic match.
在此,Match(TQ,TC)对应的值可定义,如假设该索引词集、匹配词 集与主题匹配,则Match(TQ,TC)的值为1,否则为0.5。Here, the value corresponding to Match(T Q , T C ) can be defined. If it is assumed that the index word set and the matching word set match the topic, the value of Match (T Q , T C ) is 1, otherwise it is 0.5.
随后,假设该计算得出的语义匹配度值大于预定阈值,则在子步 骤S504b中,匹配设备2将该索引词集所对应的文本信息作为与该查 询输入信息相匹配的目标文本信息。Then, assuming that the calculated semantic matching degree value is greater than a predetermined threshold, in sub-step S504b, the matching device 2 takes the text information corresponding to the index word set as the target text information matching the query input information.
需要注意的是,本发明可在软件和/或软件与硬件的组合体中被 实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何 其他类似硬件设备来实现。在一个实施例中,本发明的软件程序可 以通过处理器执行以实现上文所述步骤或功能。同样地,本发明的 软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质 中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另 外,本发明的一些步骤或功能可采用硬件来实现,例如,作为与处 理器配合从而执行各个步骤或功能的电路。It should be noted that the present invention may be implemented in software and/or a combination of software and hardware, for example, may be implemented using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device. In one embodiment, the software program of the present invention may be executed by a processor to implement the steps or functions described above. Likewise, the software programs of the present invention (including associated data structures) may be stored on computer-readable recording media, such as RAM memory, magnetic or optical drives, or floppy disks, and the like. Additionally, some of the steps or functions of the present invention may be implemented in hardware, for example, as circuitry that cooperates with a processor to perform the various steps or functions.
另外,本发明的一部分可被应用为计算机程序产品,例如计算 机程序指令,当其被计算机执行时,通过该计算机的操作,可以调 用或提供根据本发明的方法和/或技术方案。而调用本发明的方法的 程序指令,可能被存储在固定的或可移动的记录介质中,和/或通过 广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据 所述程序指令运行的计算机设备的工作存储器中。在此,根据本发 明的一个实施例包括一个装置,该装置包括用于存储计算机程序指 令的存储器和用于执行程序指令的处理器,其中,当该计算机程序 指令被该处理器执行时,触发该装置运行基于前述根据本发明的多 个实施例的方法和/或技术方案。In addition, a part of the present invention can be applied as a computer program product, such as computer program instructions, which when executed by a computer, through the operation of the computer, can invoke or provide methods and/or technical solutions according to the present invention. Rather, program instructions for invoking the method of the present invention may be stored in fixed or removable recording media, and/or transmitted via data streams in broadcast or other signal-bearing media, and/or stored in accordance with the in the working memory of the computer device on which the program instructions are executed. Here, an embodiment according to the present invention includes an apparatus including a memory for storing computer program instructions and a processor for executing the program instructions, wherein, when the computer program instructions are executed by the processor, a trigger is The apparatus operates based on the aforementioned methods and/or technical solutions according to various embodiments of the present invention.
对于本领域技术人员而言,显然本发明不限于上述示范性实施 例的细节,而且在不背离本发明的精神或基本特征的情况下,能够 以其他的具体形式实现本发明。因此,无论从哪一点来看,均应将 实施例看作是示范性的,而且是非限制性的,本发明的范围由所附 权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要 件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中 的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一 词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述 的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实 现。第一,第二等词语用来表示名称,而并不表示任何特定的顺 序。It will be apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, but that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics of the present invention. Therefore, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the invention is to be defined by the appended claims rather than the foregoing description, which are therefore intended to fall within the scope of the claims. All changes within the meaning and range of the equivalents of , are included in the present invention. Any reference signs in the claims shall not be construed as limiting the involved claim. Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Several units or means recited in the device claims can also be realized by one unit or means by means of software or hardware. The terms first, second, etc. are used to denote names and do not denote any particular order.
Claims (17)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410079818.7A CN103886034B (en) | 2014-03-05 | 2014-03-05 | A kind of method and apparatus of inquiry input information that establishing index and matching user |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410079818.7A CN103886034B (en) | 2014-03-05 | 2014-03-05 | A kind of method and apparatus of inquiry input information that establishing index and matching user |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN103886034A CN103886034A (en) | 2014-06-25 |
| CN103886034B true CN103886034B (en) | 2019-03-19 |
Family
ID=50954926
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410079818.7A Active CN103886034B (en) | 2014-03-05 | 2014-03-05 | A kind of method and apparatus of inquiry input information that establishing index and matching user |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN103886034B (en) |
Families Citing this family (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017071370A1 (en) * | 2015-10-30 | 2017-05-04 | 华为技术有限公司 | Label processing method and device |
| CN106815262B (en) * | 2015-12-01 | 2020-07-03 | 北京国双科技有限公司 | Method and device for searching referee document |
| CN105786966A (en) * | 2016-01-26 | 2016-07-20 | 浪潮软件集团有限公司 | Text structuring method and device |
| CN107291783B (en) * | 2016-04-12 | 2021-04-30 | 芋头科技(杭州)有限公司 | Semantic matching method and intelligent equipment |
| CN109074363A (en) * | 2016-05-09 | 2018-12-21 | 华为技术有限公司 | Data query method, data query system determine method and apparatus |
| CN106021225B (en) * | 2016-05-12 | 2018-12-21 | 大连理工大学 | A kind of Chinese Maximal noun phrase recognition methods based on the simple noun phrase of Chinese |
| CN107391509B (en) * | 2016-05-16 | 2023-06-02 | 中兴通讯股份有限公司 | Label recommending method and device |
| CN107918778B (en) * | 2016-10-11 | 2022-03-15 | 阿里巴巴集团控股有限公司 | Information matching method and related device |
| CN108257676B (en) * | 2016-12-28 | 2020-03-03 | 北京搜狗科技发展有限公司 | Medical case information processing method, device and equipment |
| CN108536708A (en) * | 2017-03-03 | 2018-09-14 | 腾讯科技(深圳)有限公司 | A kind of automatic question answering processing method and automatically request-answering system |
| CN110678858B (en) * | 2017-06-01 | 2021-07-09 | 互动解决方案公司 | Retrieval data information storage device |
| US10318593B2 (en) * | 2017-06-21 | 2019-06-11 | Accenture Global Solutions Limited | Extracting searchable information from a digitized document |
| CN107436922B (en) | 2017-07-05 | 2021-06-08 | 北京百度网讯科技有限公司 | Text label generation method and device |
| CN107844596A (en) * | 2017-11-22 | 2018-03-27 | 福建中金在线信息科技有限公司 | A kind of article search method and system |
| CN108255985A (en) * | 2017-12-28 | 2018-07-06 | 东软集团股份有限公司 | Data directory construction method, search method and device, medium and electronic equipment |
| CN108416026B (en) * | 2018-03-09 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Index generation method, content search method, device and equipment |
| CN110209804B (en) * | 2018-04-20 | 2023-11-21 | 腾讯科技(深圳)有限公司 | Target corpus determining method and device, storage medium and electronic device |
| CN110580276B (en) * | 2018-06-08 | 2022-06-28 | 百度在线网络技术(北京)有限公司 | Method and apparatus for processing information |
| CN109543001A (en) * | 2018-10-18 | 2019-03-29 | 华南理工大学 | A kind of scientific and technological entry abstracting method characterizing Scientific Articles research contents |
| CN109213937B (en) * | 2018-11-29 | 2020-07-24 | 深圳爱问科技股份有限公司 | Intelligent search method and device |
| CN111008265B (en) * | 2019-12-03 | 2023-03-28 | 腾讯云计算(北京)有限责任公司 | Enterprise information searching method and device |
| CN113268572A (en) * | 2020-02-14 | 2021-08-17 | 华为技术有限公司 | Question answering method and device |
| CN112765321A (en) * | 2021-01-22 | 2021-05-07 | 中信银行股份有限公司 | Interface query method and device, equipment and computer readable storage medium |
| CN113377922B (en) * | 2021-06-25 | 2024-04-02 | 北京百度网讯科技有限公司 | Methods, devices, electronic devices and media for matching information |
| CN115687579B (en) * | 2022-09-22 | 2023-08-01 | 广州视嵘信息技术有限公司 | Document tag generation and matching method, device and computer equipment |
| CN116578756A (en) * | 2023-04-07 | 2023-08-11 | 上海优司服信息科技有限公司 | A company portrait recommendation method, device and storage medium |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5694523A (en) * | 1995-05-31 | 1997-12-02 | Oracle Corporation | Content processing system for discourse |
| CN103177036A (en) * | 2011-12-23 | 2013-06-26 | 盛乐信息技术(上海)有限公司 | Method and system for label automatic extraction |
| CN103294780A (en) * | 2013-05-13 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Directory mapping relationship mining device and directory mapping relationship mining device |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7472115B2 (en) * | 2004-04-29 | 2008-12-30 | International Business Machines Corporation | Contextual flyout for search results |
| US20120166414A1 (en) * | 2008-08-11 | 2012-06-28 | Ultra Unilimited Corporation (dba Publish) | Systems and methods for relevance scoring |
-
2014
- 2014-03-05 CN CN201410079818.7A patent/CN103886034B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5694523A (en) * | 1995-05-31 | 1997-12-02 | Oracle Corporation | Content processing system for discourse |
| CN103177036A (en) * | 2011-12-23 | 2013-06-26 | 盛乐信息技术(上海)有限公司 | Method and system for label automatic extraction |
| CN103294780A (en) * | 2013-05-13 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Directory mapping relationship mining device and directory mapping relationship mining device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN103886034A (en) | 2014-06-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN103886034B (en) | A kind of method and apparatus of inquiry input information that establishing index and matching user | |
| Eke et al. | Sarcasm identification in textual data: systematic review, research challenges and open directions | |
| CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
| US9558264B2 (en) | Identifying and displaying relationships between candidate answers | |
| CN104516947B (en) | A kind of Chinese microblog emotional analysis method for merging dominant and recessive character | |
| CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
| CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
| CN106156365A (en) | A kind of generation method and device of knowledge mapping | |
| CN102955772B (en) | A kind of similarity calculating method based on semanteme and device | |
| CN106126619A (en) | A kind of video retrieval method based on video content and system | |
| CN103425640A (en) | Multimedia questioning-answering system and method | |
| CN106886567B (en) | Microblog emergency detection method and device based on semantic extension | |
| CN106682411A (en) | Method for converting physical examination diagnostic data into disease label | |
| Al-Ghadhban et al. | Arabic sarcasm detection in Twitter | |
| CN106257455B (en) | A kind of Bootstrapping method extracting viewpoint evaluation object based on dependence template | |
| JP2013529805A5 (en) | Search method, search system and computer program | |
| WO2012178152A1 (en) | Methods and systems for retrieval of experts based on user customizable search and ranking parameters | |
| WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
| CN104199833A (en) | Network search term clustering method and device | |
| Tiwari et al. | Ensemble approach for twitter sentiment analysis | |
| Tembhurnikar et al. | Topic detection using BNgram method and sentiment analysis on twitter dataset | |
| CN105447144B (en) | Microblogging forwarding visual analysis method and system based on big data analysis technology | |
| CN102929962B (en) | A kind of evaluating method of search engine | |
| JP2014219872A (en) | Utterance selecting device, method and program, and dialog device and method | |
| CN103942274A (en) | Labeling system and method for biological medical treatment image on basis of LDA |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |