[go: up one dir, main page]

TW201131394A - Information indexing method and system - Google Patents

Information indexing method and system Download PDF

Info

Publication number
TW201131394A
TW201131394A TW99106781A TW99106781A TW201131394A TW 201131394 A TW201131394 A TW 201131394A TW 99106781 A TW99106781 A TW 99106781A TW 99106781 A TW99106781 A TW 99106781A TW 201131394 A TW201131394 A TW 201131394A
Authority
TW
Taiwan
Prior art keywords
semantic
mode
query
semantic mode
sorting
Prior art date
Application number
TW99106781A
Other languages
Chinese (zh)
Other versions
TWI474197B (en
Inventor
Cheng Peng
Jian Sun
Lei Hou
Qin Zhang
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to TW99106781A priority Critical patent/TWI474197B/en
Publication of TW201131394A publication Critical patent/TW201131394A/en
Application granted granted Critical
Publication of TWI474197B publication Critical patent/TWI474197B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention discloses an information indexing method and system, which includes: ascertaining the semantics model of each inquiry phrase occurring in the historical inquiry record, from which the semantics model with occurrence frequency exceeding a preset threshold is selected; setting up the corresponding relationship between the semantics model and the filtration and sorting manners according to the semantics model obtained from statistical analysis of the user behavior; receiving the inquiry phrase and carrying out the semantics analysis to ascertain the belonging semantics tag; ascertaining the corresponding inquiry semantics model to further ascertain the filtration manner and the sorting manner corresponding to the mentioned semantics model; utilizing the filtration and sorting manners to carry out processing on the search result. When receiving the inquiry phrase inputted by the user, the embodiment of the present invention carries out analysis on the user's intention according to the language characteristic and the historical user behavior of the search engine's inquiry column and uses the model attribute obtained from the statistic analysis corresponding to the semantics model that matches the inquiry phrase to guide the search, so as to pinpoint the user's need and enhance the user's indexing efficiency.

Description

201131394 六、發明說明: 【發明所屬之技術領域】 本發明係關於網路技術領域,更具體地說,關於一種 資訊檢索方法和系統。 【先前技術】 從網際網路出現至今’信息量可以說成冪指數的增長 ’在這浩如煙海的資訊中怎麼才能找到自己所需要的資訊 ?搜索引擎就像一隻神奇的手,從雜亂的資訊中抽出一條 清晰的檢索路徑。搜索引擎是根據一定的策略、運用特定 的電腦程式搜集資訊’在對資訊進行組織和處理後,爲用 戶提供檢索服務的系統。其通過分析用戶的査詢(Query )請求(關鍵字或關鍵字組),從系統中返回相應的結果 並回饋給用戶,爲用戶的資訊獲取工作提供了方便。 經過多年的發展和摸索,搜索引擎技術得到了很大的 發展,具體表現在搜索結果的相關性提升和索引資料量增 加方面,所謂搜索相關性,指的是搜索結果與用戶要求的 切合程度。現有技術針對Query的相關工作,大部分還停 ' 留在Query分類層次,比較流行的是針對Query所在領域 - 類別的分類,判斷一個Query屬於財經類、體育類或汽車 類等,如將“邁克爾喬丹”歸類爲體育類,“奧巴馬” 歸類爲新聞類。 但是,對於Query的領域分類,其主要功能是在各垂 直搜索引擎之間提供導航,這種分類方法過於簡單’無法 201131394 正確體現用戶意圖。例如,針對Query : “手機電池” ,現有技術能確定出屬於IT領域,但是並不能分析出用 戶意圖是査找“手機”還是“電池”,搜索相關性較低》 【發明內容】 有鑒於此,本發明提供一種資訊檢索方法和系統,以 解決現有技術存在的搜索相關性低的問題。 本發明提供的一種資訊檢索方法包括: 預處理步驟,包括: 確定歷史查詢記錄中的出現的各查詢片語的語義標籤 ,根據語義標籤統計語義模式,從統計結果中選擇出現頻 率超過預定門限的語義模式: 通過統計歷史記錄中各語義模式對應的用戶行爲,設 置體現該用戶行爲的用戶査詢意圖屬性,設置所述語義模 式與該用戶查詢意圖屬性所指定的過濾方式和排序方式的 對應關係; 檢索步驟,包括: 接收査詢片語,進行語義分析確定其所屬語義標籤; 依據所述對應關係,確定與所述查詢片語所屬語義模 式對應的過濾方式和排序方式; 利用所述過濾方式和排序方式對搜索結果進行處理。 最好,在確定高頻語義模式後,還包括:依據覆蓋率 對語義模式進行篩選,篩選過程包括: 計算預定時間段內符合語義模式的查詢片語的數量, -6- 201131394 將該查詢數量與總查詢數量的比例確定爲該語義模式的覆 蓋率; 提取覆蓋率大於預定門限的語義模式。 最好’在確定高頻語義模式後,還包括:依據區分度 對語義模式進行篩選,篩選過程包括: 計算預定時間段內屬於同一語義模式的具體關鍵欄位 組針對所有查詢的關鍵欄位組中的熵,將其確定爲所述語 義模式的區分度; 提取熵大於預定値的語義模式。 最好,在確定高頻語義模式後,還包括:依據覆蓋率 和區分度對語義模式進行篩選,篩選過程包括: 計算預定時間段內符合該語義模式的查詢次數,將該 查詢次數與總查詢次數的比例確定爲該語義模式的覆蓋率 » 計算預定時間段內屬於同一語義模式的具體關鍵欄位 組針對所有查詢的關鍵欄位組中的摘,將其確定爲所述語 義模式的區分度; 提取覆蓋率大於預定門限及熵大於預定値的語義模式 〇 本發明提供的一種資訊檢索方法,還包括: 接收査詢片語,到預先設置的語義標籤庫中匹配相應 的語義標籤; 根據匹配到的所述語義標籤到語義模式表中匹配獲得 所述查詢片語的語義模式; 201131394 根據所述語義模式到按照用戶查詢意圖屬性預設的語 義模式與過濾、排序方式的對應關係表中匹配獲得所述查 詢片語對應的過濾方式和排序方式; 利用所述過濾方式和排序方式對所述查詢片語的搜索 結果進行處理。 本發明同時公開的一種資訊檢索系統包括: 參考資訊儲存單元,用於儲存語義模式與過濾方式和 排序方式的對應關係,所述語義模式爲歷史查詢記錄中的 出現的各査詢片語的語義模式中出現頻率超過預定門限的 語義模式,所述過濾方式和排序方式爲用戶査詢意圖屬性 所指定,所述用戶查詢意圖屬性是通過統計歷史記錄中各 語義模式對應的用戶行爲設定的; 接收單元,用於接收査詢片語; 語義模式匹配單元,用於將所述接收單元接收到的查 詢片語進行語義分析確定其語義標籤; 處理方式確定單元,用於依據所述參考資訊儲存單元 中儲存的資訊’確定所述査詢片語所屬語義模式及其對應 的過濾方式和排序方式: 執行單元’用於利用所述過濾方式和排序方式對搜索 結果進行處理。 最好,上述系統還包括: 第一篩選單元,用於:計算預定時間段內符合語義模 式的査詢片語的數量’將該查詢數量與總査詢數量的比例 確定爲該語義模式的覆蓋率’並提取覆蓋率大於預定門限 -8 - 201131394 的語義模式; 所述參考資訊儲存單元儲存的語義模式爲:出現頻率 超過預定門限且覆蓋率大於預定門限的語義模式。 最好,上述系統還包括: 第二篩選單元,用於:計算預定時間段內屬於同一語 義模式的具體關鍵欄位組針對所有查詢的關鍵欄位組中的 熵,將其確定爲所述語義模式的區分度’並提取熵大於預 定値的語義模式; 所述參考資訊儲存單元儲存的語義模式爲:出現頻率 超過預定門限且熵大於預定値的語義模式。 最好,上述系統還包括: 第三篩選單元,用於:計算預定時間段內符合語義模 式的查詢片語的數量,將該査詢數量與總查詢數量的比例 確定爲該語義模式的覆蓋率,以及計算預定時間段內屬於 同一語義模式的具體關鍵欄位組針對所有查詢的關鍵欄位 組中的熵,並提取出現頻率超過預定門限、覆蓋率大於預 定門限且熵大於預定値的語義模式; 所述參考資訊儲存單元儲存的語義模式爲:出現頻率 超過預定門限、覆蓋率大於預定門限且熵大於預定値的語 義模式。 從上述的技術方案可以看出,本發明實施例根據自然 語言特點及用戶的習慣用法,設置語義模式,並根據用戶 意圖,將語義模式與通過統計分析語義模式所對應的用戶 行爲所獲得的過濾方式和排序方式建立對應關係。從而使 -9 - 201131394 得’在接收到用戶輸入的查詢片語時,可在確定與該查詢 片語匹配的語義模式後,按照對應的過濾方式和排序方式 進行搜索和處理,一方面無需檢索全部資料,減少工作量 ,另一方面利用歷史經驗對用戶意圖進行了分析,提高了 用戶意圖與搜索結果的相關度,提高搜索精度。 【實施方式】 下面將結合本發明實施例中的附圖,對本發明實施例 中的技術方案進行清楚、完整地描述,顯然,所描述的實 施例僅僅是本發明一部分實施例,而不是全部的實施例。 基於本發明中的苡施例,本領域普通技術人員在沒有作出 創造性勞動前提下所獲得的所有其他實施例,都屬於本發 明保護的範圍。 本發明實施例公開了一種資訊檢索方法,通過統計歷 史査詢記錄中出現頻率較高的語義模式,將其與體現用戶 意圖的過濾方式和排序方式建立對應關係,在用戶輸入查 詢片語時,確定該查詢片語所對應的語義模式,然後依據 上述對應關係,確定相應的過濾方式和排序方式,並利用 所述過濾方式和結果顯示方式對搜索結果進行處理後,提 供給用戶,從而提高回饋結果與用戶意圖的切合程度,即 提高搜索相關性。 請參考圖1,爲本發明實施例提供的資訊檢索方法中 的預處理過程,包括以下步驟: 步驟S11、確定歷史查詢記錄中出現的各個查詢片語 -10- 201131394 的語義標籤。 選擇一段時間內的歷史查詢記錄’對各個查詢片語進 行語義分析,確定各個査詢片語的語義標籤。 例如:查詢詞爲“手機”,則其語義標籤爲“產品” 〇 所述語義標籤儲存於語義標籤庫,所述查詢片語儲存 於查詢詞庫,所述語義標籤庫與查詢詞庫均儲存於資料庫 中,且所述語義標籤庫中的語義標籤與查詢片語之間存在 對應關係。 步驟s 1 2、根據語義標籤統計其所屬語義模式。 歷史查詢記錄所覆蓋的時間越長,則查詢記錄越多, 確定的語義模式覆蓋面也更廣,因而更準確。 語義模式是根據自然語言特點總結得出的,如當查詢 片語包括多個查詢欄位時,根據自然語言特點,確定其中 的中心詞’例如:針對“手機電池”這個查詢片語,其 中心詞爲“電池”,語義模式爲“修飾詞+產品”,同樣 的’ “數位相機”對應的語義模式也爲“修飾詞+產品” 〇 S吾義模式儲存於語義模式表中。 步驟S13、從上述步驟S12確定的語義模式中選擇出 現頻率超過預定門限的語義模式。 對所述歷史查詢記錄中的査詢片語加上標籤,格式如 下: 〔Query〕\t ( Semantic Pattern ) \t〔 PV〕; -11 - 201131394 其中,Query爲查詢片語,Semantic Pattern爲語義模 式,PV爲被查詢次數。 如表1所示: 表1201131394 VI. Description of the Invention: TECHNICAL FIELD OF THE INVENTION The present invention relates to the field of network technologies, and more particularly to an information retrieval method and system. [Prior Art] From the Internet to the present, 'the amount of information can be said to be an exponential growth'. How can I find the information I need in this vast amount of information? The search engine is like a magical hand, from messy information. Extract a clear search path. A search engine is a system that collects information based on a certain strategy and uses a specific computer program to provide a search service for users after organizing and processing the information. It analyzes the user's query (query) request (keyword or keyword group), returns the corresponding result from the system and feeds back to the user, which provides convenience for the user's information acquisition work. After years of development and exploration, search engine technology has been greatly developed, which is reflected in the correlation between search results and the increase in index data. The so-called search relevance refers to the degree of matching between search results and user requirements. The prior art is related to the work of Query, most of which still stop at the Query classification level. It is more popular for the Query field - category classification, and judge a Query belongs to the financial, sports or automotive categories, such as "Michael" Jordan is classified as a sports class and "Obama" is classified as a news class. However, for the domain classification of Query, its main function is to provide navigation between vertical search engines. This classification method is too simple 'cannot 201131394 correctly reflect user intent. For example, for Query: "Mobile phone battery", the prior art can determine that it belongs to the IT field, but can not analyze whether the user's intention is to find "mobile phone" or "battery", and the search correlation is low. [Invention content] In view of this, The invention provides an information retrieval method and system to solve the problem of low search relevance existing in the prior art. An information retrieval method provided by the present invention includes: a preprocessing step, comprising: determining a semantic tag of each query phrase that appears in a historical query record, and selecting, according to a semantic tag statistical semantic mode, selecting a frequency from a statistical result that exceeds a predetermined threshold Semantic mode: setting a user query intent attribute embodying the user behavior by statistically corresponding user behaviors in the historical record, and setting a correspondence between the semantic mode and the filtering mode and the sorting mode specified by the user query intent attribute; The searching step includes: receiving a query phrase, performing semantic analysis to determine a semantic tag to which it belongs; determining, according to the correspondence relationship, a filtering mode and a sorting manner corresponding to a semantic mode to which the query phrase belongs; using the filtering manner and sorting The way to process the search results. Preferably, after determining the high frequency semantic mode, the method further comprises: screening the semantic mode according to the coverage ratio, wherein the screening process comprises: calculating the number of query phrases that match the semantic mode in the predetermined time period, -6- 201131394 The ratio to the total number of queries is determined as the coverage of the semantic mode; a semantic mode in which the coverage is greater than a predetermined threshold is extracted. Preferably, after determining the high frequency semantic mode, the method further comprises: screening the semantic mode according to the discrimination degree, and the screening process comprises: calculating a specific key field group belonging to the same semantic mode in a predetermined time period for the key field group of all the queries. The entropy in the medium is determined as the degree of discrimination of the semantic pattern; the semantic pattern in which the entropy is greater than the predetermined 値 is extracted. Preferably, after determining the high-frequency semantic mode, the method further comprises: screening the semantic mode according to the coverage rate and the discrimination degree, and the screening process comprises: calculating a number of queries that meet the semantic mode within a predetermined time period, and the number of queries and the total query. The ratio of the number of times is determined as the coverage of the semantic mode. » The specific key field group belonging to the same semantic mode in the predetermined time period is calculated for the key field group of all the queries, and is determined as the discrimination degree of the semantic mode. And extracting a semantic mode that is greater than a predetermined threshold and entropy greater than a predetermined threshold. The information retrieval method provided by the present invention further includes: receiving a query phrase, and matching a corresponding semantic tag into a preset semantic tag library; And matching the semantic tag to the semantic mode table to obtain the semantic mode of the query phrase; 201131394 is matched according to the semantic mode to the correspondence between the semantic mode preset according to the user query intent attribute and the filtering and sorting manner Filtering method and sorting manner corresponding to the query phrase; Mode filter and sort the query phrase search results are processed. An information retrieval system disclosed by the present invention includes: a reference information storage unit, configured to store a correspondence between a semantic mode and a filtering mode and a sorting mode, where the semantic mode is a semantic mode of each query phrase appearing in the historical query record a semantic mode in which a frequency exceeds a predetermined threshold, the filtering mode and the sorting manner are specified by a user query intent attribute, and the user query intent attribute is set by a user behavior corresponding to each semantic mode in the statistical history record; And a semantic pattern matching unit, configured to perform semantic analysis on the query phrase received by the receiving unit to determine a semantic label thereof; and a processing mode determining unit, configured to be stored according to the reference information storage unit The information 'determines the semantic mode of the query phrase and its corresponding filtering mode and sorting mode: the executing unit' is used to process the search result by using the filtering method and the sorting method. Preferably, the system further includes: a first screening unit, configured to: calculate a number of query phrases that match the semantic mode within a predetermined time period, and determine a ratio of the number of queries to the total number of queries as a coverage of the semantic mode. And extracting a semantic mode whose coverage is greater than a predetermined threshold -8 - 201131394; the semantic mode stored by the reference information storage unit is: a semantic mode in which the frequency exceeds a predetermined threshold and the coverage is greater than a predetermined threshold. Preferably, the system further includes: a second screening unit, configured to: calculate a certain key field group belonging to the same semantic mode within a predetermined time period for entropy in a key field group of all queries, and determine the semantics as the semantic The degree of discrimination of the pattern 'and extracts a semantic mode whose entropy is greater than a predetermined ;; the semantic mode stored by the reference information storage unit is a semantic mode in which the appearance frequency exceeds a predetermined threshold and the entropy is greater than a predetermined 値. Preferably, the system further includes: a third screening unit, configured to: calculate a number of query phrases that match the semantic mode within a predetermined time period, and determine a ratio of the number of queries to the total number of queries as a coverage of the semantic mode, And calculating, according to the entropy in the key field group of all the queries, the specific key field group belonging to the same semantic mode in the predetermined time period, and extracting the semantic mode whose appearance frequency exceeds the predetermined threshold, the coverage rate is greater than the predetermined threshold, and the entropy is greater than the predetermined threshold; The semantic mode stored by the reference information storage unit is a semantic mode in which the appearance frequency exceeds a predetermined threshold, the coverage rate is greater than a predetermined threshold, and the entropy is greater than a predetermined threshold. It can be seen from the above technical solution that the embodiment of the present invention sets the semantic mode according to the natural language characteristics and the user's custom usage, and filters the semantic mode and the user behavior corresponding to the statistical analysis semantic mode according to the user's intention. The way and the sorting method are established. Therefore, when -9 - 201131394 receives the query phrase input by the user, after determining the semantic mode matching the query phrase, the search and processing are performed according to the corresponding filtering manner and sorting manner, and no search is needed on the one hand. All the data, reducing the workload, on the other hand using historical experience to analyze the user's intention, improve the relevance of user intent and search results, improve search accuracy. The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of them. Example. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without departing from the inventive scope are the scope of the present invention. The embodiment of the invention discloses an information retrieval method, which associates a semantic pattern with a high frequency in a historical query record, and associates it with a filtering manner and a sorting manner that reflects the user's intention, and determines when the user inputs the query phrase. The semantic mode corresponding to the query phrase is determined according to the corresponding relationship, and the corresponding filtering manner and sorting manner are determined, and the search result is processed by the filtering method and the result display manner, and then provided to the user, thereby improving the feedback result. The degree of relevance to the user's intent, that is, to improve search relevance. Referring to FIG. 1, a preprocessing process in an information retrieval method according to an embodiment of the present invention includes the following steps: Step S11: Determine semantic tags of each query phrase -10-201131394 appearing in the historical query record. Select historical query records for a period of time' Semantic analysis of each query phrase to determine the semantic tags of each query phrase. For example, if the query word is “mobile phone”, the semantic tag is “product” 〇 the semantic tag is stored in a semantic tag library, and the query phrase is stored in a query term library, and the semantic tag library and the query term library are stored. In the database, and there is a correspondence between the semantic tag and the query phrase in the semantic tag library. Step s 1 2. According to the semantic tag, the semantic mode belongs to it. The longer the historical query record is covered, the more query records are, and the determined semantic pattern coverage is wider and therefore more accurate. The semantic mode is summarized according to the characteristics of natural language. For example, when the query phrase includes multiple query fields, the central word is determined according to the natural language characteristics. For example, the query phrase for "mobile phone battery" is at the center. The word is “battery”, the semantic mode is “modifier + product”, and the semantic model corresponding to the same “digital camera” is also “modifier + product”. The 吾S mode is stored in the semantic mode table. Step S13: Select a semantic mode in which the frequency exceeds a predetermined threshold from the semantic modes determined in the above step S12. The query phrase in the historical query record is tagged in the following format: [Query]\t (Semantic Pattern) \t[ PV]; -11 - 201131394 where Query is the query phrase and Semantic Pattern is the semantic mode. , PV is the number of times of the query. As shown in Table 1: Table 1

Query Semantic Pattern PV 數位相機 修飾詞產品 13 手機電池 修飾詞產品 13 根據所述PV資訊確定被查詢次數超過預定門限的語 義模式。 可以將查詢次數超過預定門限的語義模式進行標記, 也可以單獨儲存。 步驟s 1 4、統計歷史記錄中各語義模式對應的用戶行 爲,設置體現該用戶行爲的用戶査詢意圖屬性。所謂用戶 行爲,指的是用戶在利用某查詢欄位進行查詢後,在査詢 結果中點擊了哪些鏈結。 歷史上,用戶輸入某査詢片語,並在返回的結果中選 擇(即滑鼠點擊)某些結果,這種行爲本身可以體現一種 過濾和排序方式,因爲每個査詢行爲都會被記錄在查詢日 誌中,因此,可以通過統計査詢曰誌中的査詢片語所屬語 義模式’設置體現對應的用戶行爲的用戶査詢意圖屬性, 儲存於用戶查詢意圖屬性表中。 所述用戶查詢意圖屬性包括歧義程度、權威性要求、 -12- 201131394 時效性要求和地域要求,如表2所示: 表2 語義模式的意圖屬性 屬性値 歧義程度 確定的/泛泛的/精確的 權威性要求 需要權威結果/不需要權威結果 時效性要求 是/否 地域要求 本地/附近/無地域要求 這些屬性的設定決定了選擇何種過濾方式和排序方式 ,所述過濾方式和排序方式指的是對查詢結果的處理方式 ,過濾方式可以是按照地域、權威性、歧義程度過濾,所 述排序方式一般指的是對結果按照某種特徵(如時間)進 行排列,時間靠近查詢時間的結果排列在前。不同的屬性 設置對應不同的過濾方式和排序方式。例如:如果某語義 模式需要權威結果時,則需要選擇相應的過濾方式對結果 資訊進行過濾,以選取其中的權威結果(如來自權威網站 的權威資訊);如果某語義模式具有地域性要求時,則在 結果資訊中過濾出符合地域性要求的結果資訊。或者,根 據歧義程度將搜索結果資訊進行排序,歧義程度越小,位 置越靠前。 步驟S15、確定語義模式的用戶查詢意圖屬性,設置 語義模式與用戶查詢意圖屬性所指定的過濾方式和排序方 式。 所述過濾方式指的是對搜索結果的篩選方式,排序方 式指的是對搜索結果的排列。 -13- 201131394 過濾方式和排序方式與語義模式之間的關係用表格方 式體現,如下表所示: 表3 語義模式的用戶查詢意圖屬性 過滤方式 排序方式 確定的、需要權威結果、沒有 時效性要求及具有地域性要求 精確過濾(完全匹配原則) ,從權威網站獲取資訊、依 據地域進行區分 ^γγΤ· 無 模糊的、不需要權威結果、有 時效性要求及具有地域性要求 模糊過濾(模糊匹配原則) ,從各種網站獲取資訊、依 據地域進行區分 按照時間先後進行 排序 ... ... ... 下面對各個用戶查詢意圖屬性進行分析: 歧義程度,指的是用戶對搜索的資訊所理解的具體程 度。當用戶對搜索的資訊有具體的理解時,則認爲該語義 模式是一個確定性的語義模式,例如語義模式對應的Query Semantic Pattern PV Digital Camera Modifier Product 13 Mobile Phone Battery Modifier Product 13 Determine the semantic mode in which the number of queries exceeds the predetermined threshold based on the PV information. The semantic mode in which the number of queries exceeds a predetermined threshold can be marked or stored separately. Step s 1 4: The user behavior corresponding to each semantic mode in the statistical history record, and set a user query intent attribute that reflects the user behavior. The so-called user behavior refers to which links are clicked in the query results after the user queries with a query field. Historically, the user entered a query phrase and selected (ie, mouse clicks) certain results in the returned results. This behavior itself can reflect a filtering and sorting method because each query behavior is recorded in the query log. Therefore, the user query intent attribute embodying the corresponding user behavior can be set in the user query intent attribute table by using the semantic pattern of the query phrase in the statistical query. The user query intent attributes include ambiguity level, authoritative requirement, -12-201131394 timeliness requirement and geographic requirement, as shown in Table 2: Table 2 Intent attribute attribute of semantic pattern 値 Ambiguity degree determined / generalized / precise Authoritative requirements require authoritative results / do not require authoritative results Timeliness requirements Yes / No Geographic requirements Local / Nearby / No geographical requirements These properties are determined by the setting of the filtering method and the sorting method. It is a method for processing the query result. The filtering method may be filtering according to the geographical, authoritative, and ambiguous degree. The sorting method generally refers to arranging the results according to a certain feature (such as time), and the time is close to the result of the query time. in front. Different attribute settings correspond to different filtering methods and sorting methods. For example, if a semantic pattern requires authoritative results, you need to select the appropriate filtering method to filter the result information to select the authoritative results (such as authoritative information from authoritative websites); if a semantic pattern has regional requirements, The result information is filtered out in the result information to meet the regional requirements. Or, sort the search result information according to the degree of ambiguity. The smaller the ambiguity, the higher the position. Step S15: determining a user query intent attribute of the semantic mode, and setting a filtering mode and a sorting mode specified by the semantic mode and the user query intent attribute. The filtering method refers to the filtering method of the search results, and the sorting method refers to the arrangement of the search results. -13- 201131394 The relationship between filtering mode and sorting mode and semantic mode is expressed in tabular form, as shown in the following table: Table 3 User query intent attribute filtering method of semantic mode is determined by sorting method, requires authoritative result, and has no timeliness requirement And regional requirements for precise filtering (complete matching principle), obtaining information from authoritative websites, distinguishing by region ^γγΤ·no ambiguity, no authoritative results, time-sensitive requirements and regional requirements for fuzzy filtering (fuzzy matching principle) ), obtaining information from various websites, sorting according to time, sorting by time... ... The following analysis of each user's query intent attribute: Ambiguity degree refers to the user's understanding of the search information. The degree of specificity. When the user has a specific understanding of the searched information, the semantic mode is considered to be a deterministic semantic mode, such as a semantic mode.

Query中包含有具體名稱、數字或表示具體限定的欄位, 如:“諾基亞N92原裝電池”;否則,則可認爲用戶對 搜索的資訊所要求的僅是一個大槪的瞭解,即需要搜索引 擎回饋多角度、多來源、多領域的結果時,該語義模式就 是一個泛化的語義模式,如對應的Query爲“上海代理 合作”:而當語義模式中包含指示唯一性要求的資訊時, 該語義模式即是一個精確的語義模式,如對應的Query爲 "阿裏巴巴杭州電話”,或“毛澤東生曰”等。 需要說明的是’在確定語義模式的歧義程度時,可以 根據對應的具體Query中各查詢欄位的含義進行,例如, -14 - 201131394 “手機電池”對應的模式是泛化的模式,而“諾基亞 N92電池”對應的模式則是確定性的模式,因爲“諾基 亞N92”的範圍比“手機”小得多。 權威性要求:指的是用戶是否需要一個權威性的結果 。權威性要求可以從字面含義得到,例如語義模式“年份 +政策”爲一個需要權威性結果的模式,因此,對於內容 爲“ 2008年出口退稅額度”的Query,優先回饋來自權威 資訊源(如官方網站)的查詢結果。 時效性要求:指的是需要回饋某時間點或時間段的結 果。首先是隱性的時間要求,比如查詢“香蕉價格”需 要返回儘量即時的資訊。時效性要求也可以從字面含義得 到,如果語義模式涉及到具體時間欄位(年、月、日), 則該語義模式具有時效性要求。例如內容爲“ 2008年出 口退稅額度”的Query,當然,某些詞也可以表示時效性 ,例如“新”,“最新”,則內容爲“新款諾基亞”的 Query將被配置爲具有時效性要求的語義模式。 地域性要求:指的是搜索目標是否有地域範圍限制, 根據用戶搜索習慣,對於一些Query類型我們根據先驗知 識認爲其隱性的和地域相關,例如“產品+運輸”的語義 模式就是指示從本地運出的產品或者從外地運來的產品的 相關資訊,具體的Query如"煤炭運輸”。 另外,對於某類與產品相關的語義模式,還可以包括 批量屬性,用於指示產品是零售還是批發。例如Query “ 大米代理”一般被視爲一個批發性的Query,而“ Dell •15- 201131394 D630”則被視爲一個零售的Query。 綜上,對於一個具體的Query,如“運輸產品”, 來說,其將被標記爲“泛泛的”、“不需要權威性結果” 、“有時效性要求”、"有地域性要求”和“批量大小不 一定”。該模式的儲存形式可以爲: [Pattern] \t ( Ambiguity) \t [ Authority) \t [ Temporal〕\t〔 Regional〕\t〔 Batch〕 表4爲一些具體Query的意圖分析結果(所述語義模 式對應的用戶查詢意圖屬性): 表4Query contains specific names, numbers or fields that indicate specific limits, such as: "Nokia N92 original battery"; otherwise, it can be considered that the user only needs a big understanding of the search information, that is, need to search When the engine returns multi-angle, multi-source, multi-domain results, the semantic pattern is a generalized semantic pattern, such as the corresponding Query is "Shanghai agent cooperation": when the semantic pattern contains information indicating the uniqueness requirement, The semantic mode is an exact semantic mode, such as the corresponding Query is "Alibaba Hangzhou Phone", or "Mao Zedong Health", etc. It should be noted that 'in determining the degree of ambiguity of the semantic mode, according to the corresponding The meaning of each query field in the specific Query is carried out. For example, -14 - 201131394 "Mobile phone battery" corresponds to the mode of generalization, while the "Nokia N92 battery" corresponds to the mode is deterministic mode, because "Nokia N92 The scope of "is much smaller than the "mobile phone." Authoritative requirements: refers to whether the user needs an authoritative result. Sexual requirements can be derived from literal meanings. For example, the semantic model “Year+Policy” is a model that requires authoritative results. Therefore, for Query whose content is “2008 Export Tax Credit”, priority is given to authoritative information sources (such as official websites). The result of the query. Timeliness requirement: refers to the result of the need to give back a certain point in time or time period. The first is the implicit time requirement, such as querying the "banana price" needs to return as much information as possible. The timeliness requirement can also be The literal meaning is obtained. If the semantic pattern involves a specific time field (year, month, day), then the semantic model has timeliness requirements. For example, the content is “2008 Export Tax Credit” Query, of course, some words can also For timeliness, such as "new", "latest", the Query with the content "New Nokia" will be configured as a semantic model with timeliness requirements. Regional requirements: refers to whether the search target has geographical limits, according to User search habits, for some Query types we consider the recessive sum based on prior knowledge Geographically relevant, for example, the semantic model of “product + transportation” is information indicating the products shipped from the local or the products shipped from the field. The specific Query is “Coal Transportation”. In addition, for certain types of product-related semantic patterns, you can also include bulk attributes to indicate whether the product is retail or wholesale. For example, Query “Rice Agent” is generally considered a wholesale Query, while “Dell • 15-201131394 D630” is considered a retail Query. In summary, for a specific Query, such as "transportation products," it will be marked as "general", "does not require authoritative results", "time-sensitive requirements", "regional requirements" And the "bulk size is not necessarily." The mode can be stored as: [Pattern] \t ( Ambiguity) \t [ Authority) \t [ Temporal]\t[ Regional]\t[ Batch] Table 4 is some specific Query The result of the intent analysis (the user query intent attribute corresponding to the semantic pattern): Table 4

Query 歧義程度 權威性要求 時效性要求 地域性要求 批量大小 Nokia N73 精確 否 否 4rrr 無 零售 橡膠加工 泛泛的 否 否 本地 香蕉價格 精確 是 是 • 用戶査詢意圖屬性和過濾方式、排序方式之間的對應 關係可以以表格方式儲存,例如以“對應關係資料表” 作爲儲存所述對應關係的資料表。 因此,建立符合上述意圖屬性的過濾方式和排序方式 與所述“運輸產品”所屬語義模式的對應關係,從而使 得在用戶的查詢片語符合所述語義模式時,根據上述對應 關係,確定對應的過濾方式對查詢結果進行過濾,並以對 應的排序方式進行排序。具體過程如圖2所示,包括以下 步驟: -16- 201131394 步驟S21、搜索引擎查詢片語。 步驟S22、對所述査詢片語進行語義分析,以確定其 所屬語義模式。 具體的,例如根據自然語言特點,到預先設置的語義 標籤庫中匹配相應的語義標籤,然後到語義模式表中進行 匹配,如將具體內容爲“數位相機”或“手機電池” 的Query與語義模式“修飾詞+產品”相匹配。 步驟S23、依據預設的參考資訊,確定與所述查詢片 語所屬語義模式對應的過濾方式和排序方式。所述參考資 訊以表格方式(即上述的對應關係資料表)呈現。先到所 述對應關係表中查詢與所述查詢片語所屬語義模式一致的 語義模式,然後確定相應的過濾方式和排序方式。 所述參考資訊即上述預處理過程設置的語義模式與過 濾方式和排序方式的對應關係。 步驟S24 '利用所述過濾方式對結果資訊進行過濾後 ,按照對應的排序方式進行排序和顯示。 具體的’利用查詢片語進行搜索,然後利用所述過濾 方式對搜索結果進行過濾,最後,按照所述排序方式進行 排序和顯示。 例如,對於查詢片語“手機電池”來說,其採用過 濾方式是:利用“手機”作爲修飾條件進行結果篩選,同 時利用‘‘電池”作爲搜索主體輸入搜索引擎進行搜索^ 在上述預處理過程中,由於用戶輸入的Query紛繁多 變’因此爲了降低提取語義模式的複雜度,可以先對 -17- 201131394Query ambiguity degree Authoritative requirement Timeliness requirement Regional requirement Batch size Nokia N73 Accurate No No 4rrr No retail rubber processing is general No No Local banana price is accurate • User correspondence intent attribute and filtering method, sorting method It may be stored in a table form, for example, a "correspondence data table" as a data table storing the correspondence. Therefore, establishing a correspondence between the filtering manner and the sorting manner that meets the above-mentioned intent attribute and the semantic mode to which the “transportation product” belongs, so that when the user's query phrase conforms to the semantic mode, the corresponding correspondence is determined according to the corresponding relationship. The filtering method filters the query results and sorts them in the corresponding sorting manner. The specific process is shown in Figure 2, including the following steps: -16- 201131394 Step S21, the search engine queries the phrase. Step S22: Perform semantic analysis on the query phrase to determine a semantic mode to which it belongs. Specifically, for example, according to the natural language feature, the corresponding semantic tag is matched into a preset semantic tag library, and then matched in the semantic mode table, such as Query and semantics of the specific content as “digital camera” or “mobile phone battery”. The pattern "modifier + product" matches. Step S23: Determine, according to the preset reference information, a filtering manner and a sorting manner corresponding to the semantic mode of the query phrase. The reference information is presented in a tabular manner (i.e., the correspondence data table described above). First, a semantic pattern consistent with the semantic pattern of the query phrase is searched in the correspondence table, and then the corresponding filtering manner and sorting manner are determined. The reference information is a correspondence between a semantic mode set by the foregoing preprocessing process and a filtering mode and a sorting mode. Step S24' After filtering the result information by using the filtering method, sorting and displaying according to the corresponding sorting manner. Specifically, the search is performed by using the query phrase, and then the search result is filtered by the filtering method, and finally, sorting and displaying are performed according to the sorting manner. For example, for the query phrase "mobile phone battery", the filtering method is: using "mobile phone" as a modification condition for screening results, and using ''battery' as a search subject input search engine for searching ^ in the above preprocessing process In order to reduce the complexity of extracting semantic patterns, the Query can be changed -17- 201131394

Query進行處理’例如去掉非法字元及無意義的Query ( 字典中不存在的中文、英文單詞,亂碼等),並在進行適 當的規格化操作後(合倂多餘的空格,過濾無意義的符號 )’進行分詞(分詞具體方式屬於現有技術,在此不對其 展開描述),然後再確定語義模式》 此外’爲了進一步提高語義模式區分度,在上述預處 理過程中’可以總結一些能夠直接體現用戶意圖的詞語, 例如“代理”、“求購”、“購買”、“加盟”等’爲了 方便描述,下文將此類詞語稱爲意圖詞。通過意圖詞表自 動挖掘其對應的語義模式,比如“意圖詞+產品”,並建 立確定符合該語義模式對應的意圖屬性(歧義程度、權威 性要求、時效性要求、地域性要求和批量大小)的過濾方 式和排序方式。於是,在後續的檢索過程中,當Query中 包含出現上述意圖詞時,即可將其匹配爲“意圖詞+產品 ”或“產品+意圖詞”的語義模式。因此,預處理過程中 確定的語義模式如表5所不: 表5Query handles 'for example, removing illegal characters and meaningless Query (Chinese, English words, garbled characters, etc. that do not exist in the dictionary), and after performing appropriate normalization operations (together with extra spaces, filter meaningless symbols) ) 'The word segmentation (the specific method of word segmentation belongs to the prior art, which is not described here), and then the semantic mode is determined. In addition, in order to further improve the semantic mode discrimination, in the above preprocessing process, some can directly summarize the user. Intent words, such as "agent", "buy", "purchase", "join", etc., for convenience of description, such words are referred to below as intent words. Automatically mine the corresponding semantic patterns through the intent vocabulary, such as "intent word + product", and establish the intent attribute (ambiguity level, authoritative requirement, time efficiency requirement, regional requirement and batch size) corresponding to the semantic pattern. Filtering and sorting methods. Therefore, in the subsequent retrieval process, when the above-mentioned intent word is included in the Query, it can be matched to the semantic mode of "intent word + product" or "product + intent word". Therefore, the semantic patterns determined during the preprocessing are not as shown in Table 5: Table 5

Query Semantic Pattern PV 數位相機 修飾詞產品 13 維修電動機 意圖詞產品 11 需要說明的是,由於意圖詞是脫離Query語境整理的 ,存在覆蓋率的問題,不能保證所有涉及意圖詞的語義模 -18- 201131394 式都被發現和確定。爲瞭解決這個問題’可以在進行統計 之前,對Query進行擴展,將辭彙本身和其所屬語義模式 進行替換,並都計入總數中’以期望所涵蓋PV較高的帶 意圖詞的模式能夠被發現和確定。例如內容爲“化學產品 運輸”的Query可以被擴展爲“產品運輸”、“產品 意圖詞”和“化學產品意圖詞”。 對於擴展後的Query及其模式,因爲資料量較大,可 以通過分散式計算平臺對其按照模式進行合倂,並對結果 按照PV進行排序,結果格式可以如下: 〔Pattern〕\t [ PV ] \t [ Unique Count ] \t ( Examples ) 此外,還需要說明的是,對與上述所有實施例中預處 理過程中確定的語義模式,可以進行模式篩選,以確定良 好的模式。本文認爲,一個良好的模式,必然均勻地覆蓋 了一定數量的Query。具體的,可以通過以下方式進行評 價:設置語義模式所覆蓋的Query和PV的數量門限,並 設置語義模式所覆蓋具體Query的PV分佈的熵的門限, 先後以該數量門限和熵門限爲基準,對確定的語義模式進 行過濾,過濾掉覆蓋能力不強或分佈均勻性較差的語義模 式。然後,再進行意圖分析,並設置語義模式與分類目標 的對應關係。 此外,設置意圖詞提高語義模式的區分度後,可能會 出現一個Query對應多個語義模式的情況,具體含義的語 義模式的配置優先順序較高,而抽象含義的語義模式的配 置優先順序較低。例如:具體內容爲“香蕉價格”對應 -19- 201131394 "產品意圖詞”和“產品價格”兩個模式時,語義模 式“產品價格”將被確定與“香蕉價格”唯一對應的 語義模式。 本發明實施例根據自然語言特點及用戶的習慣用法, 設置語義模式,並根據用戶意圖,將語義模式與過濾方式 和排序方式建立對應關係,從而使得在接收到用戶輸入的 查詢片語時,可在確定與該查詢片語匹配的語義模式後, 在按照對應的過濾方式和排序方式進行處理,一方面無需 檢索全部資料而減少工作量,另一方面,由於利用歷史經 驗對用戶意圖進行分析,提高了用戶意圖與搜索結果的相 關度,提高搜索精度。 本發明實施例同時還提供了實現上述方法的資訊檢索 系統,該系統的結構如圖3所示,包括:參考儲存單元31 、接收單元32、語義模式匹配單元33、處理方式確定單 元3 4和執行單元3 5 ; 其中: 參考資訊儲存單元31,用於儲存語義模式與過濾方式 、排序方式的對應關係,所述語義模式爲歷史查詢記錄中 的出現的各査詢片語的語義模式中出現頻率超過預定門限 的語義模式;語義模式是根據自然語言特點總結得出的, 如當查詢片語包括多個査詢欄位時,根據自然語言特點, 確定其中的中心詞,例如:針對“手機電池”這個查詢 片語’其中心詞爲"電池”,語義模式爲"修飾詞+產品 ” ’同樣的,“數位相機”對應的語義模式也爲“修飾 -20- 201131394 詞+產品”。 因爲查詢日誌能夠記錄與某查詢片語對應的用戶行爲 ,因此,可以通過統計歷史記錄中各語義模式對應的用戶 行爲,設置體現該用戶行爲的用戶查詢意圖屬性。用戶查 詢意圖屬性的設置決定了過濾方式和排序方式。因此,語 義模式與過濾方式和排序方式的對應關係是可以建立的。 接收單元32,用於接收用戶輸入的查詢片語,該查詢 片語一般包括兩個或兩個以上的關鍵字。 語義模式匹配單元33,用於將接收單元32接收到的 査詢片語進行語義分析確定其語義標籤,進而確定其所屬 語義模式。 處理方式確定單元34,用於依據參考資訊儲存單元 3 1中儲存的資訊,確定與所述查詢片語所屬語義模式對應 的過濾方式和排序方式。 執行單元3 5,用於利用所述過濾方式和排序方式對搜 索結果進行處理。 對於上述出現頻率超過預定門限的語義模式,還可以 進一步進行模式篩選’以從中確定良好的模式。本文認爲 ’一個良好的模式,必然均勻地覆蓋了一定數量的具有同 樣意圖的Query。因此’模式篩選過程可以以覆蓋率和/或 熵値爲基準進行,下麵通過幾個實施例詳細說明: 圖4示出了資訊檢索系統的一種結構形式,包括:參 考儲存單元41、接收單元42、語義模式匹配單元43、處 理方式確定單元44、執行單元45和第一篩選單元46,其 -21 - 201131394 中: 接收單元42、語義模式匹配單元43、處理方式確定 單元44和執行單元45的功能,與接收單元32、語義模式 匹配單元33、處理方式確定單元34和執行單元35的功能 基本相同。 第一篩選單元46用於:計算預定時間段內符合語義 模式的査詢片語的數量,將該查詢數量與總查詢數量的比 例確定爲該語義模式的覆蓋率,並提取覆蓋率大於預定門 限的語義模式; 參考儲存單元41用於:儲存語義模式與過濾方式和 排序方式的對應關係,所述語義模式爲歷史查詢記錄中的 出現的各査詢片語的語義模式中出現頻率超過預定門限, 且覆蓋率大於預定門限的語義模式的語義模式。 圖5示出了資訊檢索系統的另一種結構形式,包括: 參考儲存單元5 1、接收單元52、語義模式匹配單元53、 處理方式確定單元54、執行單元55和第二篩選單元56, 其中: 接收單元52、語義模式匹配單元53、處理方式確定 單元54和執行單元55的功能,與接收單元32、語義模式 匹配單元33、處理方式確定單元34和執行單元35的功能 基本相同。 第二篩選單元56用於:計算預定時間段內屬於同一 語義模式的具體關鍵欄位組針對所有查詢的關鍵欄位組中 的熵,將其確定爲所述語義模式的區分度,並提取熵大於 -22- 201131394 預定値的語義模式; 參考儲存單元51用於:儲存語義模式與過濾方式和 排序方式的對應關係’所述語義模式爲歷史查詢記錄中的 出現的各查詢片語的語義模式中出現頻率超過預定門限, 且熵大於預定値的語義模式0 圖6示出了資訊檢索系統的另一種結構形式,包括: 參考儲存單元61、接收單元62、語義模式匹配單元63、 處理方式確定單元64、執行單元65和第三篩選單元66, 其中: 接收單元62、語義模式匹配單元63、處理方式確定 單元64和執行單元65的功能,與接收單元32、語義模式 匹配單元33、處理方式確定單元34和執行單元35的功能 基本相同。 第三篩選單元66用於:計算預定時間段內符合語義 模式的查詢片語的數量,將該查詢數量與總查詢數量的比 例確定爲該語義模式的覆蓋率,以及計算預定時間段內屬 於同一語義模式的具體關鍵欄位組針對所有查詢的關鍵欄 位組中的熵,並提取出現頻率超過預定門限、覆蓋率大於 預定門限且熵大於預定値的語義模式; 參考儲存單元61用於:儲存語義模式與過濾方式和 排序方式的對應關係,所述語義模式爲歷史查詢記錄中的 出現的各查詢片語的語義模式中出現頻率超過預定門限、 熵大於預定値且覆蓋率大於預定門限的語義模式。 本領域技術人員可以理解’結合本文中所公開的實施Query Semantic Pattern PV digital camera modifier product 13 Maintenance motor intent word product 11 It should be noted that since the intent word is sorted out of the Query context, there is a problem of coverage, and all semantic modules involving the intent word cannot be guaranteed. The 201131394 style was discovered and determined. In order to solve this problem, Query can be extended before the statistics are performed, and the vocabulary itself and its semantic model are replaced, and all of them are included in the total number. Was discovered and determined. For example, a Query whose content is "Chemical Product Transportation" can be expanded to "Product Transportation", "Product Intent Word" and "Chemical Product Intent Word". For the extended Query and its mode, because of the large amount of data, it can be merged according to the pattern through the distributed computing platform, and the results are sorted by PV. The result format can be as follows: [Pattern]\t [ PV ] \t [ Unique Count ] \t ( Examples ) In addition, it should be noted that pattern filtering can be performed to determine a good pattern for the semantic patterns determined in the preprocessing process in all of the above embodiments. This paper argues that a good model necessarily covers a certain number of Query evenly. Specifically, the evaluation may be performed by setting a threshold of the number of Query and PV covered by the semantic mode, and setting a threshold of entropy of the PV distribution of the specific Query covered by the semantic mode, which is based on the threshold and the entropy threshold. Filter the determined semantic patterns and filter out semantic patterns with poor coverage or poor distribution uniformity. Then, perform the intent analysis and set the correspondence between the semantic mode and the classification target. In addition, after setting the intent word to improve the semantic mode discrimination, a Query may correspond to multiple semantic modes. The semantic mode of the specific meaning has a higher priority, while the semantic mode of the abstract meaning has a lower priority. . For example, when the specific content is “banana price” corresponding to the two modes of “-19-201131394 "product intent word” and “product price”, the semantic mode “product price” will be determined to be the only semantic mode corresponding to “banana price”. According to the natural language feature and the user's custom usage, the embodiment of the present invention sets the semantic mode, and associates the semantic mode with the filtering mode and the sorting mode according to the user's intention, so that when the user inputs the query phrase, After determining the semantic pattern matching the query phrase, the processing is performed according to the corresponding filtering manner and sorting manner, on the one hand, the workload is reduced without retrieving all the data, and on the other hand, the user intention is analyzed by using historical experience. The correlation between the user's intention and the search result is improved, and the search accuracy is improved. The embodiment of the present invention also provides an information retrieval system for implementing the above method. The structure of the system is as shown in FIG. 3, and includes: a reference storage unit 31 and a receiving unit. 32. The semantic pattern matching unit 33 and the processing mode determining unit 3 4 And an execution unit 3 5 ; wherein: the reference information storage unit 31 is configured to store a correspondence between the semantic mode and the filtering mode and the sorting mode, where the semantic mode is generated in a semantic mode of each query phrase that appears in the historical query record The semantic mode whose frequency exceeds the predetermined threshold; the semantic mode is summarized according to the natural language features. For example, when the query phrase includes multiple query fields, the central word is determined according to the natural language feature, for example: for the mobile phone battery "The query phrase 'the center word is "battery", the semantic mode is "modifier + product" 'The same, the semantic mode corresponding to the "digital camera" is also "modification -20- 201131394 word + product". Because the query log can record the user behavior corresponding to a certain query phrase, the user query intent attribute reflecting the user behavior can be set by statistically corresponding to the user behavior of each semantic mode in the history record. The setting of the user query intent attribute determines Filtering and sorting. Therefore, semantic mode and filtering and sorting The corresponding relationship is configurable. The receiving unit 32 is configured to receive a query phrase input by the user, where the query phrase generally includes two or more keywords. The semantic pattern matching unit 33 is configured to receive the receiving unit 32. The received query phrase is semantically analyzed to determine its semantic tag, and then determines the semantic mode to which it belongs. The processing mode determining unit 34 is configured to determine the semantics of the query phrase according to the information stored in the reference information storage unit 31. The filtering mode and the sorting mode corresponding to the mode. The executing unit 35 is configured to process the search result by using the filtering mode and the sorting manner. For the semantic mode in which the frequency exceeds a predetermined threshold, the mode filtering may be further performed to Determining a good model. This article argues that 'a good model must necessarily cover a certain number of Query with the same intent. Therefore, the 'mode screening process can be performed based on the coverage rate and/or the entropy ,. The following is explained in detail through several embodiments: FIG. 4 shows a structural form of the information retrieval system, including: a reference storage unit 41 and a receiving unit 42. The semantic pattern matching unit 43, the processing mode determining unit 44, the executing unit 45, and the first screening unit 46, in -21 to 201131394: the receiving unit 42, the semantic pattern matching unit 43, the processing mode determining unit 44, and the executing unit 45 The functions are substantially the same as those of the receiving unit 32, the semantic pattern matching unit 33, the processing mode determining unit 34, and the executing unit 35. The first screening unit 46 is configured to: calculate a number of query phrases that match the semantic mode in a predetermined time period, determine a ratio of the number of queries to a total number of queries as a coverage of the semantic mode, and extract a coverage ratio greater than a predetermined threshold. a semantic mode; the reference storage unit 41 is configured to: store a correspondence between the semantic mode and the filtering mode and the sorting mode, where the semantic mode is that the frequency of occurrence of each query phrase in the historical query record exceeds a predetermined threshold, and A semantic mode of a semantic mode with a coverage greater than a predetermined threshold. FIG. 5 shows another structural form of the information retrieval system, including: a reference storage unit 51, a receiving unit 52, a semantic pattern matching unit 53, a processing mode determining unit 54, an executing unit 55, and a second screening unit 56, wherein: The functions of the receiving unit 52, the semantic pattern matching unit 53, the processing mode determining unit 54, and the executing unit 55 are substantially the same as those of the receiving unit 32, the semantic pattern matching unit 33, the processing mode determining unit 34, and the executing unit 35. The second screening unit 56 is configured to: calculate a certain key field group belonging to the same semantic mode in a predetermined time period, and determine the entropy in the key field group of all the queries, determine the semantic degree of the semantic mode, and extract the entropy The semantic mode is greater than -22-201131394; the reference storage unit 51 is configured to: store a correspondence between the semantic mode and the filtering mode and the sorting mode. The semantic mode is a semantic mode of each query phrase that appears in the historical query record. The semantic mode in which the frequency exceeds the predetermined threshold and the entropy is greater than the predetermined threshold. FIG. 6 shows another structural form of the information retrieval system, including: a reference storage unit 61, a receiving unit 62, a semantic pattern matching unit 63, and a processing manner. The unit 64, the executing unit 65, and the third screening unit 66, wherein: the receiving unit 62, the semantic pattern matching unit 63, the processing mode determining unit 64, and the function of the executing unit 65, and the receiving unit 32, the semantic pattern matching unit 33, and the processing manner The functions of the determining unit 34 and the executing unit 35 are substantially the same. The third screening unit 66 is configured to: calculate a quantity of the query phrase that matches the semantic mode in the predetermined time period, determine a ratio of the number of the query to the total number of queries as the coverage of the semantic mode, and calculate that the predetermined time period belongs to the same The specific key field group of the semantic mode is for the entropy in the key field group of all the queries, and extracts the semantic mode whose appearance frequency exceeds the predetermined threshold, the coverage rate is greater than the predetermined threshold, and the entropy is greater than the predetermined threshold; the reference storage unit 61 is used for: storing The correspondence between the semantic mode and the filtering mode and the sorting mode, wherein the semantic mode is a semantic value of a semantic pattern of each query phrase appearing in the historical query record exceeding a predetermined threshold, the entropy is greater than a predetermined threshold, and the coverage is greater than a predetermined threshold mode. Those skilled in the art will understand that 'incorporating the implementations disclosed herein

C -23- 201131394 例描述的各示例的單元及演算法步 電腦軟體或者二者的結合來實現, 軟體的可互換性,在上述說明中已 述了各示例的組成及步驟。這些功 方式來執行,取決於技術方案的特 。專業技術人員可以對每個特定的 實現所描述的功能,但是這種實現 範圍。 結合本文中所公開的實施例描 驟可以直接用硬體、處理器執行的 結合來實施。軟體模組可以置於隨 憶體、唯讀記憶體(ROM )、電π 可編程ROM、寄存器、硬碟、可ii 技術領域內所公知的任意其他形式 對所公開的實施例的上述說明 員能夠實現或使用本發明。對這些 領域的專業技術人員來說將是顯而 的一般原理可以在不脫離本發明的 在其他實施例中實現。因此,本發 所示的這些實施例,而是要符合與 穎特點相一致的最寬的範圍。 【圖式簡單說明】 爲了更清楚地說明本發明實施 驟,能夠以電子硬體、 爲了清楚地說明硬體和 經按照功能一般性地描 能究竟以硬體還是軟體 定應用和設計約束條件 應用來使用不同方法來 不應認爲超出本發明的 述的方法或演算法的步 軟體模組,或者二者的 機記憶體(RAM)、記 I編程ROM、電可擦除 I式磁片、CD-ROM、或 的儲存介質中》 ,使本領域專業技術人 實施例的多種修改對本 易見的,本文中所定義 精神或範圍的情況下, 明將不會被限制於本文 本文所公開的原理和新 例或現有技術中的技術 -24- 201131394 方案,下面將對實施例或現有技術描述中所需要使用的附 圖作簡單地介紹,顯而易見地,下面描述中的附圖僅僅是 本發明的一些實施例,對於本領域普通技術人員來講,在 不付出創造性勞動性的前提下,還可以根據這些附圖獲得 其他的附圖。 圖1爲本發明實施例提供的資訊檢索方法中的預處理 流程圖; 圖2爲本發明實施例提供的資訊檢索方法中的檢索流 程圖; 圖3爲本發明實施例提供的資訊檢索系統的結構示意 圖1 ; 圖4爲本發明實施例提供的資訊檢索系統的結構示意 圖2 ; 圖5爲本發明實施例提供的資訊檢索系統的結構示意 圖3 ; 圖6爲本發明實施例提供的資訊檢索系統的結構示意 圖4。 【主要元件符號說明】 3 1 :參考儲存單元 32 :接收單元 3 3 :語義模式匹配單元 34:處理方式確定單元 3 5 :執行單元 -25- 201131394 41 : 4 2 : 43 : 44 : 45 : 46 : 5 1: 52 : 53 : 54 : 55 : 56 : 61 : 62 : 63 : 6 4 : 65 : 66 : 參考儲存單元 接收單元 語義模式匹配單元 處理方式確定單元 執行單元 第一篩選單元 參考儲存單元 接收單元 語義模式匹配單元 處理方式確定單元 執行單元 第二篩選單元 參考儲存單元 接收單元 語義模式匹配單元 處理方式確定單元 執行單元 第三篩選單元 -26C -23- 201131394 Example of each of the example units and algorithm steps Computer software or a combination of the two, the interchangeability of the software, the composition and steps of each example have been described in the above description. These methods of implementation are performed depending on the characteristics of the technical solution. The skilled person can implement the functions described for each particular implementation, but this range of implementations. The embodiments in conjunction with the embodiments disclosed herein can be implemented directly in combination with hardware and processor execution. The software module can be placed in any other form known to those skilled in the art, such as a memory, a read-only memory (ROM), an electro-π programmable ROM, a register, a hard disk, or any other form known in the art. The invention can be implemented or used. It will be apparent to those skilled in the art that the general principles may be practiced in other embodiments without departing from the invention. Therefore, the embodiments shown in the present invention are intended to conform to the widest range consistent with the features of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS In order to more clearly illustrate the practice of the present invention, it is possible to apply the application and design constraints in an electronic hardware, in order to clearly illustrate the hardware and to generally describe the application and design constraints in terms of hardware or software. To use different methods, it should not be considered as a stepping software module that exceeds the method or algorithm of the present invention, or both of the computer memory (RAM), the I programming ROM, the electrically erasable I-type magnetic disk, In the case of a CD-ROM, or a storage medium, various modifications of the embodiments of the skilled artisan will be apparent to those skilled in the art, and the spirit or scope defined herein will not be limited to the disclosure herein. Principles and New Examples or Techniques in the Prior Art - The following drawings will briefly describe the drawings used in the embodiments or the description of the prior art. Obviously, the drawings in the following description are merely the present invention. Some embodiments of the present invention can be obtained by those skilled in the art from the drawings without any inventive labor. FIG. 1 is a flowchart of a pre-processing in an information retrieval method according to an embodiment of the present invention; FIG. 2 is a flowchart of a retrieval in an information retrieval method according to an embodiment of the present invention; FIG. 4 is a schematic structural diagram 2 of an information retrieval system according to an embodiment of the present invention; FIG. 5 is a schematic structural diagram 3 of an information retrieval system according to an embodiment of the present invention; FIG. 6 is an information retrieval system according to an embodiment of the present invention; Schematic diagram of structure 4. [Main component symbol description] 3 1 : Reference storage unit 32: Reception unit 3 3 : Semantic pattern matching unit 34: Processing mode determination unit 3 5 : Execution unit - 25 - 201131394 41 : 4 2 : 43 : 44 : 45 : 46 : 5 1:52 : 53 : 54 : 55 : 56 : 61 : 62 : 63 : 6 4 : 65 : 66 : Reference storage unit receiving unit semantic pattern matching unit processing mode determination unit execution unit first screening unit reference storage unit reception Unit semantic pattern matching unit processing mode determination unit execution unit second screening unit reference storage unit receiving unit semantic pattern matching unit processing manner determination unit execution unit third screening unit -26

Claims (1)

201131394 七、申請專利範圍: 1. 一種資訊檢索方法,其特徵在於,包括: 預處理步驟,包括: 確定歷史查詢記錄中出現的各查詢片語的語義標籤, 根據語義標籤統計語義模式,從統計結果中選擇出現頻率 超過預定門限的語義模式; 通過統計歷史記錄中各語義模式對應的用戶行爲,設 置體現該用戶行爲的用戶查詢意圖屬性,設置該語義模式 與該用戶查詢意圖屬性所指定的過濾方式和排序方式的對 應關係; 檢索步驟,包括: 接收查詢片語,進行語義分析確定其所屬語義標籤; 依據該對應關係,確定與該查詢片語所屬語義模式對 應的過濾方式和排序方式; 利用該過濾方式和排序方式對搜索結果進行處理。 2. 如申請專利範圍第1項之方法,其中,在確定高 頻語義模式後,還包括:依據覆蓋率對語義模式進行篩選 ,篩選過程包括: 計算預定時間段內符合語義模式的査詢片語的數量, 將該查詢數量與總查詢數量的比例確定爲該語義模式的覆 蓋率; 提取覆蓋率大於預定門限的語義模式。 3. 如申請專利範圍第1項之方法,其中,在確定高 頻語義模式後,還包括:依據區分度對語義模式進行篩選 -27- 201131394 ,篩選過程包括: 計算預定時間段內屬於同一語義模式的具體關鍵欄位 組針對所有查詢的關鍵欄位組中的熵,將其確定爲該語義 模式的區分度; 提取熵大於預定値的語義模式。 4. 如申請專利範圍第1項之方法,其中,在確定高 頻語義模式後’還包括:依據覆蓋率和區分度對語義模式 進行篩選,篩選過程包括: 計算預定時間段內符合該語義模式的查詢次數,將該 査詢次數與總查詢次數的比例確定爲該語義模式的覆蓋率 計算預定時間段內屬於同一語義模式的具體關鍵欄位 組針對所有査詢的關鍵欄位組中的熵,將其確定爲該語義 模式的區分度; 提取覆蓋率大於預定門限及熵大於預定値的語義模式 〇 5. 如申請專利範圍第1項之方法,其中,該用戶査 詢意圖屬性包括:歧義程度屬性、權威性要求屬性、時效 性要求屬性、地域要求屬性和批量屬性。 6·—種資訊檢索方法,其特徵在於,包括: 接收査詢片語’到預先設置的語義標籤庫中匹配相應 的語義標籤; 根據匹配到的該語義標籤到語義模式表中匹配獲得該 查詢片語的語義模式; -28- 201131394 根據該語義模式到按照用戶查詢意圖屬性預設的語義 模式與過濾、排序方式的對應關係表中匹配獲得該查詢片 語對應的過濾方式和排序方式; 利用該過濾方式和排序方式對該查詢片語的搜索結果 進行處理。 7. —種資訊檢索系統,其特徵在於,包括: 參考資訊儲存單元,用於儲存語義模式與過濾方式和 排序方式的對應關係,該語義模式爲歷史査詢記錄中的出 現的各查詢片語的語義模式中出現頻率超過預定門限的語 義模式,該過濾方式和排序方式爲用戶査詢意圖屬性所指 定,該用戶查詢意圖屬性是通過統計歷史記錄中各語義模 式對應的用戶行爲設定的; 接收單元,用於接收查詢片語; 語義模式匹配單元,用於將該接收單元接收到的查詢 片語進行語義分析確定其語義標籤; 處理方式確定單元,用於依據該參考資訊儲存單元中 儲存的資訊,確定該査詢片語所屬語義模式及其對應的過 濾方式和排序方式; 執行單元,用於利用該過濾方式和排序方式對搜索結 果進行處理。 8. 如申請專利範圍第7項之系統,其中,還包括: 第一篩選單元’用於:計算預定時間段內符合語義模 式的查詢片語的數量,將該査詢數量與總査詢數量的比例 確定爲該語義模式的覆蓋率,並提取覆蓋率大於預定門限 -29- 201131394 的語義模式; 該參考資訊儲存單元儲存的語義模式爲:出現頻率超 過預定門限且覆蓋率大於預定門限的語義模式。 9 ·如申請專利範圍第7項之系統,其中,還包括: 第二篩選單元,用於:計算預定時間段內屬於同一語 義模式的具體關鍵欄位組針對所有查詢的關鍵欄位組中的 熇,將其確定爲該語義模式的區分度,並提取熵大於預定 値的語義模式; 該參考資訊儲存單元儲存的語義模式爲:出現頻率超 過預定門限且熵大於預定値的語義模式。 1 0.如申請專利範圍第7項之系統,其中,還包括: 第三篩選單元,用於:計算預定時間段內符合語義模 式的查詢片語的數量,將該查詢數量與總査詢數量的比例 確定爲該語義模式的覆蓋率,以及計算預定時間段內屬於 同一語義模式的具體關鍵欄位組針對所有查詢的關鍵欄位 組中的熵,並提取出現頻率超過預定門限、覆蓋率大於預 定門限且熵大於預定値的語義模式; 該參考資訊儲存單元儲存的語義模式爲:出現頻率超 過預定門限、覆蓋率大於預定門限且熵大於預定値的語義 模式。 -30-201131394 VII. Patent application scope: 1. An information retrieval method, comprising: a preprocessing step, comprising: determining a semantic label of each query phrase appearing in a historical query record, according to a semantic label statistical semantic pattern, from statistics Selecting a semantic mode in which the frequency exceeds a predetermined threshold; setting a user query intent attribute embodying the user behavior by statistic user behavior corresponding to each semantic mode in the history record, setting the semantic mode and filtering specified by the user query intent attribute Corresponding relationship between the mode and the sorting method; the searching step includes: receiving the query phrase, performing semantic analysis to determine the semantic tag to which it belongs; determining, according to the correspondence relationship, a filtering mode and a sorting mode corresponding to the semantic mode of the query phrase; The filtering method and sorting method process the search results. 2. The method of claim 1, wherein after determining the high frequency semantic mode, the method further comprises: screening the semantic mode according to the coverage rate, and the screening process comprises: calculating a query phrase that conforms to the semantic mode within a predetermined time period. The number of the query, the ratio of the number of queries to the total number of queries is determined as the coverage of the semantic mode; and the semantic mode with the coverage greater than the predetermined threshold is extracted. 3. The method of claim 1, wherein after determining the high frequency semantic mode, the method further comprises: screening the semantic mode according to the degree of discrimination -27-201131394, the screening process comprises: calculating the same semantics within a predetermined time period The specific key field group of the pattern is determined as the degree of discrimination of the semantic mode for the entropy in the key field group of all queries; the semantic mode whose entropy is greater than the predetermined 値 is extracted. 4. The method of claim 1, wherein after determining the high frequency semantic mode, the method further comprises: screening the semantic mode according to the coverage rate and the discrimination degree, and the screening process comprises: calculating the semantic mode according to the predetermined time period. The number of queries, the ratio of the number of queries to the total number of queries is determined as the coverage of the semantic mode. The entropy of the key field groups belonging to the same semantic mode within a predetermined time period for all queries in the key field group will be calculated. It is determined as the degree of discrimination of the semantic mode; extracting a semantic pattern whose coverage is greater than a predetermined threshold and entropy is greater than a predetermined threshold. 5. The method of claim 1, wherein the user query intent attribute comprises: an ambiguity degree attribute, Authoritative requirements attributes, timeliness requirements attributes, geographic requirements attributes, and batch attributes. The information retrieval method is characterized in that: the method includes: receiving a query phrase to match a corresponding semantic tag in a preset semantic tag library; and obtaining the query piece according to the matched semantic tag to the semantic mode table The semantic mode of the language; -28- 201131394 According to the semantic mode, the filtering mode and the sorting mode corresponding to the query phrase are obtained by matching the semantic mode preset by the user query intent attribute with the filtering and sorting manner; The filtering method and sorting method process the search result of the query phrase. An information retrieval system, comprising: a reference information storage unit, configured to store a correspondence between a semantic mode and a filtering mode and a sorting mode, where the semantic mode is an occurrence of each query phrase in the historical query record A semantic mode in which a frequency exceeds a predetermined threshold occurs in the semantic mode, and the filtering mode and the sorting mode are specified by a user query intent attribute, which is set by a user behavior corresponding to each semantic mode in the statistical history record; And the semantic pattern matching unit is configured to perform semantic analysis on the query phrase received by the receiving unit to determine a semantic label thereof; and the processing mode determining unit is configured to use the information stored in the reference information storage unit, Determining the semantic mode of the query phrase and its corresponding filtering mode and sorting manner; the executing unit is configured to process the search result by using the filtering method and the sorting manner. 8. The system of claim 7, wherein the method further comprises: the first screening unit is configured to: calculate a number of query phrases that match the semantic mode within a predetermined time period, and the ratio of the number of queries to the total number of queries The coverage of the semantic mode is determined, and a semantic mode whose coverage is greater than a predetermined threshold -29-201131394 is extracted; the semantic mode stored by the reference information storage unit is a semantic mode in which the frequency exceeds a predetermined threshold and the coverage is greater than a predetermined threshold. 9. The system of claim 7, wherein the method further comprises: a second screening unit, configured to: calculate a specific key field group belonging to the same semantic mode within a predetermined time period for the key field group of all the queries熇, determining the degree of discrimination of the semantic mode, and extracting a semantic mode whose entropy is greater than a predetermined ;; the semantic mode stored by the reference information storage unit is: a semantic mode in which the frequency exceeds a predetermined threshold and the entropy is greater than a predetermined 値. The system of claim 7, wherein the method further comprises: a third screening unit, configured to: calculate a number of query phrases that match the semantic mode within a predetermined time period, and the number of the query and the total number of queries The ratio is determined as the coverage of the semantic mode, and the specific key field group belonging to the same semantic mode in the predetermined time period is calculated for the entropy in the key field group of all the queries, and the extraction frequency exceeds the predetermined threshold, and the coverage ratio is greater than the predetermined one. a semantic mode in which the threshold and the entropy are greater than a predetermined threshold; the semantic mode stored by the reference information storage unit is a semantic mode in which the appearance frequency exceeds a predetermined threshold, the coverage ratio is greater than a predetermined threshold, and the entropy is greater than a predetermined threshold. -30-
TW99106781A 2010-03-09 2010-03-09 Information retrieval methods and systems TWI474197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW99106781A TWI474197B (en) 2010-03-09 2010-03-09 Information retrieval methods and systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW99106781A TWI474197B (en) 2010-03-09 2010-03-09 Information retrieval methods and systems

Publications (2)

Publication Number Publication Date
TW201131394A true TW201131394A (en) 2011-09-16
TWI474197B TWI474197B (en) 2015-02-21

Family

ID=50180357

Family Applications (1)

Application Number Title Priority Date Filing Date
TW99106781A TWI474197B (en) 2010-03-09 2010-03-09 Information retrieval methods and systems

Country Status (1)

Country Link
TW (1) TWI474197B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274217A (en) * 2017-05-27 2017-10-20 冯小平 Determine user's current behavior and the method and apparatus for predicting user view

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5604899A (en) * 1990-05-21 1997-02-18 Financial Systems Technology Pty. Ltd. Data relationships processor with unlimited expansion capability
US8224087B2 (en) * 2007-07-16 2012-07-17 Michael Bronstein Method and apparatus for video digest generation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274217A (en) * 2017-05-27 2017-10-20 冯小平 Determine user's current behavior and the method and apparatus for predicting user view

Also Published As

Publication number Publication date
TWI474197B (en) 2015-02-21

Similar Documents

Publication Publication Date Title
CN102012900B (en) An information retrieval method and system
CN107729336B (en) Data processing method, device and system
TWI544351B (en) Extended query method and system
CN111008265B (en) Enterprise information searching method and device
US9846748B2 (en) Searching for information based on generic attributes of the query
US20180260848A1 (en) Information processing method and apparatus
CN116450772B (en) Intelligent recommendation method and device for search results and unified search method
TW201348991A (en) Product search method and system
CN101464897A (en) Word matching and information query method and device
TW201546633A (en) Method and Apparatus of Matching Text Information and Pushing a Business Object
CN111444304A (en) Method and device for search ranking
CN107341268A (en) A kind of heat searches list sort method and system
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN105653546A (en) Method and system for searching target theme
CN103136256A (en) Method and system for achieving information retrieval in network
CN111723296B (en) Search processing method, device and computer equipment
CN103092838B (en) A kind of method and device for obtaining English words
TWI446191B (en) Word matching and information query method and device
TWI474197B (en) Information retrieval methods and systems
CN107807964A (en) Digital content sort method, device and computer-readable recording medium
TWI483129B (en) Retrieval method and device
CN111737489A (en) A retrieval method, device, device and readable storage medium for building information
CN119884281B (en) Legal provision retrieval method and system
HK1151870B (en) Method and system for information searching
CN115828909A (en) Method, system, equipment and medium for extracting enterprise abbreviation

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees