TWI647578B

TWI647578B - Search engine based document indexing method, data query method and server

Info

Publication number: TWI647578B
Application number: TW099106787A
Authority: TW
Inventors: 魏磊; 沈加翔
Original assignee: 阿里巴巴集團控股有限公司
Priority date: 2010-03-09
Filing date: 2010-03-09
Publication date: 2019-01-11
Also published as: TW201131396A

Abstract

本申請案之實施例揭示基於搜索引擎的文檔索引方法、資料查詢方法及伺服器，所述文檔索引方法包括：獲取待索引的文檔，並對所述文檔進行分詞操作得到一元分詞；判斷每個一元分詞是否為過濾字，若所述一元分詞是過濾字，則將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，對所述多元分詞建立索引；若所述一元分詞不是過濾字，則直接對所述一元分詞建立索引。本申請案之實施例中在進行索引或查詢時，將作為高頻字的一元分詞與其相鄰的至少一個一元分詞成多元分詞，以確保不會因為對高頻字進行索引而導致查詢時浪費搜索引擎的資源，並且也不會因為跳過對高頻字的索引而導致查詢結果不準確。 The embodiment of the present application discloses a document indexing method based on a search engine, a data query method, and a server. The document indexing method includes: acquiring a document to be indexed, and performing a word segmentation operation on the document to obtain a unitary word segmentation; Whether the unary participle is a filter word, if the unary participle is a filter word, the unary participle and the at least one unary participle adjacent to the unary participle sequence are into a multivariate participle, and the multivariate participle is indexed; If the unary participle is not a filter word, the index of the unary participle is directly indexed. In the embodiment of the present application, when indexing or querying, the unary word segmentation as a high frequency word and its adjacent at least one unary word segmentation are multi-particials to ensure that the query is not wasted due to indexing the high frequency word. Search engine resources, and will not cause inaccurate query results by skipping the index of high frequency words.

Description

Search engine based document indexing method, data query method and server

本申請案係有關搜索引擎技術領域，尤其有關一種基於搜索引擎的文檔索引方法、資料查詢方法及伺服器。 The application relates to the field of search engine technology, and particularly relates to a document indexing method based on a search engine, a data query method and a server.

搜索引擎(Search Engine)是指根據一定的策略、運用特定的電腦程式集網際網路上的資訊，在對資訊進行組織和處理後，並將處理後的資訊顯示給用戶，為用戶提供檢索服務的系統。 Search Engine refers to the use of specific computer programs to organize and process information according to certain strategies and information, and display the processed information to users to provide users with search services. system.

搜索引擎的工作原理如下：首先，進行網頁抓取，每個獨立的搜索引擎都有自己的網頁抓取程式，俗稱網路蜘蛛(Spider)，Spider順著網頁中的超鏈結，連續地抓取網頁，所抓取到的網頁被稱為網頁快照，由於網際網路中超鏈結的應用很普遍，理論上，從一定範圍的網頁出發，就能搜集到絕大多數的網頁；其次，進行網頁處理，搜索引擎抓到網頁後，提取關鍵字，建立索引檔；才能提供檢索服務；最後，提供檢索服務，用戶輸入關鍵字進行檢索，搜索引擎從索引資料庫中找到匹配該關鍵字的網頁，為了用戶便於判斷，除了網頁標題和URL外，還會提供一段來自網頁的摘要以及其他資訊。 The search engine works as follows: First, web crawling, each independent search engine has its own web crawler, commonly known as web spider (Spider), Spider follows the hyperlink in the web page, continuously grabbing Take the webpage, the webpage that is captured is called the webpage snapshot. Since the application of the hyperlink in the internet is very common, in theory, from a certain range of webpages, most of the webpages can be collected; secondly, proceed Web page processing, after the search engine catches the webpage, extracts the keywords and builds the index file; in order to provide the retrieval service; finally, provides the retrieval service, the user inputs the keyword to search, and the search engine finds the webpage matching the keyword from the index database. In addition to the page title and URL, a summary of the page and other information will be provided for the user's convenience.

對於中文搜索引擎來說，在進行索引和查詢時，都需要進行中文分詞的操作，其中常用的中文分詞方法為一元分詞法，亦即將句子中的每個漢字作為一個單位，假設待索引的句子為“中國股市”，則經過一元分詞後的結果為四個單字，分別為“中”、“國”、“股”、“市”。以“市”字為例，在索引了600萬個文檔的單台搜索引擎伺服器內，“市”字出現的概率高達93%，因此在根據一元分詞劃分結果查詢“中國股市”時，對於“市”字的查詢將極大消耗搜索引擎伺服器的資源，因此在搜索引擎內，預先保存了高頻字列表，對於高頻字採用過濾的方式不進行查詢，因此搜索“中國股市”就簡化為搜索“中國股”，以跳過對高頻字“市”的查詢。 For Chinese search engines, Chinese indexing operations are required for indexing and querying. The commonly used Chinese word segmentation method is the unary word segmentation method, which means that each Chinese character in the sentence is treated as a unit. The indexed sentence is “China Stock Market”, and the result after the one-yuan participle is four words, namely “中中”, “国”, “股”, “市”. Taking the word "city" as an example, in a single search engine server indexing 6 million documents, the probability of "city" appearing is as high as 93%. Therefore, when querying "Chinese stock market" based on the result of dividing the unitary word, The query of "city" will greatly consume the resources of the search engine server. Therefore, in the search engine, the list of high-frequency words is pre-stored, and the high-frequency words are not filtered, so searching for "Chinese stock market" is simplified. To search for "China stocks", to skip the query for the high-frequency word "city".

在對現有技術的研究和實踐過程中，發明人發現現有技術中存在以下問題：在採用一元分詞法進行索引和查詢時，雖然透過預先設置的高頻字列表跳過了對高頻字的查詢，但是卻會導致查詢結果不準確。仍然以查詢“中國股市”為例，雖然跳過了“市”字的查詢，但是返回的查詢結果中將包括大量的“中國股民”、“中國股票”等包含“中國股”的查詢結果，因此導致查詢結果與需要查詢的內容不相符。 In the research and practice of the prior art, the inventors found that the prior art has the following problems: when using the unary word segmentation for indexing and querying, although the query of the high frequency word is skipped through the preset high frequency word list , but it will lead to inaccurate query results. Still taking the query "China stock market" as an example, although the query of "city" is skipped, the returned query results will include a large number of "Chinese stocks", "Chinese stocks" and other query results containing "Chinese stocks". As a result, the result of the query does not match the content that needs to be queried.

本申請案之實施例的目的在於提供一種基於搜索引擎的文檔索引方法、資料查詢方法及伺服器，以解決現有透過高頻詞過濾方式進行索引和查詢，導致查詢結果不準確的問題。 The purpose of the embodiment of the present application is to provide a document indexing method based on a search engine, a data query method, and a server, so as to solve the problem that the existing indexing and querying by the high frequency word filtering method is performed, resulting in inaccurate query results.

為解決上述技術問題，本申請案之實施例提供了一種基於搜索引擎的文檔索引方法，是這樣實現的：一種基於搜索引擎的文檔索引方法，包括：獲取待索引的文檔，並對所述文檔進行分詞操作得到一元分詞；判斷每個一元分詞是否為過濾字，若所述一元分詞是過濾字，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，對所述多元分詞建立索引；若所述一元分詞不是過濾字，則直接對所述一元分詞建立索引。為解決上述技術問題，本申請案之實施例提供了一種基於搜索引擎的資料查詢方法，是這樣實現的：一種基於搜索引擎的資料查詢方法，所述資料查詢方法應用所述文檔索引方法所建立的索引，包括：獲取待查詢的資料，並對所述資料進行分詞操作得到一元分詞；判斷每個一元分詞是否為過濾字，若所述一元分詞是過濾字，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，根據所述多元分詞查找所述索引，若所述一元分詞不是過濾字，則根據所述一元分詞查找所述索引；將查找所述索引後得到的查詢結果進行合併。 In order to solve the above technical problem, an embodiment of the present application provides a The search engine-based document indexing method is implemented as follows: a search engine-based document indexing method includes: acquiring a document to be indexed, and performing a word segmentation operation on the document to obtain a unary word segmentation; determining whether each unary segmentation word is filtered a word, if the unary participle is a filter word, the unary participle and the at least one unary participle adjacent to the unary participle sequence are multivariate word segmentation, and the multivariate participle is indexed; if the unary participle is not a filter word , directly indexing the unary participle. To solve the above technical problem, an embodiment of the present application provides a data query method based on a search engine, which is implemented by a search engine-based data query method, and the data query method is established by applying the document index method. The index includes: obtaining the data to be queried, and performing a word segmentation operation on the data to obtain a unary word segmentation; determining whether each unary segmentation word is a filter word, and if the unary segmentation word is a filter word, the unitary word segmentation and the Locating at least one unary word segment adjacent to the unary word segmentation into a multivariate word segment, searching for the index according to the multivariate word segment, if the unary segmentation word is not a filter word, searching for the index according to the unary segmentation word; The resulting query results are merged.

為解決上述技術問題，本申請案之實施例還提供了一種基於搜索引擎的文檔索引伺服器，是這樣實現的：一種基於搜索引擎的文檔索引伺服器，包括：獲取單元，用以獲取待索引的文檔；分詞單元，用以對所述獲取單元獲取的文檔進行分詞操作得到一元分詞；判斷單元，用以判斷每個一元分詞是否為過濾字；索引單元，用以當所述判斷單元判斷所述一元分詞是過濾字時，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，對所述多元分詞建立索引，當所述判斷單元判斷所述一元分詞不是過濾字時，則直接對所述一元分詞建立索引。 To solve the above technical problem, an embodiment of the present application further provides a search engine-based document index server, which is implemented as follows: a search engine-based document index server, including: An obtaining unit, configured to obtain a document to be indexed; a word segmentation unit, configured to perform a word segmentation operation on the document obtained by the obtaining unit to obtain a unary word segmentation; a determining unit, configured to determine whether each unary segmentation word is a filtering word; When the judging unit judges that the unary participle is a filter word, the unary participle and the at least one unary participle adjacent to the unary participle sequence are into a multi-word segmentation, and the multi-particial word is indexed when When the judging unit judges that the unary participle is not a filter word, it directly indexes the unary participle.

為解決上述技術問題，本申請案之實施例還提供了一種基於搜索引擎的資料查詢伺服器，是這樣實現的：一種基於搜索引擎的資料查詢伺服器，所述資料查詢伺服器應用所述文檔索引伺服器所建立的索引，包括：獲取單元，用以獲取待查詢的資料；分詞單元，用以對所述獲取單元獲取的資料進行分詞操作得到一元分詞；判斷單元，用以判斷每個一元分詞是否為過濾字；查找單元，用以當所述判斷單元判斷所述一元分詞是過濾字時，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，根據所述多元分詞查找所述索引，當所述判斷單元判斷所述一元分詞不是過濾字時，則根據所述一元分詞查找所述索引；合併單元，用以將所述查找單元查找所述索引後得到的查詢結果進行合併。 In order to solve the above technical problem, an embodiment of the present application further provides a data query server based on a search engine, which is implemented by: a search engine based data query server, wherein the data query server applies the document The index established by the index server includes: an obtaining unit for acquiring data to be queried; a word segment unit for performing a word segmentation operation on the data obtained by the obtaining unit to obtain a unary word segmentation; and a determining unit for determining each one dollar Whether the word segmentation is a filter word; the searching unit is configured to, when the determining unit determines that the unary word segmentation is a filter word, the plurality of word segmentation words and at least one unary word segment adjacent to the unitary word segmentation sequence into a multi-word segmentation, according to The multivariate word segmentation searches the index, and when the determining unit determines that the unary segmentation word is not a filter word, the index is searched according to the unary word segmentation; the merging unit is configured to search the index unit after searching the index The obtained query results are merged.

可見，本申請案之實施例中在進行索引或查詢時，將作為高頻字的一元分詞與其相鄰的至少一個一元分詞成多元分詞，以保證不會因為對高頻字進行索引而導致查詢時浪費搜索引擎的資源，並且也不會因為跳過對高頻字的索引而導致查詢結果不準確；以查詢“中國股市”為例，本申請案之實施例將以分詞“中”、“國”、“股”、“股市”進行查詢，由於分詞“股市”的命中率遠低於高頻字“市”的命中率，並且也不會由於跳過對“市”的查詢而查詢到除“股市”以外的分詞，因此在降低搜索引擎資源消耗的同時可以返回正確的查詢結果，由此提高了搜索引擎的性能。 It can be seen that, in the embodiment of the present application, when indexing or querying, the unary word segmentation of the high frequency word and the adjacent at least one unary word segmentation are multivariate word segmentation, so as to ensure that the query is not caused by indexing the high frequency word. It wastes the resources of the search engine, and does not cause the query result to be inaccurate because of skipping the index of the high frequency word; taking the example of "China stock market" as an example, the embodiment of the present application will use the word "zhong" and " The country, the "stock", and the "stock market" are queried. The hit rate of the segmentation "stock market" is much lower than the hit rate of the high-frequency word "city", and it will not be queried by skipping the query for "city". In addition to the "stock market" word segmentation, it can reduce the search engine resource consumption while returning the correct query results, thereby improving the performance of the search engine.

本申請案之實施例提供一種基於搜索引擎的文檔索引方法、資料查詢方法及伺服器。 An embodiment of the present application provides a document indexing method based on a search engine, a data query method, and a server.

為了使本技術領域的人員更好地理解本申請案之實施例中的技術方案，並使本申請案之實施例的上述目的、特徵和優點能夠更加明顯易懂，下面結合附圖對本申請案之實施例中技術方案作進一步詳細的說明。 In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, and the above objects, features and advantages of the embodiments of the present application can be more clearly understood, the present application will be described below with reference to the accompanying drawings. The technical solutions in the embodiments are described in further detail.

搜索引擎按照功能劃分，通常由搜索模組、索引模組、查詢模組和用戶介面模組四個部分組成。其中，搜索模組的功能是透過Spider在網際網路中漫遊，發現和搜集網頁資訊；索引模組的功能是從搜索模組搜索到的網頁中抽取出索引項，用以表示文檔以及產生文檔庫的索引表；查詢模組的功能是根據用戶的查詢在索引庫中檢索文檔，並對將要輸出的結果進行排序，按照用戶的查詢需求合理回饋資訊；用戶介面模組的作用是接收用戶的查詢請求，並向用戶返回查詢結果。本申請案之實施例主要描述搜索引擎中的索引功能和查詢功能的實現過程。 The search engine is divided into four parts: a search module, an index module, a query module, and a user interface module. Among them, the function of the search module is to roam through the Internet through the Spider to discover and collect webpage information; the function of the index module is to extract index items from the webpages searched by the search module to represent the documents and generate documents. Library index table; check The function of the query module is to retrieve documents in the index library according to the user's query, and to sort the results to be output, and to feed back the information according to the user's query requirements; the function of the user interface module is to receive the query request of the user, and The user returns the result of the query. The embodiment of the present application mainly describes an implementation process of an index function and a query function in a search engine.

為了使本技術領域的人員更好地理解本申請案中的技術方案，下面將結合本申請案之實施例中的附圖，對本申請案之實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本申請案的一部分實施例，而不是全部的實施例。基於本申請案中的實施例，本領域普通技術人員在沒有作出創造性勞動前提下所獲得的所有其他實施例，都應當屬於本申請案之保護的範圍。 In order to enable a person skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope should fall within the scope of the protection of the present application.

參見圖1，為本申請案之基於搜索引擎的文檔索引方法的第一實施例流程圖： Referring to FIG. 1, a flowchart of a first embodiment of a search engine-based document indexing method according to the present application is as follows:

步驟101：獲取待索引的文檔。 Step 101: Acquire a document to be indexed.

本申請案之實施例中待索引的文檔通常為透過搜索引擎的搜索功能從網際網路中搜集的網頁資訊，這些網頁資訊被保存在搜索引擎伺服器的儲存空間內(例如磁片)，當搜索引擎伺服器需要進行索引時，則從儲存空間內獲取還未索引過的網頁資訊。 The document to be indexed in the embodiment of the present application is usually webpage information collected from the Internet through a search function of a search engine, and the webpage information is stored in a storage space of a search engine server (for example, a magnetic disk). When the search engine server needs to index, it obtains the information of the webpage that has not been indexed from the storage space.

步驟102：對待索引的文檔進行分詞操作得到一元分詞。 Step 102: Perform a word segmentation operation on the indexed document to obtain a unary word segmentation.

需要說明的是，如果待索引的文檔中只有一個字，並且該字是高頻字，則不再為該文檔建立索引。 It should be noted that if there is only one word in the document to be indexed, and the word is a high frequency word, the document is no longer indexed.

步驟103：判斷每個一元分詞是否為過濾字，若是，則執行步驟104：否則，執行步驟105。 Step 103: Determine whether each unary participle is a filter word, and if yes, execute step 104: Otherwise, perform step 105.

其中，過濾字就是搜索引擎在查詢時會直接忽略不查的高頻字，這些高頻字由於在文檔中出現的頻率較高，因此查詢時會耗費大量系統資源，所以通常事先透過統計等手段在索引前就確定好過濾字列表，用以後續查詢時跳過對這些高頻字的查詢。例如，根據統計結果，“中國股市”中的“市”為高頻字，因此可以將“市”寫入過濾字列表。 Among them, the filter word is the high-frequency word that the search engine will ignore directly when querying. These high-frequency words have a high frequency in the document, so the query will consume a lot of system resources, so usually through statistical means. The filter word list is determined before the index, and the query for these high frequency words is skipped for subsequent query. For example, according to the statistical results, the “city” in the “Chinese stock market” is a high-frequency word, so the “city” can be written into the filter word list.

通常對文檔分詞後得到的一元分詞結果中包含若干個一元分詞，因此順序對每個一元分詞是否為過濾字進行判斷。 Usually, the unary word segmentation result obtained after the document segmentation includes several unary segmentation words, so the order is judged whether each unary segmentation word is a filter word.

步驟104：將一元分詞和與該一元分詞順序相鄰的至少一個一元分詞成多元分詞，對該多元分詞建立索引，結束目前流程。 Step 104: The unary participle and the at least one unary participle adjacent to the unary participle are into multivariate word segmentation, index the multivariate participle, and end the current process.

其中，較佳地，將一元分詞和與該一元分詞順序相鄰的一元分詞成二元分詞，對於判斷為高頻字的目前一元分詞，如果該一元分詞是文檔中的第一個一元分詞時，則將該一元分詞與其後一個一元分詞成二元分詞；如果該一元分詞是文檔中的最後一個一元分詞時，則將該一元分詞與其前一個一元分詞成二元分詞；如果該一元分詞不是該文檔中的第一個一元分詞和最後一個一元分詞，則將該一元分詞與其前一個一元分詞和後一個一元分詞分別組成二元分詞。 Wherein, preferably, the unary participle and the unary participle adjacent to the unary participle are into a binary participle, and for the current unary participle judged as the high frequency word, if the unary participle is the first unary participle in the document , the binary participle is merged with the latter one-part participle into a binary participle; if the unary participle is the last unary participle in the document, the unary participle is combined with the previous unary participle into a binary participle; if the unary participle is not The first unary participle and the last unary participle in the document respectively form the binary participle with the previous unary participle and the latter unary participle.

對於按照上述方式組成的二元分詞建立索引，由上述描述可知，兩個相鄰的一元分詞成一個二元分詞，例如，“我”和“的”是兩個一元分詞，則將它們結合起來組成“我的”就是二元分詞。 For the indexing of binary word segments composed in the above manner, it can be seen from the above description that two adjacent unary word segments are formed into one binary word segment. For example, "I" and "I" are two unary word segments, and then they are combined. The composition of "my" is a binary participle.

同理，多元分詞指至少兩個相鄰的一元分詞成一個多元分詞，除了上述二元分詞外，還有三元分詞、四元分詞等，例如，由“中”、“國”和“人”組成一個三元分詞“中國人”。 Similarly, multivariate word segmentation refers to at least two adjacent unary word segments into a multivariate word segmentation. In addition to the above binary word segmentation, there are also ternary participles, quaternary participles, etc., for example, by "middle", "national" and "person" Form a ternary participle "Chinese."

步驟105：直接對該一元分詞建立索引，結束目前流程。 Step 105: Directly index the unary participle and end the current process.

對於不是高頻字的一元分詞，則按照現有技術中的方式直接對其建立索引。 For unary words that are not high frequency words, they are directly indexed according to the manner in the prior art.

參見圖2，為本申請案之基於搜索引擎的文檔索引方法的第二實施例流程圖，該實施例以多元分詞為二元分詞為例，詳細描述了文檔索引過程： 2 is a flowchart of a second embodiment of a search engine-based document indexing method according to the present application. The embodiment describes a document indexing process in detail by taking a multivariate word segmentation as a binary word segmentation as an example:

步驟201：預先設置過濾字列表。 Step 201: Set a filter word list in advance.

過濾字列表的設置可以透過對網際網路中大量文檔的統計計算得到。例如，假設對600萬個文檔分別進行一元分詞，然後計算這些一元分詞在每個文檔中出現的次數，最後對這些一元分詞出現的次數進行排序，取排序最高(例如前10個)的一元分詞作為高頻字，構建高頻字列表。 The setting of the filter word list can be calculated by statistical calculation of a large number of documents in the Internet. For example, suppose that 6 million documents are separately scored, then count the number of occurrences of these unary words in each document, and finally sort the number of occurrences of these unary parts, and take the highest ranked (for example, the top 10) unary words. As a high frequency word, a list of high frequency words is constructed.

步驟202：載入過濾字列表後，獲取待索引的文檔。 Step 202: After loading the filter word list, obtain the document to be indexed.

步驟203：對待索引的文檔進行分詞操作得到一元分詞。 Step 203: performing a word segmentation operation on the indexed document to obtain a unitary score word.

步驟204：遍歷所述一元分詞。 Step 204: Traverse the unitary word segmentation.

遍歷一元分詞，亦即按照文檔的分詞結果順序獲取每一個一元分詞，並對獲取的每一個一元分詞執行後續是否為過濾字的判斷步驟。 Traversing the unary participle, that is, obtaining each unary participle in the order of the word segmentation result of the document, and performing a judging step of whether or not the subsequent filtering word is performed for each of the obtained unary participles.

步驟205：透過查找過濾字列表判斷目前一元分詞是否為過濾字，若是，則執行步驟206；否則，執行步驟208。 Step 205: Determine whether the current unary participle is a filter word by searching the filter word list. If yes, execute step 206; otherwise, perform step 208.

步驟206：將所述一元分詞和與所述一元分詞順序相鄰的一元分詞成二元分詞。 Step 206: Combine the unary participle and the unary participle adjacent to the unary segmentation into a binary participle.

步驟207：對二元分詞建立索引，執行步驟209。 Step 207: Index the binary partifier, and perform step 209.

步驟208：直接對一元分詞建立索引。 Step 208: directly index the unary participle.

步驟209：判斷是否遍歷完所述一元分詞，若是，則結束目前流程；否則，返回步驟204。 Step 209: Determine whether the unigram is completed. If yes, the current process ends; otherwise, return to step 204.

上述本申請案之基於搜索引擎的文檔索引方法的第二實施例中，步驟204至步驟208可以採用如下示例的虛擬碼實現： In the second embodiment of the search engine-based document indexing method of the present application, steps 204 to 208 may be implemented by using the following virtual code:

在搜索引擎中應用上述文檔索引方法實施例建立的索引中，沒有為高頻字建立的索引，但包括了對高頻字與其相鄰的字組合而成的二元分詞建立的索引，為後續資料查詢的準確性提供了保證。 In the index created by the embodiment of the above document indexing method applied in the search engine, there is no index for the high frequency word, but an index for the binary word segmentation of the high frequency word and its adjacent words is included, which is followed. The accuracy of the data query provides a guarantee.

與本申請案之基於搜索引擎的文檔索引方法的實施例相對應，本申請案還提供了基於搜索引擎的資料查詢方法的實施例，所述資料查詢方法的實施例透過應用所述文檔索引方法的實施例所建立的索引進行資料查詢。 Corresponding to the embodiment of the search engine-based document indexing method of the present application, the present application further provides an embodiment of a search engine-based data query method, and the embodiment of the data query method uses the document indexing method. The index established by the embodiment performs data query.

參見圖3，為本申請案之基於搜索引擎的資料查詢方法的第一實施例流程圖； 3 is a flowchart of a first embodiment of a search engine-based data query method according to the present application;

步驟301：獲取待查詢的資料。 Step 301: Acquire data to be queried.

待查詢的資料通常為網際網路用戶從網站前端輸入的需要查詢的資料，由搜索引擎接收該輸入的查詢資料。 The information to be queried is usually the data that the Internet user inputs from the front end of the website, and the search engine receives the input query data.

步驟302：對待查詢的資料進行分詞操作得到一元分詞。 Step 302: Perform a word segmentation operation on the data to be queried to obtain a unary word segmentation.

步驟303：判斷每個一元分詞是否為過濾字，若是，則執行步驟304；否則，執行步驟305。 Step 303: Determine whether each unary participle is a filter word. If yes, go to step 304; otherwise, go to step 305.

通常對資料分詞後得到的一元分詞結果中包含若干個一元分詞，因此順序對每個一元分詞是否為過濾字進行判斷。 Usually, the unary word segmentation result obtained after the data segmentation includes several unary segmentation words, so the order is judged whether each unary segmentation word is a filter word.

步驟304：將一元分詞和與該一元分詞順序相鄰的至少一個一元分詞成多元分詞，根據該多元分詞查找建立的索引，執行步驟306。 Step 304: Form a unitary participle and at least one unary participle adjacent to the unitary participle into a multivariate participle, according to the multivariate participle search Index, go to step 306.

其中，較佳地，將所述一元分詞和與所述一元分詞順序相鄰的一元分詞成二元分詞，對於判斷為高頻字的目前一元分詞，如果該一元分詞是待查詢資料中的第一個一元分詞，則將該一元分詞與其後一個一元分詞成二元分詞；如果該一元分詞不是待查詢資料中的第一個一元分詞，則將該一元分詞與其前一個一元分詞或後一個一元分詞成二元分詞。 Preferably, the unary participle and the unary participle adjacent to the unary word segmentation are binary word segments, and for the current unary participle judged as a high frequency word, if the unary participle is the first in the to-be-queried data a one-yuan participle, the binary participle is converted into a binary participle with the latter one-part participle; if the unary participle is not the first unary participle in the to-be-queried material, the unary participle is compared with the previous unary participle or the latter unary The word segmentation is a binary participle.

步驟305：根據該一元分詞查找建立的索引。 Step 305: Find an index established according to the unary participle.

步驟306：將查找索引後得到的查詢結果進行合併，結束目前流程。 Step 306: Combine the query results obtained after the index is searched, and end the current process.

參見圖4，為本申請案之基於搜索引擎的資料查詢方法的第二實施例流程圖，該實施例以多元分詞為二元分詞為例，詳細描述了資料查詢過程： Referring to FIG. 4, it is a flowchart of a second embodiment of a search engine-based data query method according to the present application. The embodiment describes a data query process by taking a multivariate word segmentation as a binary word segmentation as an example.

步驟401：載入預先設置的過濾字列表後，獲取待查詢的資料。 Step 401: After loading the preset filter word list, obtain the data to be queried.

步驟402：對待查詢的資料進行分詞操作得到一元分詞。 Step 402: Perform a word segmentation operation on the data to be queried to obtain a unary word segmentation.

步驟403：遍歷所述一元分詞。 Step 403: Traverse the unitary word segmentation.

遍歷一元分詞，亦即按照待查詢資料的分詞結果順序獲取每一個一元分詞，並對獲取的一元分詞執行後續是否為過濾字的判斷步驟。 Traversing the unary participle, that is, obtaining each unary participle according to the word segmentation result of the data to be queried, and performing a judging step of whether or not the subsequent unfiltered word is performed on the obtained unary participle.

步驟404：透過查找過濾字列表判斷目前一元分詞是否為過濾字，若是，則執行步驟405；否則，執行步驟 407。 Step 404: Determine whether the current unary participle is a filter word by searching the filter word list, and if yes, perform step 405; otherwise, perform steps 407.

步驟405：將所述一元分詞和與所述一元分詞順序相鄰的一元分詞成二元分詞。 Step 405: Combine the unary participle and the unary participle adjacent to the unary segmentation into a binary participle.

步驟406：根據該二元分詞查找建立的索引，執行步驟208。 Step 406: Perform step 208 according to the index established by the binary participle search.

步驟407：根據該一元分詞查找建立的索引。 Step 407: Find an index established according to the unary participle.

步驟408：判斷是否遍歷完所述一元分詞，若是，則執行步驟409；否則，返回步驟403。 Step 408: Determine whether the unitary word segmentation is traversed. If yes, execute step 409; otherwise, return to step 403.

步驟409：將查找索引後得到的所有查詢結果進行合併，結束目前流程。 Step 409: Combine all the query results obtained after the index is searched, and end the current process.

對於根據每個分詞查詢索引後得到的結果進行合併，進一步還可以根據預先設置的條件(例如返回20個結果)向用戶返回最終顯示的查詢結果，此處與現有技術一致，在此不再贅述。 For the merging of the results obtained by querying the index according to each word segment, the result of the final display may be returned to the user according to the pre-set conditions (for example, returning 20 results), which is consistent with the prior art, and will not be described herein. .

上述本申請案之基於搜索引擎的資料查詢方法的第二實施例中，步驟403至步驟407可以採用如下示例的虛擬碼實現： In the second embodiment of the search engine-based data query method of the present application, steps 403 to 407 can be implemented by using the following virtual code:

在搜索引擎中應用上述資料查詢方法實施例進行資料查詢，由於建立的索引中沒有高頻字，並且將高頻字與其他字組成二元分詞後建立了索引，因此在資料查詢時不會因為對高頻字進行索引而浪費搜索引擎的資源，並且也不會因為跳過對高頻字的查詢而導致結果不準確。 In the search engine, the above data query method embodiment is applied to perform data query. Since there is no high frequency word in the established index, and the high frequency word is combined with other words to form a binary word segmentation, an index is established, so the data query is not because Indexing high frequency words wastes the resources of the search engine and does not result in inaccurate results by skipping queries for high frequency words.

另外，需要說明的是，上述本申請案之索引和查詢實施例應用在中文搜索引擎時，其中對待查詢的資料進行分詞操作得到的一元分詞指每個單字，以“中國股市”為例，根據統計結果預先設置的過濾字為“市”，劃分得到的一元分詞為“中”、“國”、“股”、“市”；當上述本申請案之索引和查詢實施例應用在外文搜索引擎時，以英文“Chinese Stock Market”為例，則假設根據統計“Stock”是過濾字，則按照英文分詞特點劃分得到的一元分詞為“Chinese”、“Stock”、“Market”，後續索引和查詢的過程與中文字一致，亦即在索引時，過濾“Stock”後，得到的索引分別為“Chinese”、“Chinese Stock”、“Stock Market”和“Market”；查詢時，可以對“Chinese”、“Chinese Stock”、“Market”進行查詢，在此不再贅述。 In addition, it should be noted that when the index and the query embodiment of the present application are applied to a Chinese search engine, the unary word segmentation obtained by the word segmentation operation refers to each word, and the “Chinese stock market” is taken as an example, according to The filter word preset in the statistical result is “City”, and the undivided word segmentation is “Medium”, “Country”, “Share”, “City”; when the index and query embodiment of the above application is applied to the foreign language search engine In the English "Chinese Stock Market" as an example, it is assumed that according to the statistics "Stock" is a filter word, the unary participles classified according to the characteristics of the English participle are "Chinese", "Stock", "Market", subsequent indexes and queries. The process is consistent with the Chinese characters, that is, when indexing, after filtering "Stock", the indexes obtained are "Chinese", "Chinese Stock", "Stock Market" and "Market"; when querying, you can "Chinese" , "Chinese Stock", "Market" to query, no longer repeat them here.

與本申請案之基於搜索引擎的文檔索引方法和資料查詢方法的實施例相對應，本申請案還提供了基於搜索引擎的文檔索引伺服器和資料查詢伺服器的實施例。 Search engine-based document indexing method and data search with this application Corresponding to the embodiment of the query method, the present application also provides an embodiment of a search engine based document index server and a data query server.

參見圖5，為本申請案之基於搜索引擎的文檔索引伺服器的第一實施例方塊圖： Referring to FIG. 5, it is a block diagram of a first embodiment of a search engine-based document indexing server of the present application:

該文檔索引伺服器包括：獲取單元510、分詞單元520、判斷單元530和索引單元540。 The document index server includes an acquisition unit 510, a word segmentation unit 520, a determination unit 530, and an index unit 540.

其中，獲取單元510，用以獲取待索引的文檔；分詞單元520，用以對所述獲取單元510獲取的文檔進行分詞操作得到一元分詞；判斷單元530，用以判斷每個一元分詞是否為過濾字；索引單元540，用以當所述判斷單元530判斷所述一元分詞是過濾字時，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，對所述多元分詞建立索引，當所述判斷單元530判斷所述一元分詞不是過濾字時，則直接對所述一元分詞建立索引。 The obtaining unit 510 is configured to obtain a document to be indexed, and the word segmentation unit 520 is configured to perform a word segmentation operation on the document obtained by the obtaining unit 510 to obtain a unary word segmentation; the determining unit 530 is configured to determine whether each of the unary segmentation words is filtered. a wording unit 540, configured to, when the determining unit 530 determines that the unary word segmentation is a filter word, the first-part word segment and the at least one unary word segment adjacent to the unitary word segmentation sequence into a multi-word segmentation, The multi-partition word is indexed, and when the judging unit 530 judges that the unary participle is not a filter word, the index is directly indexed.

參見圖6，為本申請案之基於搜索引擎的文檔索引伺服器的第二實施例方塊圖：該文檔索引伺服器包括：預置單元610、載入單元620、獲取單元630、分詞單元640、判斷單元650和索引單元660。 6 is a block diagram of a second embodiment of a search engine-based document index server according to the present application: the document index server includes: a preset unit 610, a loading unit 620, an obtaining unit 630, a word segmentation unit 640, The judging unit 650 and the index unit 660.

其中，預置單元610，用以預先設置過濾字列表；載入單元620，用以載入所述閾值單元610中的過濾字列表；獲取單元630，用以獲取待索引的文檔；分詞單元640，用以對所述獲取單元630獲取的文檔進行分詞操作得到一元分詞；判斷單元650，用以判斷每個一元分詞是否為過濾字；索引單元660，用以當所述判斷單元650判斷所述一元分詞是過濾字時，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，對所述多元分詞建立索引，當所述判斷單元650判斷所述一元分詞不是過濾字時，則直接對所述一元分詞建立索引。較佳地，索引單元660將所述一元分詞和與所述一元分詞順序相鄰的一元分詞成二元分詞，對所述二元分詞建立索引。 The preset unit 610 is configured to preset a filter word list, and the loading unit 620 is configured to load the filter word list in the threshold unit 610; The obtaining unit 630 is configured to obtain a document to be indexed; the word segmentation unit 640 is configured to perform a word segmentation operation on the document obtained by the obtaining unit 630 to obtain a unary word segmentation; and the determining unit 650 is configured to determine whether each of the unary segmentation words is a filter word; The indexing unit 660 is configured to, when the determining unit 650 determines that the unary segmentation word is a filter word, the first-part word segment and the at least one unary word segment adjacent to the unitary word segmentation sequence into a multi-word segmentation, and the multi-partition word segmentation The index is established. When the determining unit 650 determines that the unary participle is not a filter word, the index is directly indexed. Preferably, the indexing unit 660 indexes the binary participle and the unary participle adjacent to the unitary participle into a binary participle, and indexes the binary participle.

具體上，判斷單元650可以包括(圖6中未示出)：遍歷分詞單元，用以遍歷所述一元分詞；查找過濾字單元，用以透過查找所述過濾字列表判斷每個一元分詞是否為過濾字。 Specifically, the determining unit 650 may include (not shown in FIG. 6): a traversal word segment unit for traversing the unary word segmentation; and a search filter word unit for determining whether each unary segmentation word is determined by searching the filter word list. Filter words.

本申請案提供的基於搜索引擎的資料查詢伺服器的實施例透過應用所述文檔索引伺服器的實施例所建立的索引進行資料查詢。 The embodiment of the search engine based data query server provided by the present application performs data query by applying an index established by the embodiment of the document index server.

參見圖7，為本申請案之基於搜索引擎的資料查詢伺服器的第一實施例方塊圖：該資料查詢伺服器包括：獲取單元710、分詞單元720、判斷單元730、查找單元740和合併單元750。 FIG. 7 is a block diagram of a first embodiment of a search engine-based data query server according to the present application. The data query server includes: an obtaining unit 710, a word segmentation unit 720, a determining unit 730, a searching unit 740, and a merging unit. 750.

獲取單元710，用以獲取待查詢的資料；分詞單元720，用以對所述獲取單元獲取的資料進行分詞操作得到一元分詞；判斷單元730，用以判斷每個一元分詞是否為過濾字；查找單元740，用以當所述判斷單元730判斷所述一元分詞是過濾字時，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，根據所述多元分詞查找所述索引，當所述判斷單元730判斷所述一元分詞不是過濾字時，則根據所述一元分詞查找所述索引；合併單元750，用以將所述查找單元740查找所述索引後得到的查詢結果進行合併。 The obtaining unit 710 is configured to obtain the data to be queried; The word segmentation unit 720 is configured to perform a word segmentation operation on the data acquired by the obtaining unit to obtain a unary word segmentation; the determining unit 730 is configured to determine whether each of the unary segmentation words is a filter word; the searching unit 740 is configured to determine by the determining unit 730 When the unary word segmentation is a filter word, the unary segmentation word and at least one unary word segment adjacent to the unitary word segmentation sequence are multivariate word segmentation, and the index is searched according to the multivariate word segment, when the judging unit 730 judges When the unary participle is not a filter word, the index is searched according to the unary participle; the merging unit 750 is configured to merge the query results obtained by the searching unit 740 after searching the index.

參見圖8，為本申請案之基於搜索引擎的資料查詢伺服器的第二實施例方塊圖：該資料查詢伺服器包括：載入單元810、獲取單元820、分詞單元830、判斷單元840、查找單元850和合併單元860。 FIG. 8 is a block diagram of a second embodiment of a search engine-based data query server according to the present application. The data query server includes: a loading unit 810, an obtaining unit 820, a word segmentation unit 830, a determining unit 840, and a search. Unit 850 and merging unit 860.

載入單元810，用以載入預先設置的過濾字列表；獲取單元820，用以獲取待查詢的資料；分詞單元830，用以對所述獲取單元820獲取的資料進行分詞操作得到一元分詞；判斷單元840，用以判斷每個一元分詞是否為過濾字；查找單元850，用以當所述判斷單元840判斷所述一元分詞是過濾字時，將所述一元分詞和與所述一元分詞順序相鄰的至少一個一元分詞成多元分詞，根據所述多元分詞查找所述索引，當所述判斷單元840判斷所述一元分詞不是過濾字時，則根據所述一元分詞查找所述索引；較佳地，查找單元850將所述一元分詞和與所述一元分詞順序相鄰的一元分詞成二元分詞，根據所述二元分詞查找所述索引；合併單元860，用以將所述查找單元850查找所述索引後得到的查詢結果進行合併。 The loading unit 810 is configured to load a pre-set filter word list; the obtaining unit 820 is configured to obtain the data to be queried; and the word segmentation unit 830 is configured to perform a word segmentation operation on the data acquired by the obtaining unit 820 to obtain a unary word segmentation; The determining unit 840 is configured to determine whether each of the unary word segments is a filter word; the searching unit 850 is configured to: when the determining unit 840 determines that the unary segmentation word is a filter word, the unitary word segmentation and the unary word segmentation are The at least one unary word segment adjacent to the sequence is a multi-part word segment, and the index is searched according to the multi-word segmentation. When the determining unit 840 determines that the unary segmentation word is not a filter word, the index is searched according to the unary segmentation word; Preferably, the searching unit 850 divides the unary segmentation and the unary segmentation adjacent to the unary segmentation into a binary segmentation, and searches the index according to the binary segmentation; the merging unit 860 is configured to use the searching unit The query results obtained after the 850 finds the index are merged.

具體上，判斷單元840可以包括(圖8中未示出)：遍歷分詞單元，用以遍歷所述一元分詞；查找過濾字單元，用以透過查找所述過濾字列表判斷每個一元分詞是否為過濾字。 Specifically, the determining unit 840 may include (not shown in FIG. 8): a traversal word segment unit for traversing the unary word segmentation; and a search filter word unit for determining whether each unary segmentation word is determined by searching the filter word list. Filter words.

透過以上的實施方式的描述可知，本申請案之實施例在對搜索引擎的索引和查詢過程中，透過將高頻一元分詞成有限多元分詞，從而將對高頻字的查詢轉換為對低頻詞的查詢，在獲取正確查詢結果的基礎上降低了搜索引擎的負載，提升了搜索引擎的查詢性能。 Through the description of the above embodiments, the embodiment of the present application converts the query of the high frequency word into the low frequency word by dividing the high frequency unit into the finite multivariate word segment in the indexing and query process of the search engine. The query reduces the load of the search engine based on the correct query result and improves the query performance of the search engine.

透過以上的實施方式的描述可知，本領域的技術人員可以清楚地瞭解到本申請案可借助軟體加必需的通用硬體平臺的方式來實現。基於這樣的理解，本申請案的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該電腦軟體產品可以儲存在儲存媒體中，如ROM/RAM、磁片、光碟等，包括若干指令用以使得一台電腦設備(可以是個人電腦，伺服器，或者網路設備等)執行本申請案之各個實施例或者實施例的某些部分所述的方法。 It will be apparent to those skilled in the art from the above description of the embodiments that the present application can be implemented by means of a software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a magnetic disk. , CD, etc., including a number of instructions to make a computer device (can be a personal computer, server, or network The methods described in the various embodiments of the present application or in certain portions of the embodiments are performed.

本說明書中的各個實施例均採用漸進的方式描述，各個實施例之間相同相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。尤其，對於系統實施例而言，由於其基本相似於方法實施例，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。 The various embodiments in the present specification are described in a gradual manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

本申請案可用於許多通用或專用的計算系統環境或配置中。例如：個人電腦、伺服器電腦、手持設備或攜帶型設備、平板型設備、多處理器系統、基於微處理器的系統、置頂盒、可編程的消費電子設備、網路PC、小型電腦、大型電腦、包括以上任何系統或設備的分散式計算環境等等。 This application can be used in many general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, small computers, large Computer, decentralized computing environment including any of the above systems or devices, and so on.

本申請案可以在由電腦執行的電腦可執行指令的一般上下文中描述，例如程式模組。一般地，程式模組包括執行特定任務或實現特定抽象資料類型的常式、程式、物件、元件、資料結構等等。也可以在分散式計算環境中實踐本申請案，在這些分散式計算環境中，由透過通信網路而被連接的遠端處理設備來執行任務。在分散式計算環境中，程式模組可以位於包括儲存設備在內的本地和遠端電腦儲存媒體中。 The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, a program module includes routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in a decentralized computing environment in which tasks are performed by remote processing devices that are coupled through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media, including storage devices.

雖然透過實施例描繪了本申請案，本領域普通技術人員知道，本申請案有許多變形和變化而不脫離本申請案的精神，希望所附加的申請專利範圍包括這些變型和變化而不違離本申請案的精神。 While the present invention has been described by way of example, it will be understood by those skilled in the art Do not violate the spirit of this application.

510‧‧‧獲取單元 510‧‧‧Acquisition unit

520‧‧‧分詞單元 520‧‧‧ segmentation unit

530‧‧‧判斷單元 530‧‧‧judging unit

540‧‧‧索引單元 540‧‧‧ index unit

610‧‧‧預置單元 610‧‧‧Preset unit

620‧‧‧載入單元 620‧‧‧Loading unit

630‧‧‧獲取單元 630‧‧‧Acquisition unit

640‧‧‧分詞單元 640‧‧‧word segmentation unit

650‧‧‧判斷單元 650‧‧‧judging unit

660‧‧‧索引單元 660‧‧‧ index unit

710‧‧‧獲取單元 710‧‧‧Acquisition unit

720‧‧‧分詞單元 720‧‧‧word segmentation unit

730‧‧‧判斷單元 730‧‧‧judging unit

740‧‧‧查找單元 740‧‧‧Search unit

750‧‧‧合併單元 750‧‧‧Merge unit

810‧‧‧載入單元 810‧‧‧Loading unit

820‧‧‧獲取單元 820‧‧‧Acquisition unit

830‧‧‧分詞單元 830‧‧‧ segmentation unit

840‧‧‧判斷單元 840‧‧‧judging unit

850‧‧‧查找單元 850‧‧‧Search unit

860‧‧‧合併單元 860‧‧‧Merge unit

為了更清楚地說明本申請案之實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本申請案中記載的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。 In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a few embodiments described in the present application, and other drawings can be obtained from those skilled in the art without any inventive labor.

圖1為本申請案之基於搜索引擎的文檔索引方法的第一實施例流程圖；圖2為本申請案之基於搜索引擎的文檔索引方法的第二實施例流程圖；圖3為本申請案之基於搜索引擎的資料查詢方法的第一實施例流程圖；圖4為本申請案之基於搜索引擎的資料查詢方法的第二實施例流程圖；圖5為本申請案之基於搜索引擎的文檔索引伺服器的第一實施例方塊圖；圖6為本申請案之基於搜索引擎的文檔索引伺服器的第二實施例方塊圖；圖7為本申請案之基於搜索引擎的資料查詢伺服器的第一實施例方塊圖；圖8為本申請案之基於搜索引擎的資料查詢伺服器的第二實施例方塊圖。 1 is a flow chart of a first embodiment of a search engine-based document indexing method according to the present application; FIG. 2 is a flow chart of a second embodiment of a search engine-based document indexing method according to the present application; The flow chart of the first embodiment of the search engine based data query method; FIG. 4 is a flow chart of the second embodiment of the search engine based data query method of the present application; FIG. 5 is a search engine based document of the present application; FIG. 6 is a block diagram of a second embodiment of a search engine-based document index server according to the present application; FIG. 7 is a search engine-based data query server of the present application. First embodiment block diagram; FIG. 8 is a search engine based data query server of the present application A block diagram of the second embodiment.

Claims

A document indexing method based on a search engine, comprising: obtaining a document to be indexed, and performing a word segmentation operation on the document to obtain a unary word segmentation; determining whether each unary segmentation word is a filter word, wherein the filter word is included in the query During the period, the search engine ignores the high frequency word that is not checked. If the unary participle is a filter word, the unary participle and the at least one unary participle adjacent to the unary participle are multivariately segmented, and the multivariate participle is indexed. If the unary participle is not a filter word, the index of the unary participle is directly indexed.

The method of claim 1, further comprising: presetting a list of filter words.

The method of claim 2, wherein before the obtaining the document to be indexed, the method further comprises: loading the filter word list.

The method of claim 1, wherein the unifying participle and the at least one unary participle adjacent to the unary participle are multivariately included: the unitary participle and the unitary participle are adjacent to the unitary participle The unary participle is a binary participle.

The method of claim 4, wherein the unifying participle and the unary participle adjacent to the unary participle are binary words: when the unary participle is the first unary participle in the document When, The unary participle and its subsequent unary participle become a binary participle; when the unary participle is the last unary participle in the document, the unary participle is combined with its previous unary participle into a binary participle; when the unary participle is not in the document When the first unary participle and the last unary participle are combined, the unary participle is combined with the previous unary participle and the latter unary participle to form a binary participle.

A search engine-based data query method, characterized in that the method uses an index established by the document indexing method described in claim 1 of the patent application, including: obtaining data to be queried, and performing word segmentation operation on the data a unitary word segmentation; determining whether each unitary word segmentation is a filter word, wherein the filter word includes a high frequency word that is ignored by the search engine during the query, and if the unitary word segmentation is a filter word, the unitary word segmentation and The at least one unary word segment adjacent to the unitary word segmentation is into a multivariate word segmentation, and the index is searched according to the multivariate word segmentation. If the unary segmentation word segment is not a filter word, the index is searched according to the unary segmentation word; the query result obtained by searching the index is performed. merge.

The method of claim 6, wherein before the obtaining the data to be queried, the method further comprises: loading a preset filter word list.

The method of claim 6, wherein the unary participle and the at least one unary participle adjacent to the unary participle are multivariately included: the unary participle and the order of the unary participle The unary participle is a binary participle.

The method of claim 8, wherein the unifying participle and the unary participle adjacent to the unary participle are binary words: when the unary participle is the first unary participle in the material When the unitary participle is formed into a binary participle with the latter unitary participle; when the unitary participle is not the first unary participle of the material, the unitary participle is merged with the previous unary participle or the latter unary participle into a binary participle .

A document indexing server based on a search engine, comprising: an obtaining unit, configured to obtain a document to be indexed; a word segmentation unit, configured to perform a word segmentation operation on the document obtained by the obtaining unit to obtain a unitary word segmentation; For determining whether each unary participle is a filter word, wherein the filter word includes a high frequency word that is ignored by the search engine during the query; and an index unit for determining, when the judging unit determines that the unary participle is filtering In the case of a word, the unitary participle and the at least one unary participle adjacent to the unitary participle sequence are multivariate word segmentation, and the multivariate participle is indexed. When the judging unit judges that the unary participle is not a filter word, the unitary participle is directly Indexing.

The server according to claim 10, wherein the method further comprises: Preset unit for presetting the list of filter words.

The server of claim 11, further comprising: a loading unit, configured to load the filter word list before the obtaining unit acquires the document to be indexed.

A search engine-based data query server, wherein the server applies an index established by a document index server according to claim 10 of the patent application scope, comprising: an obtaining unit, configured to acquire data to be queried; a word segmentation unit, configured to perform a word segmentation operation on the data obtained by the obtaining unit to obtain a unary word segmentation; a determining unit, configured to determine whether each of the unary segmentation words is a filter word, wherein the filter word is included in the query engine and is ignored by the search engine a high frequency word that is not checked; a search unit configured to, when the determining unit determines that the unary participle is a filter word, the multipart word segment and the at least one unary word segment adjacent to the unary word segmentation into a multivariate word segment, according to the multivariate word segmentation The index is searched. When the judging unit judges that the unary participle is not a filter word, the index is searched according to the unary participle; and the merging unit is used to merge the query results obtained by the search unit after searching the index.

The server of claim 13, further comprising: a loading unit, configured to load a preset filter word list before the obtaining unit acquires the data to be queried.