WO2015010509A1

WO2015010509A1 - One-dimensional liner space-based method for implementing trie tree dictionary search

Info

Publication number: WO2015010509A1
Application number: PCT/CN2014/080179
Authority: WO
Inventors: 贾西贝; 王国印
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2013-07-03
Filing date: 2014-06-18
Publication date: 2015-01-29
Also published as: CN103365992B; CN103365992A

Abstract

A one-dimensional liner space-based method for implementing trie tree dictionary search: one-dimensional liner space trie tree dictionary data is generated; a to-be-queried entry key is determined on the basis of a user input; and, a query is implemented on the basis of a current state of the entry key. In the trie tree dictionary data that is constructed in a one-dimensional linear space, dictionary loading and search speeds are increased, and rapid retrieval of all prefix terms of an entry is allowed. In addition, trie tree dictionary search implemented on the basis of one-dimensional linear space allows for solving of the problem of a conflict that is caused by insertion of a new state and is found in a process of trie tree construction of a conventional trie tree dictionary data search, thus allowing for prevention of the problem of a movement of a large amount of dictionary data caused by the conflict.

Description

A dictionary retrieval method based on one-dimensional linear space to implement Trie tree

The invention relates to a dictionary retrieval method, in particular to a dictionary retrieval method based on a one-dimensional linear space to implement a Trie tree. Background technique

In the field of information retrieval and natural language processing, especially in dictionary-based technology applications, the scale of the dictionary is generally very large, with thousands or even hundreds of records, especially the reverse index of search engines. . The search for massive data dictionaries is currently implemented using an indexed data structure. Commonly used index structures include linear index tables, inverted tables, hash tables, and search trees. For a record whose key (key) length is n, and the size of the dictionary is N (where N»n), the time complexity of each index structure search is analyzed as follows:

A linear index structure or an inverted table generally uses a sequential structure to store records in a dictionary. The search for records in a dictionary generally traverses each record sequentially, so the time complexity of each search is 0 (Ν) Χ 1(η) (where l(n) is the time it takes for the two recorded keywords to be compared). The improvement of this method is that each record of the dictionary is ordered by key (key), and it is searched by half when searching, and its time complexity is 0 (logN) X l(n).

In dictionary-based natural language processing applications, such as dictionary-based Chinese word segmentation, dictionary-based word-to-speech conversion (Chinese characters converted to pinyin), dictionary-based named entity recognition, dictionary-based annotation (including part-of-speech tagging and semantic tagging, etc.) The core part is to do a lot of query operations. In order to meet the requirements of such applications, an efficient dictionary search method is needed. Nowadays, there is also a dictionary query method based on a two-dimensional array of Trie trees, but this kind of query method based on two-dimensional arrays may have conflicts caused by inserting new states during the construction of the Trie tree, resulting in moving a large number of conflicting data. problem. Summary of the invention

The present invention is directed to solving one of the above drawbacks.

Therefore, the present invention provides a dictionary search method for implementing a Trie tree based on a one-dimensional linear space, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a query key to be queried according to user input; and implementing according to the current state of the entry key Inquire. Under the dictionary data of the Trie tree in the one-dimensional linear space, the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved. In addition, the dictionary search based on the one-dimensional linear space to implement the Trie tree can be solved. The traditional Trie tree dictionary data retrieval conflicts in the construction process of the Tire tree due to the insertion of a new state, and can avoid the movement of a large amount of dictionary data caused by the conflict.

To this end, the present invention discloses a dictionary retrieval method for implementing a Trie tree based on a one-dimensional linear space, the method comprising the steps of: generating dictionary data of a one-dimensional linear space Trie tree; determining a query key to be queried according to user input; The current state of the entry key implements the query.

Preferably, the key of the dictionary data is converted into a node and stored in a one-dimensional array, and the value of the one-dimensional array is used to identify whether the base value is unique.

Preferably, in the Trie tree of the one-dimensional linear space, all terminal nodes are changed into non-terminal nodes, a leaf node is added after the terminal node, and a check value of the leaf node is assigned to its storage location.

Preferably, the leaf node further includes: a base value of the leaf node to identify whether it is a terminal node.

Preferably, the query comprises the steps of: pointing the current node to the root node; making a state transition according to the currently input character, obtaining a position of the direct successor state; verifying the precursor of the current state, determining which state the current state is. Transfer from; get the value of the value corresponding to the entry key.

Preferably, the query comprises: a query of the entry key can obtain the result of all of its prefix words.

The invention provides a dictionary retrieval method based on a one-dimensional linear space to implement a Trie tree, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a to-be-queried key according to a user input; and implementing a query according to the current state of the entry key . Under the dictionary data of the Trie tree in the one-dimensional linear space, the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved. At the same time, the base value is adjusted during the construction of the Trie tree so that all its direct successors do not conflict, thus avoiding the backtracking problem of data movement or storage space allocation.

It is to be understood that the foregoing general description DRAWINGS

1 is a flow chart of a method for implementing a Trie tree dictionary search based on a one-dimensional linear space according to an embodiment of the present invention.

2 is a flow chart of implementing a query according to the current state of the entry key according to an embodiment of the present invention. detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. A dictionary search method for implementing a Trie tree based on a one-dimensional linear space is provided in an embodiment of the present invention.

As shown in FIG. 1, it is a flowchart of a method for implementing a Trie tree dictionary retrieval method based on a one-dimensional linear space according to an embodiment of the present invention.

Step S110: Generate dictionary data of the one-dimensional linear space Trie tree.

Obtaining dictionary data, to generate one-dimensional linear space Trib tree dictionary data includes the following specific steps:

Step S111: Sort all the terms and attribute information in the lexicographic order with the key as the center, and merge the values having the same key value, so as to ensure that the key does not have a duplicate.

Iterate over the elements stored in _Keys and _¥&1^5, sort Keys and _¥&1^5 in key order, and combine the value of the same key to store the ordered key sequence and value sequence. List<String> keys and Collection<String> attributes;

The pseudo code for this step is as follows:

Array[0] = 1;

Node root-node = new Node();

Root-node, left = 0;

Root-node. right: keys.size();

Root-node, depth = 0;

Array Li st<Node> siblings = new ArrayList<Node>();

Fetch(root_node, siblings);

Insert(siblings);

Return keys.size(); Step S112: Define the starting state, number 0, and the information value it contains is [code = 0, depth = 0, start = 0, end = N], where N is the size of the dictionary. That is the number of keys.

Step S113: Put the start state into the 0th position of the double array, set its base[0]=l (array[2*0]=array[0]=l), and identify that the value of base is 1 is already occupied ( Ensure that the base value of all states is unique), check[0]=0 (array[2*0+l]=array[l]=0)

Step S114: The initial state is taken as the current state.

Step S115: Obtain information of all direct successor states of the current state. If the direct successor node list is empty, that is, the current node is the terminal node "$", indicating that the key formed from the starting node to the current node is exactly A complete entry in the dictionary, the base value of the current node (terminal node) is assigned to the opposite of the current key dictionary sequence number, and the execution is completed on the path; otherwise, step S116 is performed. The pseudo code for this step is as follows:

Int fetch(Node parent, List<Node> siblings) {

II Get all direct successor nodes of the current node parent, and store the result in the siblings list

/ / Returns the number of all direct successor nodes of the current node, if 0 means the current node is the terminal node int prev = 0;

For (int i = parent, start; i < parent, end; i++) {

If (key s . get(i ) .1 engthQ < parent, depth) {

/ / If the current key has been processed, skip this key, here used to handle the terminal node

// ie the terminal node has no successor nodes

Continue;

String tmp = keys.get(i);

Int cur = 0;

If (key s . get(i ) .1 ength() != parent. depth) {

Cur = (int) tmp . char At(parent. depth) + 1 ; if (cur != prev || siblings. isEmpty()) {

Node tmp_node = new Node();

Tmp_node. depth = parent, depth + 1 ;

Tmp_node.code = cur;

Tmp_node. start = i;

If (! siblings. isEmptyO) {

(siblings. get( siblings. sizeQ - l)).end = i; siblings. add(tmp_node); prev = cur; if (! siblings. isEmptyO) {

(siblings. get( siblings. sizeQ - l)).end = parent. end; Return siblings. sizeQ; Step S116: Find a suitable base value for the current node, so that the base value is unique, and does not cause all direct successor nodes to collide with the nodes stored in the existing Trie tree. In turn, the direct successor node of the current node is inserted into the Trie tree, and the check value is assigned to the base value of the current node, and then the direct successor node of the current node is sequentially used as the current node, and the process proceeds to step S. 115.

The pseudo code for this step is as follows:

Int insert(List<Node> siblings) {

II Find an eligible base value for the current node, insert all its direct successor nodes

II and return the base value of the current node

Find a suitable unused base value for the current node, so that the base value is unique and not 0, and will not cause all direct successor nodes to conflict with the existing Trie tree stored nodes, in order to be compatible with the root of the Trie tree. Node, base value starts from 1;

Used[base] = 1 ; II identifies that this value has been used

For (int i = 0; i < siblings. size(); i++) {

II assigns the check value of the immediate successor node of the current node to the base value of the current node, and completes the insert operation array[(((int) base + (siblings. get(i)). code) « 1) + 1 ] = base; for (int i = 0; i < siblings. size(); i++) {

/ / In turn, the current successor node of the current node as the current node

/ / Recursively to get the direct successor node set and insert the subsequent node set operation

List<Node> new—siblings = new ArrayList<Node>();

If (fetch(( siblings. get(i)), new—siblings) == 0) {

II At this time, the current node is the terminal node (the direct successor is empty)

II assigns its base value to the opposite of the key dictionary number

Array[((int)base+(int)(siblings.get(i)).code) « 1] =(int)(-(siblings.get(i)).left - 1); } else {

Int ins = (int) insert(new_siblings);

/ / Will find the appropriate base: ins assigned to the base value of the current node

Array[((int) base + (siblings. get(i)). code) « 1] = ins; Return base;

In the Trie tree of the one-dimensional linear space, all terminal nodes are changed into non-terminal nodes, a leaf node is added after the terminal node, and the check value of the leaf node is assigned to its storage location.

Change all terminal nodes to non-terminal nodes, and add a leaf node behind them. The check value of the leaf node is assigned its own storage location, and the base value of the leaf node is assigned from the initial node (0 Node) The opposite of the position of the complete entry into the current leaf node path in the entire ordered set of terms (ie, the key in the list of all the lexicographically ordered terms in the key of the leaf node) The opposite is true, so the sign of the base value of the node is used to identify whether it is a terminal node (the leaf node whose base value is less than 0).

Step S120: Enter a query key to be queried according to the user.

After the Trie tree is built, the next step is to query whether the entry entered by the user has a Trie tree, that is, whether it is a complete path from the root node to the leaf node.

Step S130: Implement a query according to the current state of the entry key.

As shown in FIG. 2, it is a process framework diagram for implementing a query according to the current state of the entry key. The specific steps are as follows: Step S131: Point the current node to the root node.

Step S132: Perform a state transition according to the currently input character to obtain a position of the direct successor state.

Step S133: Verify the precursor of the current state, and determine which state the current state is transferred from.

Step S134: Obtain the value of the value corresponding to the entry key.

In the current state is s, the input character is c, and the next state is t, the constraint condition of the query process of this method is modified to:

Check[base[s] + c ] = base[s] ; base[s]+c=t ; The base[s] value of each state is unique.

If the current state s can be transferred to the leaf node t, its constraint is -base[s]=t; t=check[t].

Base[t]<0 and the value of base[t] is the initial node of DFA. The opposite of the number of entries in the current lexicographically ordered term in the lexicographically ordered entry.

In the embodiment of the present invention, all the prefix words of the entry key can be obtained, and each retrieved result information can be saved in an object TrieResult, and the variables stored in the table are described as follows: Length indicates the length of the current key; Index current The storage number of the key in the dictionary is -1, which is the storage location of the current key corresponding to the attribute.

The query speed of the dictionary search method based on the one-dimensional linear space to implement the Trie tree is 18.3 MB/s. Therefore, the present invention provides a dictionary search method for implementing a Trie tree based on a one-dimensional linear space, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a query key to be queried according to user input; and implementing according to the current state of the entry key Inquire. By constructing the dictionary data of the Trie tree in the one-dimensional linear space, the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved. At the same time, the base value is adjusted during the construction of the Trie tree, so that Conflicts do not occur with all direct successors, thus avoiding backtracking issues with data movement or storage space allocation.

Claims

Claim

A dictionary retrieval method for implementing a Trie tree based on a one-dimensional linear space, the method comprising: generating dictionary data of a one-dimensional linear space Trie tree;

Determining the term to be queried according to user input;

The query is implemented according to the current state of the entry key.

The method according to claim 1, wherein the key of the dictionary data is converted into a node and stored in a one-dimensional array, and the value of the one-dimensional array is used to identify whether the base value is unique.

The method according to claim 1, wherein in the one-dimensional linear space Trie tree, all terminal nodes are changed to non-terminal nodes, and a leaf node is added after the terminal node, and The leaf node's check value is assigned its storage location.

The method according to claim 1 or 3, wherein the leaf node further comprises: a base value of the leaf node to identify whether it is a terminal node.

The method according to claim 1, wherein the query comprises the following steps:

Point the current node to the root node;

Making a state transition based on the currently entered character, obtaining the position of its immediate successor state;

Verify the precursor of the current state and determine which state the current state is transferred from;

Get the value of the value corresponding to the entry key.

The method according to claim 1 or claim 5, wherein the query comprises:

A query based on the entry key can get the result of all its prefix words.