[go: up one dir, main page]

WO2015010509A1 - One-dimensional liner space-based method for implementing trie tree dictionary search - Google Patents

One-dimensional liner space-based method for implementing trie tree dictionary search Download PDF

Info

Publication number
WO2015010509A1
WO2015010509A1 PCT/CN2014/080179 CN2014080179W WO2015010509A1 WO 2015010509 A1 WO2015010509 A1 WO 2015010509A1 CN 2014080179 W CN2014080179 W CN 2014080179W WO 2015010509 A1 WO2015010509 A1 WO 2015010509A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
trie tree
dictionary
key
value
Prior art date
Application number
PCT/CN2014/080179
Other languages
French (fr)
Chinese (zh)
Inventor
贾西贝
王国印
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2015010509A1 publication Critical patent/WO2015010509A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Definitions

  • the invention relates to a dictionary retrieval method, in particular to a dictionary retrieval method based on a one-dimensional linear space to implement a Trie tree. Background technique
  • index structures include linear index tables, inverted tables, hash tables, and search trees.
  • a linear index structure or an inverted table generally uses a sequential structure to store records in a dictionary.
  • the search for records in a dictionary generally traverses each record sequentially, so the time complexity of each search is 0 ( ⁇ ) ⁇ 1( ⁇ ) (where l(n) is the time it takes for the two recorded keywords to be compared).
  • the improvement of this method is that each record of the dictionary is ordered by key (key), and it is searched by half when searching, and its time complexity is 0 (logN) X l(n).
  • dictionary-based natural language processing applications such as dictionary-based Chinese word segmentation, dictionary-based word-to-speech conversion (Chinese characters converted to pinyin), dictionary-based named entity recognition, dictionary-based annotation (including part-of-speech tagging and semantic tagging, etc.)
  • dictionary-based natural language processing applications such as dictionary-based Chinese word segmentation, dictionary-based word-to-speech conversion (Chinese characters converted to pinyin), dictionary-based named entity recognition, dictionary-based annotation (including part-of-speech tagging and semantic tagging, etc.)
  • dictionary-based annotation including part-of-speech tagging and semantic tagging, etc.
  • the present invention is directed to solving one of the above drawbacks.
  • the present invention provides a dictionary search method for implementing a Trie tree based on a one-dimensional linear space, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a query key to be queried according to user input; and implementing according to the current state of the entry key Inquire.
  • the dictionary data of the Trie tree in the one-dimensional linear space the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved.
  • the dictionary search based on the one-dimensional linear space to implement the Trie tree can be solved.
  • the traditional Trie tree dictionary data retrieval conflicts in the construction process of the Tire tree due to the insertion of a new state, and can avoid the movement of a large amount of dictionary data caused by the conflict.
  • the present invention discloses a dictionary retrieval method for implementing a Trie tree based on a one-dimensional linear space, the method comprising the steps of: generating dictionary data of a one-dimensional linear space Trie tree; determining a query key to be queried according to user input; The current state of the entry key implements the query.
  • the key of the dictionary data is converted into a node and stored in a one-dimensional array, and the value of the one-dimensional array is used to identify whether the base value is unique.
  • all terminal nodes are changed into non-terminal nodes, a leaf node is added after the terminal node, and a check value of the leaf node is assigned to its storage location.
  • the leaf node further includes: a base value of the leaf node to identify whether it is a terminal node.
  • the query comprises the steps of: pointing the current node to the root node; making a state transition according to the currently input character, obtaining a position of the direct successor state; verifying the precursor of the current state, determining which state the current state is. Transfer from; get the value of the value corresponding to the entry key.
  • the query comprises: a query of the entry key can obtain the result of all of its prefix words.
  • the invention provides a dictionary retrieval method based on a one-dimensional linear space to implement a Trie tree, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a to-be-queried key according to a user input; and implementing a query according to the current state of the entry key .
  • the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved.
  • the base value is adjusted during the construction of the Trie tree so that all its direct successors do not conflict, thus avoiding the backtracking problem of data movement or storage space allocation.
  • FIG. 1 is a flow chart of a method for implementing a Trie tree dictionary search based on a one-dimensional linear space according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of implementing a query according to the current state of the entry key according to an embodiment of the present invention. detailed description
  • a dictionary search method for implementing a Trie tree based on a one-dimensional linear space is provided in an embodiment of the present invention.
  • FIG. 1 it is a flowchart of a method for implementing a Trie tree dictionary retrieval method based on a one-dimensional linear space according to an embodiment of the present invention.
  • Step S110 Generate dictionary data of the one-dimensional linear space Trie tree.
  • Obtaining dictionary data, to generate one-dimensional linear space Trib tree dictionary data includes the following specific steps:
  • Step S111 Sort all the terms and attribute information in the lexicographic order with the key as the center, and merge the values having the same key value, so as to ensure that the key does not have a duplicate.
  • Root-node, left 0;
  • Root-node, depth 0;
  • Step S114 The initial state is taken as the current state.
  • Step S115 Obtain information of all direct successor states of the current state. If the direct successor node list is empty, that is, the current node is the terminal node "$", indicating that the key formed from the starting node to the current node is exactly A complete entry in the dictionary, the base value of the current node (terminal node) is assigned to the opposite of the current key dictionary sequence number, and the execution is completed on the path; otherwise, step S116 is performed.
  • the pseudo code for this step is as follows:
  • Step S116 Find a suitable base value for the current node, so that the base value is unique, and does not cause all direct successor nodes to collide with the nodes stored in the existing Trie tree.
  • the direct successor node of the current node is inserted into the Trie tree, and the check value is assigned to the base value of the current node, and then the direct successor node of the current node is sequentially used as the current node, and the process proceeds to step S. 115.
  • base value starts from 1;
  • Used[base] 1 ; II identifies that this value has been used
  • the current node is the terminal node (the direct successor is empty)
  • Step S120 Enter a query key to be queried according to the user.
  • the next step is to query whether the entry entered by the user has a Trie tree, that is, whether it is a complete path from the root node to the leaf node.
  • Step S130 Implement a query according to the current state of the entry key.
  • Step S131 Point the current node to the root node.
  • Step S132 Perform a state transition according to the currently input character to obtain a position of the direct successor state.
  • Step S133 Verify the precursor of the current state, and determine which state the current state is transferred from.
  • Step S134 Obtain the value of the value corresponding to the entry key.
  • Base[t] ⁇ 0 and the value of base[t] is the initial node of DFA.
  • the query speed of the dictionary search method based on the one-dimensional linear space to implement the Trie tree is 18.3 MB/s. Therefore, the present invention provides a dictionary search method for implementing a Trie tree based on a one-dimensional linear space, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a query key to be queried according to user input; and implementing according to the current state of the entry key Inquire.
  • the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved.
  • the base value is adjusted during the construction of the Trie tree, so that Conflicts do not occur with all direct successors, thus avoiding backtracking issues with data movement or storage space allocation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A one-dimensional liner space-based method for implementing trie tree dictionary search: one-dimensional liner space trie tree dictionary data is generated; a to-be-queried entry key is determined on the basis of a user input; and, a query is implemented on the basis of a current state of the entry key. In the trie tree dictionary data that is constructed in a one-dimensional linear space, dictionary loading and search speeds are increased, and rapid retrieval of all prefix terms of an entry is allowed. In addition, trie tree dictionary search implemented on the basis of one-dimensional linear space allows for solving of the problem of a conflict that is caused by insertion of a new state and is found in a process of trie tree construction of a conventional trie tree dictionary data search, thus allowing for prevention of the problem of a movement of a large amount of dictionary data caused by the conflict.

Description

一种基于一维线性空间实现 Trie树的词典检索方法 技术领域  A dictionary retrieval method based on one-dimensional linear space to implement Trie tree
本发明涉及一种词典检索方法, 尤其涉及一种基于一维线性空间实现 Trie树的词典检索 方法。 背景技术  The invention relates to a dictionary retrieval method, in particular to a dictionary retrieval method based on a one-dimensional linear space to implement a Trie tree. Background technique
在信息检索和自然语言处理领域, 特别是基于词典的技术应用中, 词典的规模一般都非 常大, 拥有成千上万甚至上亿条记录, 尤其是搜索引擎中倒排索引词的规模最为庞大。 对海 量数据词典的查找, 当前通常采用索引的数据结构来实现。常用的索引结构包括线性索引表、 倒排表、 散列 (hash) 表和搜索树等。 对于一个记录的关键字 (key) 长度为 n, 词典的规模 为 N的情况下 (其中 N»n) 各索引结构检索的时间复杂度分析如下:  In the field of information retrieval and natural language processing, especially in dictionary-based technology applications, the scale of the dictionary is generally very large, with thousands or even hundreds of records, especially the reverse index of search engines. . The search for massive data dictionaries is currently implemented using an indexed data structure. Commonly used index structures include linear index tables, inverted tables, hash tables, and search trees. For a record whose key (key) length is n, and the size of the dictionary is N (where N»n), the time complexity of each index structure search is analyzed as follows:
线性索引结构或倒排表一般采用顺序结构存储词典中的记录, 对词典中记录的查找一般 采用顺序遍历每一个记录, 因此, 每一次查找的时间复杂度为 0(Ν) Χ 1(η) (其中 l(n)为两条记 录的关键字因比较所花费的时间)。 此种方法的改进是让词典的每一条记录按关键字 (key) 有序, 在检索的时候用折半查找, 其时间复杂度为 0(logN)X l(n)。  A linear index structure or an inverted table generally uses a sequential structure to store records in a dictionary. The search for records in a dictionary generally traverses each record sequentially, so the time complexity of each search is 0 (Ν) Χ 1(η) (where l(n) is the time it takes for the two recorded keywords to be compared). The improvement of this method is that each record of the dictionary is ordered by key (key), and it is searched by half when searching, and its time complexity is 0 (logN) X l(n).
在基于词典的自然语言处理应用, 如基于词典的中文分词、 基于词典的字音转换 (汉字 转换成拼音)、 基于词典的命名实体识别、 基于词典的标注 (包括词性标注和语义标注等), 其中的核心部分是要进行大量的查询操作。 为了满足此类应用的要求, 就需要一种高效的词 典检索方法。 现今也有基于二维数组的 Trie树的词典查询方法, 但是基于二维数组的这种查 询方法在 Trie树的构建过程中会存在因插入新状态而引起的冲突, 导致要移动大量存在冲突 的数据问题。 发明内容  In dictionary-based natural language processing applications, such as dictionary-based Chinese word segmentation, dictionary-based word-to-speech conversion (Chinese characters converted to pinyin), dictionary-based named entity recognition, dictionary-based annotation (including part-of-speech tagging and semantic tagging, etc.) The core part is to do a lot of query operations. In order to meet the requirements of such applications, an efficient dictionary search method is needed. Nowadays, there is also a dictionary query method based on a two-dimensional array of Trie trees, but this kind of query method based on two-dimensional arrays may have conflicts caused by inserting new states during the construction of the Trie tree, resulting in moving a large number of conflicting data. problem. Summary of the invention
本发明旨在解决上述缺陷之一。  The present invention is directed to solving one of the above drawbacks.
因此, 本发明提供一种基于一维线性空间实现 Trie树的词典检索方法, 通过生成一维线 性空间的 Trie树词典数据; 根据用户输入确定待查询词条 key; 根据词条 key的当前状态实 现查询。 在一维线性空间下构建 Trie树的词典数据下, 提高了词典加载和检索速度, 并且能 够快速检索到词条的所有前缀词, 另外, 基于一维线性空间实现 Trie树的词典检索能够解决 传统 Trie树词典数据检索在 Tire树的构建过程中存在的因插入新状态而引起的冲突问题, 并 能避免该冲突导致的大量词典数据的移动问题。 Therefore, the present invention provides a dictionary search method for implementing a Trie tree based on a one-dimensional linear space, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a query key to be queried according to user input; and implementing according to the current state of the entry key Inquire. Under the dictionary data of the Trie tree in the one-dimensional linear space, the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved. In addition, the dictionary search based on the one-dimensional linear space to implement the Trie tree can be solved. The traditional Trie tree dictionary data retrieval conflicts in the construction process of the Tire tree due to the insertion of a new state, and can avoid the movement of a large amount of dictionary data caused by the conflict.
为此, 本发明公开了一种基于一维线性空间实现 Trie树的词典检索方法, 该方法包括以 下步骤: 生成一维线性空间 Trie树的词典数据; 根据用户输入确定待查询词条 key; 根据词 条 key的当前状态实现查询。  To this end, the present invention discloses a dictionary retrieval method for implementing a Trie tree based on a one-dimensional linear space, the method comprising the steps of: generating dictionary data of a one-dimensional linear space Trie tree; determining a query key to be queried according to user input; The current state of the entry key implements the query.
优选地, 词典数据的 key转化为节点后存储在一维数组中, 所述一维数组的值用来标识 base值是否唯一。  Preferably, the key of the dictionary data is converted into a node and stored in a one-dimensional array, and the value of the one-dimensional array is used to identify whether the base value is unique.
优选地, 所述一维线性空间的 Trie树中将所有终端结点变为非终端结点, 在所述终端结 点后面增加一个叶子节点, 并将叶子节点的 check值赋上其存储位置。  Preferably, in the Trie tree of the one-dimensional linear space, all terminal nodes are changed into non-terminal nodes, a leaf node is added after the terminal node, and a check value of the leaf node is assigned to its storage location.
优选地, 所述叶子节点还包括: 叶子节点的 base值用以标识其是否为终端结点。  Preferably, the leaf node further includes: a base value of the leaf node to identify whether it is a terminal node.
优选地, 所述查询包括以下步骤: 把当前结点指向根节点; 根据当前输入的字符做出状 态转移, 获取其直接后继状态的位置; 校验当前状态的前驱, 确定当前状态由哪一状态转移 而来; 获取词条 key对应的 value值。  Preferably, the query comprises the steps of: pointing the current node to the root node; making a state transition according to the currently input character, obtaining a position of the direct successor state; verifying the precursor of the current state, determining which state the current state is. Transfer from; get the value of the value corresponding to the entry key.
优选地, 所述查询包括: 词条 key的查询可以获得其所有前缀词的结果。  Preferably, the query comprises: a query of the entry key can obtain the result of all of its prefix words.
本发明提供的一种基于一维线性空间实现 Trie树的词典检索方法, 通过生成一维线性空 间的 Trie树词典数据; 根据用户输入确定待查询词条 key; 根据词条 key的当前状态实现查 询。 在一维线性空间下构建 Trie树的词典数据下, 提高了词典加载和检索速度, 并且能够快 速检索到词条的所有前缀词。 同时, 在 Trie树的构建过程中通过调整 base值, 使得其所有直 接后继均不会发生冲突, 因此避免了数据移动或存储空间分配的回溯问题。  The invention provides a dictionary retrieval method based on a one-dimensional linear space to implement a Trie tree, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a to-be-queried key according to a user input; and implementing a query according to the current state of the entry key . Under the dictionary data of the Trie tree in the one-dimensional linear space, the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved. At the same time, the base value is adjusted during the construction of the Trie tree so that all its direct successors do not conflict, thus avoiding the backtracking problem of data movement or storage space allocation.
应当理解, 以上总体说明和以下详细说明都是说明性和实例性的, 旨在提供对所要求的 本发明的进一步说明。 附图说明  It is to be understood that the foregoing general description DRAWINGS
图 1是本发明实施例一种基于一维线性空间实现 Trie树的词典检索方法的流程图。  1 is a flow chart of a method for implementing a Trie tree dictionary search based on a one-dimensional linear space according to an embodiment of the present invention.
图 2是本发明实施例根据词条 key的当前状态实现查询的流程框架图。 具体实施方式  2 is a flow chart of implementing a query according to the current state of the entry key according to an embodiment of the present invention. detailed description
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附图及实施例, 对本发 明进行进一步的详细说明。 应当理解, 此处所描述的具体实施例仅仅用于解释本发明, 并不 用于限定本发明。 本发明实施例提供的一种基于一维线性空间实现 Trie树的词典检索方法。 The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. A dictionary search method for implementing a Trie tree based on a one-dimensional linear space is provided in an embodiment of the present invention.
如图 1所示, 是本发明实施例一种基于一维线性空间实现 Trie树的词典检索方法的流程 图。  As shown in FIG. 1, it is a flowchart of a method for implementing a Trie tree dictionary retrieval method based on a one-dimensional linear space according to an embodiment of the present invention.
步骤 S110: 生成一维线性空间 Trie树的词典数据。  Step S110: Generate dictionary data of the one-dimensional linear space Trie tree.
获取词典数据, 要生成一维线性空间 Trie树的词典数据包括以下具体步骤:  Obtaining dictionary data, to generate one-dimensional linear space Trib tree dictionary data includes the following specific steps:
步骤 S111 : 将所有词条和属性信息以 key为中心按词典顺序排序, 合并拥有相同 key值 的 values, 要保证 key不存在重复。  Step S111: Sort all the terms and attribute information in the lexicographic order with the key as the center, and merge the values having the same key value, so as to ensure that the key does not have a duplicate.
遍历 _Keys和_¥&1^5中存储的元素,按词典顺序将 Keys和_¥&1^5以 key为中心排序, 并合并相同 key的 value值, 将有序的 key序列和 value序列存储到 List<String> keys 禾口 Collection<String> attributes中;  Iterate over the elements stored in _Keys and _¥&1^5, sort Keys and _¥&1^5 in key order, and combine the value of the same key to store the ordered key sequence and value sequence. List<String> keys and Collection<String> attributes;
该步骤的伪代码如下:  The pseudo code for this step is as follows:
array[0] = 1;  Array[0] = 1;
Node root—node = new Node();  Node root-node = new Node();
root—node, left = 0;  Root-node, left = 0;
root—node. right: keys.size();  Root-node. right: keys.size();
root—node, depth = 0;  Root-node, depth = 0;
Array Li st<Node> siblings = new ArrayList<Node>();  Array Li st<Node> siblings = new ArrayList<Node>();
fetch(root_node, siblings);  Fetch(root_node, siblings);
insert(siblings);  Insert(siblings);
return keys.size(); 步骤 S112: 定义起始状态, 编号为 0, 其包含的信息值为 [code = 0, depth = 0, start = 0, end = N], 其中 N为词典的规模, 即 key的数量。  Return keys.size(); Step S112: Define the starting state, number 0, and the information value it contains is [code = 0, depth = 0, start = 0, end = N], where N is the size of the dictionary. That is the number of keys.
步骤 S113 : 将起始状态放入双数组第 0位置, 将其 base[0]=l (array[2*0]=array[0]=l ), 并标识 base 为 1 的值已经被占用 (保证所有状态的 base 值唯一), check[0]=0 (array[2*0+l]=array[l]=0)  Step S113: Put the start state into the 0th position of the double array, set its base[0]=l (array[2*0]=array[0]=l), and identify that the value of base is 1 is already occupied ( Ensure that the base value of all states is unique), check[0]=0 (array[2*0+l]=array[l]=0)
步骤 S114: 以起始状态作为当前状态。  Step S114: The initial state is taken as the current state.
步骤 S115: 获取当前状态的所有直接后继状态的信息, 若直接后继结点列表为空, 即当 前结点为终端结点 "$", 表示从起始结点到当前结点构成的 key恰好是词典中的一个完整词 条, 将当前结点 (终端结点) 的 base值赋上当前 key词典顺序序号的相反数, 该路径上执行 完毕; 否则执行步骤 S116。 该步骤的伪代码如下: Step S115: Obtain information of all direct successor states of the current state. If the direct successor node list is empty, that is, the current node is the terminal node "$", indicating that the key formed from the starting node to the current node is exactly A complete entry in the dictionary, the base value of the current node (terminal node) is assigned to the opposite of the current key dictionary sequence number, and the execution is completed on the path; otherwise, step S116 is performed. The pseudo code for this step is as follows:
int fetch(Node parent, List<Node> siblings) { Int fetch(Node parent, List<Node> siblings) {
II获取当前结点 parent的所有直接后继结点, 并将结果存储到 siblings列表内 II Get all direct successor nodes of the current node parent, and store the result in the siblings list
//返回当前结点所有直接后继结点的数量, 若为 0表示当前结点为终端结点 int prev = 0; / / Returns the number of all direct successor nodes of the current node, if 0 means the current node is the terminal node int prev = 0;
for (int i = parent, start; i < parent, end; i++) { For (int i = parent, start; i < parent, end; i++) {
if (key s . get(i ) .1 engthQ < parent, depth) {  If (key s . get(i ) .1 engthQ < parent, depth) {
//若当前 key已经处理完毕, 跳过此 key, 此处用于处理终端结点的情况 / / If the current key has been processed, skip this key, here used to handle the terminal node
// 即终端结点没有后继结点 // ie the terminal node has no successor nodes
continue;  Continue;
String tmp = keys.get(i); String tmp = keys.get(i);
int cur = 0;  Int cur = 0;
if (key s . get(i ) .1 ength() != parent. depth) {  If (key s . get(i ) .1 ength() != parent. depth) {
cur = (int) tmp . char At(parent. depth) + 1 ; if (cur != prev || siblings. isEmpty()) {  Cur = (int) tmp . char At(parent. depth) + 1 ; if (cur != prev || siblings. isEmpty()) {
Node tmp_node = new Node();  Node tmp_node = new Node();
tmp_node. depth = parent, depth + 1 ;  Tmp_node. depth = parent, depth + 1 ;
tmp_node.code = cur;  Tmp_node.code = cur;
tmp_node. start = i;  Tmp_node. start = i;
if (! siblings. isEmptyO) {  If (! siblings. isEmptyO) {
(siblings. get( siblings. sizeQ - l)).end = i; siblings. add(tmp_node); prev = cur; if (! siblings. isEmptyO) {  (siblings. get( siblings. sizeQ - l)).end = i; siblings. add(tmp_node); prev = cur; if (! siblings. isEmptyO) {
(siblings. get( siblings. sizeQ - l)).end = parent. end; return siblings. sizeQ; 步骤 S 116: 为当前结点寻找一个合适的 base值, 使得 base值唯一, 且不会导致所有直接 后继结点与现有 Trie树存储的结点冲突。 依次将当前结点的直接后继结点插入 Trie树中, 并 将其 check值赋上当前结点的 base值, 再依次把当前结点的直接后继结点作为当前结点, 跳 转到步骤 S 115。 (siblings. get( siblings. sizeQ - l)).end = parent. end; Return siblings. sizeQ; Step S116: Find a suitable base value for the current node, so that the base value is unique, and does not cause all direct successor nodes to collide with the nodes stored in the existing Trie tree. In turn, the direct successor node of the current node is inserted into the Trie tree, and the check value is assigned to the base value of the current node, and then the direct successor node of the current node is sequentially used as the current node, and the process proceeds to step S. 115.
该步骤的伪代码如下:  The pseudo code for this step is as follows:
int insert(List<Node> siblings) {  Int insert(List<Node> siblings) {
II为当前结点寻找一个符合条件的 base值, 插入其所有直接后继结点  II Find an eligible base value for the current node, insert all its direct successor nodes
II并返回当前结点的 base值  II and return the base value of the current node
为当前结点寻找一个合适的未被使用的 base值, 使得 base值唯一且不为 0, 也不会导致 所有直接后继结点与现有 Trie树存储的结点冲突, 为了兼容 Trie树的根结点, base值从 1开 始;  Find a suitable unused base value for the current node, so that the base value is unique and not 0, and will not cause all direct successor nodes to conflict with the existing Trie tree stored nodes, in order to be compatible with the root of the Trie tree. Node, base value starts from 1;
used[base] = 1 ; II标识此值已被使用  Used[base] = 1 ; II identifies that this value has been used
for (int i = 0; i < siblings. size(); i++) {  For (int i = 0; i < siblings. size(); i++) {
II将当前结点的直接后继结点的 check值赋上当前结点的 base值, 完成插入操作 array[(((int) base + (siblings. get(i)). code) « 1) + 1] = base; for (int i = 0; i < siblings. size(); i++) {  II assigns the check value of the immediate successor node of the current node to the base value of the current node, and completes the insert operation array[(((int) base + (siblings. get(i)). code) « 1) + 1 ] = base; for (int i = 0; i < siblings. size(); i++) {
//依次把当前结点的直接后继结点作为当前结点  / / In turn, the current successor node of the current node as the current node
//递归做获取直接后继结点集合和插入后继结点集合操作  / / Recursively to get the direct successor node set and insert the subsequent node set operation
List<Node> new—siblings = new ArrayList<Node>();  List<Node> new—siblings = new ArrayList<Node>();
if (fetch(( siblings. get(i)), new—siblings) == 0) {  If (fetch(( siblings. get(i)), new—siblings) == 0) {
II此时当前结点为终端结点 (直接后继为空)  II At this time, the current node is the terminal node (the direct successor is empty)
II将其 base值赋上该 key词典序号的相反数  II assigns its base value to the opposite of the key dictionary number
array[((int)base+(int)(siblings.get(i)).code) « 1] =(int)(-(siblings.get(i)).left - 1); } else {  Array[((int)base+(int)(siblings.get(i)).code) « 1] =(int)(-(siblings.get(i)).left - 1); } else {
int ins = (int) insert(new_siblings);  Int ins = (int) insert(new_siblings);
//将找到合适的 base: ins赋给当前结点的 base值  / / Will find the appropriate base: ins assigned to the base value of the current node
array[((int) base + (siblings. get(i)). code) « 1] = ins; return base; Array[((int) base + (siblings. get(i)). code) « 1] = ins; Return base;
所述一维线性空间的 Trie树中将所有终端结点变为非终端结点,在所述终端结点后面增 加一个叶子节点, 并将叶子节点的 check值赋上其存储位置。 In the Trie tree of the one-dimensional linear space, all terminal nodes are changed into non-terminal nodes, a leaf node is added after the terminal node, and the check value of the leaf node is assigned to its storage location.
将所有的终端结点变为非终端结点, 并在其后面增加一个叶子结点, 叶子结点的 check 值赋上自己的存储位置, 叶子结点的 base值赋上从初始结点 (0结点) 到当前叶子结点路径 上的输入字符组成的完整词条在整个有序词条集合中位置的相反数 (即叶子结点所处的 key 中在所有按词典排序的词条集中序号的相反数), 因此结点的 base值的符号用于标识是否是 终端结点 (叶子结点, 其 base值小于 0)。  Change all terminal nodes to non-terminal nodes, and add a leaf node behind them. The check value of the leaf node is assigned its own storage location, and the base value of the leaf node is assigned from the initial node (0 Node) The opposite of the position of the complete entry into the current leaf node path in the entire ordered set of terms (ie, the key in the list of all the lexicographically ordered terms in the key of the leaf node) The opposite is true, so the sign of the base value of the node is used to identify whether it is a terminal node (the leaf node whose base value is less than 0).
步骤 S 120: 根据用户输入待查询词条 key。  Step S120: Enter a query key to be queried according to the user.
Trie树构建好之后, 接下来就是查询用户输入的词条是否存在 Trie树, 即是否是从根结 点到叶子结点的一条完整路径。  After the Trie tree is built, the next step is to query whether the entry entered by the user has a Trie tree, that is, whether it is a complete path from the root node to the leaf node.
步骤 S 130: 根据词条 key的当前状态实现查询。  Step S130: Implement a query according to the current state of the entry key.
如图 2所示, 是根据词条 key的当前状态实现查询的流程框架图, 具体步骤如下: 步骤 S 131 : 把当前结点指向根节点。  As shown in FIG. 2, it is a process framework diagram for implementing a query according to the current state of the entry key. The specific steps are as follows: Step S131: Point the current node to the root node.
步骤 S 132: 根据当前输入的字符做出状态转移, 获取其直接后继状态的位置。  Step S132: Perform a state transition according to the currently input character to obtain a position of the direct successor state.
步骤 S 133 : 校验当前状态的前驱, 确定当前状态由哪一状态转移而来。  Step S133: Verify the precursor of the current state, and determine which state the current state is transferred from.
步骤 S 134: 获取词条 key对应的 value值。  Step S134: Obtain the value of the value corresponding to the entry key.
在当前状态为 s, 输入的字符为 c, 下一状态为 t的条件下, 其本方法的查询过程的约束 条件修改为:  In the current state is s, the input character is c, and the next state is t, the constraint condition of the query process of this method is modified to:
check[base[s] + c ] = base[s] ; base[s]+c=t ; 每个状态的 base[s]值唯一。  Check[base[s] + c ] = base[s] ; base[s]+c=t ; The base[s] value of each state is unique.
若当前状态 s可以转移到叶子结点 t中, 则其约束条件为- base[s]=t ; t=check[t]。  If the current state s can be transferred to the leaf node t, its constraint is -base[s]=t; t=check[t].
base[t]<0 且 base[t]的值为 DFA的初始结点 0到当前叶子结点经过的字符组成的词条在 所有按词典顺序排序的词条集中位置的相反数。  Base[t]<0 and the value of base[t] is the initial node of DFA. The opposite of the number of entries in the current lexicographically ordered term in the lexicographically ordered entry.
本发明实施例中可以获取词条 key的所有前缀词, 并可以将检索到的每条结果信息保存 在一个对象 TrieResult中, 该表存储的变量作如下描述: Length表示当前 key的长度; Index 当前 key在词典中的存储序号 -1, 即当前 key对应属性的存储位置。  In the embodiment of the present invention, all the prefix words of the entry key can be obtained, and each retrieved result information can be saved in an object TrieResult, and the variables stored in the table are described as follows: Length indicates the length of the current key; Index current The storage number of the key in the dictionary is -1, which is the storage location of the current key corresponding to the attribute.
本发明实施例一种基于一维线性空间实现 Trie树的词典检索方法的查询速度 18.3MB/s。 因此, 本发明提供一种基于一维线性空间实现 Trie树的词典检索方法, 通过生成一维线 性空间的 Trie树词典数据; 根据用户输入确定待查询词条 key; 根据词条 key的当前状态实 现查询。 在一维线性空间下构建 Trie树的词典数据下, 提高了词典加载和检索速度, 并且能 够快速检索到词条的所有前缀词, 同时, 在 Trie树的构建过程中通过调整 base值, 使得其所 有直接后继均不会发生冲突, 因此避免了数据移动或存储空间分配的回溯问题。 The query speed of the dictionary search method based on the one-dimensional linear space to implement the Trie tree is 18.3 MB/s. Therefore, the present invention provides a dictionary search method for implementing a Trie tree based on a one-dimensional linear space, by generating a Trio tree dictionary data of a one-dimensional linear space; determining a query key to be queried according to user input; and implementing according to the current state of the entry key Inquire. By constructing the dictionary data of the Trie tree in the one-dimensional linear space, the dictionary loading and retrieval speed is improved, and all the prefix words of the entry can be quickly retrieved. At the same time, the base value is adjusted during the construction of the Trie tree, so that Conflicts do not occur with all direct successors, thus avoiding backtracking issues with data movement or storage space allocation.

Claims

权 利 要 求 书 Claim
1.一种基于一维线性空间实现 Trie树的词典检索方法, 其特征在于, 该方法包括: 生成一维线性空间 Trie树的词典数据;  A dictionary retrieval method for implementing a Trie tree based on a one-dimensional linear space, the method comprising: generating dictionary data of a one-dimensional linear space Trie tree;
根据用户输入确定待查询词条 key;  Determining the term to be queried according to user input;
根据词条 key的当前状态实现查询。  The query is implemented according to the current state of the entry key.
2.根据权利要求 1所述的方法, 其特征在于, 所述词典数据的 key转化为节点后存储在 一维数组中, 所述一维数组的值用来标识 base值是否唯一。  The method according to claim 1, wherein the key of the dictionary data is converted into a node and stored in a one-dimensional array, and the value of the one-dimensional array is used to identify whether the base value is unique.
3.根据权利要求 1所述的方法, 其特征在于, 所述一维线性空间 Trie树中将所有终端结 点变为非终端结点, 在所述终端结点后面增加一个叶子节点, 并将叶子节点的 check值赋上 其存储位置。  The method according to claim 1, wherein in the one-dimensional linear space Trie tree, all terminal nodes are changed to non-terminal nodes, and a leaf node is added after the terminal node, and The leaf node's check value is assigned its storage location.
4.根据权利要求 1或 3所述的方法,其特征在于,所述叶子节点还包括:叶子节点的 base 值用以标识其是否为终端结点。  The method according to claim 1 or 3, wherein the leaf node further comprises: a base value of the leaf node to identify whether it is a terminal node.
5.根据权利要求 1所述的方法, 其特征在于, 所述查询包括以下步骤:  The method according to claim 1, wherein the query comprises the following steps:
把当前结点指向根节点;  Point the current node to the root node;
根据当前输入的字符做出状态转移, 获取其直接后继状态的位置;  Making a state transition based on the currently entered character, obtaining the position of its immediate successor state;
校验当前状态的前驱, 确定当前状态由哪一状态转移而来;  Verify the precursor of the current state and determine which state the current state is transferred from;
获取词条 key对应的 value值。  Get the value of the value corresponding to the entry key.
6.根据权利要求 1或权利要求 5所述的方法, 其特征在于, 所述查询包括:  The method according to claim 1 or claim 5, wherein the query comprises:
根据词条 key的查询可以获得其所有前缀词的结果。  A query based on the entry key can get the result of all its prefix words.
PCT/CN2014/080179 2013-07-03 2014-06-18 One-dimensional liner space-based method for implementing trie tree dictionary search WO2015010509A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310288821.5 2013-07-03
CN201310288821.5A CN103365992B (en) 2013-07-03 2013-07-03 Method for realizing dictionary search of Trie tree based on one-dimensional linear space

Publications (1)

Publication Number Publication Date
WO2015010509A1 true WO2015010509A1 (en) 2015-01-29

Family

ID=49367333

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/080179 WO2015010509A1 (en) 2013-07-03 2014-06-18 One-dimensional liner space-based method for implementing trie tree dictionary search

Country Status (2)

Country Link
CN (1) CN103365992B (en)
WO (1) WO2015010509A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365992B (en) * 2013-07-03 2017-02-15 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN103365991B (en) * 2013-07-03 2017-03-08 深圳市华傲数据技术有限公司 A kind of dictionaries store management method realizing Trie tree based on one-dimensional linear space
CN107680588B (en) * 2017-05-10 2020-10-20 平安科技(深圳)有限公司 Intelligent voice navigation method, device and storage medium
CN107239549A (en) * 2017-06-07 2017-10-10 传神语联网网络科技股份有限公司 Method, device and the terminal of database terminology retrieval
CN107273360A (en) * 2017-06-21 2017-10-20 成都布林特信息技术有限公司 Chinese notional word extraction algorithm based on semantic understanding
CN108153907B (en) * 2018-01-18 2021-01-22 中国计量大学 Dictionary storage management method for realizing space optimization through 16-bit Trie tree
CN108197313B (en) * 2018-02-01 2021-06-25 中国计量大学 A space-optimized dictionary indexing method through a 16-bit Trie tree
CN108509419B (en) * 2018-03-21 2022-02-22 山东中医药大学 Chinese medicine ancient book document word segmentation and part of speech indexing method and system
CN109739948B (en) * 2018-12-28 2021-08-03 北京金山安全软件有限公司 Word list storage management method and device, electronic equipment and storage medium
CN109684439B (en) * 2018-12-28 2020-10-30 语联网(武汉)信息技术有限公司 Method and device for indexing prefix in word segmentation process
CN110688483B (en) * 2019-09-16 2022-10-18 重庆邮电大学 Dictionary-based noun visibility labeling method, medium and system in context conversion
CN112100132B (en) * 2020-09-24 2024-07-30 深圳软牛科技有限公司 Deleted file type identification method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
US20080208854A1 (en) * 2005-06-06 2008-08-28 3618633 Canada Inc. Method of Syntactic Pattern Recognition of Sequences
CN101398830A (en) * 2007-09-27 2009-04-01 阿里巴巴集团控股有限公司 Thesaurus fuzzy enquiry method and thesaurus fuzzy enquiry system
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN102651026A (en) * 2012-04-01 2012-08-29 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7096235B2 (en) * 2003-03-27 2006-08-22 Sand Technology Systems International, Inc. Computer implemented compact 0-complete tree dynamic storage structure and method of processing stored data
EP1702275A1 (en) * 2003-08-11 2006-09-20 France Telecom Trie memory device with a circular pipeline mechanism
CN101788990A (en) * 2009-01-23 2010-07-28 北京金远见电脑技术有限公司 Global optimization and construction method and system of TRIE double-array
JP5262864B2 (en) * 2009-03-10 2013-08-14 富士通株式会社 Storage medium, search method and search device
US8608063B2 (en) * 2010-05-28 2013-12-17 Securitymetrics, Inc. Systems and methods employing intermittent scanning techniques to identify sensitive information in data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208854A1 (en) * 2005-06-06 2008-08-28 3618633 Canada Inc. Method of Syntactic Pattern Recognition of Sequences
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN101398830A (en) * 2007-09-27 2009-04-01 阿里巴巴集团控股有限公司 Thesaurus fuzzy enquiry method and thesaurus fuzzy enquiry system
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN102651026A (en) * 2012-04-01 2012-08-29 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space

Also Published As

Publication number Publication date
CN103365992B (en) 2017-02-15
CN103365992A (en) 2013-10-23

Similar Documents

Publication Publication Date Title
WO2015010509A1 (en) One-dimensional liner space-based method for implementing trie tree dictionary search
CN102033954B (en) Extensible Markup Language Document Full Text Retrieval Query Indexing Method in Relational Database
CN104866593B (en) A kind of database search method of knowledge based collection of illustrative plates
CN106874422B (en) A Graph Query Method for Relational Databases
KR101646754B1 (en) Apparatus and Method of Mobile Semantic Search
JP5241738B2 (en) Method and apparatus for building tree structure data from tables
CN105701253A (en) Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN103646032A (en) Database query method based on body and restricted natural language processing
CN106528647B (en) One kind carrying out the matched method of term based on cedar even numbers group dictionary tree algorithm
CN112513836B (en) Structured record retrieval method, computing system and computer readable medium
CN110795526B (en) A method and system for creating mathematical formula index for retrieval system
WO2015010508A1 (en) One-dimensional linear space-based method for implementing trie tree dictionary storage and management
CN104346331A (en) Retrieval method and system for XML database
CN108846016A (en) A kind of searching algorithm towards Chinese word segmentation
JP2019040598A5 (en)
CN107239549A (en) Method, device and the terminal of database terminology retrieval
CN105808729B (en) Academic big data analysis method based on adduction relationship between paper
CN113590650B (en) Structured query statement identification method and device based on feature expression
US20090234852A1 (en) Sub-linear approximate string match
JP4712718B2 (en) Array generation method and array generation program
CN105824956A (en) Inverted index model based on link list structure and construction method of inverted index model
CN119293195B (en) A RAG optimization method for large language models based on tree neighbor context
CN103500222A (en) Method and device for searching for chat object through communication software
CN100565508C (en) Structured-document management apparatus, search equipment, storage and searching method
CN117573096B (en) Intelligent code completion method integrating abstract syntax tree structure information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14828829

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14828829

Country of ref document: EP

Kind code of ref document: A1