[go: up one dir, main page]

CN114756553A - Columnar database and method for retrieving sequence of events - Google Patents

Columnar database and method for retrieving sequence of events Download PDF

Info

Publication number
CN114756553A
CN114756553A CN202210437897.9A CN202210437897A CN114756553A CN 114756553 A CN114756553 A CN 114756553A CN 202210437897 A CN202210437897 A CN 202210437897A CN 114756553 A CN114756553 A CN 114756553A
Authority
CN
China
Prior art keywords
event
data
sequence
field
retrieved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210437897.9A
Other languages
Chinese (zh)
Inventor
叶杨
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhuochen Info Tech Co ltd
Original Assignee
Shanghai Zhuochen Info Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhuochen Info Tech Co ltd filed Critical Shanghai Zhuochen Info Tech Co ltd
Priority to CN202210437897.9A priority Critical patent/CN114756553A/en
Publication of CN114756553A publication Critical patent/CN114756553A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Automation & Control Theory (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种列式数据库及检索事件序列的方法,用于存储和检索事件序列。列式数据库对数据进行如下处理:步骤a:将数据输入列式数据库;步骤b:对数据的时间字段设置索引;步骤c:判断事件字段的数据类型,若为文本,且存在字符数大于8个的数据,进入步骤d,否则进入步骤g;步骤d:对事件字段去重;步骤e:根据去重结果,进行哈希编码,构造事件数据序号字典;步骤f:遍历列式数据库,对所有事件字段中数据类型为文本的数据赋予对应的哈希编码值,生成事件索引列;步骤g:根据索引将列式数据库按时间升序进行排序。本发明解决了现有列式数据库不便于对输入的数据进行排序和检索,且在事件字段的数据类型为文本时检索复杂的问题。

Figure 202210437897

The present invention provides a column database and a method for retrieving event sequences for storing and retrieving event sequences. The columnar database processes the data as follows: step a: input the data into the columnar database; step b: set an index on the time field of the data; step c: determine the data type of the event field, if it is text, and the number of characters is greater than 8 Step d: deduplicate the event field; step e: perform hash coding according to the deduplication result, and construct a dictionary of event data serial numbers; step f: traverse the columnar database, The data in all event fields whose data type is text is assigned a corresponding hash code value to generate an event index column; step g: sort the column database in ascending time order according to the index. The invention solves the problems that the existing column database is inconvenient to sort and retrieve the input data, and the retrieval is complicated when the data type of the event field is text.

Figure 202210437897

Description

列式数据库及检索事件序列的方法Columnar database and method for retrieving event sequences

技术领域technical field

本发明涉及一种列式数据库及检索事件序列的方法,属于数据存储领域。The invention relates to a column database and a method for retrieving event sequences, belonging to the field of data storage.

背景技术Background technique

在数据库中,存储了大量的数据,包括某一时间、某一主体及其发生的某一事件。在数据量较小时,可以对数据库内的事件进行评估,提前发现问题或趋势。而在数据量越来越大的今天,无法通过对单一事件的检索,直观的看出发生的情况及发展趋势,也就无法评估对主体的影响。In the database, a large amount of data is stored, including a time, a subject and an event that occurred. When the amount of data is small, events in the database can be evaluated to detect problems or trends in advance. In today's increasingly large amount of data, it is impossible to intuitively see the occurrence and development trend through the retrieval of a single event, and it is impossible to evaluate the impact on the subject.

事件序列是指连续的依次发生的多个事件的序列,事件序列可以准确的评估发生的情况及发展趋势,但现有的列式数据库仅能进行单一事件的检索,无法对事件序列进行检索,更无法对事件序列进行模糊检索。同时现有的检索方法在事件的数据类型为文本时仅能直接对文本内容进行检索,检索复杂不便。Event sequence refers to the sequence of multiple events that occur in sequence. The event sequence can accurately evaluate the occurrence and development trend, but the existing columnar database can only retrieve a single event, and cannot retrieve the event sequence. It is even more impossible to perform fuzzy retrieval of event sequences. At the same time, the existing retrieval method can only directly retrieve the text content when the data type of the event is text, and the retrieval is complicated and inconvenient.

有鉴于此,确有必要提供一种列式数据库及检索事件序列的方法,以解决上述问题。In view of this, it is indeed necessary to provide a columnar database and a method for retrieving event sequences to solve the above problems.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种列式数据库及检索事件序列的方法,解决了现有列式数据库不便于对输入的数据进行排序和检索,且在事件字段的数据类型为文本时检索复杂、效率较低的问题。The purpose of the present invention is to provide a columnar database and a method for retrieving event sequences, which solves the problem that the existing columnar database is inconvenient for sorting and retrieving input data, and the retrieval is complex and efficient when the data type of the event field is text. lower problem.

为实现上述目的,本发明提供了一种列式数据库,用于存储和检索事件序列,所述列式数据库能够对数据进行如下处理:To achieve the above purpose, the present invention provides a columnar database for storing and retrieving event sequences, and the columnar database can process data as follows:

步骤a:将数据输入列式数据库,所述数据为时序数据,至少包括时间字段和事件字段;Step a: input data into a columnar database, the data is time series data, including at least a time field and an event field;

步骤b:对步骤a中输入的数据的时间字段设置索引;Step b: set an index to the time field of the data entered in step a;

步骤c:判断所述数据中的事件字段的数据类型,若所述事件字段的数据类型为文本,且存在字符数大于8个的数据,则进入步骤d,否则进入步骤g;Step c: judge the data type of the event field in the data, if the data type of the event field is text, and there is data with the number of characters greater than 8, then enter step d, otherwise enter step g;

步骤d:对所述事件字段的内容进行去重;Step d: de-duplicate the content of the event field;

步骤e:根据步骤d的去重结果,进行哈希编码,构造事件数据序号字典;Step e: perform hash coding according to the deduplication result of step d, and construct a dictionary of event data serial numbers;

步骤f:遍历列式数据库,对所有事件字段中数据类型为文本的数据按所述事件数据序号字典赋予对应的哈希编码值,在列式数据库生成相应的事件索引列;Step f: traverse the column database, assign the corresponding hash code value to the data whose data type is text in all event fields according to the event data serial number dictionary, and generate the corresponding event index column in the column database;

步骤g:根据步骤b中对时间字段设置的索引将所述列式数据库按时间升序进行排序。Step g: Sort the columnar database in ascending time order according to the index set on the time field in step b.

作为本发明的进一步改进,所述列式数据库中的数据还包括除所述事件字段和所述时间字段之外的任意属性字段。As a further improvement of the present invention, the data in the columnar database further includes any attribute field except the event field and the time field.

本发明还提供了一种检索事件序列的方法,应用于前述的列式数据库中,具体包括以下步骤:The present invention also provides a method for retrieving event sequences, which is applied to the aforementioned columnar database, and specifically includes the following steps:

步骤1:输入待检索事件序列,并转化为查询语句;Step 1: Input the event sequence to be retrieved and convert it into a query statement;

步骤2:判断待检索事件序列的数据类型,当待检索事件序列的数据类型为数值或单字符或字符数均小于等于8个的文本时,进入步骤4;当待检索事件序列的数据类型为文本,且存在字符数大于8个的数据时,进入步骤3;Step 2: Judging the data type of the event sequence to be retrieved, when the data type of the event sequence to be retrieved is a numerical value or a single character or text with a number of characters less than or equal to 8, go to step 4; when the data type of the event sequence to be retrieved is text, and there is data with more than 8 characters, go to step 3;

步骤3:将待检索事件序列进行哈希编码,待检索事件序列的数据类型转变为数值后进入步骤4;Step 3: perform hash coding on the sequence of events to be retrieved, and enter step 4 after the data type of the sequence of events to be retrieved is converted into a numerical value;

步骤4:选择查询类型,若是精确查询,则以待检索事件序列作为一个数组整体去精确匹配检索;若是模糊查询,则根据预设模糊查询距离,进行模糊匹配检索,此时将待检索事件序列作为一个数组整体计算与事件序列中的数据组成序列的最小距离,最小距离小于等于预设模糊查询距离;Step 4: Select the query type. If it is an exact query, use the sequence of events to be retrieved as a whole to search for exact matching; if it is a fuzzy query, perform a fuzzy matching retrieval based on the preset fuzzy query distance. As an array as a whole, calculate the minimum distance from the data composition sequence in the event sequence, and the minimum distance is less than or equal to the preset fuzzy query distance;

步骤5:构建结果字典,将所有匹配到的对应序列加入结果字典,直到遍历完事件字段的所有数据;Step 5: Build a result dictionary, add all the matched corresponding sequences to the result dictionary, until all data in the event field is traversed;

步骤6:输出结果字典作为结果。Step 6: Output the result dictionary as the result.

作为本发明的进一步改进,所述待检索事件序列为多个事件依次排列组成的序列。As a further improvement of the present invention, the to-be-retrieved event sequence is a sequence composed of multiple events arranged in sequence.

作为本发明的进一步改进,步骤1中的所述查询语句为类SQL语句。As a further improvement of the present invention, the query statement in step 1 is an SQL-like statement.

作为本发明的进一步改进,所述查询语句包括依次设置的第一参数、第二参数及第三参数,所述第一参数配置为筛选查询的字段,所述第二参数配置为限定查询的存储地址,所述第三参数配置为限定检索条件。As a further improvement of the present invention, the query statement includes a first parameter, a second parameter and a third parameter set in sequence, the first parameter is configured as a field for filtering the query, and the second parameter is configured as a storage limit for the query address, and the third parameter is configured to limit retrieval conditions.

作为本发明的进一步改进,所述第三参数包括所述待检索事件序列。As a further improvement of the present invention, the third parameter includes the sequence of events to be retrieved.

作为本发明的进一步改进,所述步骤4中进行模糊查询计算最小距离具体为采用Levenshtein算法,计算数据类型为字符串时,把所述待检索事件序列中的任一字符串通过添加、删除、替换字符的方式转变成所述事件字段中的任一字符串所需要的最少步骤。As a further improvement of the present invention, in the step 4, the fuzzy query is performed to calculate the minimum distance by using the Levenshtein algorithm. When the calculated data type is a string, any string in the to-be-retrieved event sequence is added, deleted, The minimum number of steps required to convert the way characters are replaced into any string in the event field.

作为本发明的进一步改进,所述最少步骤的计算具体为:As a further improvement of the present invention, the calculation of the minimum steps is specifically:

步骤41:计算所述待检索事件序列中的任一字符串strA的长度n,计算所述待检索事件序列中的任一字符串strB的长度m;Step 41: Calculate the length n of any character string strA in the sequence of events to be retrieved, and calculate the length m of any character string strB in the sequence of events to be retrieved;

步骤42:如果n=0,则最小编辑距离是m,如果m=0,则最小编辑距离是n,若都不是,则进入步骤43;Step 42: If n=0, the minimum edit distance is m, if m=0, the minimum edit distance is n, if not, then go to step 43;

步骤43:构造一个(m+1)*(n+1)的矩阵Arr,并初始化矩阵的第一行和第一列分别为0~n,0~m;Step 43: Construct a matrix Arr of (m+1)*(n+1), and initialize the first row and first column of the matrix to be 0~n, 0~m respectively;

步骤44:两重循环,遍历strA,在此基础上遍历strB,如果strA[i]=strB[j],那么cost=0,否则cost=1,判断Arr[j-1][i]+1,Arr[j][i-1]+1,Arr[j-1][i-1]+cost的最小值,将最小值赋值给Arr[j][i];Step 44: Double loop, traverse strA, and traverse strB on this basis, if strA[i]=strB[j], then cost=0, otherwise cost=1, judge Arr[j-1][i]+1 , Arr[j][i-1]+1, the minimum value of Arr[j-1][i-1]+cost, and assign the minimum value to Arr[j][i];

步骤45:循环结束后,矩阵的最后一个元素就是最小编辑距离,即最少步骤。Step 45: After the loop is over, the last element of the matrix is the minimum edit distance, that is, the minimum steps.

作为本发明的进一步改进,步骤6中输出的所述结果字典包括数据库中所有符合条件的匹配序列,所述匹配序列至少输出所述时间字段和所述事件字段。As a further improvement of the present invention, the result dictionary output in step 6 includes all matching sequences in the database that meet the conditions, and the matching sequences output at least the time field and the event field.

本发明的有益效果是:与现有技术相比,本发明的列式数据库通过对时间字段设置索引,解决了现有列式数据库不便于对输入的数据进行排序和检索的问题,通过对数据类型为文本的事件字段进行哈希编码,解决了现有列式数据库在事件字段的数据类型为文本时检索复杂,效率较低的问题。The beneficial effects of the present invention are: compared with the prior art, the columnar database of the present invention solves the problem that the existing columnar database is inconvenient for sorting and retrieving the input data by setting an index on the time field. Hash coding is performed on the event field whose type is text, which solves the problem of complex retrieval and low efficiency in the existing columnar database when the data type of the event field is text.

附图说明Description of drawings

图1是本发明的列式数据库处理数据的步骤图。FIG. 1 is a step diagram of processing data in the columnar database of the present invention.

图2是图1的流程图。FIG. 2 is a flowchart of FIG. 1 .

图3是图1中步骤1时序数据的示意图。FIG. 3 is a schematic diagram of the time series data of step 1 in FIG. 1 .

图4是本发明的事件序列检索方法的步骤图。FIG. 4 is a step diagram of the event sequence retrieval method of the present invention.

图5是图4中步骤4的Levenshtein算法的流程图。FIG. 5 is a flowchart of the Levenshtein algorithm of step 4 in FIG. 4 .

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

在此,需要说明的是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与本发明的方案密切相关的结构和/或处理步骤,而省略了与本发明关系不大的其他细节。Here, it should be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the solution of the present invention are shown in the drawings, and the drawings related to the present invention are omitted. Other details that don't matter much.

另外,还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。In addition, it should be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Also included are other elements not expressly listed or inherent to such a process, method, article or apparatus.

请参阅图1和图2所示,本发明提供了一种列式数据库,用于存储和检索事件序列,列式数据库对于数据进行如下处理:Referring to Figures 1 and 2, the present invention provides a columnar database for storing and retrieving event sequences. The columnar database processes data as follows:

步骤a:将数据输入列式数据库,数据为时序数据,至少包括时间字段和事件字段;Step a: Input the data into the columnar database, the data is time series data, including at least the time field and the event field;

步骤b:对步骤a中输入的数据的时间字段设置索引;Step b: set an index to the time field of the data entered in step a;

步骤c:判断数据中的事件字段的数据类型,若事件字段的数据类型为文本,且存在字符数大于8个的数据,则进入步骤d,否则进入步骤g;Step c: Judging the data type of the event field in the data, if the data type of the event field is text, and there is data with more than 8 characters, then go to step d, otherwise go to step g;

步骤d:对事件字段的内容进行去重;Step d: De-duplicate the content of the event field;

步骤e:根据步骤d的去重结果,进行哈希编码,构造事件数据序号字典;Step e: perform hash coding according to the deduplication result of step d, and construct a dictionary of event data serial numbers;

步骤f:遍历列式数据库,对所有事件字段中数据类型为文本的数据按事件数据序号字典赋予对应的哈希编码值,在列式数据库生成相应的事件索引列;Step f: traverse the columnar database, assign the corresponding hash code value to the data whose data type is text in all event fields according to the event data serial number dictionary, and generate the corresponding event index column in the columnar database;

步骤g:将列式数据库按时间升序进行排序。Step g: Sort the columnar database in ascending time order.

请参阅图3所示,步骤a输入数据时,输入的数据排序可能是无序的,事件字段的数据类型若是长文本的情况,举例如表1所示:Please refer to Figure 3. When data is input in step a, the order of the input data may be out of order, and the data type of the event field is the case of long text, for example, as shown in Table 1:

表1Table 1

时间time 事件event 2022-01-042022-01-04 发动机保养engine maintenance 2022-01-012022-01-01 加油come on 2022-01-032022-01-03 补充玻璃水Refill glass water 2022-01-022022-01-02 雨刮器保养Wiper Maintenance 2022-01-062022-01-06 洗车CAR WASH 2022-02-042022-02-04 加油come on 2022-02-062022-02-06 补充玻璃水Refill glass water 2022-02-052022-02-05 加油come on

在表1中,若干条数据输入列式数据库中,这些数据是随机排序的,因此要进行一定的处理。在表1示例性的数据中,事件字段的数据类型为文本,且事件字段中存在数据“发动机保养”的字符数为2*5即10个,大于8个,因此根据步骤c判断后,需要依次进行步骤d~步骤g,先按步骤d,对事件字段的内容进行去重,得到去重集合为[发动机保养,加油,补充玻璃水,雨刮器保养,洗车],然后进行步骤e,对去重后的事件字段的文本内容进行哈希编码,构造事件数据字典序号为{发动机保养:1,加油:2,补充玻璃水:3,雨刮器保养:4,洗车:5},然后根据步骤f,遍历列式数据库,对所有事件字段的数据类型为文本的数据按事件数据序号字典赋予对应的哈希编码值,在列式数据库生成相应的事件索引列,此时列式数据库中的数据会由表1的形式转变为表2的形式:In Table 1, several pieces of data are entered into the columnar database, and these data are randomly ordered, so certain processing is required. In the exemplary data in Table 1, the data type of the event field is text, and the number of characters in the data "engine maintenance" in the event field is 2*5 or 10, which is greater than 8. Therefore, after judging according to step c, it is necessary to Perform steps d to g in sequence, first press step d to de-duplicate the content of the event field, and obtain the de-duplication set as [engine maintenance, refueling, replenishing glass water, wiper maintenance, car wash], and then proceed to step e to remove The text content of the re-event field is hashed, and the sequence number of the constructed event data dictionary is {engine maintenance: 1, refueling: 2, glass water replenishment: 3, wiper maintenance: 4, car wash: 5}, and then according to step f, Traverse the columnar database, assign the corresponding hash code value to the data whose data type is text for all event fields according to the event data serial number dictionary, and generate the corresponding event index column in the columnar database. The form of Table 1 is transformed into the form of Table 2:

表2Table 2

时间time 事件event 事件索引event index 2022-01-042022-01-04 发动机保养engine maintenance 11 2022-01-012022-01-01 加油come on 22 2022-01-032022-01-03 补充玻璃水Refill glass water 33 2022-01-022022-01-02 雨刮器保养Wiper Maintenance 44 2022-01-062022-01-06 洗车CAR WASH 55 2022-02-042022-02-04 加油come on 22 2022-02-062022-02-06 补充玻璃水Refill glass water 33 2022-02-052022-02-05 加油come on 22

输入列式数据库进行存储至少包括事件字段和时间字段,待检索事件序列为多个事件依次排列组成的序列。相同或不同的事件按时间发生从早到晚的顺序依次排列才能称为事件序列,这也是在时间序列的基础上对事物发展的进一步的分析方法,通过对依次发生的事件的检索分析才能实现对发展趋势或主体影响的评估,因此在列式数据库中必须包含事件字段和时间字段,且要在开始时将列式数据库按时间升序进行排序,从而便于检索。因此需要最后进行步骤g,将列式数据库按时间升序进行排序,此时列式数据库中的数据会由表2的形式转变为表3的形式:The input column database for storage includes at least an event field and a time field, and the event sequence to be retrieved is a sequence composed of multiple events arranged in sequence. The same or different events are arranged in the order of time from early to late to be called an event sequence, which is also a further analysis method for the development of things on the basis of time series, which can be realized by retrieving and analyzing the events that occurred in sequence. The evaluation of the development trend or the influence of the main body, so the column database must include the event field and the time field, and the column database should be sorted in ascending time order at the beginning, so as to facilitate retrieval. Therefore, it is necessary to perform step g at the end to sort the columnar database in ascending time order. At this time, the data in the columnar database will be changed from the form of Table 2 to the form of Table 3:

表3table 3

Figure BDA0003613515360000061
Figure BDA0003613515360000061

Figure BDA0003613515360000071
Figure BDA0003613515360000071

原本输入进列式数据库的大量的混乱的数据被本申请的列式数据库进行处理后排序,获得了按照时间升序的数据,从而便于对待检索事件序列进行检索。而在此之前,列式数据库对事件字段中数据类型为文本的数据进行了判断,当字符数小于等于8个时,即在64位操作系统中能一次性进行读取判断时,认为该情况下的数据不会影响检索速度,不对其进行处理,而在数据类型为文本,且字符数大于8个时,认为长文本类型的数据不易进行判断处理,原文检索会使检索系统开销过大,运行缓慢,因此在数据输入列式数据库的过程中,就对数据类型为文本,且字符数大于8个的事件字段的数据内容进行处理,进行哈希编码,从而便于后续的事件序列的检索。A large amount of chaotic data originally input into the column-type database is processed and sorted by the column-type database of the present application, and data in ascending order of time is obtained, thereby facilitating retrieval of the sequence of events to be retrieved. Before that, the columnar database judged the data in the event field whose data type is text. When the number of characters is less than or equal to 8, that is, when the 64-bit operating system can be read and judged at one time, it is considered that this is the case. The data below will not affect the retrieval speed and will not be processed. However, when the data type is text and the number of characters is greater than 8, it is considered that the data of long text type is not easy to be judged and processed, and the retrieval of the original text will make the retrieval system too expensive. The operation is slow, so in the process of data input into the columnar database, the data content of the event field whose data type is text and the number of characters is greater than 8 is processed and hash encoded, so as to facilitate the retrieval of subsequent event sequences.

需要注意的是,此处的事件字段中的内容仅为示例性的,不以此为限。It should be noted that the content in the event field here is only exemplary, and not limited thereto.

而步骤b对输入的数据的时间字段设置索引同样也能够提高检索时的速度。In step b, setting an index on the time field of the input data can also improve the retrieval speed.

请参阅图4所示,本发明提供了一种检索事件序列的方法,应用于列式数据库中,具体包括以下步骤:Referring to Figure 4, the present invention provides a method for retrieving event sequences, which is applied to a columnar database, and specifically includes the following steps:

步骤1:输入待检索事件序列,并转化为查询语句;Step 1: Input the event sequence to be retrieved and convert it into a query statement;

步骤2:判断待检索事件序列的数据类型,当待检索事件序列的数据类型为数值或单字符或字符数均小于等于8个的文本时,进入步骤4;当待检索事件序列的数据类型为文本,且存在字符数大于8个的数据时,进入步骤3;Step 2: Judging the data type of the event sequence to be retrieved, when the data type of the event sequence to be retrieved is a numerical value or a single character or text with a number of characters less than or equal to 8, go to step 4; when the data type of the event sequence to be retrieved is text, and there is data with more than 8 characters, go to step 3;

步骤3:将待检索事件序列进行哈希编码,待检索事件序列的数据类型转变为数值后进入步骤4;Step 3: perform hash coding on the sequence of events to be retrieved, and enter step 4 after the data type of the sequence of events to be retrieved is converted into a numerical value;

步骤4:选择查询类型,若是精确查询,则以待检索事件序列作为一个数组整体去精确匹配检索;若是模糊查询,则根据预设模糊查询距离,进行模糊匹配检索,此时将待检索事件序列作为一个数组整体计算与事件序列中的数据组成序列的最小距离,最小距离小于等于预设模糊查询距离;Step 4: Select the query type. If it is an exact query, use the sequence of events to be retrieved as a whole to search for exact matching; if it is a fuzzy query, perform a fuzzy matching retrieval based on the preset fuzzy query distance. As an array as a whole, calculate the minimum distance from the data composition sequence in the event sequence, and the minimum distance is less than or equal to the preset fuzzy query distance;

步骤5:构建结果字典,将所有匹配到的对应序列加入结果字典,直到遍历完事件字段的所有数据;Step 5: Build a result dictionary, add all the matched corresponding sequences to the result dictionary, until all data in the event field is traversed;

步骤6:输出结果字典作为结果。Step 6: Output the result dictionary as the result.

步骤1中对于输入的待检索事件序列需要转变为列数数据库中能够识别应用并进行查询检索的查询语句,本申请的列式数据库的查询语句为类SQL语句,类SQL语句相较于标准SQL语句更为简单方便。查询语句包括依次设置的第一参数、第二参数及第三参数,第一参数配置为筛选查询的字段,第二参数配置为限定查询的存储地址,第三参数配置为限定检索条件,第三参数包括待检索事件序列。在本申请的实施例中,查询语句为select···from···where···,第一参数为select参数,输入的参数值为时间,事件,即筛选查询的字段为数据库中的时间字段和事件字段;第二参数为from参数,输入的参数值为mydb.mydata,即查询的数据库的存储地址;第三参数为where参数,输入的参数值包括事件&list=‘****’,其中****为待检索事件序列,即为检索条件。In step 1, the input event sequence to be retrieved needs to be converted into a query statement that can identify applications and perform query retrieval in the column database. Statements are simpler and more convenient. The query statement includes a first parameter, a second parameter, and a third parameter that are set in sequence. The first parameter is configured as a field for filtering the query, the second parameter is configured as a storage address for limiting the query, the third parameter is configured as a limited retrieval condition, and the third parameter is configured as a limited search condition. Parameters include the sequence of events to retrieve. In the embodiment of the present application, the query statement is select...from...where..., the first parameter is the select parameter, the input parameter value is the time, and the event, that is, the field of the filter query, is the time in the database Field and event field; the second parameter is the from parameter, and the input parameter value is mydb.mydata, that is, the storage address of the database to be queried; the third parameter is the where parameter, and the input parameter value includes the event &list='****' , where **** is the sequence of events to be retrieved, which is the retrieval condition.

步骤6中输出的结果字典包括数据库中所有符合条件的匹配序列,匹配序列至少输出时间字段和事件字段。The result dictionary output in step 6 includes all matching sequences in the database that meet the conditions, and the matching sequences output at least the time field and the event field.

当步骤3中待检索事件序列的数据类型为数值时,如表4所示,此时列式数据库中有时间字段、事件字段两列数据字段,事件字段的数据类型也为数值,无需进行处理。以“2,0,2,2”这四行作为一个整体组成的待检索事件序列为例,若要在列式数据库中查询到所有为“2,0,2,2”的事件序列,将待检索事件序列转化为类SQL查询语句:select时间,事件frommydb.mydata where事件&list=‘2,0,2,2’,其中事件&list=‘2,0,2,2’表示将‘2,0,2,2’作为一个事件序列进行检索,在本申请的此实施例中,共有两组符合的结果,在表4中采用下划线标记,最终根据事件序列检索返回结果字典的结果为:[{第1组:[{时间:01:01:02,事件:2},{时间:01:01:03,事件:0},{时间:01:01:04,事件:2},{时间:01:01:05,事件:2}]},{第2组:[{时间:01:01:07,事件:2},{时间:01:01:08,事件:0},{时间:01:01:09,事件:2},{时间:01:01:10,事件:2}]}]。When the data type of the event sequence to be retrieved in step 3 is numeric, as shown in Table 4, there are two columns of data fields, time field and event field in the columnar database, and the data type of the event field is also numeric, and no processing is required. . Take the event sequence to be retrieved composed of the four lines of "2, 0, 2, 2" as an example, to query all event sequences of "2, 0, 2, 2" in the columnar database, use The sequence of events to be retrieved is converted into a SQL-like query statement: select time, event frommydb.mydata where event &list='2,0,2,2', where event &list='2,0,2,2' means that '2, 0, 2, 2' are retrieved as an event sequence. In this embodiment of the present application, there are two sets of matching results, which are marked with underscores in Table 4. Finally, according to the event sequence retrieval, the result of the returned result dictionary is: [ {group 1:[{time:01:01:02,event:2},{time:01:01:03,event:0},{time:01:01:04,event:2},{time :01:01:05,event:2}]},{group 2:[{time:01:01:07,event:2},{time:01:01:08,event:0},{time :01:01:09, event: 2}, {time: 01:01:10, event: 2}]}].

表4Table 4

时间time 事件event 01:01:0101:01:01 00 <u>01:01:02</u><u>01:01:02</u> <u>2</u><u>2</u> <u>01:01:03</u><u>01:01:03</u> <u>0</u><u>0</u> <u>01:01:04</u><u>01:01:04</u> <u>2</u><u>2</u> <u>01:01:05</u><u>01:01:05</u> <u>2</u><u>2</u> 01:01:0601:01:06 33 <u>01:01:07</u><u>01:01:07</u> <u>2</u><u>2</u> <u>01:01:08</u><u>01:01:08</u> <u>0</u><u>0</u> <u>01:01:09</u><u>01:01:09</u> <u>2</u><u>2</u> <u>01:01:10</u><u>01:01:10</u> <u>2</u><u>2</u> 01:01:1101:01:11 00

当步骤3中待检索事件序列的数据类型为单字符时,如表5所示,列式数据库中有时间字段、事件字段两列数据字段,事件字段的数据类型也为单字符,无需进行处理。以“CACC”这四行作为一个整体组成的待检索事件序列为例,若要在列式数据库中查询到所有“CACC”的事件序列,将待检索事件序列转化为类SQL查询语句:select时间,事件frommydb.mydata where事件&list=‘C,A,C,C’,其中事件&list=‘C,A,C,C’表示将‘C,A,C,C’作为一个事件序列进行检索,在本申请的此实施例中,共有两组符合的结果,在表5中采用下划线标记,最终根据事件序列检索返回结果字典的结果为:[{第1组:[{时间:01:01:02,事件:C},{时间:01:01:03,事件:A},{时间:01:01:04,事件:C},{时间:01:01:05,事件:C}]},{第2组:[{时间:01:01:07,事件:C},{时间:01:01:08,事件:A},{时间:01:01:09,事件:C},{时间:01:01:10,事件:C}]}]。When the data type of the event sequence to be retrieved in step 3 is a single character, as shown in Table 5, there are two columns of data fields, a time field and an event field, and the data type of the event field is also a single character, and no processing is required. . Taking the sequence of events to be retrieved composed of the four lines of "CACC" as a whole as an example, to query all event sequences of "CACC" in the columnar database, convert the sequence of events to be retrieved into a SQL-like query statement: select time , event frommydb.mydata where event&list='C,A,C,C', where event&list='C,A,C,C' means to retrieve 'C,A,C,C' as an event sequence, In this embodiment of the present application, there are two sets of matching results, which are marked with underlines in Table 5. Finally, the results of the returned result dictionary are retrieved according to the sequence of events: [{Group 1: [{Time: 01:01: 02, event: C}, {time: 01:01:03, event: A}, {time: 01:01:04, event: C}, {time: 01:01:05, event: C}]} , {group 2: [{time:01:01:07,event:C},{time:01:01:08,event:A},{time:01:01:09,event:C},{ Time: 01:01:10, Event: C}]}].

表5table 5

时间time 字符character 01:01:0101:01:01 AA <u>01:01:02</u><u>01:01:02</u> <u>C</u><u>C</u> <u>01:01:03</u><u>01:01:03</u> <u>A</u><u>A</u> <u>01:01:04</u><u>01:01:04</u> <u>C</u><u>C</u> <u>01:01:05</u><u>01:01:05</u> <u>C</u><u>C</u> 01:01:0601:01:06 DD <u>01:01:07</u><u>01:01:07</u> <u>C</u><u>C</u> <u>01:01:08</u><u>01:01:08</u> <u>A</u><u>A</u> <u>01:01:09</u><u>01:01:09</u> <u>C</u><u>C</u> <u>01:01:10</u><u>01:01:10</u> <u>C</u><u>C</u> 01:01:1101:01:11 AA

列式数据库还包括除事件字段和时间字段之外的任意属性字段,配置为限定待检索事件序列的查询范围。The columnar database also includes any attribute fields other than the event field and the time field, which are configured to limit the query scope of the event sequence to be retrieved.

当步骤3中待检索事件序列的数据类型为字符串时,如表6所示,列式数据库中此时有时间字段、设备名称字段、事件字段三列数据字段,设备名称字段为列式数据库中除事件字段和时间字段之外的任意属性字段,用以限定待检索事件序列的查询范围,事件字段中的数据为字符数为2的字符串,无需进行处理。以设备二的“AA,AA,BB,BB”这四行作为一个整体组成的事件序列为例,若要在数据库中查询到所有设备二的“AA,AA,BB,BB”序列,将待检索事件序列转化为类SQL查询语句:select时间,事件frommydb.mydata where设备名称=‘设备二’and事件&list=‘AA,AA,BB,BB’,在第三参数的where参数中对额外对设备字段进行了限定,将查询条件限定在设备字段中的设备二内,然后第三参数中事件&list=‘AA,AA,BB,BB’表示将‘AA,AA,BB,BB’作为一个事件序列进行检索,在本申请的此实施例中,共有一组符合的结果,在表6中采用下划线标记,最终根据事件序列检索返回字典的结果为:[{第1组:[{时间:01:01:05,事件:AA},{时间:01:01:08,事件:AA},{时间:01:01:11,事件:BB},{时间:01:01:14,事件:BB}]}]。When the data type of the event sequence to be retrieved in step 3 is a string, as shown in Table 6, there are three data fields of time field, device name field and event field in the columnar database, and the device name field is a columnar database Any attribute field except the event field and time field in the field is used to limit the query scope of the event sequence to be retrieved. The data in the event field is a character string of 2 characters, which does not need to be processed. Taking the event sequence composed of the four lines of "AA, AA, BB, BB" of device 2 as a whole as an example, to query all the "AA, AA, BB, BB" sequences of device 2 in the database, wait The retrieval event sequence is converted into a SQL-like query statement: select time, event frommydb.mydata where device name = 'device two' and event &list = 'AA, AA, BB, BB', in the where parameter of the third parameter for additional pairs The equipment field is limited, and the query condition is limited to the equipment two in the equipment field, and then the event &list='AA, AA, BB, BB' in the third parameter indicates that 'AA, AA, BB, BB' is used as an event In this embodiment of the present application, there is a set of matching results, which are marked with underscores in Table 6. Finally, the results returned to the dictionary according to the event sequence retrieval are: [{Group 1: [{Time: 01 :01:05, event: AA}, {time: 01:01:08, event: AA}, {time: 01:01:11, event: BB}, {time: 01:01:14, event: BB }]}].

表6Table 6

时间time 设备名称device name 事件event 01:01:0101:01:01 设备一equipment one AAAA 01:01:0201:01:02 设备二equipment two BBBB 01:01:0301:01:03 设备三equipment three CCCC 01:01:0401:01:04 设备一equipment one DDDD <u>01:01:05</u><u>01:01:05</u> 设备二equipment two <u>AA</u><u>AA</u> 01:01:0601:01:06 设备三equipment three BBBB 01:01:0701:01:07 设备一equipment one CCCC <u>01:01:08</u><u>01:01:08</u> <u>设备二</u><u>Device Two</u> <u>AA</u><u>AA</u> 01:01:0901:01:09 设备三equipment three CCCC 01:01:1001:01:10 设备一equipment one CCCC <u>01:01:11</u><u>01:01:11</u> <u>设备二</u><u>Device Two</u> <u>BB</u><u>BB</u> 01:01:1201:01:12 设备三equipment three CCCC 01:01:1301:01:13 设备一equipment one AAAA <u>01:01:14</u><u>01:01:14</u> <u>设备二</u><u>Device Two</u> <u>BB</u><u>BB</u> 01:01:1501:01:15 设备三equipment three CCCC

当步骤3中待检索事件序列的数据类型为文本时,原本输入进列式数据库的数据仅有时间字段、事件字段两列数据字段,列式数据库将数据进行步骤a~步骤g的处理后,根据步骤e所得的事件数据序号字典,将列式数据库中的事件字段的数据类型为文本的数据赋予了相应的哈希编码值,在本申请的实施例中,事件字段中的文本:我是长文本一、我是长文本二、我是长文本三、我是长文本四、我是长文本五、我是长文本六均为字符数大于8的文本,因此被哈希编码为1、2、3、4、5、6,编码后列式数据库生成事件索引列,即如表7所示。此时若以“我是长文本一,我是长文本二,我是长文本四”这三行作为一个整体组成的事件序列为例,若要在数据库中查询到所有“我是长文本一,我是长文本二,我是长文本四”的事件序列,将待检索事件序列转化为类SQL查询语句:select时间,事件from mydb.mydatawhere事件&list=‘我是长文本一,我是长文本二,我是长文本四’,其中事件&list=‘我是长文本一、我是长文本二、我是长文本四’表示将‘我是长文本一、我是长文本二、我是长文本四’作为一个事件序列进行检索,且根据步骤e所得的事件数据序号字典,将待检索事件序列赋予对应的哈希编码值“1,2,4”,即实际的待检索事件序列在列式数据库中的匹配是以“1,2,4”作为待检索序列在事件索引列中进行匹配检索的,在本申请的此实施例中,共有一组符合的结果,在表7中采用下划线标记,最终根据待事件序列检索则返回的结果字典的结果为:[{第1组:[{时间:01:01:04,事件:我是长文本一},{时间:01:01:05,事件:我是长文本二},{时间:01:01:06事件:我是长文本四}]}]。When the data type of the event sequence to be retrieved in step 3 is text, the data originally entered into the column database only has two columns of data fields, the time field and the event field. After the column database processes the data from steps a to g, According to the event data serial number dictionary obtained in step e, the data of the event field in the columnar database whose data type is text is given a corresponding hash code value. In the embodiment of this application, the text in the event field: I am Long text one, I am long text two, I am long text three, I am long text four, I am long text five, I am long text six are texts with more than 8 characters, so they are hash-coded as 1, 2, 3, 4, 5, and 6, the column database generates event index columns after coding, as shown in Table 7. At this time, if the three lines of "I am long text 1, I am long text 2, and I am long text 4" are used as an example of an event sequence as a whole, if you want to query the database to find all "I am long text 1" , I am long text two, I am long text four" event sequence, convert the event sequence to be retrieved into a SQL-like query statement: select time, event from mydb.mydatawhere event &list='I am long text one, I am long text Text two, I am long text four', where event &list='I am long text one, I am long text two, I am long text four' means that 'I am long text one, I am long text two, I am The long text 4' is retrieved as an event sequence, and according to the event data serial number dictionary obtained in step e, the event sequence to be retrieved is given the corresponding hash code value "1, 2, 4", that is, the actual event sequence to be retrieved is The matching in the columnar database uses "1, 2, 4" as the to-be-retrieved sequence to perform matching retrieval in the event index column. In this embodiment of the present application, there is a set of matching results, which are used in Table 7. Underlined, the result of the result dictionary returned according to the sequence of events to be retrieved is: [{Group 1: [{Time: 01:01:04, Event: I am a long text}, {Time: 01:01: 05, event: I am long text two}, {time: 01:01:06 event: I am long text four}]}].

表7Table 7

Figure BDA0003613515360000121
Figure BDA0003613515360000121

Figure BDA0003613515360000131
Figure BDA0003613515360000131

待检索事件序列的数据类型为文本时,如果直接在数据库中对文本数据进行检索,会使整个检索过程开销庞大,无法高效检索的同时还可能造成检索过程的卡顿乃至崩溃,有鉴于此,本申请的列式数据库在数据存储到列式数据库时就进行了步骤d、步骤f,对数据类型为文本,且字符数大于8个的数据进行哈希编码,当输入待检索事件序列进行检索过程中对待检索事件序列的数据类型进行判断,在判断待检索事件序列的数据类型为文本时,根据列式数据库中的事件数据序号字典将文本类型的待检索事件序列进行哈希编码,将编码后的待检索事件序列与编码后列式数据库的事件索引列进行对比检索,由文本类型的数据检索转变为数值类型的数据检索,从而降低了检索复杂度,提高了检索速度。When the data type of the event sequence to be retrieved is text, if the text data is retrieved directly in the database, the entire retrieval process will be expensive, and the retrieval process cannot be efficiently retrieved. In the columnar database of the present application, steps d and f are performed when the data is stored in the columnar database, and hash coding is performed on the data whose data type is text and whose number of characters is greater than 8. When the event sequence to be retrieved is input, the retrieval is performed During the process, the data type of the to-be-retrieved event sequence is judged. When the data type of the to-be-retrieved event sequence is judged to be text, the text-type to-be-retrieved event sequence is hash-coded according to the event data serial number dictionary in the columnar database, and the encoded The subsequent event sequence to be retrieved is compared with the event index column of the encoded columnar database, and the data retrieval of the text type is changed to the data retrieval of the numerical type, thereby reducing the retrieval complexity and improving the retrieval speed.

而本申请的列式数据库及检索事件序列的方法还能进行精确检索和模糊检索,步骤4若选择进行模糊检索,则如表8所示,列式数据库中有时间字段、设备名称字段、事件字段三列数据字段,以设备二的“A,A,B,C”这四行作为一个整体所组成事件序列为例,预设模糊查询距离为1,若要在列式数据库中查询到设备二的所有误差距离1的“A,A,B,C”序列,将待检索事件序列转化为类SQL查询语句:select时间,事件frommydb.mydata where设备名称=‘设备二’事件&list=‘A,A,B,C’&dist=1,其中设备名称=‘设备二’表示检索条件限定在设备名称字段中设备二的情况下,事件&list=‘A,A,B,C’&dist=1表示将‘A,A,B,C’作为一个事件序列并进行误差距离为1的模糊检索,在本申请的此实施例中,“A,A,B,B”为符合与待检索时间序列“A,A,B,C”的误差距离为1的检索结果,在表8中采用下划线标记,最终根据事件序列检索则返回结果为:[{第1组:[{时间:01:01:05,事件:A},{时间:01:01:08,事件:A},{时间:01:01:11,事件:B},{时间:01:01:14,事件:B}]}]。The columnar database and the method for retrieving event sequences of the present application can also perform precise retrieval and fuzzy retrieval. If fuzzy retrieval is selected in step 4, as shown in Table 8, the columnar database has a time field, a device name field, an event Field Three-column data fields, take the event sequence composed of the four lines of "A, A, B, C" of device two as an example, the preset fuzzy query distance is 1, if you want to query the device in the column database All the "A, A, B, C" sequences of error distance 1 of the second, convert the sequence of events to be retrieved into SQL-like query statements: select time, event frommydb.mydata where device name='device two' event&list='A , A, B, C'&dist=1, where device name='device two' means that the retrieval condition is limited to the device two in the device name field, event &list='A, A, B, C'&dist=1 means Take 'A, A, B, C' as an event sequence and perform a fuzzy search with an error distance of 1. In this embodiment of the application, "A, A, B, B" is the time sequence that matches the time sequence to be retrieved" A, A, B, C" with an error distance of 1" is marked with an underline in Table 8, and finally retrieved according to the event sequence, the returned result is: [{Group 1:[{Time:01:01:05 , event: A}, {time: 01:01:08, event: A}, {time: 01:01:11, event: B}, {time: 01:01:14, event: B}]}] .

表8Table 8

时间time 设备名称device name 事件event 01:01:0101:01:01 设备一equipment one AA 01:01:0201:01:02 设备二equipment two BB 01:01:0301:01:03 设备三equipment three CC 01:01:0401:01:04 设备一equipment one DD <u>01:01:05</u><u>01:01:05</u> <u>设备二</u><u>Device Two</u> <u>A</u><u>A</u> 01:01:0601:01:06 设备三equipment three BB 01:01:0701:01:07 设备一equipment one CC <u>01:01:08</u><u>01:01:08</u> <u>设备二</u><u>Device Two</u> <u>A</u><u>A</u> 01:01:0901:01:09 设备三equipment three CC 01:01:1001:01:10 设备一equipment one CC <u>01:01:11</u><u>01:01:11</u> <u>设备二</u><u>Device Two</u> <u>B</u><u>B</u> 01:01:1201:01:12 设备三equipment three CC 01:01:1301:01:13 设备一equipment one AA <u>01:01:14</u><u>01:01:14</u> <u>设备二</u><u>Device Two</u> <u>B</u><u>B</u> 01:01:1501:01:15 设备三equipment three CC

再次以表3为例,此时列式数据库中有时间字段、事件字段、事件索引字段三列数据字段,以事件“发动机保养,加油”这两个行为作为一个整体所组成事件序列为例,预设模糊查询距离为1,若要在列式数据库中查询到与“发动机保养,加油”所有误差距离1的事件序列,将待检索事件序列转化为类SQL查询语句:select时间,事件from mydb.mydatawhere事件&list=‘发动机保养,加油’&dist=1,事件&list=‘发动机保养,加油’&dist=1表示将‘发动机保养,加油’作为一个事件序列并进行误差距离为1的模糊检索,在本申请的此实施例中,“发动机保养,洗车,加油”即对应事件索引为“1,5,2”为符合与待检索事件序列“发动机保养,加油”即对应检索内容为“1,2”的误差距离为1的检索结果,在表9中采用下划线标记,最终根据事件序列检索则返回结果为:[{第1组:[{时间:2022-01-04,事件:发动机保养},{时间:2022-01-06,事件:洗车},{时间:2022-02-04,事件:加油}]}]。Take Table 3 as an example again. At this time, there are three columns of data fields in the columnar database: time field, event field and event index field. Take the event sequence composed of the two behaviors of "engine maintenance, refueling" as a whole as an example, The preset fuzzy query distance is 1. If you want to query the column database for all event sequences with an error distance of 1 from "engine maintenance, refueling", convert the to-be-retrieved event sequence into an SQL-like query statement: select time, event from mydb .mydatawhere event &list='engine maintenance, refueling'&dist=1, event &list='engine maintenance, refueling'&dist=1 means that 'engine maintenance, refueling' is taken as an event sequence and a fuzzy search with an error distance of 1 is performed. In this embodiment of the present application, "engine maintenance, car wash, refueling" means that the corresponding event index is "1, 5, 2", which is consistent with the to-be-retrieved event sequence "engine maintenance, refueling", that is, the corresponding retrieval content is "1, 2" ”, the search results with an error distance of 1 are marked with underlines in Table 9. Finally, according to the event sequence retrieval, the returned results are: [{Group 1: [{Time: 2022-01-04, Event: Engine Maintenance}, {time: 2022-01-06, event: car wash}, {time: 2022-02-04, event: refueling}]}].

表9Table 9

时间time 事件event 事件索引event index 2022-01-012022-01-01 加油come on 22 2022-01-022022-01-02 雨刮器保养Wiper Maintenance 44 2022-01-032022-01-03 补充玻璃水Refill glass water 33 <u>2022-01-04</u><u>2022-01-04</u> <u>发动机保养</u><u>Engine Maintenance</u> <u>1</u><u>1</u> 2022-01-062022-01-06 洗车CAR WASH <u>5</u><u>5</u> <u>2022-02-04</u><u>2022-02-04</u> <u>加油</u><u>Come on</u> <u>2</u><u>2</u> 2022-02-052022-02-05 加油come on 22 2022-02-062022-02-06 补充玻璃水Refill glass water 33

请参阅图5所示,步骤4中进行模糊查询计算最小距离具体为采用Levenshtein算法,计算数据类型为字符串时,把待检索事件序列中的任一字符串通过添加、删除、替换字符的方式转变成事件字段中的任一字符串所需要的最少步骤,最少步骤即为最小距离,最少步骤的计算具体为:Please refer to Fig. 5. In step 4, the minimum distance calculated by the fuzzy query is to use the Levenshtein algorithm. When the calculated data type is a character string, any character string in the event sequence to be retrieved is added, deleted, or replaced by characters. The minimum steps required to convert into any string in the event field. The minimum steps are the minimum distances. The calculation of the minimum steps is as follows:

步骤41:计算待检索事件序列中的任一字符串strA的长度n,计算待检索事件序列中的任一字符串strB的长度m;Step 41: Calculate the length n of any character string strA in the sequence of events to be retrieved, and calculate the length m of any character string strB in the sequence of events to be retrieved;

步骤42:如果n=0,则最小编辑距离是m,如果m=0,则最小编辑距离是n,若都不是,进入步骤43;Step 42: If n=0, the minimum edit distance is m, if m=0, then the minimum edit distance is n, if not, go to step 43;

步骤43:构造一个(m+1)*(n+1)的矩阵Arr,并初始化矩阵的第一行和第一列分别为0~n,0~m;Step 43: Construct a matrix Arr of (m+1)*(n+1), and initialize the first row and first column of the matrix to be 0~n, 0~m respectively;

步骤44:两重循环,遍历strA,在此基础上遍历strB,如果strA[i]=strB[j],那么cost=0,否则cost=1,判断Arr[j-1][i]+1,Arr[j][i-1]+1,Arr[j-1][i-1]+cost的最小值,将最小值赋值给Arr[j][i];Step 44: Double loop, traverse strA, and traverse strB on this basis, if strA[i]=strB[j], then cost=0, otherwise cost=1, judge Arr[j-1][i]+1 , Arr[j][i-1]+1, the minimum value of Arr[j-1][i-1]+cost, and assign the minimum value to Arr[j][i];

步骤45:循环结束后,矩阵的最后一个元素就是最小编辑距离,即最少步骤。Step 45: After the loop is over, the last element of the matrix is the minimum edit distance, that is, the minimum steps.

在本申请针对表8进行的检索中,“A,A,B,B”最少仅需要一步就可以转变为“A,A,B,C”,即为与待检索事件序列“A,A,B,C”的误差距离为1的检索结果。在本申请针对表9进行的检索中,待检索事件序列“发动机保养,加油”赋予哈希编码值后为“1,2”,与待检索事件序列“发动机保养,洗车,加油”赋予哈希编码值后的“1,5,2”的最少步骤为1,为符合预设模糊查询距离为1要求的检索结果。In the search for Table 8 in this application, "A, A, B, B" can be converted into "A, A, B, C" in at least one step, which is the sequence of events to be searched "A, A, B, C" with an error distance of 1. In the retrieval of Table 9 in this application, the event sequence to be retrieved "engine maintenance, refueling" is assigned a hash code value of "1, 2", and the event sequence to be retrieved "engine maintenance, car wash, refueling" is assigned a hash code value of "1, 2". The minimum step of "1, 5, 2" after the encoded value is 1, which is the retrieval result that meets the requirement that the preset fuzzy query distance is 1.

综上所述,本发明的列式数据库通过对时间字段设置索引,解决了现有列式数据库不便于对输入的数据进行排序和检索的问题,通过对数据类型为文本且字符数大于8个的事件字段进行哈希编码,解决了现有列式数据库在事件字段的数据类型为文本时检索复杂,效率较低的问题。本发明的检索事件序列的方法通过将待检索事件序列转化为查询语句,并根据数据类型将文本类型进行哈希编码从而进行检索查询,当进行模糊查询时,则根据预设模糊查询距离将所述待检索事件序列作为一个数组整体计算与事件序列中的数据组成序列的最小距离,最小距离小于等于预设模糊查询距离,从而进行模糊匹配检索,解决了现有列式数据库不便于对事件序列进行精确检索和模糊检索且在数据类型为文本时检索复杂的问题。To sum up, the columnar database of the present invention solves the problem that the existing columnar database is inconvenient for sorting and retrieving the input data by setting an index on the time field. The event field is hashed and encoded, which solves the problem of complex retrieval and low efficiency in the existing columnar database when the data type of the event field is text. The method for retrieving event sequences of the present invention converts the to-be-retrieved event sequence into a query statement, and hash codes the text type according to the data type to perform retrieval query. The event sequence to be retrieved is calculated as an array as a whole to calculate the minimum distance from the sequence of data in the event sequence, and the minimum distance is less than or equal to the preset fuzzy query distance, so that fuzzy matching retrieval is performed, which solves the problem that the existing columnar database is inconvenient for event sequences. Perform precise and fuzzy searches and retrieve complex questions when the data type is text.

以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified or equivalently replaced. without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1.一种列式数据库,用于存储和检索事件序列,其特征在于,所述列式数据库能够对数据进行如下处理:1. A columnar database for storing and retrieving event sequences, wherein the columnar database can process data as follows: 步骤a:将数据输入列式数据库,所述数据为时序数据,至少包括时间字段和事件字段;Step a: input data into a columnar database, the data is time series data, including at least a time field and an event field; 步骤b:对步骤a中输入的数据的时间字段设置索引;Step b: set an index to the time field of the data entered in step a; 步骤c:判断所述数据中的事件字段的数据类型,若所述事件字段的数据类型为文本,且存在字符数大于8个的数据,则进入步骤d,否则进入步骤g;Step c: judge the data type of the event field in the data, if the data type of the event field is text, and there is data with the number of characters greater than 8, then enter step d, otherwise enter step g; 步骤d:对所述事件字段的内容进行去重;Step d: de-duplicate the content of the event field; 步骤e:根据步骤d的去重结果,进行哈希编码,构造事件数据序号字典;Step e: perform hash coding according to the deduplication result of step d, and construct a dictionary of event data serial numbers; 步骤f:遍历列式数据库,对所有事件字段中数据类型为文本的数据按所述事件数据序号字典赋予对应的哈希编码值,在列式数据库生成相应的事件索引列;Step f: traverse the column database, assign the corresponding hash code value to the data whose data type is text in all event fields according to the event data serial number dictionary, and generate the corresponding event index column in the column database; 步骤g:根据步骤b中对时间字段设置的索引将所述列式数据库按时间升序进行排序。Step g: Sort the columnar database in ascending time order according to the index set on the time field in step b. 2.根据权利要求1所述的列式数据库,其特征在于:所述列式数据库中的数据还包括除所述事件字段和所述时间字段之外的任意属性字段。2 . The columnar database according to claim 1 , wherein the data in the columnar database further comprises any attribute field except the event field and the time field. 3 . 3.一种检索事件序列的方法,应用于权利要求1~2中任一项所述的列式数据库中,其特征在于,具体包括以下步骤:3. A method for retrieving event sequences, applied in the columnar database according to any one of claims 1 to 2, characterized in that it specifically comprises the following steps: 步骤1:输入待检索事件序列,并转化为查询语句;Step 1: Enter the event sequence to be retrieved and convert it into a query statement; 步骤2:判断待检索事件序列的数据类型,当待检索事件序列的数据类型为数值或单字符或字符数均小于等于8个的文本时,进入步骤4;当待检索事件序列的数据类型为文本,且存在字符数大于8个的数据时,进入步骤3;Step 2: Judging the data type of the event sequence to be retrieved, when the data type of the event sequence to be retrieved is a numerical value or a single character or text with a number of characters less than or equal to 8, go to step 4; when the data type of the event sequence to be retrieved is text, and there is data with more than 8 characters, go to step 3; 步骤3:将待检索事件序列进行哈希编码,待检索事件序列的数据类型转变为数值后进入步骤4;Step 3: perform hash coding on the sequence of events to be retrieved, and enter step 4 after the data type of the sequence of events to be retrieved is converted into a numerical value; 步骤4:选择查询类型,若是精确查询,则以待检索事件序列作为一个数组整体去精确匹配检索;若是模糊查询,则根据预设模糊查询距离,进行模糊匹配检索,此时将待检索事件序列作为一个数组整体计算与事件序列中的数据组成序列的最小距离,最小距离小于等于预设模糊查询距离;Step 4: Select the query type. If it is an exact query, use the sequence of events to be retrieved as a whole to search for exact matching; if it is a fuzzy query, perform a fuzzy matching retrieval based on the preset fuzzy query distance. As an array as a whole, calculate the minimum distance from the data composition sequence in the event sequence, and the minimum distance is less than or equal to the preset fuzzy query distance; 步骤5:构建结果字典,将所有匹配到的对应序列加入结果字典,直到遍历完事件字段的所有数据;Step 5: Build a result dictionary, add all the matched corresponding sequences to the result dictionary, until all data in the event field is traversed; 步骤6:输出结果字典作为结果。Step 6: Output the result dictionary as the result. 4.根据权利要求3所述的检索事件序列的方法,其特征在于:所述待检索事件序列为多个事件依次排列组成的序列。4. The method for retrieving an event sequence according to claim 3, wherein the to-be-retrieved event sequence is a sequence composed of multiple events arranged in sequence. 5.根据权利要求3所述的检索事件序列的方法,其特征在于:步骤1中的所述查询语句为类SQL语句。5 . The method for retrieving event sequences according to claim 3 , wherein the query statement in step 1 is an SQL-like statement. 6 . 6.根据权利要求5所述的检索事件序列的方法,其特征在于:所述查询语句包括依次设置的第一参数、第二参数及第三参数,所述第一参数配置为筛选查询的字段,所述第二参数配置为限定查询的存储地址,所述第三参数配置为限定检索条件。6 . The method for retrieving an event sequence according to claim 5 , wherein the query statement comprises a first parameter, a second parameter and a third parameter set in sequence, and the first parameter is configured as a field to filter the query. 7 . , the second parameter is configured to limit the storage address of the query, and the third parameter is configured to limit the retrieval condition. 7.根据权利要求6所述的检索事件序列的方法,其特征在于:所述第三参数包括所述待检索事件序列。7 . The method for retrieving an event sequence according to claim 6 , wherein the third parameter comprises the to-be-retrieved event sequence. 8 . 8.根据权利要求3所述的检索事件序列的方法,其特征在于:所述步骤4中进行模糊查询计算最小距离具体为采用Levenshtein算法,计算数据类型为字符串时,把所述待检索事件序列中的任一字符串通过添加、删除、替换字符的方式转变成所述事件字段中的任一字符串所需要的最少步骤。8. the method for retrieving event sequence according to claim 3, is characterized in that: in described step 4, carrying out fuzzy query and calculating minimum distance is specifically adopting Levenshtein algorithm, when calculating data type is character string, described event to be retrieved The minimum number of steps required to convert any string in the sequence into any string in the event field by adding, deleting, or replacing characters. 9.根据权利要求8所述的检索事件序列的方法,其特征在于,所述最少步骤的计算具体为:9. The method for retrieving event sequences according to claim 8, wherein the calculation of the minimum steps is specifically: 步骤41:计算所述待检索事件序列中的任一字符串strA的长度n,计算所述待检索事件序列中的任一字符串strB的长度m;Step 41: Calculate the length n of any character string strA in the sequence of events to be retrieved, and calculate the length m of any character string strB in the sequence of events to be retrieved; 步骤42:如果n=0,则最小编辑距离是m,如果m=0,则最小编辑距离是n,若都不是,则进入步骤43;Step 42: If n=0, the minimum edit distance is m, if m=0, the minimum edit distance is n, if not, then go to step 43; 步骤43:构造一个(m+1)*(n+1)的矩阵Arr,并初始化矩阵的第一行和第一列分别为0~n,0~m;Step 43: Construct a matrix Arr of (m+1)*(n+1), and initialize the first row and first column of the matrix to be 0~n, 0~m respectively; 步骤44:两重循环,遍历strA,在此基础上遍历strB,如果strA[i]=strB[j],那么cost=0,否则cost=1,判断Arr[j-1][i]+1,Arr[j][i-1]+1,Arr[j-1][i-1]+cost的最小值,将最小值赋值给Arr[j][i];Step 44: Double loop, traverse strA, and traverse strB on this basis, if strA[i]=strB[j], then cost=0, otherwise cost=1, judge Arr[j-1][i]+1 , Arr[j][i-1]+1, the minimum value of Arr[j-1][i-1]+cost, and assign the minimum value to Arr[j][i]; 步骤45:循环结束后,矩阵的最后一个元素就是最小编辑距离,即最少步骤。Step 45: After the loop is over, the last element of the matrix is the minimum edit distance, that is, the minimum steps. 10.根据权利要求3所述的检索事件序列的方法,其特征在于:步骤6中输出的所述结果字典包括数据库中所有符合条件的匹配序列,所述匹配序列至少输出所述时间字段和所述事件字段。10. The method for retrieving event sequences according to claim 3, wherein the result dictionary output in step 6 includes all matching sequences that meet the conditions in the database, and the matching sequences output at least the time field and all the matching sequences. the event field.
CN202210437897.9A 2022-04-25 2022-04-25 Columnar database and method for retrieving sequence of events Pending CN114756553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210437897.9A CN114756553A (en) 2022-04-25 2022-04-25 Columnar database and method for retrieving sequence of events

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210437897.9A CN114756553A (en) 2022-04-25 2022-04-25 Columnar database and method for retrieving sequence of events

Publications (1)

Publication Number Publication Date
CN114756553A true CN114756553A (en) 2022-07-15

Family

ID=82332464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210437897.9A Pending CN114756553A (en) 2022-04-25 2022-04-25 Columnar database and method for retrieving sequence of events

Country Status (1)

Country Link
CN (1) CN114756553A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061678A (en) * 1997-10-31 2000-05-09 Oracle Corporation Approach for managing access to large objects in database systems using large object indexes
US20060184984A1 (en) * 2005-01-05 2006-08-17 Digital Networks North America, Inc. Method and system for intelligent indexing of recordable event identifiers
US20080215546A1 (en) * 2006-10-05 2008-09-04 Baum Michael J Time Series Search Engine
CN107291858A (en) * 2017-06-09 2017-10-24 成都索贝数码科技股份有限公司 Data indexing method based on character string suffix
CN113821544A (en) * 2020-06-18 2021-12-21 律商联讯风险解决方案公司 Fuzzy search using field-level pruning of neighborhoods

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061678A (en) * 1997-10-31 2000-05-09 Oracle Corporation Approach for managing access to large objects in database systems using large object indexes
US20060184984A1 (en) * 2005-01-05 2006-08-17 Digital Networks North America, Inc. Method and system for intelligent indexing of recordable event identifiers
US20080215546A1 (en) * 2006-10-05 2008-09-04 Baum Michael J Time Series Search Engine
CN101641674A (en) * 2006-10-05 2010-02-03 斯普兰克公司 Time series search engine
CN107291858A (en) * 2017-06-09 2017-10-24 成都索贝数码科技股份有限公司 Data indexing method based on character string suffix
CN113821544A (en) * 2020-06-18 2021-12-21 律商联讯风险解决方案公司 Fuzzy search using field-level pruning of neighborhoods

Similar Documents

Publication Publication Date Title
CN103080924B (en) Method and apparatus for processing data sets
US8533203B2 (en) Identifying synonyms of entities using a document collection
US20090210412A1 (en) Method for searching and indexing data and a system for implementing same
CN110147364B (en) Data cleaning method, device, equipment and storage medium
CN110597844B (en) Unified access method for heterogeneous database data and related equipment
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
US10417208B2 (en) Constant range minimum query
TWI706260B (en) Index establishment method and device based on mobile terminal NoSQL database
US11288266B2 (en) Candidate projection enumeration based query response generation
CN112181490B (en) Method, device, equipment and medium for identifying function category in function point evaluation method
US20180173710A1 (en) Multi-Level Directory Tree with Fixed Superblock and Block Sizes for Select Operations on Bit Vectors
CN111984673B (en) Fuzzy retrieval method and device for tree structure of power grid electric energy metering system
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN114691709A (en) A fast response method for power business data processing
CN114756553A (en) Columnar database and method for retrieving sequence of events
CN115129871A (en) Text category determination method, apparatus, computer equipment and storage medium
US20230046471A1 (en) Machine learning enhanced classifier
CN114528311A (en) Similarity detection method and device for SQL (structured query language) statements
CN119474125A (en) SQL statement detection method, device, equipment and medium
CN116561181A (en) Data query method, device, computer equipment, and computer-readable storage medium
CN106557668A (en) DNA sequence dna similar test method based on LF entropys
CN115576913B (en) Semi-automatic database construction method and computer-readable medium
CN115688788A (en) Training method and related equipment for named entity recognition model in audit field
JP2009181524A (en) Document search system and document search method
CN112257416A (en) Inspection new word discovery method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination